cs.CL @ 2025-07-04: 659
-
00 07-03 (4) Requirements Elicitation Follow-Up Question Generation Voraussetzungen Elicitation Follow-Up Question Generation 问询后查询 2507.02858v1 -
01 07-03 Answer Matching Outperforms Multiple Choice for Language Model Evaluation Antwort Matching Outperforms Multiple Choice für Sprachmodell-Bewertung 语言模式评价的多种选择 2507.02856v1 -
02 07-03 MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs MOTIF: Modulares Denken durch Verstärkung Feinabstimmung in LLMs MOTIF:通过强化微调在LLM中进行模块思考 2507.02851v1 -
03 07-03 LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users LLM Hypnose: Nutzung des Benutzerfeedbacks für unautorisierte Wissensinjektion für alle Benutzer LLM Hypnisis:利用用户反馈,为所有用户提供未经授权知识注射 2507.02850v1 -
04 07-03 Legal Requirements Translation from Law Rechtliche Voraussetzungen Übersetzung aus dem Recht 法律要求译自法律 2507.02846v1 -
05 07-03 Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection Visual Contextual Attack: Jailbreaking MLLMs mit Image-Driven Context Injection 视觉上下文攻击:带有图像驱动背景注射的破狱MLLMs MLLMs 2507.02844v1 -
06 07-03 Improved Unbiased Watermark for Large Language Models Verbessertes unvoreingenommenes Wasserzeichen für große Sprachmodelle 改进大语言模型的无偏见水印 2502.11268v2 -
07 07-03 StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason StepHint: Mehrstufige stufenweise Hinweise stärken das Lernen zur Vernunft 步进提示:多级分步骤将强化学习提升到合理 2507.02841v1 -
08 07-03 From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents Von der Web-Suche in Richtung Agentic Deep Research: Incentivizing Search with Reasoning Agents 从网络搜索到代理深层研究:激励使用理性代理进行搜索 2506.18959v3 -
09 07-03 ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning ExPO: Entsperren harter Vernunft mit selbsterklärungsgeführtem Verstärkungslernen ExPO: 以自我剥削指导强化学习来解锁困难理由 2507.02834v1 -
10 07-03 Generalizing Verifiable Instruction Following Verallgemeinern der überprüfbaren Anleitung 普遍适用的可核实说明 2507.02833v1 -
11 07-03 SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model SynapseRoute: Ein Auto-Routen-Schaltrahmen für das Dual-State Large Language Model SynapseRoute:关于两州大语言模式的自动运行切换框架 2507.02822v1 -
12 07-03 Multimodal Mathematical Reasoning with Diverse Solving Perspective Multimodale mathematische Vernunft mit unterschiedlicher Lösungsperspektive 具有不同解决视角的多模式数学理由 2507.02804v1 -
13 07-03 Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models Ist Vernunft alles, was Sie brauchen? Probieren von Bias im Zeitalter der Vernunft Sprachmodelle 需要什么理由就需要什么理由吗? 2507.02799v1 -
14 07-03 From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding Von langen Videos zu Clips: Ein von Menschen inspiriertes Video-Editing-Framework mit multimodalem Narrative Understanding 从长视频到启动剪贴板:由人启发的视频编辑框架,包含多模式叙述理解 2507.02790v1 -
15 07-03 GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling GPAS: Beschleunigung der Konvergenz des LLM-Vortrainings durch Gradient-Preserving Activation Scaling GPAS:通过 “ 渐进式保留动力扩增 “ 加速汇集LLM预备训练 2506.22049v2 -
16 07-03 Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation Verbesserung klinischer Multiple-Choice-Fragen Benchmarks mit Knowledge Graph Guided Distractor Generierung 加强具有知识图导引引产生体的临床多选择问题基准,加强临床多选择问题基准 2506.00612v3 -
17 07-03 Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs Selbstkorrektionsbank: Enthüllung und Adressierung des Selbstkorrektions-Blindflecks in LLMs 自我校正法官:在LLMs中披露和处理自我校正的盲人点 2507.02778v1 -
18 07-03 DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment DeSTA2.5-Audio: Auf dem Weg zu einem General-Purpose Large Audio Language Model mit selbsterzeugter Cross-Modal Alignment DeSTA2.5-Audio:努力建立具有自发跨模式一致的通用大型音频语言模型 2507.02768v1 -
19 07-03 Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression Batch-Max: Höherer LLM-Durchsatz mit größeren Batch-Größen und KV Cache-Kompression 批量最大:使用大批量大小和 KV缓存压缩的高级 LLM 输送量 2412.05693v3 -
20 07-03 Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens Messung der Granularität des Vowel-Produktionsraumes durch einfach nur produzierbare unterschiedliche (JPD) Limens 仅用可制成差异(JPD)激光测量Vowel 生产空间的颗粒度 2507.02744v1 -
21 07-03 Early Signs of Steganographic Capabilities in Frontier LLMs Frühe Anzeichen von Steganographischen Fähigkeiten in Frontier LLMs 边疆长长体动物能力早期信号 2507.02737v1 -
22 07-03 Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge Mind2Web 2: Agentische Suche mit Agent-as-a-Judge bewerten Mind2Web 2: 与代理法官评估代理搜索 2506.21506v2 -
23 07-03 On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability Über Charakterisierungen für die Sprachgenerierung: Interplay von Halluzinationen, Breadth und Stabilität 语言生成特征:幻觉、面包和稳定之间的相互作用 2412.18530v2 -
24 07-03 Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation Next-Token-Vorhersage-Aufgabe setzt eine optimale Datenbestellung für LLM-Training in Proof Generation voraus 假定为校实生成的LLM培训提供最佳数据排序 2411.00863v2 -
25 07-03 Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers Können LLMs kritische Einschränkungen innerhalb der wissenschaftlichen Forschung identifizieren? Eine systematische Bewertung von KI-Forschungspapieren LLMs能否查明科学研究中的关键限制? 对AI研究文件的系统评估 2507.02694v1 -
26 07-03 Exploring Gender Bias Beyond Occupational Titles Erforschen von Gender-Bias über Berufsbezeichnungen hinaus 探索职业职称之外的性别偏见 2507.02679v1 -
27 07-03 Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning Code2Logic: Game-Code-getriebene Datensynthese zur Verbesserung von VLMs Allgemeine Begründung 代码2Llogic: 用于增强VLMs一般理由的游戏-代码-驱动数据合成 2505.13886v2 -
28 07-03 ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning ASDA: Audio-Spektrogramm Differential-Achtungsmechanismus für selbstüberwachtes Repräsentationslernen ASDA:自我监督代表制学习的听觉分光差异关注机制 2507.02666v1 -
29 07-03 OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding OmniDraft: Ein Cross-Vocabulary, Online Adaptive Drafter für die gerätespezifische Dekodierung 总括草案:跨词汇、在线在线可调适性套用投机下限设计图纸 2507.02659v1 -
30 07-03 Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search Entkoppelte Planung und Ausführung: Ein Hierarchisches Reasoning-Framework für tiefe Suche 分解的规划和执行:深海搜索的等级理据框架 2507.02652v1 -
31 07-03 Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory Strategische Intelligenz in großen Sprachmodellen: Beweise aus der evolutionären Spieltheorie 大语言模型战略情报:进化游戏理论的证据 2507.02618v1 -
32 07-03 Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur 以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测 2506.08713v2 -
33 07-03 Direct Preference Optimization Using Sparse Feature-Level Constraints Direkte Preference-Optimierung mit Sparse-Feature-Level-Beschränkungen 使用粗简地物限制的直接优惠优化 2411.07618v2 -
34 07-03 Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs Symbolisch oder numerisch? Physik-Probleme verstehen, die LLMs aufklären 理解在理赔中解决物理问题 2507.01334v2 -
35 07-03 MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion MPF: Sprachmodelle nach der Bereitstellung über Multi Perspective Fusion ausrichten und abgrenzen MPF:通过多视角融合进行部署后调整和取消对语言模式的偏见 2507.02595v1 -
36 07-03 MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration MedAide: Informationsfusion und Anatomie von medizinischen Intents über LLM-basierte Agent Collaboration 医学辅助:通过以LLM为基地的合作公司代理进行医疗成瘾者的信息汇集和解剖 2410.12532v3 -
37 07-03 Revisiting Active Learning under (Human) Label Variation Aktives Lernen unter (menschlichen) Label-Varianten 在(人)标签标签变换下重新审查积极学习 2507.02593v1 -
38 07-03 WebSailor: Navigating Super-human Reasoning for Web Agent WebSailor: Navigieren Super-Mensch Vernunft für Web Agent Web 服务员: 为 Web 代理导航超人理由 2507.02592v1 -
39 07-03 AI Flow: Perspectives, Scenarios, and Approaches AI Flow: Perspektiven, Szenarien und Ansätze AI 流动:观点、设想和方法 2506.12479v2 -
40 07-03 Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs Sprachenübergreifendes Reisen: Benchmarking Cross-Lingual Consistency in multimodalen LLMs 跨语言旅行:多模式LLM中跨语言一致基准 2505.15075v2 -
41 07-03 Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning Selbstgesteuerte Prozess-Reward-Optimierung mit neu definiertem Schrittweiser Vorteil für Prozess-Verstärkungs-Lernen 自指导流程向上优化,具有重新定义的逐步改进的流程强化学习优势 2507.01551v2 -
42 07-03 IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders IndianBailJudgments-1200: Ein Multi-Attribut-Datensatz für legale NLP auf indischen Bail-Aufträgen IndianBailJail Judgments-12000:印度保释令法律国家保护程序多属性数据集 2507.02506v1 -
43 07-03 Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack Robustheit von Fehlinformations-Klassifikationssystemen zu Adversarial-Beispielen durch BeamAttack 通过“BeamAttack”进行错误信息分类系统对反向实例的强力 2506.23661v2 -
44 07-03 Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer Task Prompt Vektoren: Effektive Initialisierung durch Multi-Task Soft-Prompt Transfer 任务提示矢量 : 通过多任务软性即时传输实现有效的初始化 2408.01119v3 -
45 07-03 Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants Hanzi als Narrative Bridges herstellen: Ein KI-Co-Creation-Workshop für ältere Migranten 将Hanzi编成叙述性桥梁:大赦国际为老年移民举办的共同创造讲习班 2507.01548v2 -
46 07-03 A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages Ein Kochbuch für die gemeinschaftsorientierte Datenerfassung von schwachen Sprachkenntnissen in LowResource-Sprachen 社区驱动的低成本低资源语言有缺陷演讲数据收集手册 2507.02428v1 -
47 07-03 Delving into LLM-assisted writing in biomedical publications through excess vocabulary Eintauchen in LLM-unterstütztes Schreiben in biomedizinischen Publikationen durch überschüssiges Vokabular 通过超量词汇,在生物医学出版物中进行LLM协助撰写 2406.07016v5 -
48 07-03 Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability Benchmarking Akan ASR-Modelle über Domain-spezifische Datensätze: Eine vergleichende Bewertung von Leistung, Skalierbarkeit und Anpassungsfähigkeit 确定Akan ASR模型基准的全域具体数据集:业绩比较评价、可缩放性和可调适性 2507.02407v1 -
49 07-03 AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation AIn’t Not Nothing But a Survey? Mit großen Sprachmodellen für die Codierung Deutsch Open-Ended Survey Responses on Survey Motivation 使用大语言模型对德国关于调查动机的开放式调查答复进行编码 2506.14634v3 -
50 07-03 JoyTTS: LLM-based Spoken Chatbot With Voice Cloning JoyTTS: LLM-basierter gesprochener Chatbot mit Voice Cloning 以LLM为基地的 “ 配有语音克隆的口音聊天机器人 “ 2507.02380v1 -
51 07-03 Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection Effiziente Code-LLM-Schulung über Distribution-Konsistenz und Diversity-Aware-Datenauswahl 通过分配和多样性软件数据选择进行高效率的守则LLM培训 2507.02378v1 -
52 07-03 QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers QFFN-BERT: Eine empirische Studie über Tiefe, Leistung und Dateneffizienz in hybriden Quantum-Klassischen Transformern QFFN-BERT:对混合量子-分类变异器的深度、性能和数据效率的经验研究 2507.02364v1 -
53 07-03 Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning Verbesserung der Robustheit der distantly-überwachten Anerkennung von Personen mit Namen durch unsicheres Lehrerlernen und studentisch-studentisches kollaboratives Lernen 通过不确定-软件教师学习和学生-学生合作学习,提高以不确定-软件教师学习和学生-学生合作学习的方式,提高以不确定-软件命名的实体识别的力度 2311.08010v3 -
54 07-03 Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models Coling-UniA bei SciVQA 2025: Wenig-heißes Beispiel Retrieval und Vertrauen-informierte Montage für multimodale große Sprachmodelle 在SciVQA 2025 SciVQA 的Coling-UniA:多式大语言模型的很少热实例检索和信任化组合 2507.02357v1 -
55 07-03 Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation Einschließlich LLMs für großräumige Urban Complex Mobility Simulation 大型城市综合流动模拟项目LLMs 2505.21880v2 -
56 07-03 Decision-Oriented Text Evaluation Entscheidungsorientierte Textbewertung 注重决定的案文评价 2507.01923v2 -
57 07-03 Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs Token Prepending: Ein trainingsfreier Ansatz zur Eliziierung besserer Sentence-Embeddings von LLMs Token Predudo:从LLM女士那里采用不培训办法,使判刑内容更好地嵌入Elibear 2412.11556v2 -
58 07-03 Layered Insights: Generalizable Analysis of Authorial Style by Leveraging All Transformer Layers Layered Insights: Generalisierbare Analyse des Autorial Styles durch Hebelisierung aller Transformer Layers 图层透视: 通过利用所有变换层对文件样式的通用分析 2503.00958v2 -
59 07-03 Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy Skywork-Reward-V2:通过人类-AI协同增强优先数据曲线 2507.01352v2 -
60 07-03 Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach Ausrichten von gefrorenen LLMs durch Verstärkungslernen: Ein iteratives Reweight-then-Optimize-Ansatz 通过强化学习将冻结的LLMs与 “ 强化学习:一种过渡性再加权再优化方法 “ 相匹配 2506.17828v2 -
61 07-03 Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding Fast-dLLM: Trainingsfreie Beschleunigung von Diffusion LLM durch Ermöglichen von KV Cache und Paralleldecoding 快速dLLM:通过授权 KV 缓存和平行编码加速免培训传播LLM 2505.22618v3 -
62 07-03 Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient Bypass Back-Propagation: Optimierungsbasiertes Structural Pruning für große Sprachmodelle über Policy Gradient Bypass 后回通信:通过 “ 政策梯度 “ 优化基于优化的结构结构,为大语言模式提供缓冲 2406.10576v3 -
63 07-03 REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models REINFORCE++: Effizienter RLHF-Algorithmus mit Robustheit sowohl für Prompt- als auch für Reward-Modelle REINFORCE++: 高效的RLHF对快速模型和奖励模型具有强力的测算法 2501.03262v4 -
64 07-03 DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning DoMIX: Ein effizientes Framework zur Nutzung von Domain-Wissen im Feintuning DoMIX:一个在微调中利用域知识的有效框架 2507.02302v1 -
65 07-03 Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models Commander-GPT: Die Fähigkeit von Multi-Modal Large Language Models, den Sarkasmus vollständig zu entleeren GPT指挥官:完全解除多模式大语言模型的讽刺性探测能力 2503.18681v3 -
66 07-03 Prompt-Guided Turn-Taking Prediction Prompt-geführte Turn-Taking-Vorhersage 即时指导的回转预测 2506.21191v2 -
67 07-03 Optimal strategies to perform multilingual analysis of social content for a novel dataset in the tourism domain Optimale Strategien zur mehrsprachigen Analyse sozialer Inhalte für einen neuartigen Datensatz im Tourismusbereich 为旅游领域新数据集的社会内容进行多语种社会内容分析的最佳最佳战略 2311.14727v2 -
68 07-03 Seeing Through Green: Text-Based Classification and the Firm’s Returns from Green Patents Durch Grün sehen: Textbasierte Klassifizierung und die Rückkehr der Firma aus grünen Patenten 通过 “ 绿色观光:基于文本的分类和公司从绿色专利的回报 “ 2507.02287v1 -
69 07-03 Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments Kausales Repräsentationslernen mit generativer Künstlicher Intelligenz: Anwendung auf Texte als Behandlungen 产生人工智能的因果代表性学习:应用文字作为治疗 2410.00903v3 -
70 07-03 SMARTe: Slot-based Method for Accountable Relational Triple extraction SMARTe: Slot-basierte Methode für die relationale Triple-Extraktion SMARTE: 衡算关系三重采掘的基于固态方法 2504.12816v3 -
71 07-03 MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent MemAgent: Umgestalten von Langkontext-LLM mit Multi-Conv RL-basierten Speicheragenten MemerAgent: 与基于多Conv RL的内存代理重塑长文本LLM 2507.02259v1 -
72 07-03 Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks Schaltungs-Tuning: Mechanistischer Ansatz zur Identifizierung von Parameter Redundanz und Feinsteuerung neuraler Netzwerke 电路调控:确定参数冗余和精微调整神经网络的机械化方法 2502.06106v2 -
73 07-03 Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies Mixture of Reasonings: Große Sprachmodelle mit adaptiven Strategien zur Vernunft bringen 理由混合:与适应战略一道教授大语言模式 2507.00606v2 -
74 07-03 GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons GDC Cohort Copilot: Ein KI-Copilot für die Kuratierung von Kohorten aus den Genomic Data Commons GDC Cohort Cohort 副驾驶:AI 基因组数据共同点的Curate Choorts联合驾驶员 2507.02221v1 -
75 07-03 SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers SciGA: Ein umfassender Datensatz zur Gestaltung grafischer Abstracts in wissenschaftlichen Papieren SciGA: 用于设计学术论文制图摘要的综合数据集 2507.02212v1 -
76 07-02 (3) SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction SHuBERT: Selbstüberwachte Sign Language Representation Lernen über Multi-Stream Cluster Prediction 通过多系统集群预测进行自上自上手语代表制学习 2411.16765v3 -
77 07-02 ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning ESTR-CoT: Auf dem Weg zu einer erklärbaren und präzisen Ereignisstrom-basierten Szenetexterkennung mit Chain-of-Thought-Reasoning ESTR-CoT: 争取实现可解释和准确事件流的基于现场的文本识别,并附有研究链理由 2507.02200v1 -
78 07-02 Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer Latent Chain-of-Thought? Dekodierung des Tiefen-Recurrent Transformers 点解深度- Rent 变换器 2507.02199v1 -
79 07-02 Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis Analyse und Verbesserung der Speaker-Ähnlichkeitsbewertung für Sprachsynthese 分析和改进议长对发言综述的相似性评估 2507.02176v1 -
80 07-02 Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data Beyond Scale: Der Diversity-Koeffizient als Data Quality Metric für Variabilität in natürlichen Sprachdaten 超越尺度:多样性系数作为衡量自然语言数据可变性的数据质量计量标准 2306.13840v4 -
81 07-02 Rethinking LLM Training through Information Geometry and Quantum Metrics Rethinking LLM Training durch Informationsgeometrie und Quantenmetrics 通过信息几何和量度测量重新思考LLM培训 2506.15830v3 -
82 07-02 Quantifying the Importance of Data Alignment in Downstream Model Performance Quantifizierung der Bedeutung der Datenausrichtung in Downstream-Modellleistung 量化数据协调在下游模式绩效中的重要性 2501.08496v3 -
83 07-02 Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization Eine umfassende Bewertung von LLMs für den Dialog Zusammenfassung 全面评价对话总结说明说明说明理由的理由 2507.02145v1 -
84 07-02 Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency Die Auswirkungen mobiler DVFS-Gouverneure auf LLM-Inferenzleistung und Energieeffizienz abklären 分散移动的家庭暴力和退伍军人服务局局长对LLLM 推断性能和能源效率的影响 2507.02135v1 -
85 07-02 De-mark: Watermark Removal in Large Language Models Markierung: Wasserzeichenentfernung in großen Sprachmodellen 标记:大语言模型中去除水印 2410.13808v2 -
86 07-02 Energy-Based Transformers are Scalable Learners and Thinkers Energiebasierte Transformer sind skalierbare Lernende und Denker 以能源为基础的变换器是可缩放的学习者和思想家 2507.02092v1 -
87 07-02 McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models McBE: Ein Multi-Task Chinese Bias Evaluation Benchmark für große Sprachmodelle MCBE: 大型语言模式多任务中文双语评价基准 2507.02088v1 -
88 07-02 Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions Bewertung des Versprechens und der Fälle von LLMs bei Hiring-Entscheidungen 评估LLM女士在雇用决定中的许诺和机会 2507.02087v1 -
89 07-02 Sequential Diagnosis with Language Models Sequentielle Diagnose mit Sprachmodellen 语言模型的序列分析 2506.22405v2 -
90 07-02 Test-Time Scaling with Reflective Generative Model Test-Zeit-Skalierung mit reflektierendem Generativem Modell 具有反反思考生成模型的试验时间缩放 2507.01951v1 -
91 07-02 The Thin Line Between Comprehension and Persuasion in LLMs Die dünne Linie zwischen Verständnis und Überzeugung in LLMs LLMM 理解与劝导之间的细细线 2507.01936v1 -
92 07-02 Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla Anpassungsfähigkeit von ASR-Modellen auf Low-Resource-Sprache: Eine vergleichende Studie von Whisper und Wav2Vec-BERT auf Bangla 低资源语言ASR模型的可调适性:孟加拉语Wav2Vec-BERT和Wav2Vec-BERT的比较研究 2507.01931v1 -
93 07-02 NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks NaturalThoughts: Auswählen und Destillieren von Rückschlüssen für allgemeine Aufgaben 自然探索:为一般理由任务选择和保留合理的理由线索 2507.01921v1 -
94 07-02 Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models Gradient-Adaptive Policy Optimization: Auf dem Weg zu einer multi-objektiven Ausrichtung großer Sprachmodelle 渐进式政策优化:实现大语言模式多目标一致 2507.01915v1 -
95 07-02 AI4Research: A Survey of Artificial Intelligence for Scientific Research AI4Research: Eine Untersuchung der Künstlichen Intelligenz für die wissenschaftliche Forschung AI4Research:科学研究人工情报调查 2507.01903v1 -
96 07-02 High-Layer Attention Pruning with Rescaling Hochebene Aufmerksamkeit Pruning mit Rescaling 高关注度 以降降降为缓冲 2507.01900v1 -
97 07-02 Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data? Rekursive Trainingsschleifen in LLMs: Wie modulieren Trainingsdateneigenschaften die Verteilungsverschiebung in generierten Daten? LLMM中的递归培训循环:培训数据特性如何调整生成数据的分布变化? 2504.03814v3 -
98 07-02 MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants MiCoTA: Die Lernfähigkeitslücke mit Intermediate CoT und Lehrerassistenten überbrücken MiCOCTA: 缩小与中级COT和教师助理的学习能力差距 2507.01887v1 -
99 07-02 Towards Universal Semantics With Large Language Models Hin zu universeller Semantik mit großen Sprachmodellen 走向具有大语言模式的普遍语义 2505.11764v2 -
100 07-02 LinguaSynth: Heterogeneous Linguistic Signals for News Classification LinguaSynth: Heterogene linguistische Signale für Nachrichtenklassifikation LUUASynth:不同语言信号用于新闻分类 2506.21848v2 -
101 07-02 DIY-MKG: An LLM-Based Polyglot Language Learning System DIY-MKG: Ein LLM-basiertes Polyglotte-Sprachlernsystem DIY-MKG:一个基于LLM的多金语言学习系统 2507.01872v1 -
102 07-02 Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages Eka-Eval : Ein umfassender Evaluierungsrahmen für große Sprachmodelle in indischen Sprachen Eka-Eval:印度语大语言模式综合评价框架 2507.01853v1 -
103 07-02 Low-Perplexity LLM-Generated Sequences and Where To Find Them Low-Perplexity LLM-generierte Sequenzen und wo sie zu finden sind 低重复性 LLM 生成序列及其查找地点 2507.01844v1 -
104 07-02 Guaranteed Generation from Large Language Models Garantierte Generation aus großen Sprachmodellen 从大语言模式中担保产生 2410.06716v2 -
105 07-02 QAEncoder: Towards Aligned Representation Learning in Question Answering Systems QAEncoder: Auf dem Weg zu einem ausgerichteten Repräsentationslernen in Fragestellungssystemen QAEncolder:在问题解答系统中实现代表性统一学习 2409.20434v3 -
106 07-02 Evaluating Structured Output Robustness of Small Language Models for Open Attribute-Value Extraction from Clinical Notes Bewertung der Robustheit von kleinen Sprachmodellen für eine offene Attribut-Wert-Extraktion aus klinischen Anmerkungen 评价从临床说明中公开属性价值提取的小型语言模式的结构化产出强强度 2507.01810v1 -
107 07-02 LoRA Fine-Tuning Without GPUs: A CPU-Efficient Meta-Generation Framework for LLMs LoRA Feintuning ohne GPUs: Ein CPU-effizientes Meta-Generation-Framework für LLMs LoRA 无GPUs的精细调整:LLMs的CPU-提高功能元元发光框架 2507.01806v1 -
108 07-02 The Anatomy of Evidence: An Investigation Into Explainable ICD Coding Die Anatomie der Beweise: Eine Untersuchung zur erklärbaren ICD-Kodierung 证据解剖学:调查可解释的 ICD 编码 2507.01802v1 -
109 07-02 How Do Vision-Language Models Process Conflicting Information Across Modalities? Wie verarbeiten Vision-Language-Modelle widersprüchliche Informationen über Modalitäten hinweg? 愿景-语言模型如何以不同方式处理信息冲突问题? 2507.01790v1 -
110 07-02 Probing Evaluation Awareness of Language Models Beurteilung des Kenntnisstands von Sprachmodellen 检验对语文模式的评价意识 2507.01786v1 -
111 07-02 MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining MuRating: Ein qualitativ hochwertiger Datenauswahlansatz zur Mehrsprachigen Vorschulung großer Sprachmodelle 词汇:多语言大语言模式预科培训的高质量数据选择方法 2507.01785v1 -
112 07-02 Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results Dateninterferenzen: Emojis, Homoglyphen und Fragen der Datentreue in Korpora und deren Ergebnisse 数据干扰:表象、同质词和公司的数据忠诚问题及其结果 2507.01764v1 -
113 07-02 Tuning without Peeking: Provable Privacy and Generalization Bounds for LLM Post-Training Tuning ohne Peeking: Provable Privacy and Generalization Bounds for LLM Post-Training 无足足迹的注资:LLM培训后可实现的隐私和普遍化的圈子 2507.01752v1 -
114 07-02 ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving ECCV 2024 W-CODA: 1. Workshop zur multimodalen Wahrnehmung und Verständlichkeit von Eckfällen im autonomen Fahren ECCV 2024 W-CODA:第一次关于自主驾驶时对拐角案例的多模式认识和了解的讲习班 2507.01735v1 -
115 07-02 LLMs for Legal Subsumption in German Employment Contracts LLMs für rechtliche Subsumption in deutschen Arbeitsverträgen 德国就业合同法律补贴LLM 2507.01734v1 -
116 07-02 Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models Unified Triplet-Level Halluzination Evaluation für große Vision-Sprache Modelle 大型视觉语言模型统一三维级幻觉评价 2410.23114v3 -
117 07-02 Stereotype Detection as a Catalyst for Enhanced Bias Detection: A Multi-Task Learning Approach Stereotyp-Erkennung als Katalysator für verbesserte Bias-Erkennung: Ein Multi-Task-Lernansatz 作为强化比亚斯探测催化剂的陈规定型观念探测:多任务学习方法 2507.01715v1 -
118 07-02 AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness AdamMeme: Adaptiv die Vernunft von multimodalen großen Sprachmodellen auf die Schädlichkeit untersuchen AdamMememe:适应性预测关于协调性的多模式大语言模型的理性能力 2507.01702v1 -
119 07-02 Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling Mischen von Supervised und Verstärkung Feintuning mit Präfix-Sampling 与前缀抽样混合监管和强化精细推荐 2507.01679v1 -
120 07-02 On the Fundamental Impossibility of Hallucination Control in Large Language Models Über die grundsätzliche Unmöglichkeit der Halluzinationskontrolle in großen Sprachmodellen 关于大语言模型中幻听控制的基本不可能性 2506.06382v2 -
121 07-02 Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions Achtung vor der Umwelt: Multimodale Substanzen sind für Umweltbeeinträchtigungen empfänglich 注意环境:多式制剂可被环境灾害所接受 2408.02544v2 -
122 07-02 Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings Anpassung von Sprachmodellen an indonesische lokale Sprachen: Eine empirische Studie zur Übertragbarkeit von Sprache auf Null-Schuss-Einstellungen 调整语言模式以适应印度尼西亚当地语言:零热设置的语言可转让性经验研究 2507.01645v1 -
123 07-02 Confidence and Stability of Global and Pairwise Scores in NLP Evaluation Vertrauen und Stabilität von Global und Pairwise Scores in NLP-Evaluation 国家劳工规划评价中全球和对等分数和对等分数的可信度和稳定性 2507.01633v1 -
124 07-02 Chart Question Answering from Real-World Analytical Narratives Diagramm Frage-Antworten von Real-World Analytical Narratives 从真实世界分析叙述中回答的图表问题 2507.01627v1 -
125 07-02 Developing ChemDFM as a large language foundation model for chemistry ChemDFM als großes Sprach-Grundmodell für die Chemie entwickeln 开发化学化学化学化学成像模型,将其作为一个大型语言基础化学模型 2401.14818v6 -
126 07-02 Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems Data Agent: Eine ganzheitliche Architektur für die Orchestrierung von Daten+AI-Ökosystemen 数据代号:一个用于管弦化数据+AI生态系统的综合结构 2507.01599v1 -
127 07-02 T3DM: Test-Time Training-Guided Distribution Shift Modelling for Temporal Knowledge Graph Reasoning T3DM: Test-Time Training-Guided Distribution Shift Modellierung für zeitliche Wissensdiagramm-Reasoning T3DM: 试验时间培训指导分布分布变化模型,用于时间知识图表推理 2507.01597v1 -
128 07-02 Emotionally Intelligent Task-oriented Dialogue Systems: Architecture, Representation, and Optimisation Emotional intelligente, aufgabenorientierte Dialogsysteme: Architektur, Repräsentation und Optimierung 以任务为导向的对话系统:结构、代表性和优化 2507.01594v1 -
129 07-02 Is External Information Useful for Stance Detection with LLMs? Ist externe Informationen nützlich für Stance Detection mit LLMs? 外部信息是否对利用LLMS探测 Stance有用? 2507.01543v1 -
130 07-02 Efficient Out-of-Scope Detection in Dialogue Systems via Uncertainty-Driven LLM Routing Effiziente Out-of-Scope-Erkennung in Dialogsystemen durch unsicheres LLM Routing 通过不确定性驱动LLM路由在对话系统中高效地外探测 2507.01541v1 -
131 07-02 Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence Im Anschluss an die Klues: Experimente zur Person Re-ID mit Cross-Modal Intelligence 在Clues之后:利用跨模式情报对个人重新识别进行实验 2507.01504v1 -
132 07-02 Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities Bewertung der Wirksamkeit der Direktpräferenzoptimierung zur Personalisierung von deutschen automatischen Textvereinfachungen für Personen mit intellektuellen Behinderungen 评估直接优惠优化使德国残疾人自动文本简化措施个人化的效果 2507.01479v1 -
133 07-02 Unifying Global and Near-Context Biasing in a Single Trie Pass Globale und kontextnahe Einigung in einem einzigen Trie Pass 统一全球和近距离统一在单三通 2409.13514v2 -
134 07-02 BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning BIS Reasoning 1.0: Der erste großformatige japanische Benchmark für glaubens-inkonsistente syllogistische Reasoning BIS 理由1.0:日本第一个大尺度的信仰不一致时断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断 2506.06955v3 -
135 07-02 LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation LogitSpec: Beschleunigung der retrieval-basierten spekulativen Dekodierung über die nächste nächste Token-Spekulation logitspec: 加速检索基于回收的投机代号 2507.01449v1 -
136 07-02 DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues DICE-BENCH: Bewertung der Tool-Use-Fähigkeiten von großen Sprachmodellen in multi-round, Multi-Party-Dialogen DICE-BENCH:评估多党对话中大语言模式工具使用能力 2506.22853v2 -
137 07-02 Clinical NLP with Attention-Based Deep Learning for Multi-Disease Prediction Klinische NLP mit aufmerksamkeitsbasiertem Deep Learning für Multi-Disease-Vorhersage 以关注为基础深入学习多疾病预测多疾病预测的临床NLP 2507.01437v1 -
138 07-02 VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues VLM2-Bench: Ein genauerer Blick darauf, wie gut VLMs explizit mit visuellen Queues verknüpfen VLM2-Bench:更仔细地审视VLMs如何良好, 2502.12084v4 -
139 07-02 Pensieve Grader: An AI-Powered, Ready-to-Use Platform for Effortless Handwritten STEM Grading Pensieve Grader: Eine KI-Powered, Ready-to-Use Plattform für mühelose handschriftliche STEM-Grading Pensieve grafer: 一个AI授权的无力手写STEM分级的现用平台 2507.01431v1 -
140 07-02 Don’t Say No: Jailbreaking LLM by Suppressing Refusal Sagen Sie nicht Nein: Jailbreaking LLM durch Unterdrückung der Weigerung 不要说不,不要说不: 2404.16369v3 -
141 07-02 Transferable Modeling Strategies for Low-Resource LLM Tasks: A Prompt and Alignment-Based Approach Übertragbare Modellierungsstrategien für LLM-Aufgaben mit geringem Ressourcenbedarf: Ein prompter und ausgerichteter Ansatz 可转让的低资源LLM任务可转让示范战略:迅速和统一的方法 2507.00601v2 -
142 07-02 Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction Text zu Band Gap: Vortrainierte Sprachmodelle als Encoder für Semiconductor Band Gap Prediction 文字到带宽差距:作为半导体带宽差距预测的编译者的培训前语言模式 2501.03456v2 -
143 07-02 Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models Mehrsprachige Ethische Bias: Der MSQAD mit statistischen Hypothesentests für große Sprachmodelle 跳入多语言伦理比喻:高语言模型统计假设测试的MSQAD 2505.19121v2 -
144 07-02 Multi-interaction TTS toward professional recording reproduction Multi-Interaktion TTS für professionelle Aufnahmewiedergabe 关于专业记录复制的多互动TTS 2507.00808v2 -
145 07-02 olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models olmOCR: Entsperren von Tillionen von Token in PDFs mit Vision Language Models olmOCR:用愿景语言模型在PDF中解锁数万亿托肯 2502.18443v3 -
146 07-02 Direct Quantized Training of Language Models with Stochastic Rounding Direkte Quantisierte Schulung von Sprachmodellen mit stochastischer Rundung 直接量化的语言模式直接量化培训,并进行盘点四舍四入 2412.04787v2 -
147 07-02 MassTool: A Multi-Task Search-Based Tool Retrieval Framework for Large Language Models MassTool: Ein Multi-Task Search-Based Tool Retrieval Framework für große Sprachmodelle MassTool:一个用于大语言模型的多任务搜索工具检索框架 2507.00487v2 -
148 07-02 Pre-training Large Memory Language Models with Internal and External Knowledge Vorschulung großer Speicher Sprachmodelle mit internem und externem Wissen 具有内部和外部知识的大型记忆语言模型 2505.15962v2 -
149 07-02 KatFishNet: Detecting LLM-Generated Korean Text through Linguistic Feature Analysis KatFishNet: LLM-generierter koreanischer Text durch Linguistik-Feature-Analyse erkennen KatFishNet:通过语言特征分析检测LLM-发光的韩文文本 2503.00032v4 -
150 07-02 LEDOM: An Open and Fundamental Reverse Language Model LEDOM: Ein offenes und grundlegendes Reverse Language Modell LEDOM: 开放和基本反向语言模式 2507.01335v1 -
151 07-02 La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation La RoSA: Steigerung der LLM-Effizienz durch schichtweise rotierte Sparse-Aktivierung La RoSA:通过图层旋转的分散启动提高LLM效率 2507.01299v1 -
152 07-02 Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks Frustrierend Einfaches Retrieval verbessert anspruchsvolle, vernünftig-intensive Benchmarks 令人沮丧的简单检索改进挑战、理由说明和密集基准 2507.01297v1 -
153 07-02 Rethinking All Evidence: Enhancing Trustworthy Retrieval-Augmented Generation via Conflict-Driven Summarization Alle Beweise neu denken: Vertrauenswürdige retrieval-angereicherte Generation durch konfliktgetriebene Zusammenfassung verbessern 重新思考所有证据:通过冲突驱动的总结,加强可信赖的回溯可信赖的一代人 2507.01281v1 -
154 07-02 Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening Bewertung großer Sprachmodelle für multimodale simulierte ophthalmische Entscheidungsfindung in diabetischer Retinopathie und Glaukom-Screening 评估糖尿病病理病理和青光眼筛查中多式模拟眼部模拟决策的大型语言模型 2507.01278v1 -
155 07-02 $μ^2$Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for Radiology Report Generation $μ^2$Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer für Radiologie Report Generation $2 $2 收缩器:用于产生放射学报告的可区别的多规模多式多式调控器 2507.00316v2 -
156 07-02 Combating Confirmation Bias: A Unified Pseudo-Labeling Framework for Entity Alignment Bekämpfung von Konfirmations-Bias: Ein einheitliches Pseudo-Labeling-Rahmenwerk für die Ausrichtung von Unternehmen 打击确认的偏见:统一实体统一化框架 2307.02075v4 -
157 07-02 GAIus: Combining Genai with Legal Clauses Retrieval for Knowledge-based Assistant GAIus: Genai mit rechtlichen Klauseln verbinden Rückzug für wissensbasierte Assistentin GAIus:将热奈与法律条款相结合,为知识型助理提供法律条款检索服务 2507.01259v1 -
158 07-02 Towards Safety Evaluations of Theory of Mind in Large Language Models Zu Sicherheitsbewertungen der Geistestheorie in großen Sprachmodellen 争取对大语言模式中思想理论进行安全评价 2506.17352v2 -
159 07-01 (2) The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure Das Medium ist nicht die Botschaft: Deconfounding Text-Embeddings via Linear Concept Erasure 介质不是信息:通过线性概念时代的沉降文本嵌入 2507.01234v1 -
160 07-01 MEGA: xLSTM with Multihead Exponential Gated Fusion for Precise Aspect-based Sentiment Analysis MEGA: xLSTM mit Multihead Exponential Gated Fusion für präzise aspektbasierte Sentimentanalyse MEGA:xLSTM, 带有多头辐射光度G化聚合, 用于基于频谱的感应分析 2507.01213v1 -
161 07-01 A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions Eine Umfrage zur Unsicherheit Quantifizierung großer Sprachmodelle: Taxonomie, offene Forschungsherausforderungen und zukünftige Richtungen 关于大语言模型不确定性量化调查:分类学、开放研究挑战和未来方向 2412.05563v2 -
162 07-01 Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know? Vernunft über Ungewissheit: Wissen Vernunftmodelle, wenn sie es nicht wissen? 关于不确定性的原因:理性模型知道他们不知道什么时候知道吗? 2506.18183v2 -
163 07-01 STELLA: Self-Evolving LLM Agent for Biomedical Research STELLA: Selbstständiger LLM-Agent für biomedizinische Forschung STELLA: 生物医学研究代理公司 2507.02004v1 -
164 07-01 Matching and Linking Entries in Historical Swedish Encyclopedias Passende und verbindende Einträge in historischen schwedischen Enzyklopädien 瑞典历史百科全书中的匹配和链接条目 2507.01170v1 -
165 07-01 Event-based evaluation of abstractive news summarization Eventbasierte Auswertung der abstrakten News-Zusammenfassung 以活动为基础对抽象新闻摘要总结的评价 2507.01160v1 -
166 07-01 Squat: Quant Small Language Models on the Edge Squat: Quant kleine Sprachmodelle am Rand Squt: 边边缘的量化小语言模型 2402.10787v2 -
167 07-01 Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution? Selbstreflektierende Unsicherheiten: Kennen LLMs ihre interne Antwortverteilung? 自我反感的不确定性:LLMs知道他们的内部答案分布吗? 2505.20295v2 -
168 07-01 Divergent Creativity in Humans and Large Language Models Unterschiedliche Kreativität in Menschen und großen Sprachmodellen 人类和大语言模式的不同创造性 2405.13012v2 -
169 07-01 BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining BioPars: Ein vorgebildetes biomedizinisches Großsprachmodell für persischen biomedizinischen Textbergbau BioPars:波斯生物医学材料开采的预先培训的生物医学大语言模型 2506.21567v2 -
170 07-01 SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks SciArena: Eine offene Bewertungsplattform für Stiftungsmodelle in wissenschaftlichen Literaturaufgaben SciArena:科学文献任务基础模型公开评价平台 2507.01001v1 -
171 07-01 Capturing Visualization Design Rationale Capturing Visualization Design Rationale 模拟可视化设计 2506.16571v2 -
172 07-01 Flow-Modulated Scoring for Semantic-Aware Knowledge Graph Completion Flow-Modulated Scoring für semantisch-bewusste Wissensgraphenvervollständigung 用于语义智能知识图补全的流动移动模型拼图 2506.23137v2 -
173 07-01 La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America La Leaderboard: Großes Sprachmodell für spanische Sorten und Sprachen Spaniens und Lateinamerikas 领头板:西班牙和拉丁美洲西班牙语品种和语言大语言示范板 2507.00999v1 -
174 07-01 Should We Still Pretrain Encoders with Masked Language Modeling? Sollten wir noch Encoder mit maskierten Sprachmodellen vortrainieren? 我们是否仍应该为带有隐蔽语言建模的编程者预作准备? 2507.00994v1 -
175 07-01 Discourse Heuristics For Paradoxically Moral Self-Correction Diskurs Heuristik für paradoxerweise sittliche Selbstkorrektion 反相矛盾道德自我自我修正的超常性理论 2507.00985v1 -
176 07-01 Enhancing LLM Agent Safety via Causal Influence Prompting Verbesserung der Sicherheit von LLM-Agenten durch ursächlichen Einfluss 通过原因影响促进增强LLM代理安全 2507.00979v1 -
177 07-01 Large Language Model Confidence Estimation via Black-Box Access Große Sprachmodell-Konfidenzschätzung über Black-Box-Zugriff 通过黑箱访问大语言模型信任度估计 2406.04370v4 -
178 07-01 MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research MLR-Bench: Bewertung von KI-Agenten auf Open-Ended Machine Learning Research MLR-Bench:评估AI公司在开放式机械学习研究方面的代理机构 2505.19955v2 -
179 07-01 Intertextual Parallel Detection in Biblical Hebrew: A Transformer-Based Benchmark Intertextuelle Parallelerkennung im biblischen Hebräisch: Ein transformerbasierter Benchmark 《圣经希伯来文:以变换者为基础的基准》 2506.24117v2 -
180 07-01 The Cognate Data Bottleneck in Language Phylogenetics Der Cognate Data Bottleneck in der Sprache Phylogenetik 语言哲学遗传学中的 Cognate 数据瓶颈 2507.00911v1 -
181 07-01 ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models NUR: One-Layer-Intervention Genügend mildert Halluzinationen in großen Vision-Sprachen-Modellen 仅:在大型视觉语言模型中,单声道干预足以减少幻觉 2507.00898v1 -
182 07-01 MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes MemeCMD: Ein automatisch generierter chinesischer Multiturn Dialogue Datensatz mit kontextuell abgerufenen Memes MemeCMD: 一个自动生成的中文多方向对话框数据集, 带有上下文检索的Memes 2507.00891v1 -
183 07-01 Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check Skalierungsgesetze sind für Downstream-Aufgaben unzuverlässig: Ein Realitätscheck 增强法律对下流任务不可靠:一个现实检查 2507.00885v1 -
184 07-01 Mathematics Isn’t Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations Mathematik ist nicht kulturfrei: Probing Cultural Gaps via Entity und Scenario Perturbations 数学不是没有文化的:通过实体和假想干扰来证明文化差距。 2507.00883v1 -
185 07-01 Benchmarking the Pedagogical Knowledge of Large Language Models Benchmarking der pädagogischen Kenntnisse großer Sprachmodelle 确定大语言模式教学知识基准 2506.18710v3 -
186 07-01 Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite Überprüfbare natürliche Sprache zur linearen Zeitlogik Übersetzung: Ein Benchmark-Datensatz und Bewertungs-Suite 线性时时逻辑翻译的可核实自然语言:基准数据集和评价套件 2507.00877v1 -
187 07-01 TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation Translaw: Benchmarking von großen Sprachmodellen in der Multi-Agenten-Simulation der Kollaborativen Übersetzung TransLaw:在多方代理模拟协作翻译时确定大语言模式基准 2507.00875v1 -
188 07-01 Text Production and Comprehension by Human and Artificial Intelligence: Interdisciplinary Workshop Report Textproduktion und -verständnis durch menschliche und künstliche Intelligenz: Interdisziplinärer Workshop-Bericht 人文和人工情报的文字制作和理解:跨学科讲习班报告 2506.22698v2 -
189 07-01 Stylometry recognizes human and LLM-generated texts in short samples Stylometrie erkennt menschliche und LLM-generierte Texte in kurzen Proben tytylometerm在短样本中确认人类和LLM产生的文本 2507.00838v1 -
190 07-01 ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering ProxAnn: Use-Oriented Assessments of Topic Models and Document Clustering ProxAnn:专题模型和文件分类组合的使用评价 2507.00828v1 -
191 07-01 A Study of In-Context-Learning-Based Text-to-SQL Errors Eine Studie über In-Context-Learning-basierte Text-zu-SQL-Fehler 文中学习基于文本到SQL错误的研究 2501.09310v2 -
192 07-01 Many LLMs Are More Utilitarian Than One Viele LLMs sind nützlicher als eins 许多LLLM女士比一比一更实用 2507.00814v1 -
193 07-01 OM4OV: Leveraging Ontology Matching for Ontology Versioning OM4OV: Ontologie für die Ontologie-Versionierung OM4OV:利用本体学匹配本体学版本的本体学 2409.20302v4 -
194 07-01 Generative AI and the future of scientometrics: current topics and future questions Generative KI und die Zukunft der Scientometrics: aktuelle Themen und Zukunftsfragen A. 生成的人工智能和科学计量法的未来:当前专题和今后的问题 2507.00783v1 -
195 07-01 A Diagrammatic Calculus for a Functional Model of Natural Language Semantics Ein diagrammatischer Kalkulus für ein funktionelles Modell der natürlichen Sprachsemantik 自然语言语义学功能模型的图表计算 2507.00782v1 -
196 07-01 LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing LitBench: Ein Benchmark und Datensatz für eine zuverlässige Bewertung des kreativen Schreibens 《创意书写:可靠评价基准和数据集》 2507.00769v1 -
197 07-01 Safe Low Bandwidth SPV: A Formal Treatment of Simplified Payment Verification Protocols and Security Bounds Safe Low Bandwidth SPV: Eine formale Behandlung von vereinfachten Zahlungsverifikationsprotokollen und Sicherheitsbunden 安全低频带宽度SPV:对简化付款核查议定书和安全圈的正式处理 2507.00740v1 -
198 07-01 HyperCLOVA X THINK Technical Report HyperCLOVA X THINK Technischer Bericht HypercLOVA X 思考技术报告 2506.22403v2 -
199 07-01 AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models AudioTrust: Benchmarking der vielfältigen Vertrauenswürdigkeit von Audio Large Language Models 音频信任:确定音频大语言模式多面信任度基准 2505.16211v2 -
200 07-01 AI Analyst: Framework and Comprehensive Evaluation of Large Language Models for Financial Time Series Report Generation KI Analyst: Rahmen und umfassende Bewertung von großen Sprachmodellen für die Erstellung von Finanzzeitreihen AI分析员:财务时间系列报告编制大语言模式框架和综合评价 2507.00718v1 -
201 07-01 Quasi-symbolic Semantic Geometry over Transformer-based Variational AutoEncoder Quasi-symbolische Semantische Geometrie über Transformer-basierte Variational AutoEncoder 相对于基于变压器的变异自动编码器的 准正对立线语义学几何测量 2210.06230v3 -
202 07-01 Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English Kontrastierende kognitive Stile in Vision-Language-Modellen: Ganzheitliche Aufmerksamkeit im japanischen Vers Analytischen Fokus auf Englisch 视觉语言模型中相互矛盾的认知模式:日本口述分析重点中的整体关注英语 2507.00700v1 -
203 07-01 T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT T2I-R1: Verstärkung der Bildgenerierung mit kollaborativem Semantik- und Token-Level CoT T2I-R1:与合作语义级和Token 级COT加强图像生成 2505.00703v2 -
204 07-01 Leveraging Large Language Models for Spontaneous Speech-Based Suicide Risk Detection Nutzung von großen Sprachmodellen für spontane sprachbasierte Suizidrisikoerkennung 利用大型语言模型进行自发语音自杀风险探测 2507.00693v1 -
205 07-01 Iterative Resolution of Prompt Ambiguities Using a Progressive Cutting-Search Approach Iterative Auflösung von Prompt-Ambiguitäten mittels eines progressiven Cutting-Search-Ansatzes 采用逐步切割和搜寻办法迅速解决问题 2505.02952v2 -
206 07-01 Why Multi-Interest Fairness Matters: Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System Warum Multi-Interest Fairness-Materies: Hypergraph Kontrastives Multi-Interest-Lernen für faires Gesprächs-Empfängersystem 为何多利公平问题:为公平对话建议系统进行高频对抗多利学习 2507.02000v1 -
207 07-01 Not Minds, but Signs: Reframing LLMs through Semiotics Nicht Gedanken, sondern Zeichen: LLMs durch Semiotik abwehren 不是心灵,而是符号:通过非美学重新组合LMS 2505.17080v2 -
208 07-01 SAFER: Probing Safety in Reward Models with Sparse Autoencoder SAFER: Prüfen von Sicherheit in Prämienmodellen mit Sparse Autoencoder SAFER: 使用 Sparse Autenencoder 的奖分模型中测试安全性 2507.00665v1 -
209 07-01 Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences Positionale Bias in Binärfragebeantwortung: Wie Unsicherheit Formen Modelleinstellungen 二进制问题解答中的位置偏差: 不确定形状的模型首选项 2506.23743v2 -
210 07-01 Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion Fact Recall, Heuristik oder reines Guesswork? Präzise Interpretationen von Sprachmodellen für die Fact Completion 事实召回、维力主义或纯粹的猜测?事实完成对语言模式的精确解释 2410.14405v4 -
211 07-01 Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language Effiziente Domain-Adaptive Kontinuierliche Vorschulung für die Prozessindustrie in der deutschen Sprache 以德语为加工工业提供高效的、适应性强的连续连续培训 2504.19856v3 -
212 07-01 Integrating Expert Labels into LLM-based Emission Goal Detection: Example Selection vs Automatic Prompt Design Integration von Experten-Etiketten in LLM-basierte Emissionszielerkennung: Beispielauswahl vs Automatisches Prompt-Design 将专家标签纳入基于LLM的LLM排放目标探测:选择实例与自动即时设计 2412.06432v2 -
213 07-01 TUM-MiKaNi at SemEval-2025 Task 3: Towards Multilingual and Knowledge-Aware Non-factual Hallucination Identification TUM-MiKaNi am SemEval-2025 Aufgabe 3: Mehrsprachige und wissensbasierte nicht-faktische Halluzinationsidentifikation SemEval-2025任务的TUM-MIKANi 任务3:多语种和知识-知识-软件非事实幻觉识别 2507.00579v1 -
214 07-01 DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models DiReCT: Diagnostische Begründung für klinische Anmerkungen über große Sprachmodelle DiReCT:通过大语言模型诊断临床说明的诊断理由 2408.01933v6 -
215 07-01 Methodological Rigour in Algorithm Application: An Illustration of Topic Modelling Algorithm Methodologische Rigour in Algorithmen Anwendung: Eine Illustration der Themenmodellierung Algorithmen Agorithm 应用中的方法重力:主题模型的示意 2507.00547v1 -
216 07-01 An evaluation of LLMs and Google Translate for translation of selected Indian languages via sentiment and semantic analyses Eine Auswertung von LLMs und Google Translate zur Übersetzung ausgewählter indischer Sprachen über Sentiment und semantische Analysen 对LLLM和Google LLMs和Google的评价 2503.21393v3 -
217 07-01 Capsule Network-Based Semantic Intent Modeling for Human-Computer Interaction Capsule Network-based Semantic Intent Modellierung für Mensch-Computer-Interaktion Capsule 网络基于网络的人类-计算机相互作用的语义内涵建模模型 2507.00540v1 -
218 07-01 NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data NIRANTAR: Kontinuierliches Lernen mit neuen Sprachen und Domänen auf Real-World Speech Data NIRANTAR: 关于现实世界语言数据的新语言和新域域的不断学习 2507.00534v1 -
219 07-01 SAGE: Steering Dialog Generation with Future-Aware State-Action Augmentation SAGE: Steuerungsdialog-Generierung mit zukunftssicherer State-Action-Erweiterung SAGE: 具有未来意识的国家行动增强作用的引导对话生成 2503.03040v2 -
220 07-01 TeamCMU at Touché: Adversarial Co-Evolution for Advertisement Integration and Detection in Conversational Search TeamCMU bei Touché: Adversariale Co-Evolution für Werbung Integration und Detektion in der Conversational Search CMU 接触问题小组:在谈话搜索中进行广告融合和探测的反向共同革命 2507.00509v1 -
221 07-01 Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions Learning-to-Context Slope: Bewertung von In-Context-Lerneffektivität jenseits von Performance-Illusionen 学习到文字表达式:评价除了业绩幻觉之外在学习中的效果 2506.23146v2 -
222 07-01 ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition 研究场所:通过基于灵感的分解任务,为科学发现中的科学发现中LLMs制定基准 2503.21248v2 -
223 07-01 ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry ComRAG: Retrieval-Augmented Generation mit dynamischen Vector Stores für Echtzeit-Community-Frageantworten in der Industrie ComRAG: 利用动态矢量储存库实时社区工业问题回答实时社区问题的回收-原始一代 2506.21098v2 -
224 07-01 Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer Architecture Beat- und Downbeat-Tracking in Performance-MIDI mit End-to-End-Transformer-Architektur 利用端对端转换器架构进行实绩跟踪的MIDI 2507.00466v1 -
225 07-01 Pitfalls of Evaluating Language Models with Open Benchmarks Lücken bei der Bewertung von Sprachmodellen mit offenen Benchmarks 具有开放基准的评价语言模式的空洞 2507.00460v1 -
226 07-01 Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention Überwinden von Langkontext-Grenzen von State-Space-Modellen über Kontext-Abhängige Sparse-Achtung 克服国家空间模型通过环境依赖性分散关注而克服国家空间模型的长文限制 2507.00449v1 -
227 07-01 Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models’ Uncertainty? Epistemische Marker in der Einschätzung von Vertrauen wiedersehen: Können Marker die Ungewissheit großer Sprachmodelle genau widerspiegeln? 重新审视信心估计中的亮点标记:标记能否准确地反映大语言模型的不确定性? 2505.24778v2 -
228 07-01 Beyond Sociodemographic Prompting: Using Supervision to Align LLMs with Human Response Distributions Beyond Sociodemographic Prompting: Mit Supervision LLMs mit menschlichen Response-Distributionen ausrichten 超越社会人口人口加速:利用监督使LMs与人的反应分布相匹配 2507.00439v1 -
229 07-01 Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Verbessert Mathe-Reasoning die allgemeinen LLM-Fähigkeiten? Verstehen der Übertragbarkeit von LLM-Reasoning 数学理由是否提高一般LLM能力? 理解LLM理由的可转让性 2507.00432v1 -
230 07-01 RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Radiology with Zero-Shot Multi-Task Capability RadZero: Ähnlichkeitsbasierte Cross-Attention für erklärbare Vision-Sprachenausrichtung in der Radiologie mit Zero-Shot-Multi-Task-Fähigkeit RadZero:在无热多任务能力的放射学中,对可解释的视觉-语言协调进行基于相似的交叉关注 2504.07416v2 -
231 07-01 Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows Flexible Sprachmodellierung im kontinuierlichen Raum mit transformerbasierten autoregressiven Strömungen 具有以变换器为基础的自动递减流动的连续空间灵活语言建模 2507.00425v1 -
232 07-01 Generative Representational Learning of Foundation Models for Recommendation Generatives repräsentatives Lernen von Stiftungsmodellen zur Empfehlung 产生基础基础建议模式的代言人学习 2506.11999v3 -
233 07-01 Pipelined Decoder for Efficient Context-Aware Text Generation Pipelined Decoder für effiziente Textgenerierung im Kontext 高效生成内容软件的管道解码器 2506.23431v2 -
234 07-01 ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context ASTRO: Sprachmodelle zur Vernunft lehren durch Reflektieren und Zurückverfolgen im Kontext ASTRO:通过反映和回溯文文体,将语言模式教成理论 2507.00417v1 -
235 07-01 Parameter-Efficient Fine-Tuning via Circular Convolution Parameter-Effizient Feintuning über Kreiskonvolution 通过循环革命提高参数效率 2407.19342v4 -
236 07-01 Two-Stage Regularization-Based Structured Pruning for LLMs Zweistufiges Regularisierungs-basierendes strukturiertes Pruning für LLMs LLMM 双级正规化和结构化 2505.18232v2 -
237 07-01 Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs Graft: Integration des Domainwissens über effiziente Parametersynergie für MLLMs Graft: 通过MLLM 高效参数协同将域知识整合 2506.23940v2 -
238 07-01 BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference BlockDialekt: Blockweise feinkörnige Mischformat-Quantisierung für energieeffiziente LLM-Inferenz BlockDiaect: 节能LLM 推论的粗件精细混合格式量化 2501.01144v4 -
239 07-01 Causal Prompting for Implicit Sentiment Analysis with Large Language Models Causal Prompting für Implizite Sentiment-Analyse mit großen Sprachmodellen 利用大语言模型进行隐含语言分析的诱导原因 2507.00389v1 -
240 07-01 DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning DALR: Dual Level Alignment Learning für multimodales Sentence Representative Learning DALR: 双级统一学习促进多式判决代表制学习 2506.21096v2 -
241 07-01 Flexora: Flexible Low Rank Adaptation for Large Language Models Flexora: Flexible Low-Rank-Anpassung für große Sprachmodelle 灵活度:针对大语言模式的灵活低级别适应 2408.10774v4 -
242 07-01 SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning SPIRAL: Selbst-Spiel auf Null-Sum-Spiele Anreize zur Vernunft durch Multi-Agent Multi-Turn Verstärkungs-Lernen SPIRAL: 在零桑运动会上自玩 2506.24119v2 -
243 07-01 Gregorian melody, modality, and memory: Segmenting chant with Bayesian nonparametrics Gregorianische Melodie, Modalität und Erinnerung: Segmentierungsgesang mit Bayesischen Nonparametrics Gregorian 旋律、 模式和记忆: 与巴耶斯非参数分隔的口号 2507.00380v1 -
244 07-01 Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples Lehren von Audio-Bewusst Große Sprachmodelle Was nicht hört: Halluzinationen durch synthesierte Negativproben abmildern 教授听觉大语言模型:通过合成负样本减少幻觉 2505.14518v2 -
245 07-01 Seeking and Updating with Live Visual Knowledge Suchen und Aktualisieren mit Live Visual Knowledge 利用实况视觉知识探索和更新 2504.05288v2 -
246 07-01 SPADE: Structured Prompting Augmentation for Dialogue Enhancement in Machine-Generated Text Detection SPADE: Strukturierte Prompting Augmentation für Dialog-Verbesserung bei maschinengenerierter Texterkennung SPADE: 在机器生成的文本探测中促进对话的结构性快速增强 2503.15044v2 -
247 07-01 Question Decomposition for Retrieval-Augmented Generation Zersetzung der Fragestellung für retrieval-augmented Generation 问题 后继子孙分解问题 2507.00355v1 -
248 07-01 Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios Modellierung der Datenvielfalt für die gemeinsame Instanz und die Auswahl der Verbalisatoren in Kaltstart-Szenarien 在 “ 冷开端 “ 情景下为联合试审和镇温器选择建立数据多样性模型 2507.00330v1 -
249 06-30 (1) Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones Fehler durch Interferenz: Sprachmodelle machen ausgeglichene Klammern Fehler, wenn fehlerhafte Mechanismen Klangeindrücke überschatten 被干扰失败:语言模型在错误机制压倒阴影声音一号时造成平衡括号错误 2507.00322v1 -
250 06-30 ETTA: Elucidating the Design Space of Text-to-Audio Models ETTA: Erklärung des Designraums von Text-zu-Audio-Modellen ETTA: 说明文本到模拟模型的设计空间 2412.19351v2 -
251 06-30 Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification Breaking mBad! Supervised Feinabstimmung für Cross-Lingual Entgiftung 监督跨语言解毒微调 2505.16722v2 -
252 06-30 Open-ended Scientific Discovery via Bayesian Surprise Offene wissenschaftliche Entdeckung über Bayesian Surprise 通过贝叶斯惊喜的不限名额科学发现 2507.00310v1 -
253 06-30 Natural language processing for African languages Natürliche Sprachverarbeitung für afrikanische Sprachen 非洲语言的自然语言处理 2507.00297v1 -
254 06-30 The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements Der Automatisierte LLM Speedrunning Benchmark: NanoGPT-Verbesserungen reproduzieren 自动LLM快速运行基准:复制纳米GPT改进 2506.22419v2 -
255 06-30 Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs Kann LLM komplexe Attribution in QA auswerten? Automatisches Benchmarking mit Wissensgraphen 利用知识图自动确定基准 2401.14640v2 -
256 06-30 From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning Von Tokens zu Gedanken: Wie LLMs und Menschen Kompression für Bedeutung traden 从Tokens到思想:LLM和人类如何用贸易压缩来达到意义 2505.17117v3 -
257 06-30 ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling EKG-Byte: Ein Tokenizer für die End-to-End Generative Elektrokardiogramm-Sprachenmodellierung ECG-Byte: 终端到 En-En Energy 电动心电图语言建模调控器 2412.14373v2 -
258 06-30 Impact of Fine-Tuning Methods on Memorization in Large Language Models Auswirkungen von Feintuning-Methoden auf die Erinnerung an große Sprachmodelle 大语言模型中微调教学方法对记忆化的影响 2507.00258v1 -
259 06-30 Llama-Nemotron: Efficient Reasoning Models Llama-Nemotron: Effiziente Denkmodelle Llama-Nepotron: 高效推理模型 2505.00949v4 -
260 06-30 Developing Lightweight DNN Models With Limited Data For Real-Time Sign Language Recognition Entwicklung leichter DNN-Modelle mit begrenzten Daten für Echtzeit-Sign Language-Erkennung 开发轻型DNN模型,具有实时手语识别的有限数据 2507.00248v1 -
261 06-30 EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning EfficientXLang: Verbesserung der Token-Effizienz durch Cross-Lingual Reasoning 高效XLang:通过跨语言理由提高当量效率 2507.00246v1 -
262 06-30 Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension Skalierung der Inferenz-Zeit-Suche mit Vision Value Model für verbesserte visuelle Wahrnehmung 增强视觉理解的视觉价值模型的增强推论-实时搜索 2412.03704v3 -
263 06-30 The Algebraic Structure of Morphosyntax Die algebraische Struktur von Morphosyntax 月光税的代数结构 2507.00244v1 -
264 06-30 Linearly Decoding Refused Knowledge in Aligned Language Models Lineare Dekodierung Verstärktes Wissen in ausgerichteten Sprachmodellen 在统一语言模型中线性解码拒绝的知识 2507.00239v1 -
265 06-30 Interpretable AI for Time-Series: Multi-Model Heatmap Fusion with Global Attention and NLP-Generated Explanations Interpretierbare KI für die Time-Serie: Multi-Model Heatmap Fusion mit globaler Aufmerksamkeit und NLP-generierten Erklärungen 时间序列可解释的 AI:全球关注的多模型热图融合和NLP - 引人注意的解释 2507.00234v1 -
266 06-30 A Graph-Based Classical and Quantum Approach to Deterministic L-System Inference Ein auf Graphen basierender klassischer und Quantumansatz zur deterministischen L-System-Inferenz 以图表为基础的确定性L-系统系统推断法的古学和量法 2411.19906v3 -
267 06-30 Towards Style Alignment in Cross-Cultural Translation Auf dem Weg zur Stilausrichtung in kulturübergreifender Übersetzung 实现跨文化翻译的风格一致 2507.00216v1 -
268 06-30 Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning Zweistufiges Reasoning-infused Learning: Verbesserung der Klassifizierung mit LLM-generierter Reasoning 双级推理学习:改进以LLM为主的理由分类 2507.00214v1 -
269 06-30 LineRetriever: Planning-Aware Observation Reduction for Web Agents LineRetriever: Planning-Aware-Beobachtungsreduktion für Web-Agenten 线检索: 网络代理的规划-软件观测减少 2507.00210v1 -
270 06-30 Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting Sprachmodelle verstehen dich vielleicht nicht: Theorie des Geistes über Story Prompting bewerten 语言模型可能无法理解你:通过故事提示评估心理理论 2506.19089v2 -
271 06-30 RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression RocketKV: Beschleunigung der Langkontext-LLM-Inferenz über zweistufige KV-Cache-Kompression RocketKV: 通过两步KV缓存压缩加速长文本LLM推导 2502.14051v2 -
272 06-30 Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs Bewertung von Deduplikationstechniken für Wirtschaftsforschungspapiertitel mit Fokus auf semantische Ähnlichkeit mit NLP und LLM 利用NLP和LLMs评估经济研究论文标题的应用技术,重点是语义相似性 2410.01141v3 -
273 06-30 Prompting as Scientific Inquiry Als wissenschaftliche Untersuchung prompt 作为科学调查 2507.00163v1 -
274 06-30 Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data Tabelle Verständnis und (Multimodale) LLMs: Eine Cross-Domain-Fallstudie zu wissenschaftlichen vs. nicht wissenschaftlichen Daten 理学与非科学数据交叉案例研究 2507.00152v1 -
275 06-30 On the Predictive Power of Representation Dispersion in Language Models Zur vorausschauenden Macht der Repräsentationsdispersion in Sprachmodellen 语文模式代表性分布的预测力 2506.24106v1 -
276 06-30 Knowing You Don’t Know: Learning When to Continue Search in Multi-round RAG through Self-Practicing Wissen, dass Sie nicht wissen: Lernen, wann Sie die Suche in Multi-round RAG durch Selbst-Praktiken fortsetzen 了解您不知道: 学习何时通过自我实践在多轮RAG中继续搜索 2505.02811v2 -
277 06-30 SEUF: Is Unlearning One Expert Enough for Mixture-of-Experts LLMs? SEUF: Reicht es für LLMs für Mixture-of-Experts aus, einen Experten zu lernen? SEUF:不学习一位专家是否足以使混合专家LLM公司受益? 2411.18797v2 -
278 06-30 MotionGPT3: Human Motion as a Second Modality MotionGPT3: Menschliche Bewegung als zweite Modalität MotionGPT3:人类运动作为第二模式 2506.24086v1 -
279 06-30 STACK: Adversarial Attacks on LLM Safeguard Pipelines Gegenseitige Angriffe auf LLM Safeguard Pipelines 对LLM保障管道的反向攻击 2506.24068v1 -
280 06-30 Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models Logit-Gap Steering: Effiziente Short-Suffix Jailbreaks für ausgerichtete große Sprachmodelle Lologit-Gap 指导:通用大语言模型的高效短后休息室 2506.24056v1 -
281 06-30 KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy KMI: Ein Datensatz koreanischer Motivationsinterviews für Psychotherapie KMI:韩国精神疗法动机访谈对话数据集 2502.05651v2 -
282 06-30 Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track Position: Machine Learning Konferenzen sollten einen “Refutations and Critiques” Track erstellen 职位:机器学习会议应建立“反驳和批评”轨道 2506.19882v2 -
283 06-30 Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation Befreien Sie mich damit! Stealthy Mitgliedschaft Inferenz für Retrieval-Augmented Generation 中我这个! 偷盗会员身份的回溯性 被支持的一代人的推论 2502.00306v2 -
284 06-30 LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries LibVulnWatch: Ein Deep Assessment Agent System und Leaderboard für die Entdeckung versteckter Schwachstellen in Open-Source-KI-Bibliotheken LibVuln Watch: 深入评估代理系统和开放源的AI图书馆中发现隐藏的弱点的主导板 2505.08842v2 -
285 06-30 Ella: Embodied Social Agents with Lifelong Memory Ella: Verkörperte Sozialagenten mit lebenslangem Gedächtnis Ella:有终身记忆的社会代理人 2506.24019v1 -
286 06-30 EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations EXPERT: Eine erklärbare Bildunterschrift Auswertung Metric mit strukturierten Erklärungen 具有结构性解释的可解释图像说明评价计量 2506.24016v1 -
287 06-30 Large Language Models Don’t Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective Große Sprachmodelle machen keinen Sinn für Wortprobleme. Ein Scoping Review aus einer mathematischen Bildungsperspektive 大语言模型不能引起对字问题的看法。从数学教育角度进行范围界定审查。 2506.24006v1 -
288 06-30 Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning Auto-TA: Auf dem Weg zu einer skalierbaren Automatisierten Thematischen Analyse (TA) über Multi-Agent Large Language Models mit Verstärkungslernen Auto-TA:通过具有强化学习的多代理大语言模式逐步实现可缩放自动主题分析(TA) 2506.23998v1 -
289 06-30 TTRL: Test-Time Reinforcement Learning TTRL: Test-Zeit-Verstärkungs-Lernen TTRL: 试验时间强化学习 2504.16084v3 -
290 06-30 Machine Understanding of Scientific Language Maschinelles Verständnis der wissenschaftlichen Sprache 科学语言机器理解 2506.23990v1 -
291 06-30 TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation TaP: Ein taxonomy-geführtes Framework für automatisierte und skalierbare Präferenzdatengenerierung TAP: 自动和可缩放的首选数据生成分类-指导框架 2506.23979v1 -
292 06-30 LLM Agents Are the Antidote to Walled Gardens LLM-Agenten sind das Gegenmittel zu ummauerten Gärten LLM 药剂是被围墙隔绝的花园的抗药剂 2506.23978v1 -
293 06-30 Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders Enthüllen der Entscheidungsfindung in LLMs für Textklassifikation : Extraktion einflussreicher und interpretierbarer Konzepte mit Sparse Autoencodern 文本分类LLMs的不懈决策:与Sparse Autoenckers分离具有影响力和可解释的概念 2506.23951v1 -
294 06-30 Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages Nutzung des Potenzials der Prompt-Engineering für die Hate Speech Detection in Low-Resource-Sprachen 利用迅速工程的潜力,在低资源语言中发现仇恨言论 2506.23930v1 -
295 06-30 IMPACT: Inflectional Morphology Probes Across Complex Typologies IMPACT: Beugungsmorphologie über komplexe Typologien hinweg IMPACT: 跨越复杂类型 2506.23929v1 -
296 06-30 The Trilemma of Truth in Large Language Models Das Trilemma der Wahrheit in großen Sprachmodellen 大语言模型中的真理三边 2506.23921v1 -
297 06-30 Empirical evidence of Large Language Model’s influence on human spoken communication Empirische Beweise für den Einfluss von Large Language Model auf die menschliche gesprochene Kommunikation 大语言模式对人口交流的影响的经验证据 2409.01754v2 -
298 06-30 Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting Mehrstufige mathematische Reasonierung in großen Sprachmodellen durch mehrschichtige Selbstreflexion mit Auto-Prompting 通过使用自动促进的多语言自评,在大语言模型中推进多层次多语种数学理由 2506.23888v1 -
299 06-30 Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It Warum Benchmark-Scores unzuverlässig sind und was dagegen zu tun ist 垃圾垃圾, 合理解释? 为什么基准分数不可靠? 如何做呢? 2506.23864v1 -
300 06-30 GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization GeometrieZero: Verbesserung der Geometrie-Lösung für LLM mit Gruppen-Kontrast-Policy-Optimierung 几何零:改进与集团反竞争政策优化相结合的LLM的几何解决办法 2506.07160v2 -
301 06-30 Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts Verwenden Sie Sparse Autoencoder, um unbekannte Konzepte zu entdecken, nicht um auf bekannte Konzepte zu handeln 使用粗略自动编码器发现未知概念, 而不是对已知概念采取行动 2506.23845v1 -
302 06-30 Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model Helfen oder Fallen denkende Token? Auf dem Weg zu einem effizienteren, großen, vernünftigen Modell 思考 Tok 帮助还是陷阱? 迈向更高效的大理由模型 2506.23840v1 -
303 06-30 Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning Erklärbare Sentiment-Analyse mit DeepSeek-R1: Leistung, Effizienz und wenig scharfes Lernen “深搜索-R1:性能、效率和很少热学习”的可解释的感官分析 2503.11655v2 -
304 06-30 Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph Benchmarking Uncertainty Quantification Methods for Large Language Models mit LM-Polygraph 与LM-Porgraph 参照大语言模型的不确定性量化方法 2406.15627v4 -
305 06-30 Computational Analysis of Character Development in Holocaust Testimonies Computational Analyse der Charakterentwicklung in Holocaust-Zeugnissen 大屠杀证词特征发展计算分析 2412.17063v3 -
306 06-30 AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data AutoEvoEval: Ein Automatisiertes Framework für die Evolving Close-Ended LLM-Evaluierungsdaten AutoEvoEval:发展近端LLM评价数据自动框架 2506.23735v1 -
307 06-30 CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning CSC-SQL: Korrektive Selbstkonsistenz im Text-zu-SQL durch Verstärkungslernen CSC-SQL:通过强化学习在文本到SQL中实现校正的自我统一 2505.13271v2 -
308 06-30 Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization Auf dem Weg zu einem automatisierten multimodalen Ansatz für die Videozusammenfassung: Eine Brücke zwischen Text, Audio und Gesichtsqueue-basierter Zusammenfassung bauen 采用自动多式方式进行视频摘要描述:在文字、音频和基于面轴的缩写之间架建桥梁 2506.23714v1 -
309 06-30 Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments Testbare Audits: Überprüfbare KI-Sicherheits-Benchmarks unter Verwendung von Trusted Execution Environments 可检验的审计:使用可信赖的执行环境的可核实的AI安全基准 2506.23706v1 -
310 06-30 Thinking About Thinking: SAGE-nano’s Inverse Reasoning for Self-Aware Language Models Denken über das Denken: SAGE-nano’s Inverse Reasoning for Self-Aware Language Models 思考思考:SAGE-nano 自我意识语言模型的反向理由 2507.00092v1 -
311 06-30 Sparsing Law: Towards Large Language Models with Greater Activation Sparsity Sparsing Law: Auf dem Weg zu großen Sprachmodellen mit größerer Aktivierungssparsität 评分法:走向大语言模式,具有更大的激活率平等性 2411.02335v4 -
312 06-30 Efficient Interleaved Speech Modeling through Knowledge Distillation Effiziente interleaved Speech Modeling durch Wissensdestillation 通过知识蒸馏建模建立知识蒸馏模式 2506.23670v1 -
313 06-30 L0: Reinforcement Learning to Become General Agents L0: Stärkung des Lernens, Generalagenten zu werden L0:加强学习成为一般代理 2506.23667v1 -
314 06-30 Zero-Shot Contextual Embeddings via Offline Synthetic Corpus Generation Zero-Shot Kontextuelle Einbettungen über Offline Synthetische Corpus-Generierung 通过离线合成机体生成零零热背景嵌入 2506.23662v1 -
315 06-30 Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization Robustes LLM-Unlearning mit MUDMAN: Meta-Unlearning mit Disruptionsmasken und Normalisierung 与 MUDMAN 一起重新学习: 以干扰蒙蔽和正常化的方式重新学习 2506.12484v3 -
316 06-30 Evaluating K-Fold Cross Validation for Transformer Based Symbolic Regression Models Bewertung der K-Fold Cross-Validierung für Transformer-basierte symbolische Regressionsmodelle 评估基于变换器的符号回归模型的 K- Fold 交叉验证 2410.21896v2 -
317 06-30 Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs Bewertung der Simulation menschlicher Persönlichkeits-getriebener Anfälligkeit für Fehlinformationen mit LLMs 评估模拟人类个性-驱动人对与LLMs的错误信息可视性 2506.23610v1 -
318 06-30 KAG-Thinker: Interactive Thinking and Deep Reasoning in LLMs via Knowledge-Augmented Generation KAG-Thinker: Interactive Thinking und Deep Reasoning in LLMs über wissensbasierte Generation KAG- Thinker: 通过知识型一代在LLMs中互动思考和深智 2506.17728v3 -
319 06-30 Semantic-guided Diverse Decoding for Large Language Model Semantisch-geführte Diverse Dekodierung für großes Sprachmodell 用于大语种的语义制导多种解码模型 2506.23601v1 -
320 06-30 FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models FedEx-LoRA: Exakte Aggregation für Federated and Efficient Fine-Tuning of Foundation Models FedEx-LORA:基金会模型的联邦和高效精度 2410.09432v4 -
321 06-30 Reachability in symmetric VASS Erreichbarkeit in symmetrischer VASS 对称VASS的可达性 2506.23578v1 -
322 06-30 MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI MMReason: Ein offenes Multi-Modal Multi-Step-Reason-Benchmark für MLLMs in Richtung AGI MMReason:面向AGI的MLLMs的开放性多模式多模式多步多步理由基准 2506.23563v1 -
323 06-30 From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data Von der Ausrichtung zur Weiterentwicklung: Bootstrapping Audio-Language Alignment mit synthetischen Daten 从对齐到推进: 用合成数据推动音频语言对齐 2505.20166v2 -
324 06-30 FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation FlexRAG: Ein flexibler und umfassender Rahmen für die Retrieval-Augmented Generation FlexRAG: 灵活和综合的回回回一代人框架 2506.12494v2 -
325 06-30 On Recipe Memorization and Creativity in Large Language Models: Is Your Model a Creative Cook, a Bad Cook, or Merely a Plagiator? Auf Rezept Erinnerung und Kreativität in großen Sprachmodellen: Ist Ihr Modell ein kreativer Koch, ein schlechter Koch oder nur ein Plagiator? “大语言模型中的食谱记忆和创造性:你的模型是创意烹饪,坏烹饪,还是仅仅一个粉刷器?” 2506.23527v1 -
326 06-30 NEU-ESC: A Comprehensive Vietnamese dataset for Educational Sentiment analysis and topic Classification toward multitask learning NEU-ESC: Ein umfassender vietnamesischer Datensatz für die Analyse der Lernfähigkeit und das Thema Klassifizierung in Richtung Multitask-Lernen NEU-ESC:越南综合数据集,用于教育敏感分析和多任务学习的专题分类 2506.23524v1 -
327 06-30 A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans Eine umfassende Bewertung der semantischen Beziehungskenntnisse von vorgebildeten Sprachmodellen und Menschen 全面评价未受过训练语言模式和人文的语义关系知识 2412.01131v3 -
328 06-30 Assessing GPTZero’s Accuracy in Identifying AI vs. Human-Written Essays Beurteilung der Genauigkeit von GPTzero bei der Identifizierung von KI gegen von Menschen geschriebene Essays 评估GPTZero在识别AI与人类-Written日志中的准确性 2506.23517v1 -
329 06-30 Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding Gumiho: Eine hybride Architektur, um frühe Token in spekulativer Dekodierung zu priorisieren Gumiho:在投机下限中优先考虑早期物料的混合结构 2503.10135v2 -
330 06-30 LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates LLM-Bremsen: LLM-Vorhersagen mit relevanten Sub-Updates ausgleichen LLM LLM Braress: 利用相关的子更新实现LLM预测 2503.16334v2 -
331 06-30 Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably Verstärkte Feinsteuerung ermöglicht MLLMs das Erlernen neuartiger Aufgaben stabil 强化精细调整启用 MLLMS 学习新创任务 2506.23508v1 -
332 06-30 FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models’ Knowledge and Reasoning FinEval-KR: Ein Financial Domain Evaluation Framework für das Wissen und die Vernunft großer Sprachmodelle FinEval-KR:大语言模式知识和理由说明的财务域评价框架 2506.21591v2 -
333 06-30 Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent Thought-Augmented Planung für LLM-Powered Interactive Recommender Agent LLM 授权互动建议代理商的集思广益规划 2506.23485v1 -
334 06-30 CTISum: A New Benchmark Dataset For Cyber Threat Intelligence Summarization CTISum: Ein neuer Benchmark-Datensatz für Cyber Threat Intelligence Zusammenfassung CTISum:网络威胁情报总结的新基准数据集 2408.06576v2 -
335 06-30 Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission Federated Learning-Enabled Hybrid Language Models für kommunikationseffiziente Token-Übertragung 通信-高效调式传真传播的联邦学习-进进混合语言模式 2507.00082v1 -
336 06-30 Parenting: Optimizing Knowledge Selection of Retrieval-Augmented Language Models with Parameter Decoupling and Tailored Tuning Parenting: Optimierung der Wissensauswahl von Retrieval-Augmented Language Models mit Parameterentkopplung und maßgeschneidertem Tuning 亲子关系: 优化使用参数分离和定制调试的检索增强语言模型的知识选择 2410.10360v3 -
337 06-30 What to Keep and What to Drop: Adaptive Table Filtering Framework Was zu halten und was zu fallen: Adaptive Tabelle Filterung Rahmen 保持和放下什么:适应性表格过滤框架 2506.23463v1 -
338 06-30 State and Memory is All You Need for Robust and Reliable AI Agents Zustand und Gedächtnis sind alles, was Sie für robuste und zuverlässige KI-Agenten brauchen 国家记忆是强力和可靠的AI代理所需要的一切 2507.00081v1 -
339 06-30 Bridge: A Unified Framework to Knowledge Graph Completion via Language Models and Knowledge Representation Brücke: Ein einheitliches Framework zur Wissensgraphenvervollständigung über Sprachmodelle und Wissensdarstellung 桥梁:通过语言模式和知识代表性完成知识图的统一框架 2411.06660v3 -
340 06-30 Mechanistic Interpretability of Emotion Inference in Large Language Models Mechanistische Interpretation von Emotionsinferenzen in großen Sprachmodellen 大语言模型情感引因的可解释性 2502.05489v2 -
341 06-29 (7) TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs TuCo: Messung des Beitrags von Feinsteuerung zu individuellen Reaktionen von LLMs TuCo:衡量微调对LLMM个人对策的贡献 2506.23423v1 -
342 06-29 Datasets for Fairness in Language Models: An In-Depth Survey Datensätze für Fairness in Sprachmodellen: Eine In-Depth-Umfrage 语言模型公平性数据集:内部调查 2506.23411v1 -
343 06-29 Automating Adjudication of Cardiovascular Events Using Large Language Models Automatisieren der Adjudikation von Herz-Kreislauf-Ereignissen mit großen Sprachmodellen 使用大语言模型自动裁决心血管事件 2503.17222v2 -
344 06-29 Teaching a Language Model to Speak the Language of Tools Ein Sprachmodell lehren, um die Sprache der Werkzeuge zu sprechen 教授一种语言模式,讲工具语言 2506.23394v1 -
345 06-29 Hierarchical Memory Organization for Wikipedia Generation Hierarchische Speicherorganisation für Wikipedia Generation 维基百科世代等级记忆组织 2506.23393v1 -
346 06-29 Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance Vergleichende Bewertung von ChatGPT und DeepSeek über zentrale NLP-Aufgaben: Stärken, Schwächen und Domain-spezifische Leistung 国家劳工政策关键任务:力量、弱点和具体具体绩效 2506.18501v2 -
347 06-29 Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs Perspective Dial: Perspective of Text and Guiding LLM Outputs messen 计量文字和引导性LLM产出 2506.23377v1 -
348 06-29 Emotional RAG LLMs: Reading Comprehension for the Open Internet Emotionale RAG LLMs: Leseverständnis für das offene Internet 情感性RAG LLM: 阅读开放互联网理解 2408.11189v2 -
349 06-29 You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties Sie klingen eine kleine Tense: L2 Maßgeschneiderte klare TTS Verwendung von Durational Vowel Properties 你听起来有点紧张: L2 使用时空声波属性的 L2 定制的清除 TTS 2506.23367v1 -
350 06-29 Density, asymmetry and citation dynamics in scientific literature Dichte, Asymmetrie und Zitierdynamik in der wissenschaftlichen Literatur 科学文献中的密度、不对称和引用动态 2506.23366v1 -
351 06-29 ChipXplore: Natural Language Exploration of Hardware Designs and Libraries ChipXplore: Natural Language Exploration von Hardware-Designs und Bibliotheken ChipXplore: 硬件设计和图书馆的自然语言探索 2407.12749v3 -
352 06-29 Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking Destillieren und Verfeinern von Vernunft in kleinen Sprachmodellen für die Neurangierung von Dokumenten 用于文件排序的小型语文模式中理由推理的提炼和精炼 2504.03947v3 -
353 06-29 Potemkin Understanding in Large Language Models Potemkin Verständnis in großen Sprachmodellen 大语言模型中的波坦金理解 2506.21521v2 -
354 06-29 I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue Ich verstehe, was Sie meinen: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue 我理解你的意思:在多模式对话中,用共同语音手势解决参考问题 2503.00071v3 -
355 06-29 TigerLLM - A Family of Bangla Large Language Models TigerLLM - Eine Familie von Bangla Große Sprachmodelle TegerLLLM - 孟加拉大语言模式大家庭 2503.10995v3 -
356 06-29 WebDancer: Towards Autonomous Information Seeking Agency WebDancer: Auf dem Weg zu einer autonomen Informationsagentur WebDancer:走向自主信息搜索机构 2505.22648v2 -
357 06-29 ATGen: A Framework for Active Text Generation ATGen: Ein Framework für die aktive Textgenerierung ATGen: 主动生成文本的框架 2506.23342v1 -
358 06-29 Information Loss in LLMs’ Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family Informationsverlust in der Mehrsprachigen Übersetzung von LLMs: Die Rolle von Trainingsdaten, Sprachnähe und Sprachfamilie LLM女士多种语文翻译信息损失:培训数据的作用、语言接近和语言家庭 2506.23340v1 -
359 06-29 Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition Intricate Cues im Dialog zu verfolgen: Gemeinsame Graphenstruktur und Stimmungsdynamik für multimodale Emotionserkennung 对话中的追踪源数:多模式情感认知的联合图表结构和感知动态 2407.21536v2 -
360 06-29 Automated Vehicles Should be Connected with Natural Language Automatisierte Fahrzeuge sollten mit natürlicher Sprache verbunden werden 自动车辆应与自然语言连接 2507.01059v1 -
361 06-29 GaussMaster: An LLM-based Database Copilot System GaußMaster: Ein LLM-basiertes Datenbank-Copilot-System GaussMaster:以LLM为基础的数据库联合试验系统 2506.23322v1 -
362 06-29 Creativity in AI: Progresses and Challenges Kreativität in der KI: Fortschritte und Herausforderungen 大赦国际的创造性:进展和挑战 2410.17218v5 -
363 06-29 AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling AutoToM: Skalierung modellbasierter mentaler Schlussfolgerungen über Automatisierte Agentenmodellierung AutoToM:通过自动代理建模增强基于模型的心理推断 2502.15676v2 -
364 06-29 Ensemble BERT for Medication Event Classification on Electronic Health Records (EHRs) Ensemble BERT für Medikationsveranstaltungsklassifikation auf elektronischen Gesundheitsakten (EHRs) 电子健康记录(EHRs)药品事件分类集合BERT 2506.23315v1 -
365 06-29 AI Awareness KI-Bewusstsein AIA 认识 2504.20084v2 -
366 06-29 Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles Kennen Sie zuerst und werden Sie besser: Modellierung von Mensch-ähnlichen Benutzer-Simulatoren über Implizite Profile “先知你,再善待你:通过隐含描述文件模拟人像用户模拟器” 2502.18968v4 -
367 06-29 A Context-aware Framework for Translation-mediated Conversations Ein Context-aware Framework für translation-mediated conversations 翻译调解对话的背景意识框架 2412.04205v2 -
368 06-29 Objective-Free Local Learning and Emergent Language Structure in Thinking Machines Zielfreies lokales Lernen und neue Sprachstrukturen in denkenden Maschinen 考虑机器中无目标的地方学习和新兴语言结构 2506.23293v1 -
369 06-29 Two Spelling Normalization Approaches Based on Large Language Models Zwei Rechtschreibungs-Normalisierungsansätze basierend auf großen Sprachmodellen 基于大语言模式的两种拼法正常化办法 2506.23288v1 -
370 06-29 Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models Beispiel dann Identifizieren: Ein allgemeiner Rahmen für die Risikokontrolle und Bewertung in multimodalen großen Sprachmodellen 确定:多式大语言模式风险管理和评估总框架 2410.08174v3 -
371 06-29 Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games Beschädigt durch Reasoning: Reasoning Sprachmodelle werden Free-Riders in Public Goods Games 原因:在公共货物运动会中,理性语言模式成为自由骑手 2506.23276v1 -
372 06-29 Agentic Medical Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge Agentisches medizinisches Wissen Grafiken verbessern medizinische Frageantworten: Die Lücke zwischen LLMs und sich entwickelndem medizinischem Wissen überbrücken 药用知识图加强医疗问题的回答:缩小LLMM与不断发展的医学知识之间的差距 2502.13010v3 -
373 06-29 RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing RAG und RAU: Eine Umfrage zum retrieval-augmentierten Sprachmodell in der natürlichen Sprachverarbeitung RAG和RAU:关于自然语言处理中检索增强语言模式的调查 2404.19543v2 -
374 06-29 The language of time: a language model perspective on time-series foundation models Die Sprache der Zeit: ein Sprachmodell Perspektive auf Zeitreihen Grundmodelle 时间语言:时间序列基础模型的语言模式视角 2507.00078v1 -
375 06-29 Generalist Reward Models: Found Inside Large Language Models Generalist Reward Models: In großen Sprachmodellen gefunden 通用奖赏模式:在大语言模式内建立起来 2506.23235v1 -
376 06-29 Masked Gated Linear Unit Maskierte gezahnte Lineareinheit 面罩线条股 2506.23225v1 -
377 06-29 UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding UrbanLLAVA: Ein multimodales Large Language Model für urbane Intelligenz mit räumlicher Vernunft und Verständnis UrbulalLALLAVA:具有空间合理性和理解性的城市情报多模式大语言模式 2506.23219v1 -
378 06-29 RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams RiverText: Eine Python-Bibliothek für das Training und Evaluieren inkrementaler Word-Einbettungen aus Textdatenströmen RiverText:一个培训和评价来自文本数据流的递增单词嵌入的Python图书馆 2506.23192v1 -
379 06-29 Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models Neudefinition von Bewertungsstandards: Ein einheitlicher Rahmen für die Bewertung der koreanischen Fähigkeiten von Sprachmodellen 重新界定评价标准:评价韩国语言模式能力的统一框架 2503.22968v3 -
380 06-29 FinAI-BERT: A Transformer-Based Model for Sentence-Level Detection of AI Disclosures in Financial Reports FinAI-BERT: Ein transformerbasiertes Modell für die Sentence-Level-Erkennung von KI-Enthüllungen in Finanzberichten FinAI-BERT:以判决为基础在判决一级侦查财务报告中AI披露的变换模式 2507.01991v1 -
381 06-29 The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation Die Wirksamkeit von LLMs als Annotatoren: Eine vergleichende Übersicht und empirische Analyse der direkten Repräsentation LLMs作为说明人的效力:直接代表的比较概览和经验分析 2405.01299v2 -
382 06-29 V-SYNTHESIS: Task-Agnostic Synthesis of Consistent and Diverse In-Context Demonstrations from Scratch via V-Entropy V-SYNTHESIS: Task-Agnostische Synthese von konsistenten und unterschiedlichen In-Context-Demonstrationen von Scratch über V-Entropie V-SYSIS:关于通过V-Entropy从Scratch到V-Entropy的一致和多样化的文体演示的 任务-不可知综合 2506.23149v1 -
383 06-29 Brevity is the soul of sustainability: Characterizing LLM response lengths Brevity ist die Seele der Nachhaltigkeit: Charakterisierende LLM-Responselängen 博利是可持续性的灵魂:确定LLM 反应长度 2506.08686v2 -
384 06-29 Benchmarking Deep Search over Heterogeneous Enterprise Data Benchmarking Deep Search über heterogene Unternehmensdaten 确定对不同不同企业数据进行深度搜索的基准 2506.23139v1 -
385 06-29 LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation 利用结构化数据检索软件增强型一代技术文件的LLM协助问题查询 2506.23136v1 -
386 06-29 Evaluating Rare Disease Diagnostic Performance in Symptom Checkers: A Synthetic Vignette Simulation Approach Bewertung der Diagnoseleistung bei seltenen Krankheiten bei Symptomen: Ein synthetischer Vignette-Simulationsansatz 评价症状检查器中的罕见疾病诊断性能: 合成Vignette模拟方法 2506.19750v4 -
387 06-29 Format-Adapter: Improving Reasoning Capability of LLMs by Adapting Suitable Format Format-Adapter: Verbesserung der Kapazität von LLMs durch Anpassung des geeigneten Formats 格式设计师:通过调整适当格式,提高LLMs的理据能力 2506.23133v1 -
388 06-29 Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding Time-R1: Nach dem Training Großer Vision-Sprachenmodell für die zeitliche Videoerdung 时间-R1:培训后用于实时视频定位的大型视觉语言模型 2503.13377v3 -
389 06-29 Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning Entleashing Embodyd Task Planning Fähigkeit in LLMs durch Verstärkung Learning 通过强化学习,在LLMs中释放未穿衣任务规划能力 2506.23127v1 -
390 06-29 Beware of Calibration Data for Pruning Large Language Models Hüten Sie sich vor Kalibrierdaten für das Pruning von großen Sprachmodellen 注意为粗略大语言模型提供校准数据 2410.17711v2 -
391 06-29 Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models Decoding Memes: Benchmarking Narrative Role Classification für multilinguale und multimodale Modelle 代码模式:多语种和多模式模式的 “ 示范 “ 和 “ 多语种和多模式模式 “ 的 “ 示范作用分类基准 “ 2506.23122v1 -
392 06-29 Enough Coin Flips Can Make LLMs Act Bayesian Genug Münze Flips kann LLMs Act Bayesian 足够多的硬币翻翻可以制造长效LLM 贝叶斯女士 2503.04722v2 -
393 06-29 A Survey of Test-Time Compute: From Intuitive Inference to Deliberate Reasoning Eine Übersicht über die Berechnung der Testzeit: Vom intuitiven Rückschluss zur überlegten Vernunft 试验时间计算调查:从直觉推理到故意推理 2501.02497v3 -
394 06-29 MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings MoCa: Modality-aware Continual Pre-Training macht bidirektionale multimodale Einbettungen besser MoCa: 模式 – – 有意识的连续培训前预培训使双向双向多模式嵌入更佳 2506.23115v1 -
395 06-29 FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes FairI Tales: Bewertung von Fairness in indischen Kontexten mit Fokus auf Bias und Stereotypen FairI Tales:以偏见和陈规定型观念为重点,评价印度背景下的公平性 2506.23111v1 -
396 06-29 From Individuals to Interactions: Benchmarking Gender Bias in Multimodal Large Language Models from the Lens of Social Relationship Von Individuen zu Interaktionen: Benchmarking von Gender Bias in multimodalen großen Sprachmodellen aus dem Bereich der sozialen Beziehung 从个人到互动:从社会关系的角度衡量多模式大语言模式中的性别偏见 2506.23101v1 -
397 06-29 Learning Dynamics of LLM Finetuning Dynamisches Lernen der LLM-Feinsteuerung LLM 微调的学习动态 2407.10490v4 -
398 06-29 MMInA: Benchmarking Multihop Multimodal Internet Agents MMINA: Benchmarking Multihop Multimodale Internet-Agenten MMINA: 确定多速多式互联网代理商的基准 2404.09992v2 -
399 06-29 TyphoFormer: Language-Augmented Transformer for Accurate Typhoon Track Forecasting TyphoFormer: Sprachgesteigerter Transformer für präzise Typhoon-Track-Prognose 台风前台风:用于准确预报台风轨道的语文增强变换器 2506.17609v2 -
400 06-29 DReSS: Data-driven Regularized Structured Streamlining for Large Language Models DResS: Datengesteuerte Regularisierte Strukturierte Straffung für große Sprachmodelle DReSS: 数据驱动的大型语文模式正规化结构精简 2501.17905v3 -
401 06-29 Text2VectorSQL: Bridging Text-to-SQL and Vector Search for Unified Natural Language Queries Text2VectorSQL: Überbrückung Text-zu-SQL und Vektor Suche nach Unified Natural Language Queries Text2VectorSQL: 连接文本到SQL和矢量搜索统一自然语言查询 2506.23071v1 -
402 06-29 Multimodal Medical Code Tokenizer Multimodaler medizinischer Code Tokenizer 多式联运医疗法典化器 2502.04397v3 -
403 06-29 Boosting LLM’s Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning Förderung der molekularen Struktur von LLM mit Wissensverstärkung der Baumsuche 推动LLM的分子结构 2506.23056v1 -
404 06-29 MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition MariNER: Ein Datensatz für die historische brasilianische portugiesische Identitätserkennung Marinner:巴西历史上葡萄牙命名实体识别数据集 2506.23051v1 -
405 06-29 AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks AURA: Agent für Verständnis, Vernunft und automatisierte Werkzeugnutzung in stimmgesteuerten Aufgaben AURA: 语音驱动任务中理解、解释和自动工具使用代理 2506.23049v1 -
406 06-29 SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions SoMi-ToM: Bewertung der multiperspektiven Theorie des Geistes in körpereigenen sozialen Interaktionen SoMi-ToM:评估社会互动中的多视角思维理论 2506.23046v1 -
407 06-29 MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation MetaSynth: Meta-prompting-Driven Agentic Scaffolds für vielfältige synthetische Datengenerierung MetaSynth: 用于多种合成数据生成的元- 制造- 挥发剂脚架 2504.12563v2 -
408 06-29 CHARTOM: A Visual Theory-of-Mind Benchmark for LLMs on Misleading Charts CHARTOM: Ein visueller Theorie-von-Mind-Benchmark für LLMs auf irreführenden Diagrammen 错误领导图表LLML女士的视觉理论基准 2408.14419v3 -
409 06-28 (6) Multimodal Contrastive Representation Learning in Augmented Biomedical Knowledge Graphs Multimodales Kontrastives Repräsentationslernen in Augmented Biomedical Knowledge Graphs 生物医学知识强化图中多模式差异代表性学习 2501.01644v2 -
410 06-28 The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models Die begrenzten Auswirkungen medizinischer Anpassung von großen Sprach- und Visions-Sprachenmodellen 大语言和视觉语言模式医学适应的有限影响 2411.08870v3 -
411 06-28 MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning MARBLE: Ein harter Maßstab für multimodale räumliche Vernunft und Planung 多式联运空间理由和规划的硬基准 2506.22992v1 -
412 06-28 Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement Time-MQA: Zeitreihe Multi-Task-Fragebeantwortung mit Kontextverbesserung 时间-MQA:时间系列多任务问题,加强背景回答 2503.01875v2 -
413 06-28 A Systematic Study of Compositional Syntactic Transformer Language Models Eine systematische Studie kompositorischer syntaktischer Transformer-Sprachmodelle 系统研究合成同步转换器语言模型 2506.22978v1 -
414 06-28 On the Generalizability of “Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals” Zur Verallgemeinerbarkeit von “Wettbewerb von Mechanismen: Aufspüren, wie Sprachmodelle mit Fakten und Gegenfakten umgehen” 关于“机制的竞争:追踪语言模式如何处理事实和反事实”的一般性 2506.22977v1 -
415 06-28 Interpretable LLM-based Table Question Answering Interpretierbare LLM-basierte Tabellenfragebeantwortung 基于表问题的回答 2412.12386v3 -
416 06-28 MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models MLAN: Sprachbasierte Anleitung Tuning bewahrt und überträgt Wissen in multimodalen Sprachmodellen MLAN: 多种语文模式中基于语文的指导指示图示保留和转让知识 2411.10557v3 -
417 06-28 Truth Neurons Wahrheit Neuronen 真理中世纪 2505.12182v2 -
418 06-28 What can large language models do for sustainable food? Was können große Sprachmodelle für nachhaltige Lebensmittel tun? 大型语言模式对于可持续食物能做些什么? 2503.04734v2 -
419 06-28 Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders Präzise Topic Alignment in großen Sprachmodellen über Sparse Autoencoder aktivieren 启用大语言模型中的精确主题对齐 2506.12576v2 -
420 06-28 Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models Agent-to-Agent Theorie des Geistes: Testen Gesprächspartner Bewusstsein unter großen Sprachmodellen 精神感官理论:测试大语言模型间对话者的认识 2506.22957v1 -
421 06-28 HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation HalluSegBench: Counterfactual Visual Reasoning for Segmentation Halluzination Evaluation HalluSegeBench:截肢幻觉评价的反事实视觉理由 2506.21546v2 -
422 06-28 SConU: Selective Conformal Uncertainty in Large Language Models SConU: Selektive konforme Unsicherheit in großen Sprachmodellen SCONU:大语言模式中选择性的形式不确定性 2504.14154v2 -
423 06-28 MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering MOTOR: Multimodaler Optimaler Transport über geschliffenes Retrieval in der medizinischen visuellen Fragestellung 在医疗视觉问题解答中通过定地检索进行多式最佳交通 2506.22900v1 -
424 06-28 From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment Von Ergebnissen zu Prozessen: Leitende PRM-Lernen von ORM für die Schlussfolgerungs-Zeit-Ausrichtung 从结果到过程:指导程序程序管理从ORM学习,以推断-时间协调 2506.12446v2 -
425 06-28 Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance? Welche Programmiersprache und welche Features bei Pre-Training Stage beeinflussen Downstream Logical Inferenz Performance? 培训前阶段哪些语言和特点影响下游逻辑推论性能? 2410.06735v2 -
426 06-28 Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis Arabische Dialektklassifikation mit RNNs, Transformern und großen Sprachmodellen: Eine vergleichende Analyse 使用RNN、变换器和大语言模式的阿拉伯语方言分类:比较分析 2506.19753v2 -
427 06-28 PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models PRMBench: Ein feinkörniger und anspruchsvoller Benchmark für Prozess-Level-Reward-Modelle PRMBBench:进程一级奖励模式的精细和质疑基准 2501.03124v5 -
428 06-28 Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval Mask-aware Text-to-Image Retrieval: Referenzierung der Expression-Segmentierung trifft modales Retrieval Mask-aware 文本到图像检索val: 参考表达式分解会遇到交叉模式检索val 2506.22864v1 -
429 06-28 MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding MANTA: Cross-Modal Semantic Alignment und informationstheoretische Optimierung für langformiges multimodales Verständnis MANTA:跨模式的语义一致和信息理论优化,促进长期多式联运理解 2507.00068v1 -
430 06-28 Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions Mind the Gap: Entity-Context-Aware ASR Strukturierte Transkriptionen 牢记差距:实体提供的背景软件ASR结构化分类 2506.22858v1 -
431 06-28 Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems Wissen Augmented Finetuning Matters in RAG und Agent Based Dialog Systems 在区域咨询组和代理人基础对话系统中增加知识的微调问题 2506.22852v1 -
432 06-28 Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization Steigerung der CTC-basierten ASR-Nutzung durch LLM-basierte Intermediate Loss Regularisierung 利用基于LLM的中间损失规范化,促进基于反恐委员会的ASR 2506.22846v1 -
433 06-28 Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases Besser ausgerichtet mit Umfragegegnern oder Trainingsdaten? Enthüllung politischer Leanings von LLMs in US Supreme Court Cases 与美国最高法院案件调查答卷人或培训数据更加一致? 2502.18282v3 -
434 06-28 Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback Margin Matching Preference Optimization: Verbesserte Modellausrichtung mit Granular Feedback 边际匹配优先优化:用颗粒反馈增强模型协调 2410.03145v2 -
435 06-28 Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models Auswahl und Zusammenführung: Auf dem Weg zu einer anpassungsfähigen und skalierbaren Namenskanzlei-Erkennung mit großen Sprachmodellen 选择和合并:努力以大语言模式识别可适应和可缩放命名实体 2506.22813v1 -
436 06-28 BayesLoRA: Task-Specific Uncertainty in Low-Rank Adapters BayesLoRA: Aufgabenspezifische Unsicherheit in Low-Rank-Adaptern BayesLOLRA:低兰克适应器中任务具体不确定性 2506.22809v1 -
437 06-28 MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs MedEthicsQA: Eine umfassende Frage-Antwort-Benchmark für medizinische Ethik-Bewertung von LLMs MedEthicsQA:LLMs医学道德评价的全面回答问题基准 2506.22808v1 -
438 06-28 Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement Verbesserung der Fähigkeit und Robustheit von großen Sprachmodellen durch verstärkte Learning-Driven Query Refinement 通过强化学习-驱动查询改进,加强大语言模式的能力和健全性 2407.01461v3 -
439 06-28 Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization Den Sweet Spot finden: Präferenzdatenkonstruktion für Scaling Preference Optimierung 寻找甜点:扩大优惠优化的优先数据构建 2502.16825v3 -
440 06-28 Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino Kalahi: Eine handgemachte, basis-kulturelle LLM-Evaluierungssuite für Filipino Kalahi:为菲律宾人设计的手工、基层文化LLM评价套套 2409.15380v4 -
441 06-28 ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models ContextCache: Kontext-Bewusst Semantischer Cache für Multi-Turn-Abfragen in großen Sprachmodellen 上下文缓存: 用于大语言模式多发查询的背景软件语义缓存 2506.22791v1 -
442 06-28 PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection PhonemeFake: Deepfake Realism mit sprachgetriebener Segmentmanipulation und adaptiver Bilevel-Erkennung neu definieren PhonemeFake: 重新定义“深假”现实主义, 使用语言驱动的分部分操纵和适应性双级检测 2506.22783v1 -
443 06-28 Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning Lehrmodelle zu verbalisieren Belohnung Hacking in Chain-of-Thought-Reasoning 教学模型,以思考、思考、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理 2506.22777v1 -
444 06-28 PromptDSI: Prompt-based Rehearsal-free Continual Learning for Document Retrieval PromptDSI: Prompt-basiert Probefreies Kontinuales Lernen für Dokument-Retrieval 快速检索:为检索文件而进行基于即时的无排练的持续学习 2406.12593v4 -
445 06-28 Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine Entscheiden Sie weniger, kommunizieren Sie mehr: Auf dem Konstrukt Gültigkeit der End-to-End-Fact-Checking in der Medizin 决定少决定少决定少决定,交流多交流: 2506.20876v2 -
446 06-28 Detecting Sockpuppetry on Wikipedia Using Meta-Learning Sockepuppetry auf Wikipedia erkennen Mit Meta-Learning 在维基百科上用元学习探测袜子布料 2506.10314v2 -
447 06-28 Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion Doppelentendre: Robuste audiobasierte KI-generierte Lyrics-Erkennung über Multi-View Fusion 双向内容: 强力音频根据 AI 生成的音频通过多视图组合探测 2506.15981v2 -
448 06-28 Jan-nano Technical Report Jan-nano Technischer Bericht Jan-nano技术报告 2506.22760v1 -
449 06-28 AI-Generated Song Detection via Lyrics Transcripts AI-Generated Song Detection via Lyrics Transcripts AI 创名歌曲通过歌词谱状探测 2506.18488v2 -
450 06-28 ScienceMeter: Tracking Scientific Knowledge Updates in Language Models ScienceMeter: Nachvollziehen wissenschaftlicher Wissensaktualisierungen in Sprachmodellen ScienceMeter: 语言模式科学知识最新跟踪 2505.24302v2 -
451 06-28 S^3cMath: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners S^3cMath: Spontane Step-Level Selbstkorrektur macht große Sprachmodelle besser Mathematische Reasoner S3cMatth:自发的逐步自我校正使大语言模型更好地解释数学理由 2409.01524v3 -
452 06-28 Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling Kennen Sie Ihre Fehler: Auf dem Weg zu verhindern, dass übermäßige Abhängigkeit auf Task-Oriented Conversational AI durch Accountability Modeling 了解你的错误:通过建立问责制模式,努力防止过度依赖以任务为导向的对话AI 2501.10316v4 -
453 06-28 LegiGPT: Party Politics and Transport Policy with Large Language Model LegiGPT: Parteipolitik und Verkehrspolitik mit großem Sprachmodell 友好社:具有大语言模式的党政治和交通政策 2506.16692v2 -
454 06-28 How to Retrieve Examples in In-context Learning to Improve Conversational Emotion Recognition using Large Language Models? Wie kann man Beispiele im In-Context-Lernen abrufen, um die Erkennung von Konversationsgefühlen mit großen Sprachmodellen zu verbessern? 如何利用大语言模式获取学习内文中的实例, 2506.20199v2 -
455 06-28 Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting Verbessertes Supervised-Fine-Tuning für große Sprachmodelle, um Katastrophenvergessenheit zu vermeiden 改进对大语言模型改进监督的微调,以缓解灾难性遗忘 2506.09428v2 -
456 06-28 Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs Die Hitze aufdrehen: Min-p-Sampling für kreative und kohärente LLM-Ausgaben 翻开热热:创意和一致的LLM产出的最小抽样 2407.01082v7 -
457 06-28 The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure Die Übersetzungsbarriere Hypothese: Mehrsprachige Generation mit großen Sprachmodellen leidet unter Implizitem Übersetzungsfehler 《翻译障碍假设:具有大语言模型的多语言一代人因隐含翻译失败而遭受的痛苦》 2506.22724v1 -
458 06-28 BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute BEST-Route: Adaptives LLM Routing mit Test-Time Optimal Compute 最佳选择:用测试时最佳计算法运行的适应性LMLM 2506.22716v1 -
459 06-28 Residual Matrix Transformers: Scaling the Size of the Residual Stream Residual Matrix Transformers: Skalierung der Größe des Residual Stream 残余矩阵变异器:扩大残余流的规模 2506.22696v1 -
460 06-28 Reasoner Outperforms: Generative Stance Detection with Rationalization for Social Media Reasoner Outperforms: Generative Stance Detection mit Rationalisierung für Social Media 理性外向表现:社会媒体合理化的 “ 产生式发现 “ 和 “ 社会媒体合理化 “ 。 2412.10266v2 -
461 06-28 VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs VOCABTRIM: Vokabelabgleich für effizientes spekulatives Decodieren in LLMs VOCABTRIM: 有效投机下限的词汇 2506.22694v1 -
462 06-28 Scaling Data-Constrained Language Models Skalierung von datengebundenen Sprachmodellen 受数据约束的语言模式 2305.16264v5 -
463 06-27 (5) Organize the Web: Constructing Domains Enhances Pre-Training Data Curation Organisation des Webs: Aufbau von Domains verbessert die Vorschulung von Daten-Curation 组织网络: 构建域域 增强培训前数据曲线 2502.10341v2 -
464 06-27 PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation PriorDiffusion: Leverage Language Prior in Diffusionsmodellen für monookulare Tiefenschätzung 先前传播:在单人深度估算扩散模型中先使用语言 2411.16750v3 -
465 06-27 Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missions Bewertung der Machbarkeit von Large Language Models zur Erkennung von Mikroverhalten in Teaminteraktionen während Weltraummissionen 评估大语言模型在空间飞行任务期间在团队互动中探测微型行为力模型的可行性 2506.22679v1 -
466 06-27 Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs Kann LLMs Dolmetschen und Leverage strukturierte sprachliche Repräsentationen? Eine Fallstudie mit AMRs LLMs 能够解释和利用结构化语言代表吗? 2504.04745v4 -
467 06-27 VERA: Variational Inference Framework for Jailbreaking Large Language Models VERA: Variationaler Bezugsrahmen für Jailbreaking große Sprachmodelle VERA:破碎大型语言模型变化推断框架 2506.22666v1 -
468 06-27 Demystifying Singular Defects in Large Language Models Entmystifizieren von Singularfehlern in großen Sprachmodellen 解开大语言模型中奇异的奇特缺陷 2502.07004v2 -
469 06-27 Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge Bewertung der Hybrid Retrieval Augmented Generation mit Dynamic Test Sets: LiveRAG Challenge 使用动态测试组评估混合回收增殖下一代:LiveRAG挑战 2506.22644v1 -
470 06-27 Temperature Matters: Enhancing Watermark Robustness Against Paraphrasing Attacks Temperaturfaktoren: Verbesserung der Robustheit des Wasserzeichens gegen paraphrasierende Angriffe 温度事项:加强水印力,防止袭击 2506.22623v1 -
471 06-27 RExBench: Can coding agents autonomously implement AI research extensions? RExBench: Können Codierer KI-Forschungserweiterungen autonom implementieren? RExBench:编码代理商能否自主实施AI研究扩展? 2506.22598v1 -
472 06-27 What Makes the Preferred Thinking Direction for LLMs in Multiple-choice Questions? Was macht die bevorzugte Denkrichtung für LLMs in Multiple-Choice-Fragen? ” 多种选择问题 “ 中LLMs的首选思维方向是什么? 2502.18435v3 -
473 06-27 Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning Datenqualitätsfragen in mehrsprachigen Sprachdatensätzen: Der Bedarf an soziolinguistischer Sensibilisierung und proaktiver Sprachplanung 多语言语言数据集的数据质量问题:社会语言意识和前瞻性语言规划的必要性 2506.17525v2 -
474 06-27 Refining Czech GEC: Insights from a Multi-Experiment Approach Refining Czech GEC: Einblicke aus einem Multi-Experiment-Ansatz 完善捷克的GEC:从多种经验方法中得出的看法 2506.22402v1 -
475 06-27 Metadata Conditioning Accelerates Language Model Pre-training Metadatenkonditionierung beschleunigt Sprachmodell Vortraining 训练前训练模式 2501.01956v3 -
476 06-27 QuickSilver – Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization QuickSilver – Beschleunigung der LLM-Inferenz durch dynamisches Token-Halten, KV-Überspringen, Kontext-Token-Fusion und adaptive Matryoshka-Quantisierung QuickSilver – – 通过动态声调停止、 KV 跳过、 上下文声调融合和适应性 Matryoshka 量化加速LLLM 推断 2506.22396v1 -
477 06-27 How to Train Long-Context Language Models (Effectively) Wie man Langkontext-Sprachenmodelle ausbildet (effektiv) 如何培训长文本语言模型(有效) 2410.02660v3 -
478 06-27 Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment Kann Video Große multimodale Modelle denken wie Doppel-oder Doppel-Down: Eine Studie über defensible Video Entailment Can Video Can Can Can Video 大型多模式模型思考像质疑者或双向下:关于失败视频内容的研究 2506.22385v1 -
479 06-27 Oldies but Goldies: The Potential of Character N-grams for Romanian Texts Oldies but Goldies: Das Potential des Charakters N-Gramms für rumänische Texte 旧的但金的:罗马尼亚文本的字符N克潜力 2506.15650v2 -
480 06-27 Probabilistic Optimality for Inference-time Scaling Probabilistische Optimalität für Inferenz-Zeitskalierung 推推时间缩放的概率概率优化度 2506.22376v1 -
481 06-27 Towards Fair Rankings: Leveraging LLMs for Gender Bias Detection and Measurement Auf dem Weg zu fairen Rankings: LLM-Leveraging für Gender-Bias-Erkennung und -Messung 争取公平评分:利用 “ 性别比重 “ 检测和计量的杠杆作用LMs 2506.22372v1 -
482 06-27 Robust Detection of Watermarks for Large Language Models Under Human Edits Robuste Erkennung von Wasserzeichen für große Sprachmodelle unter menschlichen Bearbeitungen 人类版下大型语言模型水印的强力探测 2411.13868v2 -
483 06-27 Why Are Parsing Actions for Understanding Message Hierarchies Not Random? Warum sind Parsing-Maßnahmen, um Botschaftshierarchien zu verstehen, nicht zufällig? 为什么为了解信件等级而采取分析行动不是随机的? 2506.22366v1 -
484 06-27 Optimal Estimation of Watermark Proportions in Hybrid AI-Human Texts Optimale Schätzung von Wasserzeichenanteilen in Hybrid-KI-Humantexten 对混合的AI-人类文案文中水标记比例的最佳估计 2506.22343v1 -
485 06-27 Multi-Turn Code Generation Through Single-Step Rewards Multi-Turn-Code-Generierung durch Single-Step-Rewards 通过单级奖励生成多发代码 2502.20380v2 -
486 06-27 Evaluating Scoring Bias in LLM-as-a-Judge Bewertung von Bias in LLM-as-a-Richter 以LLM-as-a-Judge方式评价偏见 2506.22316v1 -
487 06-27 Conceptual Topic Aggregation Begriffliche Aggregation 专题汇总概念 2506.22309v1 -
488 06-27 Detection of Personal Data in Structured Datasets Using a Large Language Model Erkennung personenbezogener Daten in strukturierten Datensätzen mittels eines großen Sprachmodells 利用大语言模式在结构化数据集中探测个人数据 2506.22305v1 -
489 06-27 All Entities are Not Created Equal: Examining the Long Tail for Ultra-Fine Entity Typing Alle Entities sind nicht gleich: Prüfung des langen Tails für Ultra-Fine Entity Typing 并非所有实体都平等创建:检查超功能实体打字的长尾 2410.17355v2 -
490 06-27 COOCO – Common Objects Out-of-Context – Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication COOCO – Common Objects Out-of-Context – Semantische Verletzung in Szenen: Untersuchung multimodaler Kontexte in referenzieller Kommunikation COOCO – – 共同点 – – 文本外的公用物体 – – 现场的语义违反:在公用通信中调查多模式背景 2506.22274v1 -
491 06-27 KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding KITAB-Bench: Ein umfassender Multi-Domain-Benchmark für arabisches OCR und Dokumentenverständnis KITAB-Bench:阿拉伯文OCR和文件理解的综合多领域综合基准 2502.14949v2 -
492 06-27 Projected Compression: Trainable Projection for Efficient Transformer Compression Projektierte Kompression: Trainierbare Projektion für effiziente Transformer-Kompression 预计压缩:高效变压器压缩培训预测 2506.22255v1 -
493 06-27 Quantum-Enhanced Attention Mechanism in NLP: A Hybrid Classical-Quantum Approach Quantenverstärkter Aufmerksamkeitsmechanismus in NLP: Hybrid-Klassisch-Quantum-Ansatz NLP中加强的注意机制:分类-量子混合办法 2501.15630v2 -
494 06-27 Fine-Tuning MIDI-to-Audio Alignment using a Neural Network on Piano Roll and CQT Representations Feintuning MIDI-to-Audio Alignment mit einem neuralen Netzwerk auf Klavierrolle und CQT-Darstellungen 利用钢琴卷和CQT代表的神经网络,将MIDI至Audi-Audio对齐 2506.22237v1 -
495 06-27 Leveraging In-Context Learning for Political Bias Testing of LLMs Leveraging In-Context Learning for Political Bias Testing of LLMs 利用知识学习促进LLMs的政治偏见测试 2506.22232v1 -
496 06-27 TableLoRA: Low-rank Adaptation on Table Structure Understanding for Large Language Models TableLoRA: Niedrigrank-Anpassung an das Verständnis der Tabellenstruktur für große Sprachmodelle 表LORA:关于大语言模式表格结构理解的低调适应 2503.04396v2 -
497 06-27 Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment Plant in Cupboard, Orange auf Rablely, Inat Aphone. Benchmarking Incremental Lernen von Situation und Sprachmodell mit einer text-simulierten Umgebung Inat Aphone. 使用文本模拟比照环境对状况和语言模式逐步学习进行基准评估 2502.11733v3 -
498 06-27 Exploring Modularity of Agentic Systems for Drug Discovery Erforschung der Modularität von Wirkstoffsystemen für die Drogenentdeckung 探索药物发现剂系统模式 2506.22189v1 -
499 06-27 LLM as GNN: Graph Vocabulary Learning for Text-Attributed Graph Foundation Models LLM als GNN: Graph Vocabulary Learning für text-Attributed Graph Foundation Models 作为GNN的LLMLM:文字图表基础模型图表词汇学习 2503.03313v2 -
500 06-27 Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models Verfeinerung von Salience-Aware Sparse Feintuning-Strategien für Sprachmodelle 精炼语文模式的精炼素养-软件简简精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精 2412.13488v2 -
501 06-27 MisinfoTeleGraph: Network-driven Misinformation Detection for German Telegram Messages MisinfoTeleGraph: Netzwerkgesteuerte Fehlinformationserkennung für deutsche Telegrammnachrichten MisinfoTeleGraph:德国电讯用网络驱动的错误信息探测 2506.22529v1 -
502 06-27 Training Language Model to Critique for Better Refinement Training Sprachmodell zu Kritik für eine bessere Verfeinerung 改进改进工作简化语言培训模式培训语言模式 2506.22157v1 -
503 06-27 MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot MedRAG: Verbesserung der retrieval-augmentierten Generation mit Wissen Graph-Eliciated Reasoning für Healthcare Copilot Medrag:加强利用知识图图获取保健理由的回收养殖业 2502.04413v2 -
504 06-27 Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX Auge des Urteils: Die Bewertung der russischsprachigen LLMs mit POLLUX 判断之眼:用POLLUX对讲俄语的LLMs的评价进行分解 2505.24616v3 -
505 06-27 SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition SAGE: Spliced-Audio Generated Data for Enhanced Foundational Models in Low-Resource Arabisch-Englisch Code-Switched Speech Recognition SAGE:用于加强低资源阿拉伯语-英语代码转换语音识别中基础模型的 2506.22143v1 -
506 06-27 DAPFAM: A Domain-Aware Patent Retrieval Dataset Aggregated at the Family Level DAPFAM: A Domain-Aware Patent Retrieval Dataset Aggregiert auf Familienebene DAPFAM: 家庭一级综合域-软件专利检索数据集 2506.22141v1 -
507 06-27 iPrOp: Interactive Prompt Optimization for Large Language Models with a Human in the Loop iPrOp: Interaktive Prompt-Optimierung für große Sprachmodelle mit einem Menschen in der Schleife iPrOp: 大语言模型与环中人类互动快速优化 iPrOp: iPrOp 2412.12644v2 -
508 06-27 Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs Llama See, Llama Do: Eine mechanistische Perspektive auf die kontextabhängige Beanspruchung und Ablenkung in LLMs Llama See, Llama Do:LLMML中背景教育和遭遇的机械视角 2505.09338v2 -
509 06-27 Identifying a Circuit for Verb Conjugation in GPT-2 Identifizierung eines Kreises für Verbkonjugation in GPT-2 在 GPT-2 中确定 Verb 混和的电路 2506.22105v1 -
510 06-27 English K_Quantization of LLMs Does Not Disproportionately Diminish Multilingual Performance Englische K_Quantisierung von LLMs nicht disproportional diminish Mehrsprachige Leistung 英文-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语 2503.03592v3 -
511 06-27 Beyond Fixed Length: Bucket Pre-training is All You Need Jenseits der festen Länge: Eimer Vor-Training ist alles, was Sie brauchen 超过固定长度: 巴克特预训练是你们需要的 2407.07495v2 -
512 06-27 Involvement drives complexity of language in online debates Einbeziehung treibt die Komplexität der Sprache in Online-Debatten an 在线辩论语言的复杂性驱动参与驱动因素 2506.22098v1 -
513 06-27 Large Language Models in Argument Mining: A Survey Große Sprachmodelle im Argumentbergbau: Eine Umfrage 争议采矿大语言模型:调查 2506.16383v2 -
514 06-27 Benchmarking Vision Language Models on German Factual Data Benchmarking von Vision Language Models auf deutschen Factual Data 制定德国事实数据愿景语言模型基准 2504.11108v2 -
515 06-27 VLM@school – Evaluation of AI image understanding on German middle school knowledge VLM@school – Auswertung des KI-Bildverständnisses über deutsche Mittelschulkenntnisse VLM@school – – 评价AI关于德国中学知识的图像理解 2506.11604v2 -
516 06-27 Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency Jailbreaking Multimodale große Sprachmodelle über Shuffle Inkonsistenz 通过破碎不连贯的打碎和不连贯的多式多式大语言模型 2501.04931v2 -
517 06-27 MDC-R: The Minecraft Dialogue Corpus with Reference MDC-R: Der Minecraft Dialogue Corpus mit Referenz MDC-R: 采矿对话公司(参考) 2506.22062v1 -
518 06-27 Lost at the Beginning of Reasoning Verloren am Anfang der Vernunft 迷失在理性的开始 2506.22058v1 -
519 06-27 Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans Due to Impenetrable Semantic Reference Sprache in Vivo vs. in Silico: Größenelemente aber größere Sprachmodelle verstehen die Sprache noch nicht auf einem Par mit Menschen aufgrund undurchdringlicher semantischer Referenz Vivo语与Silico语:大小问题,但大语言模型仍然不理解人与人之间的语言,因为不可排除的语义参考 2404.14883v3 -
520 06-27 Decoding Machine Translationese in English-Chinese News: LLMs vs. NMTs Decoding Machine Translationese in Englisch-Chinesisch Nachrichten: LLMs vs. NMTs 《中英新闻:LLMS诉NMTs》 2506.22050v1 -
521 06-27 ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows ScienceBoard: Bewertung multimodaler autonomer Agenzien in realistischen wissenschaftlichen Workflows 科学理事会:评估现实科学工作流程中的多式联运自治机构 2505.19897v2 -
522 06-27 Can Peter Pan Survive MT? A Stylometric Study of LLMs, NMTs, and HTs in Children’s Literature Translation Kann Peter Pan MT überleben? Eine stylometrische Studie von LLMs, NMTs und HTs in der Kinderliteratur Übersetzung Peter Pan Pan Survive MT? 儿童文学翻译中LLMS、NMTs和HTs的理学研究 2506.22038v1 -
523 06-27 Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores Auf dem Weg zu einer reproduzierbaren LLM-Bewertung: Quantifizierung der Unsicherheit in LLM-Benchmark-Scores 走向可复制的LLM评价:量化LLM基准分数中的不确定性 2410.03492v2 -
524 06-27 ACORD: An Expert-Annotated Retrieval Dataset for Legal Contract Drafting ACORD: Ein sachverständiger Datensatz für die Erstellung von Verträgen ACORD: 法律合同起草专家附加说明的检索数据集 2501.06582v3 -
525 06-27 ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference ChunkKV: Semantisch-bewahrende KV-Cache-Kompression für effiziente Lang-Kontext-LLM-Inferenz ChunkKV: 为高效长文本LLM 推断而保存 KV缓存压缩 2502.00299v3 -
526 06-27 Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy Robuste und effiziente autoregressive Sprachsynthese mit dynamischer Chunk-weiser Vorhersagepolitik 强力和高效的自动递减语音合成,带有动态整节预测政策 2506.22023v1 -
527 06-27 MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration MMBoundary: MLLM-Wissensgrenzen-Bewusstsein durch vernünftige Schritt-Vertrauens-Kalibrierung MMMMMMMMMM MMMMMMMM:通过合理步骤信任校准提高MLLM知识边界认识 2505.23224v3 -
528 06-27 Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs Kann den Wald für die Bäume nicht sehen: Aufruf von Heuristik und Biase zu Elicit Irrationale Wahlmöglichkeiten von LLMs 无法看到树的森林: 引用光量和比喻来选择LLMM 的不合理选择 。 2505.02862v3 -
529 06-27 Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization Advancing Language Multi-Agent Learning mit Kredit-Re-Zuweisung für interaktive Umwelt Verallgemeinerung 推进多语言多机构学习,通过信用再分配促进互动环境通用化 2502.14496v2 -
530 06-27 OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis OS-Genese: Automatisieren der GUI Agent Trajectory Construction über Reverse Task Synthesis OS-主题:通过反向任务合成实现图形界面代理轨迹构造自动化 2412.19723v3 -
531 06-27 Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference Erkenntnisgrenzen von Visionsgrößen-Sprachmodellen durch Sampling-basierte Schlussfolgerungen erkennen 通过基于抽样的推断,检测大语言视觉模型的知识范围 2502.18023v2 -
532 06-27 Federated Data-Efficient Instruction Tuning for Large Language Models Federated Data-Efficient Instruction Tuning für große Sprachmodelle 大语言模式联邦数据效率指示图示 2410.10926v2 -
533 06-27 EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models EasyDistill: Ein umfassendes Toolkit für effektive Wissensdestillation von großen Sprachmodellen 简易蒸馏:大语言模式有效知识蒸馏综合工具箱 2505.20888v2 -
534 06-27 Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit Analysieren und Feintuning-Flüsternmodelle für mehrsprachige Pilot-Sprachtranskription im Cockpit 分析并精精精细调校车舱多语种试验性语音翻译多语种试听模式 2506.21990v1 -
535 06-27 BeamLLM: Vision-Empowered mmWave Beam Prediction with Large Language Models BeamLLM: Vision-Empowered mmWave Beam Prediction mit großen Sprachmodellen BeamLLM: 具有大语言模型的视觉-电子动力毫米 2503.10432v2 -
536 06-27 Don’t Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism Vertrauen Sie Generative Agents nicht auf die Kommunikation über soziale Netzwerke, es sei denn, Sie haben ihren Empirischen Realismus Benchmarking 不要相信社会网络移动通信的创造者,除非以其经验现实主义为基准。 2506.21974v1 -
537 06-27 STAIR: Improving Safety Alignment with Introspective Reasoning STAIR: Verbesserung der Sicherheitsausrichtung mit introspektiver Begründung STAIR: 提高安全一致性,以内反省理由 2502.02384v2 -
538 06-27 Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses Verbesserung der Strategien des Jailbreaks: Ein hybrider Ansatz, um LLM-Verletzungen auszunutzen und moderne Verteidigungen zu umgehen 推进破牢战略:利用LLM脆弱性和绕过现代防御的混合办法 2506.21972v1 -
539 06-27 ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework ShifCon: Verbesserung nicht-dominanter Sprachfähigkeiten mit einem Shift-basierten mehrsprachigen Kontrastrahmen Shifcon:利用基于轮班的多语言竞争框架,提高非主要语言能力 2410.19453v6 -
540 06-27 More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents Schwacher als Sie denken: Zur Stabilität von werkzeugintegrierten LLM-Agenten 比你想象的更加脆弱:关于工具集成LLM剂稳定问题 2506.21967v1 -
541 06-27 Using Large Language Models to Suggest Informative Prior Distributions in Bayesian Statistics Große Sprachmodelle verwenden, um informative vorherige Distributionen in Bayesian Statistics vorzuschlagen Bayesian统计中利用大语言模型建议事先知情分配 2506.21964v1 -
542 06-27 PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory PapersPlease: Ein Benchmark für die Bewertung von Motivationswerten von großen Sprachmodellen basierend auf der ERG-Theorie 请文件:根据紧急和紧急和紧急需要理论评价大语言模式动力价值的基准 2506.21961v1 -
543 06-27 EUR-USD Exchange Rate Forecasting Based on Information Fusion with Large Language Models and Deep Learning Methods EUR-USD Wechselkursprognose basierend auf Informationsfusion mit großen Sprachmodellen und Deep-Learning-Methoden 基于与大语言模式和深学习方法信息融合的信息的汇率预测 2408.13214v2 -
544 06-27 A Survey of Large Language Models in Psychotherapy: Current Landscape and Future Directions Eine Umfrage zu großen Sprachmodellen in der Psychotherapie: Aktuelle Landschaft und zukünftige Richtungen 心理治疗中大语言模式调查:当前景观和未来方向 2502.11095v3 -
545 06-27 Dynamic Adaptive Rank Space Exploration for Efficient Sentiment Analysis with Large Language Models Dynamische adaptive Rank Space Exploration für effiziente Sentiment-Analyse mit großen Sprachmodellen 利用大语言模型进行高效情感分析的空间探索 2410.16589v2 -
546 06-27 Embedding-based Approaches to Hyperpartisan News Detection Einbetten-basierte Ansätze zu Hyperparteien-Nachrichten-Erkennung 以嵌入式办法探测超党派新闻 2501.01370v2 -
547 06-27 PQ-GCN: Enhancing Text Graph Question Classification with Phrase Features PQ-GCN: Verbesserung der Textgraphen-Frageklassifikation mit Phrase-Features PQ-GCN:用词组特征加强文本图问题分类 2409.02481v3 -
548 06-27 LRP4RAG: Detecting Hallucinations in Retrieval-Augmented Generation via Layer-wise Relevance Propagation LRP4RAG: Halluzinationen in der retrieval-angereicherten Generation mittels schichtweiser Relevanzvermehrung erkennen LRP4RAG:通过多层相关性传导探测回溯性养殖中的幻觉 2408.15533v3 -
549 06-27 Dynamic Adaptive Optimization for Effective Sentiment Analysis Fine-Tuning on Large Language Models Dynamische Adaptive Optimierung für effektive Sentimentanalyse Feintuning bei großen Sprachmodellen 动态优化优化,对大语言模型进行有效的感性分析,对大语言模型进行微调 2408.11856v3 -
550 06-27 ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation ARAG: Agentische Retrieval Augmented Generation für Personalisierte Empfehlung AARAG: 个人化推荐的 “ 危险回收增加的一代人 “ 2506.21931v1 -
551 06-27 HyReC: Exploring Hybrid-based Retriever for Chinese HyReC: Hybrid-basiertes Retriever für Chinesen erforschen HyreC: 探索以混合方式为中国人寻找 2506.21913v1 -
552 06-27 AutoMixer: Checkpoint Artifacts as Automatic Data Mixers AutoMixer: Checkpoint-Artefakte als automatische Datenmischer 自动混音器: 将检查点异形作为自动数据混音器 2506.21910v1 -
553 06-27 Collective Reasoning Among LLMs: A Framework for Answer Validation Without Ground Truth Kollektive Begründung unter LLMs: Ein Rahmen für die Validierung von Antworten ohne Grundwahrheit LLM女士的集体理由:无事实根据的回答验证框架 2502.20758v2 -
554 06-27 Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference Runde Aufmerksamkeit: Ein neuartiger Aufmerksamkeitsmechanismus auf runder Ebene, um die LLM-Inferenz zu beschleunigen 圆桌关注:加速LLM推断的新一轮圆桌关注机制 2502.15294v3 -
555 06-27 A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs Eine Dual-Layer-Bewertung von geopolitischen und kulturellen Bias in LLMs 对LLM中地缘政治和文化偏见的双重评价 2506.21881v1 -
556 06-27 Grammar and Gameplay-aligned RL for Game Description Generation with LLMs Grammatik und Gameplay-aligned RL für Game Description Generation mit LLMs 使用 LLM 生成游戏描述生成的语法和游戏游戏比对RLRL 2503.15783v2 -
557 06-27 Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation Haben Vision-Sprache Modelle interne Weltmodelle? Auf dem Weg zu einer Atom-Bewertung 愿景-语言模型有内部世界模型吗? 2506.21876v1 -
558 06-27 WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation WildSpeech-Bench: Benchmarking von Audio-LLMs im natürlichen Sprachgespräch WirdSpeech-Bench:为自然演讲对话中的音频LMs设定基准 2506.21875v1 -
559 06-27 Time is On My Side: Dynamics of Talk-Time Sharing in Video-chat Conversations Zeit ist auf meiner Seite: Dynamik des Gesprächs-Zeit-Sharing in Video-Chat-Gesprächen 时间就在我身边:视频聊天中的谈话时间分享动态 2506.20474v2 -
560 06-27 Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder Bridging Compositional and Distributional Semantics: Eine Umfrage zur latenten Semantischen Geometrie über AutoEncoder 搭桥构成和分布式语义学:通过自动 Encder 进行边端语义几何测量调查 2506.20083v2 -
561 06-27 RiverEcho: Real-Time Interactive Digital System for Ancient Yellow River Culture RiverEcho: Real-Time Interactive Digital System für die alte gelbe Flusskultur RiverEcho:古黄河文化实时互动数字系统 2506.21865v1 -
562 06-27 DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE DeepTalk: Auf dem Weg zu nahtloser und intelligenter Sprachinteraktion mit adaptiver Modalität-spezifischer MoE 深谈:实现与适应型模式具体部的无缝和智能语音互动 2506.21864v1 -
563 06-27 Derivational Probing: Unveiling the Layer-wise Derivation of Syntactic Structures in Neural Language Models Derivational Probing: Enthüllen der schichtweisen Ableitung syntaktischer Strukturen in neuralen Sprachmodellen 派生实验:神经语言模型中同步教学结构图层和图层推算 2506.21861v1 -
564 06-27 Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation Leveraging Online-Olympiade-Level-Mathematik Probleme für LLMs Training und Kontaminierung-Resistent Evaluation 利用在线奥林匹克层面的数学问题促进LLM女士的培训和污染 – – 评估 2501.14275v2 -
565 06-27 The Consistency Hypothesis in Uncertainty Quantification for Large Language Models Die Kohärenzhypothese in der Unsicherheitsquantifizierung für große Sprachmodelle 《大语言模型不确定性量化不确定性的一致假设》 2506.21849v1 -
566 06-27 3Description: An Intuitive Human-AI Collaborative 3D Modeling Approach 3Beschreibung: Ein intuitiver human-AI-Kollaborativer 3D-Modellierungsansatz 3 说明:直观的人类-大赦国际合作3D建模方法 2506.21845v1 -
567 06-27 MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers MMCR: Benchmarking quellübergreifender Begründungen in wissenschaftlichen Arbeiten MMCR: 科学文件的跨来源理由基准 2503.16856v2 -
568 06-27 PARSI: Persian Authorship Recognition via Stylometric Integration PARSI: Persische Anerkennung durch stylometrische Integration PARSI: 通过星体集成承认波斯语授权 2506.21840v1 -
569 06-27 GenEscape: Hierarchical Multi-Agent Generation of Escape Room Puzzles GenEscape: Hierarchische Multi-Agenten-Generation von Escape Room Puzzles GenEscape: 相向室谜题的等级化多代理生成 2506.21839v1 -
570 06-27 Strengthening False Information Propagation Detection: Leveraging SVM and Sophisticated Text Vectorization Techniques in comparison to BERT Stärkung der falschen Information Propagation Detection: Leveraging SVM und ausgefeilte Text-Vektorisierungstechniken im Vergleich zu BERT 加强虚假信息传播探测:利用SVM和高频文本矢量技术与BERT相比 2411.12703v2 -
571 06-27 RLSF: Fine-tuning LLMs via Symbolic Feedback RLSF: Feinjustierende LLMs über symbolisches Feedback RLSF:通过符号反馈对LLMs进行微调 2405.16661v3 -
572 06-26 (4) Exploring the change in scientific readability following the release of ChatGPT Erforschung der Veränderung der wissenschaftlichen Lesbarkeit nach der Veröffentlichung von ChatGPT 探讨在ChatGPT发布后科学可读性的变化 2506.21825v1 -
573 06-26 Exploring the Structure of AI-Induced Language Change in Scientific English Erforschung der Struktur des KI-induzierten Sprachwandels im wissenschaftlichen Englisch 探索AI-引自AI的英语科学语言变化结构 2506.21817v1 -
574 06-26 Towards Transparent AI: A Survey on Explainable Large Language Models Auf dem Weg zu transparenter KI: Eine Umfrage zu erklärbaren großen Sprachmodellen 走向透明AI:关于可解释的大型语言模式的调查 2506.21812v1 -
575 06-26 A suite of allotaxonometric tools for the comparison of complex systems using rank-turbulence divergence Eine Reihe allotaxonometrischer Werkzeuge für den Vergleich komplexer Systeme mit Rang-Turbulenz-Divergenz 一套用于比较复杂系统、使用降压扰动差异比较的 Alsotalogon 测量工具套套套 2506.21808v1 -
576 06-26 CitySim: Modeling Urban Behaviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation CitySim: Modellierung städtischer Verhaltensmuster und Stadtdynamik mit großformatiger LLM-Driven Agent Simulation 城市模拟:利用大型LLM-驱动剂模拟模拟模型模拟城市行为和城市动态 2506.21805v1 -
577 06-26 Offensive Language Detection on Social Media Using XLNet Offensive Spracherkennung auf Social Media mit XLNet 使用XLNet在社交媒体上发现攻击性语言 2506.21795v1 -
578 06-26 Evaluating List Construction and Temporal Understanding capabilities of Large Language Models Bewertung der Listenkonstruktion und des zeitlichen Verständnisses von großen Sprachmodellen 评价大语言模型的建筑和时间理解能力清单 2506.21783v1 -
579 06-26 Are Triggers Needed for Document-Level Event Extraction? Sind Auslöser für die Dokument-Level-Ereignisextraktion erforderlich? 需要触发文件级活动吗? 2411.08708v2 -
580 06-26 (Fact) Check Your Bias (Fakt) Prüfen Sie Ihre Bias (事实) 检查您的比亚 2506.21745v1 -
581 06-26 Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought Bias-Augmented Consistency Training reduziert biased Reasoning in Chain-of-Thought 避免和强化的一致培训减少在寻求的连锁努力中造成不利和 不利理由 2403.05518v3 -
582 06-26 Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers Identifizierung von Sprecherinformationen in Feed-Forward-Schichten von selbstüberwachten Sprachtransformatoren 识别自我支持的语音变换者向往进进言层中的演讲者信息 2506.21712v1 -
583 06-26 End-to-End Long Document Summarization using Gradient Caching End-to-End-Langdokumentzusammenfassung mit Gradient Caching 使用梯度缓存对端到 End 长文档的缩写 2501.01805v2 -
584 06-26 Introducing MAPO: Momentum-Aided Gradient Descent Prompt Optimization Einführung von MAPO: Momentum-Aided Gradient Descent Prompt Optimization 介绍MAPO: 动力-援助渐变人后裔快速优化 2410.19499v3 -
585 06-26 Multimodal Misinformation Detection Using Early Fusion of Linguistic, Visual, and Social Features Multimodale Fehlinformationserkennung mittels frühzeitiger Fusion sprachlicher, visueller und sozialer Merkmale 利用语言、视觉和社会特征的早期融合来进行多模式错误信息探测 2507.01984v1 -
586 06-26 ANUBHUTI: A Comprehensive Corpus For Sentiment Analysis In Bangla Regional Languages ANUBHUTI: Ein umfassender Corpus für die Sentimentanalyse in Bangla Regionalsprachen ANUBUHUTI:孟加拉语地区语言中感应分析综合整体体 2506.21686v1 -
587 06-26 Cohort Retrieval using Dense Passage Retrieval Cohort Retrieval mit Dense Passage Retrieval 使用毒气通过通过访问检索的 Cohort 获取地址 2507.01049v1 -
588 06-26 Do We Really Need GNNs with Explicit Structural Modeling? MLPs Suffice for Language Model Representations Brauchen wir wirklich GNNs mit expliziter Strukturmodellierung? MLPs Mangel an Sprachmodelldarstellungen 我们真的需要具有明确结构模型的GNNs吗? 2506.21682v1 -
589 06-26 Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs Feinkörnige Preference-Optimierung verbessert räumliche Vernunft in VLMs 优化优化优化优化改进甚低LMs的空间理性 2506.21656v1 -
590 06-26 Data Efficacy for Language Model Training Dateneffizienz für Sprachmodellschulungen 语文示范培训的数据效率 2506.21545v1 -
591 06-26 “What’s Up, Doc?”: Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets “Was ist los, Doc?”: Analysieren, wie Nutzer Gesundheitsinformationen in groß angelegten KI-Datensätzen suchen “怎么了,医生?” :分析用户如何在大型对话的AI数据集中寻求健康信息。 2506.21532v1 -
592 06-26 OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages OpenNER 1.0: Standardisierte Open-Access-Datensätze für die Entity-Erkennung in 50+ Sprachen OpenNER 1.0:标准化的开放获取实体识别数据集,50+语言 2412.09587v2 -
593 06-26 Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation Weak-to-Strong GraphRAG: Richten von schwachen Retrievern mit großen Sprachmodellen für graphisch basierte Retrieval Augmented Generation 弱至强强石图RAG:与基于图的回取增代大语言模型对齐 2506.22518v1 -
594 06-26 skLEP: A Slovak General Language Understanding Benchmark sklep: Ein slowakisches allgemeines Sprachverständnis Benchmark SkLEP:斯洛伐克一般语言理解基准 2506.21508v1 -
595 06-26 Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments Verbesserung des Nutzerengagements im sozial-gesteuerten Dialog durch interaktive LLM-Alignments 通过互动LLM调整,加强用户参与社会驱动对话 2506.21497v1 -
596 06-26 Bridging Offline and Online Reinforcement Learning for LLMs Überbrückung Offline- und Online-Verstärkungslernen für LLMs 为LLMMs搭桥离线和在线加强学习 2506.21495v1 -
597 06-26 Prompting with Phonemes: Enhancing LLMs’ Multilinguality for Non-Latin Script Languages Mit Phonemes: Mehrsprachigkeit von LLMs für nicht-lateinische Script-Sprachen verbessern 以电话提示:提高LLMS的非拉丁文拼写语言多重语言质量 2411.02398v3 -
598 06-26 Logios : An open source Greek Polytonic Optical Character Recognition system Logios : Ein offenes griechisches Polytonisches optisches Zeichenerkennungssystem Logios: 开放源码希腊多元光学特征识别系统 2506.21474v1 -
599 06-26 TopK Language Models TopK-Sprachenmodelle 顶 K 语言模式 2506.21468v1 -
600 06-26 Aligning Spoken Dialogue Models from User Interactions Ausrichten von gesprochenen Dialogmodellen aus Benutzerinteraktionen 校对用户互动中的口语对话框模型 2506.21463v1 -
601 06-26 Spatial Mental Modeling from Limited Views Räumliche mentale Modellierung aus begrenzten Ansichten 根据有限观点进行空间精神建模 2506.21458v1 -
602 06-26 Text2Cypher Across Languages: Evaluating Foundational Models Beyond English Text2Cypher Across Sprachen: Bewertung von Grundmodellen jenseits des Englischen 跨语言文本:评价超越英语的基础模型 2506.21445v1 -
603 06-26 Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection Domänenwissen-verbesserte LLMs für Betrug und Konzept-Drift-Erkennung 防止欺诈和概念漂流探测的有知识增强的有限LMs 2506.21443v1 -
604 06-26 Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations Erklärbarkeit großer Sprachmodelle mit SMILE: Statistische Modell-agnostische Interpretierbarkeit mit lokalen Erklärungen 使用SMILE解释大语言模型的可解释性:统计模型 – – 与当地解释的可解释性 2505.21657v3 -
605 06-26 Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference Skalierbare Bayesische Low-Rank-Anpassung von großen Sprachmodellen über stochastische Variations-Subraum-Inferenz 通过Stochastic变异性子空间推断,对大语言模型进行可缩放的Bayesian低Rank 2506.21408v1 -
606 06-26 DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation DiffuCoder: Maskierte Difffusionsmodelle für die Codegenerierung verstehen und verbessern DiffuCoder:理解和改进代代码生成的蒙面传播模式 2506.20639v2 -
607 06-26 Hybrid Deep Learning and Signal Processing for Arabic Dialect Recognition in Low-Resource Settings Hybrides Deep Learning und Signalverarbeitung für die arabische Dialekterkennung in Low-Resource-Einstellungen 低资源设置中阿拉伯语语音识别的混合深深学习和信号处理 2506.21386v1 -
608 06-26 Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation 利用LLM协助的对活检索一代人查询了解 2506.21384v1 -
609 06-26 Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models AI 文学批评主义的结构性方法:大语言模型利用Greimas半语言广场 2506.21360v1 -
610 06-26 Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts Latent Prototype Routing: Erzielen einer nahezu perfekten Lastabgleichung in Mixture-of-Experts 原型原型路由:在混合专家中实现近效果负载平衡 2506.21328v1 -
611 06-26 Exploring Adapter Design Tradeoffs for Low Resource Music Generation Erforschung von Adapter-Design-Tradeoffs für Low Resource Music Generation 探索用于低资源音乐制作的适应设计取舍 2506.21298v1 -
612 06-26 Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models Erkennung von Verweisen auf Ausdrücke im visuell begründeten Dialog mit autoregressiven Sprachmodellen 与自动递减语言模型进行视觉基础对话中检测引用表达式 2506.21294v1 -
613 06-26 Small Encoders Can Rival Large Decoders in Detecting Groundedness Kleine Encoder können große Decoder bei der Erkennung von Erdlichkeit rivalisieren 在地面探测中能够使大型分离器在探测地面时发生迭接 2506.21288v1 -
614 06-26 Thinkless: LLM Learns When to Think Denklos: LLM lernt, wann man denkt 无思想:LLM学习思考时间 2505.13379v2 -
615 06-26 Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning Double-Checker: Bessere Begründung von langsam denkenden LLMs über selbstkritische Feinsteuerung 双重检查者:通过自批评性微调,加强慢思考低迷LMs的理由 2506.21285v1 -
616 06-26 HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context HumanOmniV2: Vom Verständnis zur Omni-Modalen Vernunft mit Kontext HumanOmniV2:从理解到以上下文为根据的全方位模式 2506.21277v1 -
617 06-26 Can “consciousness” be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis Kann “Bewusstsein” von großen Sprachmodellen (LLM) innerhalb von Zuständen beobachtet werden? 从大型语言模型内部状态观察到“意识”吗?通过综合信息理论和全方位代表分析,将从思维理论测试中获得的LLM表示法解析 2506.22516v1 -
618 06-26 Cat and Mouse – Can Fake Text Generation Outpace Detector Systems? Katze und Maus – Kann die Textgenerierung ausfallende Detektorsysteme fälschen? 猫和老鼠 – – 假文本生成能否超越检测器系统? 2506.21274v1 -
619 06-26 A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns Ein Troublemaker mit ansteckenden Jailbreak macht Chaos in ehrlichen Städten 一个麻烦制造者 与贪婪的监狱破碎 制造混乱 在诚实的城镇 2410.16155v2 -
620 06-26 DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster DiLoCoX: Ein kommunikationsarmer groß angelegter Ausbildungsrahmen für dezentralisierte Cluster DILOCOX:权力下放小组的低通信大范围培训框架 2506.21263v1 -
621 06-26 Simulating Hard Attention Using Soft Attention Simulation der harten Aufmerksamkeit mit weicher Aufmerksamkeit 使用软关注模拟硬关注 2412.09925v2 -
622 06-26 Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents Agent-RewardBench: Auf dem Weg zu einem einheitlichen Benchmark für Prämienmodellierung über Wahrnehmung, Planung und Sicherheit in multimodalen Real-World-Agenten Agent-RewardBench:建立一个统一基准,用于在现实世界多式联运代理中建立跨认知、规划和安全概念、规划与安全的奖励模型 2506.21252v1 -
623 06-26 Capturing Style in Author and Document Representation Stil in der Autor- und Dokumentdarstellung erfassen 在作者和文件代表中获取样式 2407.13358v2 -
624 06-26 Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval Automatische Termextraktion mit großen Sprachmodellen durch syntactic Retrieval verbessern 通过同步检索增强使用大语言模型的自动定期抽取功能 2506.21222v1 -
625 06-26 Complexity-aware fine-tuning Komplexitätsbewusste Feinabstimmung 复杂度认知微调 2506.21220v1 -
626 06-26 Unveiling Causal Reasoning in Large Language Models: Reality or Mirage? Kausale Vernunft in großen Sprachmodellen enthüllen: Realität oder Mirage? 大语言模型中未解的因果理由:现实还是幻影? 2506.21215v1 -
627 06-26 TAPS: Tool-Augmented Personalisation via Structured Tagging TAPS: Tool-Augmented Personalisierung durch strukturiertes Tagging TAPS: 通过结构拖网提高工具的个性化 2506.20409v2 -
628 06-26 LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey LLM-basierte human-agente Kooperations- und Interaktionssysteme: Eine Umfrage 以LLM为基础的人类-机构协作和互动系统:调查 2505.00753v4 -
629 06-26 Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks Beibehaltung des MTEB: Langfristige Nutzungsfähigkeit und Reproduzierbarkeit von Einbettungs-Benchmarks 维持MDEB:实现长期使用和可复制嵌入基准 2506.21182v1 -
630 06-26 Compressed and Smooth Latent Space for Text Diffusion Modeling Komprimierter und glatter Latent-Raum für Text-Diffusionsmodellierung 压缩和平滑的文本传播中缓流空间模型模型 2506.21170v1 -
631 06-26 CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models CVC: Eine groß angelegte chinesische Wertregel Corpus zur Wertausrichtung großer Sprachmodelle CVC: 大型中文大语言模式价值调整大型中国价值规则公司 2506.01495v4 -
632 06-26 Do Large Language Models Advocate for Inferentialism? Befürworten große Sprachmodelle den Inferentialismus? 大语言模型是否为推定主义辩护? 2412.14501v2 -
633 06-26 Learning Evaluation Models from Large Language Models for Sequence Generation Learning Evaluation Models aus großen Sprachmodellen für die Sequenzgenerierung 序列生成大语言模式学习评价模式 2308.04386v3 -
634 06-26 Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models Progtuning: Progressives Fine-Tuning-Framework für transformerbasierte Sprachmodelle 改进:基于变换器的语文模式逐步微调框架 2506.21119v1 -
635 06-26 Learning to Skip the Middle Layers of Transformers Lernen, die mittleren Schichten der Transformer zu überspringen 学习跳过变换器的中层 2506.21103v1 -
636 06-26 HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics HERMES: zeitlich-zusammenhängendes lang-für-M Verständnis mit Episoden und Semantik HERMES: 与分数和语义学的理解 2408.17443v4 -
637 06-26 Enhancing LLM Tool Use with High-quality Instruction Data from Knowledge Graph Verbesserung der LLM-Tool-Nutzung mit hochwertigen Instruktionsdaten aus Wissensgrafik 利用来自知识图的高质量教学数据加强LLM工具的使用 2506.21071v1 -
638 06-26 MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection MT2-CSD: Eine neue Datensatz- und Multi-Semantic Knowledge Fusion Methode zur konversatorischen Stance-Erkennung MT2-CSD: 用于语音稳定探测的新数据集和多语层知识融合方法 2506.21053v1 -
639 06-26 Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs Suche und Verfeinerung während des Denkens: Autonome Retrieval-Augmented Reasoning von LLMs 思考期间的搜索和记忆:自主检索-强化理据(LLM) 2505.11277v3 -
640 06-26 A Semi-supervised Scalable Unified Framework for E-commerce Query Classification Ein halbüberwachtes skalierbares Unified Framework für die E-Commerce Query Classification 半监督的电子商务查询分类可扩展统一框架 2506.21049v1 -
641 06-26 MockLLM: A Multi-Agent Behavior Collaboration Framework for Online Job Seeking and Recruiting MockLLM: Ein Multi-Agent-Behavior-Kooperationsrahmen für Online-Jobsuche und Recruiting MockLLLM:网上求职和招聘多代理行为协作框架 2405.18113v2 -
642 06-26 SceneGenAgent: Precise Industrial Scene Generation with Coding Agent SceneGenAgent: Präzise industrielle Szenegenerierung mit Coding Agent SceneGenerAgenti: 精密工业场景与编码剂生成 2410.21909v3 -
643 06-26 Large Language Models Acing Chartered Accountancy Große Sprachmodelle Aking Chartered Accountancy 特许会计会计 2506.21031v1 -
644 06-26 SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control SAC: Ein Rahmen für die Messung und Induktion von Persönlichkeitseigenschaften in LLMs mit dynamischer Intensitätskontrolle SAC: 具有动态强度控制的LMLM中测量和诱导个性轨迹的框架 2506.20993v1 -
645 06-26 SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes SharpZO: Hybrid Sharpness-Aware Vision Sprachmodell Prompt Tuning via Forward-Only Passes SharpZO: 混合尖锐-敏锐视觉语言模型,通过前向-单行道快速调试 2506.20990v1 -
646 06-26 SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization SACL: Verständnis und Bekämpfung von Textbias im Code Retrieval mit semantisch-angereicherter Reranking und Lokalisierung SACL: 理解和打击《规则》中与语义-增强的重新排级和本地化相结合的 “ 检索法 “ 中的 “ 理解和打击 “ 理论上的 “ 种族 “ 行为 2506.20081v2 -
647 06-26 Can Gradient Descent Simulate Prompting? Kann Gradient Descent Simulate Prompting? 梯子源模拟能刺激吗? 2506.20989v1 -
648 06-26 Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models Vergleich von Retrieval-Augmentation und Parameter-Effizient Fine-Tuning für Datenschutz-Erhaltung Personalisierung von großen Sprachmodellen 比较大语言模型的检索增强和参数有效微量微量美化,促进保护隐私和保持个人特征化 2409.09510v2 -
649 06-26 Towards Text-free Graph Foundation Models: Rethinking Multi-Domain Graph Contrastive Learning Auf dem Weg zu textfreien Graph Foundation-Modellen: Multi-Domain-Graph Kontrastives Lernen neu denken 走向无文本图表基础模型:重新思考多领域图表对比学习 2506.22510v1 -
650 06-26 Reward-Guided Speculative Decoding for Efficient LLM Reasoning Belohnungsgeführte spekulative Dekodierung für effiziente LLM-Reasoning 高效 LLM 理由说明的受奖励指导的投机性说明 2501.19324v3 -
651 06-26 Learning to Rank for Multiple Retrieval-Augmented Models through Iterative Utility Maximization Ranken lernen für mehrere Retrieval-Augmented Modelle durch iterative Utility Maximierung 通过迭代功用最大化学习多重检索增强型号排名 2410.09942v2 -
652 06-26 AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text AgentStealth: Verstärkung des Large Language Models zur Anonymisierung von benutzergeneriertem Text AgentStealth:加强用户生成文本匿名大语言模式 2506.22508v1 -
653 06-26 Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation Jenseits der reaktiven Sicherheit: Risiko-Bewusst LLM-Ausrichtung über Long-Horizon Simulation 超越反应安全性:通过长休松模拟使风险-警用LLM对齐 2506.20949v1 -
654 06-26 Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters Bewertung großer Sprachmodelle für automatisierte klinische Abstraktion in pulmonalen Embolism Registries: Performance Across Modellgrößen, Versionen und Parameter 评价肺部新陈代谢登记簿自动临床抽象化的大型语言模型:不同模型大小、版本和参数的性能 2503.21004v2 -
655 06-26 PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks PP-DocBee: Multimodales Dokumentenverständnis durch Tricks verbessern PP-Docbee:通过一袋小把戏改进多式文件理解 2503.04065v3 -
656 06-26 KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model KaLM-Embedding-V2: Überlegene Trainingstechniken und Daten inspirieren ein vielseitiges Einbettungsmodell KaLM-Embedding-V2:高级培训技术和数据预报 2506.20923v1 -
657 06-26 FineWeb2: One Pipeline to Scale Them All – Adapting Pre-Training Data Processing to Every Language FineWeb2: Eine Pipeline, um sie alle zu skalieren – Anpassung der Vorschulungsdatenverarbeitung an jede Sprache FineWeb2: 将全部标准缩放的一条管道 – – 将培训前数据处理适应于每种语言 2506.20920v1 -
658 06-26 Optimising Language Models for Downstream Tasks: A Post-Training Perspective Sprachmodelle für Downstream-Aufgaben optimieren: Eine Perspektive nach dem Training 优化下游任务的语言模式:培训后展望 2506.20917v1
Article 0
Title@2025-07-03 (4): Requirements Elicitation Follow-Up Question Generation
Title: Requirements Elicitation Follow-Up Question Generation | Voraussetzungen Elicitation Follow-Up Question Generation | 问询后查询 2507.02858v1 |
Authors (3): Yuchen Shen, Anmol Singhal, Travis Breaux
Interviews are a widely used technique in eliciting requirements to gather stakeholder needs, preferences, and expectations for a software system. Effective interviewing requires skilled interviewers to formulate appropriate interview questions in real time while facing multiple challenges, including lack of familiarity with the domain, excessive cognitive load, and information overload that hinders how humans process stakeholders’ speech. Recently, large language models (LLMs) have exhibited state-of-the-art performance in multiple natural language processing tasks, including text summarization and entailment. To support interviewers, we investigate the application of GPT-4o to generate follow-up interview questions during requirements elicitation by building on a framework of common interviewer mistake types. In addition, we describe methods to generate questions based on interviewee speech. We report a controlled experiment to evaluate LLM-generated and human-authored questions with minimal guidance, and a second controlled experiment to evaluate the LLM-generated questions when generation is guided by interviewer mistake types. Our findings demonstrate that, for both experiments, the LLM-generated questions are no worse than the human-authored questions with respect to clarity, relevancy, and informativeness. In addition, LLM-generated questions outperform human-authored questions when guided by common mistakes types. This highlights the potential of using LLMs to help interviewers improve the quality and ease of requirements elicitation interviews in real time.
有效的面试需要熟练的面试人员在面临多种挑战时实时地提出适当的访谈问题,包括缺乏对域名的熟悉程度、过度的认知负荷和信息超负荷,从而妨碍人类处理利益攸关方的演讲。最近,大型语言模型(LLMS)在多种自然语言处理任务中表现出了最先进的表现,包括文字总结和要求。为了支持访谈人员,我们调查GPT-4o的应用,以便在需求期间,通过建立共同的访谈错误类型框架,提出后续访谈问题。此外,我们描述根据访谈者演讲产生问题的方法。我们报告有控制的实验,以最低限度的指导评价LLM所产生和人为的问题,以及第二次有控制的实验,在生成时以访谈者错误类型为指导,评价LMM产生的问题。我们的研究结果表明,对于这两个实验,LMM公司产生的问题并不比人类授权的关于清晰度、相关性和了解性能等问题更差。此外,我们还描述了基于受访者演讲者演讲结果的透明性要求。此外,LMM公司还用普通的深度问题来改进访问。
Article 1
Title@2025-07-03 (4): Answer Matching Outperforms Multiple Choice for Language Model Evaluation
Title: Answer Matching Outperforms Multiple Choice for Language Model Evaluation | Antwort Matching Outperforms Multiple Choice für Sprachmodell-Bewertung | 语言模式评价的多种选择 2507.02856v1 |
Authors (5): Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping
Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model’s free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice–but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models–even small ones–achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.
长期以来,多种选择基准一直是语言模式评估的工序,因为分级的多重选择是客观的,而且容易自动化。然而,我们展示了大众基准的多重选择问题,往往可以在不看到问题的情况下解答。这些捷径来自对模式自由形式和基因化答案的评价所没有分享的歧视性评价的根本局限性。直到最近,似乎没有可行的、可伸缩的替代方法可以替代多重选择-但是,我们证明这已经发生了变化。我们通过我们所谓的答案匹配来考虑基因化评估。我们考虑通过我们所谓的“答案匹配”来匹配我们所谓的“组合性评估 ” : 给候选人模型提供没有选项的问题,让它产生一个自由的响应,然后使用具有参考答案的现代语言模型来确定答复是否与参考匹配。为了比较不同评价战略的有效性,我们对MML-Pro和GPQA-Dimmond进行了批注,以获取人类分类数据,衡量每一种评价方法的一致。我们发现答案与最近的模型甚至小型的生态系统协议相匹配,在内部协议中,它产生一个自由的响应性回应,而我们只是用LM-as-as-as-ad ad com ad com ad ad ad ad ad ad commission the missuement the mind real des iming the mind the mind iming the mind the mind the mind iming mind im im im im im im imation imation im impridemental imation im im im imation imation imation impridementmentmentmentmentald imations
Article 2
Title@2025-07-03 (4): MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs
Title: MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs | MOTIF: Modulares Denken durch Verstärkung Feinabstimmung in LLMs | MOTIF:通过强化微调在LLM中进行模块思考 2507.02851v1 |
Authors (2): Purbesh Mitra, Sennur Ulukus
Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose $\textbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ – an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8\% and 3.3\% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15\% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at https://github.com/purbeshmitra/MOTIF and https://huggingface.co/purbeshmitra/MOTIF, respectively.
大型语言模型(LLMS)推理能力最近的进步表明,使用集体相对政策优化算法(GROP)进行强化学习(RL)培训,使模型能够使用更多的思考/推理符号来产生更好的回应。然而,LLMS只能产生有限的象征性,同时保持对先前产生的符号的注意。这一限制,又称为LLM的上下文大小,是LLM推理中的一个瓶颈,它含有任意大量符号。要超越范围范围,LMM必须采用模块式思维战略,在多个回合中解释。在这项工作中,我们提议用$textbf{MOIF:通过强化微调进行模思考,这是在多回合中生成思维符号的一种有限数量的培训方法,实际上允许模型以额外的上下文尺寸进行思考。我们通过高效的参数微调和测试了MATHTH500和AIME2024基准的精确度。我们实验显示的是3.88和3.3_BEGROBS的改进,因此在15个样本中展示了我们现有的MAF/MO标准。
Article 3
Title@2025-07-03 (4): LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users
Title: LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users | LLM Hypnose: Nutzung des Benutzerfeedbacks für unautorisierte Wissensinjektion für alle Benutzer | LLM Hypnisis:利用用户反馈,为所有用户提供未经授权知识注射 2507.02850v1 |
Authors (4): Almog Hilel, Idan Shenfeld, Leshem Choshen, Jacob Andreas
We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a “poisoned” or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).
我们用用户反馈来描述语言模型中的脆弱之处,根据这些模型,一个用户可以持续改变LM的知识和行为,因为只有能够提供有关LM产出的提示和上投/下投反馈。为了实施袭击,攻击者促使LM以“受污染”或良性反应的方式进行随机输出,然后将中毒反应或下投良性反应。当反馈信号用于随后的偏好调控行为时,LMS显示即使在没有恶意提示的情况下也更有可能产生中毒反应。我们表明,这次攻击可以被用来:(1) 插入该模型以前没有的事实知识,(2) 修改代码生成模式,以引入可以利用的安全缺陷,(3) 输入假金融新闻。我们发现,语言模式偏好调整的新定性特征(显示甚至高度有限的偏好形式数据可以用来对行为进行精细控制),以及经过用户反馈培训的LMS新的攻击机制(延长培训前数据中毒和部署时即时及时注射工作)。
Article 4
Title@2025-07-03 (4): Legal Requirements Translation from Law
Title: Legal Requirements Translation from Law | Rechtliche Voraussetzungen Übersetzung aus dem Recht | 法律要求译自法律 2507.02846v1 |
Authors (2): Anmol Singhal, Travis Breaux
Software systems must comply with legal regulations, which is a resource-intensive task, particularly for small organizations and startups lacking dedicated legal expertise. Extracting metadata from regulations to elicit legal requirements for software is a critical step to ensure compliance. However, it is a cumbersome task due to the length and complex nature of legal text. Although prior work has pursued automated methods for extracting structural and semantic metadata from legal text, key limitations remain: they do not consider the interplay and interrelationships among attributes associated with these metadata types, and they rely on manual labeling or heuristic-driven machine learning, which does not generalize well to new documents. In this paper, we introduce an approach based on textual entailment and in-context learning for automatically generating a canonical representation of legal text, encodable and executable as Python code. Our representation is instantiated from a manually designed Python class structure that serves as a domain-specific metamodel, capturing both structural and semantic legal metadata and their interrelationships. This design choice reduces the need for large, manually labeled datasets and enhances applicability to unseen legislation. We evaluate our approach on 13 U.S. state data breach notification laws, demonstrating that our generated representations pass approximately 89.4% of test cases and achieve a precision and recall of 82.2 and 88.7, respectively.
软件系统必须遵守法规,这是一个资源密集型的任务,对于小型组织和缺乏专门法律专门知识的初创机构来说尤其如此。从规章中提取元数据,以引起对软件的法律要求,是确保遵守的关键步骤。然而,由于法律文本的长度和复杂性质,这是一个繁琐的任务。虽然以前的工作寻求的是从法律文本中提取结构性和语义性元数据的自动化方法,但主要限制仍然存在:它们不考虑与这些元数据类型相关的属性之间的相互作用和相互关系,它们依赖人工标签或超自然驱动的机器学习,这并不能很好地概括新的文件。在本文件中,我们采用基于文字要求和文内文学习的方法,以自动生成法律文本的可理解性表述、可隐含和可执行的Python代码。我们的代表来自一个手工设计的Python类结构,这个结构是一个特定域的元模型,既捕捉结构性和语义性法律元数据,也依靠它们的相互关系。这一设计减少了对大型、手工标签数据设置和内文学习的需求,并增强了对隐性法律的可应用性解释性解释性解释性法律。我们评估了13项 和精确性法律的测试性案例。我们评估了13项 和精确性法律。我们评估了我们的数据和精确性检验性案例。
Article 5
Title@2025-07-03 (4): Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
Title: Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection | Visual Contextual Attack: Jailbreaking MLLMs mit Image-Driven Context Injection | 视觉上下文攻击:带有图像驱动背景注射的破狱MLLMs MLLMs 2507.02844v1 |
Authors (4): Ziqi Miao, Yi Ding, Lijun Li, Jing Shao
With the emergence of strong visual-language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments. Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: visual-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates contextual dialogue using four distinct visual-focused strategies, dynamically generating auxiliary images when necessary to construct a visual-centric jailbreak scenario. To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which performs a toxicity score of 2.48 and an ASR of 22.2%. The code is available at https://github.com/Dtc7w3PQ/Visco-Attack.
随着强大的视觉语言能力的出现,多式大型语言模型(MLLM)显示出了在现实世界应用中的巨大潜力。然而,视觉模式所展示的安全脆弱性给在开放世界环境中部署这类模型带来了重大挑战。最近的研究成功地引导了目标MLLMs的有害反应,将有害的文字语义直接纳入视觉投入中。但是,在这些方法中,视觉模式主要触发了不安全行为,往往表现出语义模糊,缺乏现实情景的基础。在这项工作中,我们定义了一个新颖的设置:视觉中心破狱,视觉信息是构建完整和现实的破狱环境的必要组成部分。在这个设置上,我们提议Visco(视觉背景背景)攻击。 VisCo构建了背景对话,使用了四种不同的视觉重点战略,在建立以视觉为中心的监狱破解情景时,动态模式主要产生了辅助图像。为了最大限度地发挥攻击效果,它包含自动毒性模糊和语义改进,以产生最终攻击,从而可靠地触发目标黑盒子MLLMS3的有害反应。具体地, Visco-COA 和22-PTA的毒性等级标准是G-MS的等级标准。
Article 6
Title@2025-07-03 (4): Improved Unbiased Watermark for Large Language Models
Title: Improved Unbiased Watermark for Large Language Models | Verbessertes unvoreingenommenes Wasserzeichen für große Sprachmodelle | 改进大语言模型的无偏见水印 2502.11268v2 |
Authors (4): Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang
As artificial intelligence surpasses human capabilities in text generation, the necessity to authenticate the origins of AI-generated content has become paramount. Unbiased watermarks offer a powerful solution by embedding statistical signals into language model-generated text without distorting the quality. In this paper, we introduce MCmark, a family of unbiased, Multi-Channel-based watermarks. MCmark works by partitioning the model’s vocabulary into segments and promoting token probabilities within a selected segment based on a watermark key. We demonstrate that MCmark not only preserves the original distribution of the language model but also offers significant improvements in detectability and robustness over existing unbiased watermarks. Our experiments with widely-used language models demonstrate an improvement in detectability of over 10% using MCmark, compared to existing state-of-the-art unbiased watermarks. This advancement underscores MCmark’s potential in enhancing the practical application of watermarking in AI-generated texts.
由于人工智能在文本生成方面超过了人的能力,认证人工智能生成的内容来源的必要性已成为至高无上。无偏见的水印通过将统计信号嵌入语言模型生成的文本而不扭曲质量,提供了一个强有力的解决方案。在本文中,我们引入了Mcmark,这是一个没有偏见的多通道水印的大家庭。Mcmark通过将模型的词汇分成几个部分,并在基于水标记键的选定部分中推广象征性概率。我们证明,Mcmark不仅保留了语言模型的原始分布,而且还大大改善了现有不偏倚的水标记的可探测性和稳健性。我们对广泛使用的语言模型的实验表明,与现有最先进的不偏倚水标记相比,使用Mcmark的可探测性提高了10%以上。这一进步凸显了Mcmark在加强在人工生成文本中实际应用水标记的潜力。
Article 7
Title@2025-07-03 (4): StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason
Title: StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason | StepHint: Mehrstufige stufenweise Hinweise stärken das Lernen zur Vernunft | 步进提示:多级分步骤将强化学习提升到合理 2507.02841v1 |
Authors (7): Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, Rui Yan
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving the complex reasoning abilities of large language models (LLMs). However, current RLVR methods face two significant challenges: the near-miss reward problem, where a small mistake can invalidate an otherwise correct reasoning process, greatly hindering training efficiency; and exploration stagnation, where models tend to focus on solutions within their comfort zone,'' lacking the motivation to explore potentially more effective alternatives. To address these challenges, we propose StepHint, a novel RLVR algorithm that utilizes multi-level stepwise hints to help models explore the solution space more effectively. StepHint generates valid reasoning chains from stronger models and partitions these chains into reasoning steps using our proposed adaptive partitioning method. The initial few steps are used as hints, and simultaneously, multiple-level hints (each comprising a different number of steps) are provided to the model. This approach directs the model's exploration toward a promising solution subspace while preserving its flexibility for independent exploration. By providing hints, StepHint mitigates the near-miss reward problem, thereby improving training efficiency. Additionally, the external reasoning pathways help the model develop better reasoning abilities, enabling it to move beyond its
comfort zone’’ and mitigate exploration stagnation. StepHint outperforms competitive RLVR enhancement methods across six mathematical benchmarks, while also demonstrating superior generalization and excelling over baselines on out-of-domain benchmarks.
以可核查的奖励加强学习(RLVR)是提高大型语言模型复杂推理能力的一个很有希望的方法。然而,目前的RLVR方法面临两大挑战:一是近乎的奖赏问题,一个小错误可能使原本正确的推理过程失效,大大妨碍培训效率;二是探索停滞,其中模型往往侧重于“舒适地带”内的解决办法,缺乏探索可能更加有效的替代方案的积极性。为了应对这些挑战,我们提议Stephint,这是一个新的RLVR算法,它利用多层次的分级提示帮助模型更有效地探索解决方案空间。StepHint从较强的模型中产生有效的推理链,用我们提议的适应性分解法将这些链分成为推理步骤。最初的几个步骤被用作提示,同时向模型提供多重提示(每个步骤由不同的步骤组成),该方法引导模型的探索走向一个有希望的通用解决方案子空间,同时保持其独立探索的灵活性。通过提供提示, StepHper 减轻近乎失败的奖赏问题,从而提高培训效率。此外,外部推理学路径将它推向更高的升级,同时推导出其升级,同时推向更高的区域。
Article 8
Title@2025-07-03 (4): From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
Title: From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents | Von der Web-Suche in Richtung Agentic Deep Research: Incentivizing Search with Reasoning Agents | 从网络搜索到代理深层研究:激励使用理性代理进行搜索 2506.18959v3 |
Authors (23): Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu
Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn. We also introduce a test-time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open-source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome-Deep-Research.
我们的立场是,具有推理和代理能力的大型语言模型(LLMS)正在引入一个新的范式,称为 “ 干深研究 “ 。这些系统通过紧密整合自主推理、迭代检索和信息合成,超越常规信息搜索技术,形成动态反馈循环。我们追踪从静态网络搜索到互动、代理系统的变化,这些系统计划、探索和学习。我们还引入了测试-时间缩放法,以正式确定计算深度对推理和搜索的影响。我们借助基准结果和开放源执行的崛起,证明 “ 干深研究 “ 不仅大大超越了现有方法,而且还准备成为未来信息搜索的主要范式。所有相关资源,包括工业产品、研究论文、基准数据集和开放源实施,都在https://github.com/DavidZZZ/Awesome-Deep-Research中为社区收集了所有相关资源,包括工业产品、研究文件、基准数据集、基准数据集和开放源实施。
Article 9
Title@2025-07-03 (4): ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning
Title: ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning | ExPO: Entsperren harter Vernunft mit selbsterklärungsgeführtem Verstärkungslernen | ExPO: 以自我剥削指导强化学习来解锁困难理由 2507.02834v1 |
Authors (4): Ruiyang Zhou, Shuozhe Li, Amy Zhang, Liu Leqi
Recent advances in large language models have been driven by reinforcement learning (RL)-style post-training, which improves reasoning by optimizing model outputs based on reward or preference signals. GRPO-style approaches implement this by using self-generated samples labeled by an outcome-based verifier. However, these methods depend heavily on the model’s initial ability to produce positive samples. They primarily refine what the model already knows (distribution sharpening) rather than enabling the model to solve problems where it initially fails. This limitation is especially problematic in early-stage RL training and on challenging reasoning tasks, where positive samples are unlikely to be generated. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model’s likelihood of predicting the correct answer. Based on these insights, we propose $\textbf{Self-Explanation Policy Optimization (ExPO)}$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most.
大型语言模型的最近进步是由强化学习(RL)式的后培训驱动的,这种培训通过优化基于奖赏或优惠信号的模型产出来改进推理。GROPO式的方法通过使用由基于结果的核查者贴上标签的自产样本来实施。然而,这些方法在很大程度上取决于模型最初生产积极样本的能力。它们主要是完善模型已经知道的(分配的精锐化),而不是使模型能够解决最初失败的问题。这种限制在RL早期质量培训和具有挑战性推理任务中特别成问题,因为不可能产生正面的样本。为了在这种环境下释放推理能力,模型必须探索超出其当前产出分布范围的新推理轨。这种探索需要获得足够好的正面样本来指导学习。专家演示似乎是一种自然解决方案,但我们发现它们往往在RL后培训中无效。相反,我们找出有效正选样本的两个关键属性:它们应该基于当前政策的基础(1) , 改进模型, 而不是用于预测正确答案的可能性。基于这些更高级的推理能力, 我们提议在最高级的推理学前的推算中, 和最精确的推算的推算方法显示这样的推算。
Article 10
Title@2025-07-03 (4): Generalizing Verifiable Instruction Following
Title: Generalizing Verifiable Instruction Following | Verallgemeinern der überprüfbaren Anleitung | 普遍适用的可核实说明 2507.02833v1 |
Authors (8): Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, Hannaneh Hajishirzi
A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like only answer with yes or no" or
mention the word `abrakadabra’ at least 3 times” that the user adds to craft a more useful answer. Even today’s strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.
语言模型或聊天爱好者能够精确地遵循人类的指示,这是人类和AI互动取得成功的关键因素。教学的一个共同特点是产出限制,例如“只回答是或否”或“至少3次”用户为制定更有用的答案而增加的“brakadabra”一词的“只回答”或“至少3次”等产出限制。即使是今天最强大的模型也为克服这些限制而奋斗。我们发现,大多数模型在测试这些能力的基准中,大大超过一套小的可核实限制,即测试这些能力的标准,这种技能称为精确指导,不能很好地概括到看不见的产出限制。我们采用了一个新的基准,即IFBench,在对58个新的、多样化的和具有挑战性的可核查的外部限制进行概括之后,对精确的指令进行评估。此外,我们还广泛分析如何和如何培训数据模型,以便在概括后改进精确的教学。具体地说,我们仔细设计制约核查模块,并表明以可核查的奖励加强学习(RVER)将大大改进随后的教学。除了IFBench外,我们还发布了29个新的附加说明性的培训限制和核查功能。
Article 11
Title@2025-07-03 (4): SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model
Title: SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model | SynapseRoute: Ein Auto-Routen-Schaltrahmen für das Dual-State Large Language Model | SynapseRoute:关于两州大语言模式的自动运行切换框架 2507.02822v1 |
Authors (12): Wencheng Zhang, Shiqin Qiao, Lingjie Luo, Yinfeng Li, Chuanyang Zheng, Qian Xu, Meng Li, Yong Gui, Yijun He, Jianing Qiu, Jindong Hong, Jiankai Sun
With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between “thinking” (high reasoning) and “non-thinking” (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost reasoning process. This highlights a clear dichotomy in problem complexity and suggests that dynamically routing queries to the appropriate mode based on complexity could optimize accuracy, cost-efficiency, and overall user experience. Based on this, we further propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes. Experimental results on several medical datasets demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs. 0.8272) compared to the thinking mode alone but also reduces inference time by 36.8% and token consumption by 39.66%. Importantly, qualitative analysis indicates that over-reasoning on simpler queries can lead to unnecessary delays and even decreased accuracy, a pitfall avoided by our adaptive routing. Finally, this work further introduces the Accuracy-Inference-Token (AIT) index to comprehensively evaluate the trade-offs among accuracy, latency, and token cost.
随着在实际应用中广泛采用大型语言模型(LLMs),选择一个合适的模型不仅需要平衡业绩,还需要平衡操作成本。具有推理能力的模型的出现进一步扩大了“思考”(高推理)和“非思考”(快速、低成本)模式之间的成本差距。在这项工作中,我们发现大约58%的医疗问题可以通过非思考模式来准确解答,而不必要求高成本推理程序。这凸显了问题复杂性的明显分化,并表明根据复杂性对适当模式进行不言而喻的查询可以优化准确性、成本效益和总体用户经验。基于这一点,我们进一步建议SyapseRoute(一个基于机器学习的动态路由框架),一个明智地将信息输入到思维模式或非思考模式的动态(快速、低成本)。 几个医学数据集的实验结果显示,SyapseRoute不仅提高了总体准确性(0.8390 vs. 0.8772),而且仅与思维模式相比,而且还会将准确性降低准确性的时间减少36.8%和象征消费,甚至减少39.66%。最后,质量分析表明,这种不必要、更精确性会降低。
Article 12
Title@2025-07-03 (4): Multimodal Mathematical Reasoning with Diverse Solving Perspective
Title: Multimodal Mathematical Reasoning with Diverse Solving Perspective | Multimodale mathematische Vernunft mit unterschiedlicher Lösungsperspektive | 具有不同解决视角的多模式数学理由 2507.02804v1 |
Authors (6): Wenhao Shi, Zhiqiang Hu, Yi Bin, Yang Yang, See-Kiong Ng, Heng Tao Shen
Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflections. In this work, we introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair, fostering richer reasoning supervision. We further propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization (GRPO), a rule-based RL approach that integrates correctness discrimination and diversity-aware reward functions. Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions. Extensive experiments on the MathVista’s minitest and Math-V benchmarks demonstrate that Qwen-VL-DP significantly outperforms prior base MLLMs in both accuracy and generative diversity, highlighting the importance of incorporating diverse perspectives and reflective reasoning in multimodal mathematical reasoning.
大规模强化学习(LLL)的近期进展明显提高了大型语言模型(LLMS)的推理能力,特别是在数学领域,然而,目前数学推理的多式LMS(MLLM)往往依赖于一对一图像文本和单一解决方案监督,忽视了各种有效的推理观点和内部反省。在这项工作中,我们引入了数学V-DP,这是一个新颖的数据集,为每对图像-问题收集多种不同的解答轨迹,促进更丰富的推理监督。我们进一步提议Qwen-VL-DP,该模型以Quen-VLL为基础,与监督性学习相调整,并通过集体相对政策优化(GROPO)加强。基于规则的RL方法将正确性歧视和多样性奖励功能结合起来。我们的方法强调从不同推理角度学习,区分正确而独特的解决办法。关于MathVista的微型测试和数学-V基准的广泛实验表明,Quen-VL-DP在准确性和基因化多样性两方面都明显超越了以前的基本MLLLMs,强调将不同观点纳入数学推理学的重要性。
Article 13
Title@2025-07-03 (4): Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models
Title: Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models | Ist Vernunft alles, was Sie brauchen? Probieren von Bias im Zeitalter der Vernunft Sprachmodelle | 需要什么理由就需要什么理由吗? 2507.02799v1 |
Authors (4): Riccardo Cantini, Nicola Gabriele, Alessio Orsino, Domenico Talia
Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.
理性语言模型(RLMs)因其通过“链式安全评分”等机制或经微调推理的推理痕迹等机制执行复杂、多步推理任务的能力而获得了牵引力。虽然这些能力有望提高可靠性,但其对社会偏见的稳健性影响仍然不明确。在这项工作中,我们利用最初为大语言模型(LLMs)设计的CLEAR-Bas基准来调查RLM的对立强度,以调查RLM的对抗性强度,以得出偏差。我们系统地评估了在多种社会文化层面执行复杂、多步推理性推理任务的能力。我们使用LLM-A-判断式的自动安全评分法,利用破狱技术来评估内建安全机制的力度。我们的评价涉及三个主要问题:(一) 引入推理能力如何影响模型的公平和稳健性。 (二) 用于推理的模型是否比依赖COT的推理更安全? (三) 针对偏向偏差的越入式攻击的成功率比用于推理机制的不同。我们的调查结果显示,在逻辑推理判断性推理学模型和直判能力上更精确推理学模型之间,这些推理的越好的关系需要更精确,这些推理,这些推理,而不是于推理性推理的推理的推理的推理。
Article 14
Title@2025-07-03 (4): From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
Title: From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding | Von langen Videos zu Clips: Ein von Menschen inspiriertes Video-Editing-Framework mit multimodalem Narrative Understanding | 从长视频到启动剪贴板:由人启发的视频编辑框架,包含多模式叙述理解 2507.02790v1 |
Authors (11): Xiangfeng Wang, Xiao Li, Yadong Wei, Xueyu Song, Yang Song, Xiaoqiang Xia, Fangrui Zeng, Zaiyi Chen, Liu Liu, Gu Xu, Tong Xu
The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.
在线视频内容的迅速增长,特别是在短视频平台上,产生了对高效视频编辑技术的日益增长的需求,这种技术可以将长式视频压缩为简洁和有吸引力的剪辑。现有的自动编辑方法主要依赖来自ASR记录誊本和端至端段选择的文字提示,往往忽视丰富的视觉背景,导致产出不连贯。在本文件中,我们提议了一个由人启发的自动视频编辑框架(HIVE),利用多式联运叙述性理解来克服这些限制。我们的方法包括了字符提取、对话分析和通过多式联运大型语言模型的叙述性归纳,从而能够全面理解视频内容。为了进一步加强一致性,我们应用了场景一级的分解,并将编辑过程分解成三个子任务:突出探测、打开/结束选择以及剪切不相干的内容。为了便利这方面的研究,我们介绍了由800多场短剧和500个经过专业编辑的广告剪辑组成的新型基准数据集DramaADread。实验结果表明,我们的框架始终超越了一般和面向广告编辑任务的现有基线,大大缩小了自动和人类编辑录像的质量差距。
Article 15
Title@2025-07-03 (4): GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling
Title: GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling | GPAS: Beschleunigung der Konvergenz des LLM-Vortrainings durch Gradient-Preserving Activation Scaling | GPAS:通过 “ 渐进式保留动力扩增 “ 加速汇集LLM预备训练 2506.22049v2 |
Authors (15): Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, Ajay Kumar Jaiswal, Fan Zhang, Jishan Hu, Yang Wang, Hao Chen, Shizhe Diao, Shiwei Liu, Yu Li, Lu Yin, Can Yang
Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings. Our code is available at https://github.com/dandingsky/GPAS.
LLAMA、Quen和DeepSeek等现代大型语言模型,主要采用LLAMA、Quen和DeepSeek系列,主要采用LayerNorm (Pre-LN) 变异器结构。在培训前稳定且可缩至大模型大小的同时,LN前会因不同层引发差异的指数性增长而受害,导致在剩余连接方面对次层输出的捷径占支配地位,并限制了更深层的学习能力。为了缓解这一问题,我们建议采用一种简单技术,即渐进式保留活化增强(GPAS),这一技术可以与现有方法相结合使用。GPASS通过缩小中间激活,同时保持其梯度不变。这使激活中的信息保持不变,避免了与梯度降幅缩小相关的梯度问题逐渐消失。从71M到1B的各种模型规模的广泛实验表明,GOS取得了一致的绩效收益。除了加强LN前变异器外,GOS还显示出改进其他结构如桑威奇-LN和Deep Norm(D)的希望改进其他结构,表明其多功能和潜力,在广泛的环境中改进培训动态。我们的代码可在 http://G。
Article 16
Title@2025-07-03 (4): Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation
Title: Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation | Verbesserung klinischer Multiple-Choice-Fragen Benchmarks mit Knowledge Graph Guided Distractor Generierung | 加强具有知识图导引引产生体的临床多选择问题基准,加强临床多选择问题基准 2506.00612v3 |
Authors (5): Running Yang, Wenlong Deng, Minghui Chen, Yuyin Zhou, Xiaoxiao Li
Clinical tasks such as diagnosis and treatment require strong decision-making abilities, highlighting the importance of rigorous evaluation benchmarks to assess the reliability of large language models (LLMs). In this work, we introduce a knowledge-guided data augmentation framework that enhances the difficulty of clinical multiple-choice question (MCQ) datasets by generating distractors (i.e., incorrect choices that are similar to the correct one and may confuse existing LLMs). Using our KG-based pipeline, the generated choices are both clinically plausible and deliberately misleading. Our approach involves multi-step, semantically informed walks on a medical knowledge graph to identify distractor paths-associations that are medically relevant but factually incorrect-which then guide the LLM in crafting more deceptive distractors. We apply the designed knowledge graph guided distractor generation (KGGDG) pipline, to six widely used medical QA benchmarks and show that it consistently reduces the accuracy of state-of-the-art LLMs. These findings establish KGGDG as a powerful tool for enabling more robust and diagnostic evaluations of medical LLMs.
诊断和治疗等临床任务需要强大的决策能力,强调严格的评估基准对于评估大型语言模型(LLMs)可靠性的重要性。在这项工作中,我们引入了一个知识引导数据增强框架,通过产生分散处理器(即与正确的选择相类似,并可能混淆现有的LMs),增加临床多选择问题数据集的困难(即不正确的选择,与正确的选择相近,可能混淆现有LMs ) 。利用我们基于KG的管道,所产生的选择在临床上是可信的,蓄意误导。我们的方法包括多步、有语义依据地在医学知识图上行走,以识别在医学上相关但事实上不正确的分散式路径协会,然后指导LM设计出更具有欺骗性的分散式分散处理器。我们将设计的知识图表用于引导分散处理器生成(KGGDG) Pipline的六种广泛使用的医学QA基准,并表明它不断降低最新LMs的准确性。这些发现KGGDGD是能够更可靠和诊断性地评估医学LMs的有力工具。
Article 17
Title@2025-07-03 (4): Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs
Title: Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs | Selbstkorrektionsbank: Enthüllung und Adressierung des Selbstkorrektions-Blindflecks in LLMs | 自我校正法官:在LLMs中披露和处理自我校正的盲人点 2507.02778v1 |
Authors (1): Ken Tsui
Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic ‘Self-Correction Blind Spot’ - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending “Wait” reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.
尽管大型语言模型(LLMS)已经转型,但它们仍然会犯错误,并且可以探索非生产性的推理路径。自我校正是值得信赖的LLM(LLM)的重要能力,特别是自动递减的LLM。虽然LLMS可以识别用户输入中的错误,但它们展示了系统化的“自我校正盲点 ” — — 无法纠正自己的产出中的相同错误。为了系统地研究这一现象,我们引入了自我校正法官,这是一个系统化的框架,通过在三个复杂层次的受控错误注入来测量这一现象。测试了14个模型,我们发现平均64.5%的盲点率。我们发现了许多证据,证明这一限制与数据构成有关:人类培训演示主要显示无误反应而不是错误校正序列,而不像受RL培训的模型那样通过结果反馈来学习错误校正。值得注意的是,仅仅附加“等待”就能将盲点减少89.3%,这表明存在这种能力,但需要激活。我们的工作突出了当前LMS的关键性限制,并且提供了提高可靠性和可信度的潜在途径。
Article 18
Title@2025-07-03 (4): DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment
Title: DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment | DeSTA2.5-Audio: Auf dem Weg zu einem General-Purpose Large Audio Language Model mit selbsterzeugter Cross-Modal Alignment | DeSTA2.5-Audio:努力建立具有自发跨模式一致的通用大型音频语言模型 2507.02768v1 |
Authors (28): Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM’s original language abilities. To address this, we revisit the data construction pipeline and propose DeSTA, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM’s native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeSTA, we construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.
我们引入了DESTA2.5-Audio,这是一个通用的大型音频语言模型(LALM),该模型是为强有力的听觉感知和教学而设计的,而不需要针对具体任务的音频教学调整。最近的LALM通常通过大规模、人工调整或LLM合成音频教学数据集的培训,增强大语言模型(LLM)的听觉能力。然而,这些方法常常由于灾难性地忘记LLM的原始语言能力而受到影响。为此,我们广泛重新审视数据建设管道,并提出DESTA的自我生成的跨模式校准战略,即由主干LM生成的自我生成的跨模式校准战略。这一方法既维护LLLM的本地语言能力,同时又建立有效的音频校准能力,从而能够在没有具体任务的校准的情况下实现零光谱化。我们利用DeSTA-AQA5MM,一个大型、任务-AQA5MA的数据模型,包含来自我们70万小时的语音分析能力,包括话、环境声音、以及整个SIRAA-A、整个SIMAA、整个SAL-A-SAL-SIMA-S-S-A-S-A-S-S-A-A-S-S-S-SMA-S-S-SMA-A-S-S-S-S-A-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SAR-S-S-S-SAR-A-S-SAR-SAR-S-S-SAR-SMA-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SMA-SMA-SMA-SMA-SMA-SMA-S-S-S-SMA-SMA-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-
Article 19
Title@2025-07-03 (4): Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression
Title: Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression | Batch-Max: Höherer LLM-Durchsatz mit größeren Batch-Größen und KV Cache-Kompression | 批量最大:使用大批量大小和 KV缓存压缩的高级 LLM 输送量 2412.05693v3 |
Authors (3): Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh
Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token generation. In settings with limited GPU memory, and when the input context is longer than the generation length, we show that by also compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput while still maintaining the original model’s accuracy.
一些工程制定了驱逐政策,从 KV 缓存中去除关键值对配对, 以便更有效地推断。 重点是在输入提示处理后压缩 KV 缓存, 以便更快的代号生成 。 在 GPU 内存有限的情况下, 当输入环境长于生成时间长度时, 我们显示, 通过在输入处理阶段压缩 KV 缓存, 还可以使用更大的批量大小, 从而在保持原始模型准确性的同时, 导致显著更高的吞吐量 。
Article 20
Title@2025-07-03 (4): Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens
Title: Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens | Messung der Granularität des Vowel-Produktionsraumes durch einfach nur produzierbare unterschiedliche (JPD) Limens | 仅用可制成差异(JPD)激光测量Vowel 生产空间的颗粒度 2507.02744v1 |
Authors (1): Peter Viechnicki
A body of work over the past several decades has demonstrated that the complex and coordinated articulatory movements of human vowel production are governed (at least in part)by control mechanisms whose targets are regions of auditory space. Within the target region control at the sub-phonemic level has also been demonstrated. But the degree of accuracy of that control is unknown. The current work investigates this question by asking how far apart must two vowel stimuli lie in auditory space in order to yield reliably different imitations? This distance is termed ‘Just Producible Difference’ (JPD). The current study uses a vowel mimicry paradigm to derive the first measurement of JPD among two sets of English speakers during front vowel production. JPD is estimated at between 14 and 51 mels in F1 X F2 space. This finding has implications for episodic theories of speech production. It also clarifies the possible structures of human vowel systems, by setting a theoretical lower bound for how close two vowel phonemes may be in a speaker’s formant space, and hence a psychophysical explanation of observed trends in number and patterns of possible vowel phonemes.
过去几十年来的一项工作表明,人类元音制作的复杂而协调的脉动运动由控制机制(至少部分由部分)控制,其目标区域是听觉空间区域。在亚声层的目标区域控制范围内,也已经证明。但这种控制的准确度尚不得而知。目前的工作对这个问题进行了调查,询问在听觉空间必须有两个元音模拟器,才能产生可靠的不同模仿效果?这种距离被称为“可实现的公平差异 ” ( JPD ) 。目前的研究使用一个元音模拟模式,在先发音制作时在两组英语发言者中得出JPD的首次测量结果。在F1 X F2 空间,JPD估计在14到51毫秒之间。这一发现对语言制作的直观理论有影响。还澄清了人类元音系统可能的结构,从理论上上看,降低了两个元音调器在发言者的形态空间中如何接近的界限,从而从心理物理角度解释了在可能使用的元音频电话的数量和模式上观察到的趋势。
Article 21
Title@2025-07-03 (4): Early Signs of Steganographic Capabilities in Frontier LLMs
Title: Early Signs of Steganographic Capabilities in Frontier LLMs | Frühe Anzeichen von Steganographischen Fähigkeiten in Frontier LLMs | 边疆长长体动物能力早期信号 2507.02737v1 |
Authors (5): Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy, Roland S. Zimmermann, David Lindner
Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models are unable to encode short messages in their outputs without a monitor noticing under standard affordances. They can succeed, however, if given additional affordances such as using an unmonitored scratchpad and coordinating on what encoding scheme to use. We additionally find early signs that models can perform basic encoded reasoning in a simple state-tracking problem. This includes some ability to reason with their own and pre-defined schemes, including encoding schemes such as Hexadecimal. Despite this, they can rarely hide reasoning subtly within a cover task to fool a monitor. Overall, our results indicate that current LLMs exhibit nascent steganographic capabilities. While these capabilities are likely insufficient to bypass well-designed monitors at present, this could change in the future.
监测大语言模型(LLM) 输出对于减少误用和错配的风险至关重要。 但是, LLMs 可以通过星座学逃避监测: 将隐蔽的信息编码在看似友好的几代人中。 在本文中, 我们评估了边界LLMs 中的隐蔽信息能力, 以更好地了解它们构成的风险。 我们关注两种类型的星座学: 传递编码信息, 并进行编码推理。 我们发现, 目前的模型无法在它们的输出中编码短信息, 而没有在标准价格下进行显示。 但是, 如果给它们额外费用, 如使用未监测的抓取和协调编码方法, 它们可以成功。 我们还发现一些早期迹象, 显示模型可以在简单的状态跟踪问题中进行基本的编码推理。 这包括使用自己和预设的编码方法, 包括 HexaDecimal 等编码方法的某些能力。 尽管如此, 它们很难在掩盖任务中隐含精细的推理。 总的来说, 我们的结果表明, 目前的LLMS 显示, 将展示新生的视觉能力显示, 而这些能力可能不足以绕过。
Article 22
Title@2025-07-03 (4): Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
Title: Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge | Mind2Web 2: Agentische Suche mit Agent-as-a-Judge bewerten | Mind2Web 2: 与代理法官评估代理搜索 2506.21506v2 |
Authors (26): Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.
深层研究系统等代理搜索,例如深层研究系统的代理自动浏览网络,综合信息,并返回全面引用支持的答案,这是用户与网络规模信息互动方式的重大转变。虽然积极搜索日益复杂和开放,速度超过了现有的评价基准和方法,这些基准和方法基本上假定短搜索视野和静态答案。在本文中,我们引入了Mind2Web 2, 基准为130个现实、高质量和长视线,需要实时网络浏览和广泛的信息合成,以人类劳动10小时以上的方式构建。为了应对对时间变化和复杂答案进行评估的挑战,我们提出了一个新的代理As-Judge框架。我们的方法根据树形结构设计设计构建了具体任务法官代理人,以自动评估答案的正确性和来源归属。我们全面评价了10个前沿代理搜索系统和人类业绩,并详细分析了未来发展的真知灼见。最佳系统,OpenAI 深层研究,在开发50-70年期基础时,可以提供最强的模型基础。
Article 23
Title@2025-07-03 (4): On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability
Title: On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability | Über Charakterisierungen für die Sprachgenerierung: Interplay von Halluzinationen, Breadth und Stabilität | 语言生成特征:幻觉、面包和稳定之间的相互作用 2412.18530v2 |
Authors (3): Alkis Kalavasis, Anay Mehrotra, Grigoris Velegkas
We study language generation in the limit - introduced by Kleinberg and Mullainathan [KM24] - building on classical works of Gold [Gol67] and Angluin [Ang79]. [KM24]’s main result is an algorithm for generating from any countable language collection in the limit. While their algorithm eventually generates unseen strings from the target language $K$, it sacrifices coverage or breadth, i.e., its ability to generate a rich set of strings. Recent work introduces different notions of breadth and explores when generation with breadth is possible, leaving a full characterization of these notions open. Our first set of results settles this by characterizing generation for existing notions of breadth and their natural extensions. Interestingly, our lower bounds are very flexible and hold for many performance metrics beyond breadth - for instance, showing that, in general, it is impossible to train generators which achieve a higher perplexity or lower hallucination rate for $K$ compared to other languages. Next, we study language generation with breadth and stable generators - algorithms that eventually stop changing after seeing an arbitrary but finite number of strings - and prove unconditional lower bounds for such generators, strengthening the results of [KMV25] and demonstrating that generation with many existing notions of breadth becomes equally hard, when stability is required. This gives a separation for generation with approximate breadth, between stable and unstable generators, highlighting the rich interplay between breadth, stability, and consistency in language generation.
克莱伯格和穆莱纳坦(KM24)在古典Gold[Gol67]和安格鲁因[Ang79]古典作品的基础上,在极限范围内,我们研究语言的生成,这是克莱伯格和穆莱纳坦[KM24]提出的。[KM24]的主要结果是从任何可计算的语言收集中产生一个算法。虽然他们的算法最终从目标语言中产生看不见的字符串,但是它牺牲了覆盖面或广度,即它能够产生一套丰富的字符串。最近的工作引入了不同宽度的概念,并在有可能产生宽度时探索这些概念,从而对这些概念的完整定性开放。我们的第一套结果通过将现有宽度概念及其自然延伸的生成特征化来解决这个问题。有趣的是,我们的下限非常灵活,而且对超出广度的许多性语言的生成量值都持有。 举例说,一般来说,不可能对发电机进行更难理解或更低的错觉觉悟率。 其次,我们用宽度和更稳定的语言生成方法最终在看到任意但有限的字符串数之后停止改变。 并且证明,我们较低的广度的频度是,这种稳定的生成的深度的频率和深度的生成的深度使得稳定的生成更加稳定,这种稳定的生成变得更加稳定。
Article 24
Title@2025-07-03 (4): Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation
Title: Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation | Next-Token-Vorhersage-Aufgabe setzt eine optimale Datenbestellung für LLM-Training in Proof Generation voraus | 假定为校实生成的LLM培训提供最佳数据排序 2411.00863v2 |
Authors (11): Chenyang An, Shima Imani, Feng Yao, Chengyu Dong, Ali Abbasi, Harsh Shrivastava, Samuel Buss, Jingbo Shang, Gayathri Mahalingam, Pramod Sharma, Maurice Diesendruck
In the field of large language model (LLM)-based proof generation, despite extensive training on large datasets such as ArXiv, LLMs still exhibit only modest performance on proving tasks of moderate difficulty. We believe that this is partly due to the widespread presence of suboptimal ordering within the data for each proof used in training. For example, published proofs often follow a purely logical order, where each step logically proceeds from the previous steps based on the deductive rules. This order is designed to facilitate the verification of the proof’s soundness, rather than to help people and models learn the discovery process of the proof. In proof generation, we argue that the optimal order for one training data sample occurs when the relevant intermediate supervision for a particular proof step in the proof is always positioned to the left of that proof step. We call such order the intuitively sequential order. We validate our claims using two tasks: intuitionistic propositional logic theorem-proving and digit multiplication. Our experiments verify the order effect and provide support for our explanations. We demonstrate that training is most effective when the proof is in the intuitively sequential order. Moreover, the order effect and the performance gap between models trained on different data orders can be substantial – with an 11 percent improvement in proof success rate observed in the propositional logic theorem-proving task, between models trained on the optimal order compared to the worst order. Lastly, we define a common type of order issue in advanced math proofs and find that 17.3 percent of theorems with nontrivial proofs in the first two chapters of a widely used graduate-level mathematics textbook suffer from this issue. A detailed list of those proofs is provided in the appendix.
在大型语言模型(LLM)证据生成领域,尽管在ArXiv等大型数据集方面进行了大量培训,但LLMs在证明中度困难的任务方面的表现仍然有限。我们认为,部分原因在于培训中所用每份证据的数据中存在不最优化的排序。例如,公布的每份证据通常都遵循纯粹逻辑顺序,每一步骤都从先前基于推理规则的步骤中合乎逻辑地从先前的步骤中得出。这一顺序的设计是为了便利验证证据的正确性,而不是帮助人们和模型了解发现证据的过程。在证据生成方面,我们争辩说,一个培训数据样本的最佳排序是广泛的,而一个培训数据样本的最佳排序总是在证据步骤的左边。我们称之为这种直观的顺序顺序。我们用两个任务来验证我们的索赔:直观的基底线逻辑推理推理推理推理和数字乘法乘法乘法。我们实验核实了最坏的顺序效果,并为我们的解释提供了支持。我们证明最差的路径表显示,当一个培训最有效的时候,当证据在不精确的顺序排序一级,我们发现一个普遍的非抽样抽样抽样,而使用最精确的数学排序的顺序排序中,在经过训练的顺序排序中,在最精确的顺序中,在最接近的排序中可以界定的排序中,在最短的排序中,在最接近的顺序中,在经过的排序和最接近的排序中,在最接近的顺序上,在最短的模型中,在最短的排序中,在最短的模型中,在最接近。
Article 25
Title@2025-07-03 (4): Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers
Title: Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers | Können LLMs kritische Einschränkungen innerhalb der wissenschaftlichen Forschung identifizieren? Eine systematische Bewertung von KI-Forschungspapieren | LLMs能否查明科学研究中的关键限制? 对AI研究文件的系统评估 2507.02694v1 |
Authors (5): Zhijian Xu, Yilun Zhao, Manasi Patwardhan, Lovekesh Vig, Arman Cohan
Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs’ capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.
对科学研究来说,同侪审查至关重要,但越来越多的出版物加剧了这一专门知识密集型进程的挑战。虽然LLMs在各种科学任务中表现出希望,但它们协助同侪审查,特别是确定纸张限制的潜力仍然不足。我们首先展示了科学研究中限制类型的全面分类,重点是AI。在这个分类学的指导下,我们展示了限制Gen,这是评价LLMs支持早期反馈和补充人类同侪审查能力的第一个全面基准。我们的基准包括两个子集:限制Gen-Syn,这是一个通过受控地干扰高质量论文而精心制作的合成数据集,以及LimeGen-Human,这是一套真正的人为限制。为了提高LMM系统识别限制的能力,我们用文献检索来补充这些限制,这对于确定先前科学发现中的局限性至关重要。我们的方法加强了LM系统在研究文件中产生限制的能力,使其能够提供更具体和建设性的反馈。
Article 26
Title@2025-07-03 (4): Exploring Gender Bias Beyond Occupational Titles
Title: Exploring Gender Bias Beyond Occupational Titles | Erforschen von Gender-Bias über Berufsbezeichnungen hinaus | 探索职业职称之外的性别偏见 2507.02679v1 |
Authors (2): Ahmed Sabir, Rajesh Sharama
In this work, we investigate the correlation between gender and contextual biases, focusing on elements such as action verbs, object nouns, and particularly on occupations. We introduce a novel dataset, GenderLexicon, and a framework that can estimate contextual bias and its related gender bias. Our model can interpret the bias with a score and thus improve the explainability of gender bias. Also, our findings confirm the existence of gender biases beyond occupational stereotypes. To validate our approach and demonstrate its effectiveness, we conduct evaluations on five diverse datasets, including a Japanese dataset.
在这项工作中,我们调查性别和背景偏见之间的相互关系,侧重于行动动词、目标名词、特别是职业等要素。我们引入了新的数据集GenderLexicon,以及一个能够估计背景偏见及其相关的性别偏见的框架。我们的模型可以以得分来解释偏见,从而改进性别偏见的解释。此外,我们的调查结果证实存在性别偏见,超出了职业陈规定型观念。为了验证我们的方法并展示其有效性,我们评估了五个不同的数据集,包括日本数据集。
Article 27
Title@2025-07-03 (4): Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning
Title: Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning | Code2Logic: Game-Code-getriebene Datensynthese zur Verbesserung von VLMs Allgemeine Begründung | 代码2Llogic: 用于增强VLMs一般理由的游戏-代码-驱动数据合成 2505.13886v2 |
Authors (26): Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Changhao Jiang, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
Visual-language Chain-of-Thought (CoT) data resources are relatively scarce compared to text-only counterparts, limiting the improvement of reasoning capabilities in Vision Language Models (VLMs). However, high-quality vision-language reasoning data is expensive and labor-intensive to annotate. To address this issue, we leverage a promising resource: game code, which naturally contains logical structures and state transition processes. Therefore, we propose Code2Logic, a novel game-code-driven approach for multimodal reasoning data synthesis. Our approach leverages Large Language Models (LLMs) to adapt game code, enabling automatic acquisition of reasoning processes and results through code execution. Using the Code2Logic approach, we developed the GameQA dataset to train and evaluate VLMs. GameQA is cost-effective and scalable, offers controllable difficulty gradation and is diverse with 30 games and 158 tasks. Surprisingly, despite training solely on game data, VLMs demonstrated out of domain generalization, specifically Qwen2.5-VL-7B improving performance by 2.33% across 7 diverse vision-language benchmarks. Our code, dataset and models are available at https://github.com/tongjingqi/Code2Logic.
与只使用文本的数据相比,视觉语言链数据资源相对稀缺,限制了提高视觉语言模型(VLMS)的推理能力。然而,高质量的视觉语言推理数据昂贵,需要大量劳动才能说明问题。为了解决这一问题,我们利用了一个大有希望的资源:游戏代码,它自然包含逻辑结构和州过渡过程。因此,我们提议了代码2Logic,这是一种由游戏代码驱动的新颖的方法,用于多式推理数据合成。我们的方法利用了大语言模型(LLLMS)来调整游戏代码,从而能够通过执行代码自动获取推理过程和结果。我们利用代码2Logic方法开发了游戏QA数据集,用于培训和评估VLMS。游戏QA具有成本效益和可扩展性,提供了可控制的难度升级,并且有30场游戏和158项任务。令人惊讶的是,尽管仅就游戏数据进行了培训,但VLMS展示了域通用,特别是Qwen2.5-L-7B,能够通过7个不同的视觉语言基准将业绩提高2.33%。我们的代码、数据设置和模型在http://Lgiqiquc.
Article 28
Title@2025-07-03 (4): ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning
Title: ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning | ASDA: Audio-Spektrogramm Differential-Achtungsmechanismus für selbstüberwachtes Repräsentationslernen | ASDA:自我监督代表制学习的听觉分光差异关注机制 2507.02666v1 |
Authors (5): Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang
In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model’s discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA’s effectiveness in audio tasks, paving the way for broader applications.
在最近的音频自我监督代表性学习进展中,标准变换器结构已成为主导方法,但其关注机制往往将部分关注权分配给不相关的信息,有可能损害模型的歧视性能力。为此,我们引入了差异关注机制,通过整合双软操作和适当调控差异系数,有效减轻了无效的注意力分配。实验结果显示,我们的自动自控模型在多个基准中取得了最先进的性能,包括音频分类(AS-2M为49.0%,AS-20K为41.5%)、关键词识别(SPC-2为98.3%的准确性)和环境声学分类(ESC-50为96.1%)。这些结果突出表明了ADA在音频任务中的效率,为更广泛的应用铺平了道路。
Article 29
Title@2025-07-03 (4): OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
Title: OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding | OmniDraft: Ein Cross-Vocabulary, Online Adaptive Drafter für die gerätespezifische Dekodierung | 总括草案:跨词汇、在线在线可调适性套用投机下限设计图纸 2507.02659v1 |
Authors (7): Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang
Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all’’} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.
在网上部署环境中,存在两大挑战:(1) 使用与模型草案不相符的目标模型;(2) 期望在使用和时间上提高延缓度。在这项工作中,我们提出Omnisuster,这是一个统一框架,使单一的模型草案能够与任何目标模型一起运作,并动态地适应用户数据。我们推出一个带有混合蒸馏微调的在线 ng 缓存,同时进行混合蒸馏微调,以解决草案和目标模型之间的交叉蒸汽错配;通过利用适应性起草技术进一步提高解码速度。在模型成本、效率和用户定制是主要争议点的地方,Ommicomicomicomi of特别适合LM应用程序。这进一步突显了应对上述挑战的必要性,并激励所有模式的Textitone起草者更新。我们通过在线学习数学推理学模型、计算模型、数字组合和指标生成LOma-B(包括数字、数字分析和指标生成),展示了OmniMI草案框架的精度。
Article 30
Title@2025-07-03 (4): Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search
Title: Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search | Entkoppelte Planung und Ausführung: Ein Hierarchisches Reasoning-Framework für tiefe Suche | 分解的规划和执行:深海搜索的等级理据框架 2507.02652v1 |
Authors (8): Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu, Yang Zhao, Hongjin Qian, Zhicheng Dou
Complex information needs in real-world search scenarios demand deep reasoning and knowledge synthesis across diverse sources, which traditional retrieval-augmented generation (RAG) pipelines struggle to address effectively. Current reasoning-based approaches suffer from a fundamental limitation: they use a single model to handle both high-level planning and detailed execution, leading to inefficient reasoning and limited scalability. In this paper, we introduce HiRA, a hierarchical framework that separates strategic planning from specialized execution. Our approach decomposes complex search tasks into focused subtasks, assigns each subtask to domain-specific agents equipped with external tools and reasoning capabilities, and coordinates the results through a structured integration mechanism. This separation prevents execution details from disrupting high-level reasoning while enabling the system to leverage specialized expertise for different types of information processing. Experiments on four complex, cross-modal deep search benchmarks demonstrate that HiRA significantly outperforms state-of-the-art RAG and agent-based systems. Our results show improvements in both answer quality and system efficiency, highlighting the effectiveness of decoupled planning and execution for multi-step information seeking tasks. Our code is available at https://github.com/ignorejjj/HiRA.
现实世界搜索情景的复杂信息需求要求不同来源的深度推理和知识综合,而传统检索-增强的一代(RAG)管道则难以有效解决。当前基于推理的方法存在一个根本性的限制:它们使用单一模式处理高级别规划和详细执行,导致低效率推理和可扩缩性。在本文中,我们引入了Hira,这是一个将战略规划与专门执行分开的分级框架。我们的方法将复杂的搜索任务分解为重点子任务,将每个子任务指派给配备外部工具和推理能力的域特有代理,并通过结构化集成机制协调结果。这种分离防止执行细节干扰高级别推理,同时使系统能够为不同类型的信息处理利用专门知识。对四种复杂、跨模式的深度搜索基准的实验表明,Hira大大超越了现代搜索和代理系统的现状。我们的结果显示,在回答质量和系统效率方面都得到了改进,突出了为寻求多步骤信息而进行分解组合规划和执行的有效性。我们的代码可在 https://giuthus/hibugh.
Article 31
Title@2025-07-03 (4): Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory
Title: Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory | Strategische Intelligenz in großen Sprachmodellen: Beweise aus der evolutionären Spieltheorie | 大语言模型战略情报:进化游戏理论的证据 2507.02618v1 |
Authors (2): Kenneth Payne, Baptiste Alloui-Cros
Are Large Language Models (LLMs) a new form of strategic intelligence, able to reason about goals in competitive settings? We present compelling supporting evidence. The Iterated Prisoner’s Dilemma (IPD) has long served as a model for studying decision-making. We conduct the first ever series of evolutionary IPD tournaments, pitting canonical strategies (e.g., Tit-for-Tat, Grim Trigger) against agents from the leading frontier AI companies OpenAI, Google, and Anthropic. By varying the termination probability in each tournament (the “shadow of the future”), we introduce complexity and chance, confounding memorisation. Our results show that LLMs are highly competitive, consistently surviving and sometimes even proliferating in these complex ecosystems. Furthermore, they exhibit distinctive and persistent “strategic fingerprints”: Google’s Gemini models proved strategically ruthless, exploiting cooperative opponents and retaliating against defectors, while OpenAI’s models remained highly cooperative, a trait that proved catastrophic in hostile environments. Anthropic’s Claude emerged as the most forgiving reciprocator, showing remarkable willingness to restore cooperation even after being exploited or successfully defecting. Analysis of nearly 32,000 prose rationales provided by the models reveals that they actively reason about both the time horizon and their opponent’s likely strategy, and we demonstrate that this reasoning is instrumental to their decisions. This work connects classic game theory with machine psychology, offering a rich and granular view of algorithmic decision-making under uncertainty.
大型语言模型(LLMS)是否是一种新型的战略智能,能够在竞争性环境下解释目标?我们提出令人信服的支持证据。迭代囚犯的Diillemma(IPD)长期以来一直是研究决策的典范。我们举办了有史以来第一系列进化的IPD锦标赛,对领先的前沿AI公司OpenAI、Google和Anthroopic的代理人的卡通策略(例如Tit-for-Tat、Grim Triggger)进行尖锐的卡通策略(比如Tit-for-Tat、Grim Trigger)。通过在每次比赛中(“未来阴影” ),我们引入了复杂和机会,混淆了记忆。我们的结果表明LLOMS在这些复杂的生态系统中具有高度竞争力,持续生存,有时甚至扩散。此外,它们展示了独特和持久的“战略指纹”:谷歌的Gemini模型在战略上冷酷无情,利用合作对手,对叛逆者进行报复,而OpenAI的模型则保持高度合作,在敌对环境中造成灾难性的特征。即使Claudeculational-rocial ex 也是最令人发ncial-comcial exal exal ex excial excial ex excience excience 也展示了一种最令人难以置信的逻辑上的理论, 推论的逻辑, 推论的理论论论论,在这种推论, 3 。
Article 32
Title@2025-07-03 (4): Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure
Title: Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure | Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur | 以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测 2506.08713v2 |
Authors (2): Fariz Ikhwantri, Dusica Marijan
Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicated nature of legal and technical texts, the need for model explanations, and limited access to assurance case data. We propose a compliance detection approach based on Natural Language Inference (NLI): EXplainable CompLiance detection with Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection. We address the limited number of assurance cases by generating them using large language models (LLMs). We introduce metrics that measure the coverage and structural consistency. We demonstrate the effectiveness of the generated assurance case from GDPR requirements in a multi-hop inference task as a case study. Our results highlight the potential of NLI-based approaches in automating the regulatory compliance process.
确保复杂的系统符合规章条例,通常需要通过索赔-论证-证据框架检查保证案件的有效性。这一过程的一些挑战包括法律和技术文本性质复杂,需要示范解释,以及获得保证案件数据的机会有限。我们建议根据自然语言推断(NLI):利用多点推理推理(EXCLAIM)的推理推理(EXCLAIM)进行可推广共测。我们将保证案件的索赔-论证-证据结构作为可解释和可追踪的遵守检测的多重推理。我们用大型语言模型(LLLMs)生成的保证案件数量有限。我们提出了衡量覆盖面和结构一致性的衡量标准。我们作为案例研究,展示了GDPR要求产生的保证案件的有效性。我们的结果突出了以NLI为基础的方法在监管遵守过程自动化方面的潜力。
Article 33
Title@2025-07-03 (4): Direct Preference Optimization Using Sparse Feature-Level Constraints
Title: Direct Preference Optimization Using Sparse Feature-Level Constraints | Direkte Preference-Optimierung mit Sparse-Feature-Level-Beschränkungen | 使用粗简地物限制的直接优惠优化 2411.07618v2 |
Authors (11): Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang
The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.
大型语言模式(LLMS)与人类偏好相匹配仍然是一个关键的挑战。尽管培训后技术(如“加强人类反馈学习”和“直接偏好优化”)取得了显著成功,但它们往往导致计算效率低下和训练不稳定。在本文中,我们提出“特级有限偏好优化”(FPO)新颖方法,旨在简化校准进程,同时确保稳定性。FPO利用预先培训的“Sparse Autoencors (SAEs)” ,引入了特质级限制,允许高效、宽度强化校准。我们的方法通过使用精密的稀疏特性在训练有素的稀疏的稀释自动校对器中激活,以及使用地层离线参考的相继KL差异的质量而效率得到提高。基准数据集的实验结果表明,FPO在赢率方面实现了5.08%的绝对改善,其计算成本比最先进的基线低得多,这为高效、可控的LM校准的解决方案带来了希望。
Article 34
Title@2025-07-03 (4): Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs
Title: Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs | Symbolisch oder numerisch? Physik-Probleme verstehen, die LLMs aufklären | 理解在理赔中解决物理问题 2507.01334v2 |
Authors (3): Nifu Dan, Yujun Cai, Yiwei Wang
Navigating the complexities of physics reasoning has long been a difficult task for Large Language Models (LLMs), requiring a synthesis of profound conceptual understanding and adept problem-solving techniques. In this study, we investigate the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark. Our comprehensive experimental evaluation reveals the remarkable capabilities of reasoning models. Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation. Furthermore, our findings indicate that even for these highly sophisticated reasoning models, the strategic incorporation of few-shot prompting can still yield measurable improvements in overall accuracy, highlighting the potential for continued performance gains.
评估物理学推理的复杂性长期以来一直是大语言模型(LLMs)的一项艰巨任务,需要综合深刻的概念理解和恰当的解决问题技术。在本研究中,我们调查了应用高级指导调整推理模型(如Deepseek-R1),以解决从具有挑战性的SciBench基准中归纳出来的多种多样的物理问题。我们的全面实验评估揭示了推理模型的非凡能力。这些模型不仅在回答复杂的物理学问题时达到了最先进的准确性,而且还产生了强调象征性衍生的独特的推理模式。此外,我们的调查结果表明,即使这些高度复杂的推理模型,从战略上整合几发提示仍然能够取得可衡量的总体准确性改进,突出持续取得绩效收益的潜力。
Article 35
Title@2025-07-03 (4): MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion
Title: MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion | MPF: Sprachmodelle nach der Bereitstellung über Multi Perspective Fusion ausrichten und abgrenzen | MPF:通过多视角融合进行部署后调整和取消对语言模式的偏见 2507.02595v1 |
Authors (7): Xin Guan, PeiHsin Lin, Zekun Wu, Ze Wang, Ruibo Zhang, Emre Kazim, Adriano Koshiyama
Multiperspective Fusion (MPF) is a novel posttraining alignment framework for large language models (LLMs) developed in response to the growing need for easy bias mitigation. Built on top of the SAGED pipeline, an automated system for constructing bias benchmarks and extracting interpretable baseline distributions, MPF leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines. By decomposing baseline, such as sentiment distributions from HR professionals, into interpretable perspective components, MPF guides generation through sampling and balancing of responses, weighted by the probabilities obtained in the decomposition. Empirically, we demonstrate its ability to align LLM sentiment distributions with both counterfactual baselines (absolute equality) and the HR baseline (biased for Top Univeristy), resulting in small KL divergence, reduction of calibration error and generalization to unseen questions. This shows that MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning.
多重视觉融合(MPF)是针对日益容易减少偏见的需要而开发的大型语言模型(LLMS)的新的培训后调整框架,在SAGED管道(一个建立偏差基准和提取可解释基线分布的自动化系统)上建起一个构建偏差基准和提取可解释基线分布的自动系统(SAGED管道)上建起,MPF利用多世代的多重视觉将LLM产出的偏差与细微的、人性化的基线暴露和匹配;通过分解基线(如人力资源专业人员的情绪分布),将其分为可解释的视角组成部分,MPF通过抽样和平衡按分解状态获得的概率加权,生成MFPF指南,从而显示我们有能力将LM情绪分布与反事实基线(绝对平等)和HR基线(对顶部联合的偏向)相匹配,从而导致小KLL差异、减少校准错误和对隐性问题的概括化。这表明,MFPF为调整和减轻偏差提供了可扩展和可解释的方法,与部署的LMMS相容和不要求广泛迅速工程或微调。
Article 36
Title@2025-07-03 (4): MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration
Title: MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration | MedAide: Informationsfusion und Anatomie von medizinischen Intents über LLM-basierte Agent Collaboration | 医学辅助:通过以LLM为基地的合作公司代理进行医疗成瘾者的信息汇集和解剖 2410.12532v3 |
Authors (11): Dingkang Yang, Jinjie Wei, Mingcheng Li, Jiyao Liu, Lihao Liu, Ming Hu, Junjun He, Yakun Ju, Wei Zhou, Yang Liu, Lihua Zhang
In healthcare intelligence, the ability to fuse heterogeneous, multi-intent information from diverse clinical sources is fundamental to building reliable decision-making systems. Large Language Model (LLM)-driven information interaction systems currently showing potential promise in the healthcare domain. Nevertheless, they often suffer from information redundancy and coupling when dealing with complex medical intents, leading to severe hallucinations and performance bottlenecks. To this end, we propose MedAide, an LLM-based medical multi-agent collaboration framework designed to enable intent-aware information fusion and coordinated reasoning across specialized healthcare domains. Specifically, we introduce a regularization-guided module that combines syntactic constraints with retrieval augmented generation to decompose complex queries into structured representations, facilitating fine-grained clinical information fusion and intent resolution. Additionally, a dynamic intent prototype matching module is proposed to utilize dynamic prototype representation with a semantic similarity matching mechanism to achieve adaptive recognition and updating of the agent’s intent in multi-round healthcare dialogues. Ultimately, we design a rotation agent collaboration mechanism that introduces dynamic role rotation and decision-level information fusion across specialized medical agents. Extensive experiments are conducted on four medical benchmarks with composite intents. Experimental results from automated metrics and expert doctor evaluations show that MedAide outperforms current LLMs and improves their medical proficiency and strategic reasoning.
在保健情报方面,整合来自不同临床来源的多种多功能信息的能力对于建立可靠的决策系统至关重要。目前,大型语言模型驱动的信息互动系统在保健领域具有潜在的希望。然而,在处理复杂的医疗意图时,它们往往会遇到信息冗余和混合,导致严重的幻觉和性能瓶颈。为此,我们提议MedAide,一个基于LLM的医疗多剂协作框架,目的是在各种专业保健领域实现意向意识信息融合和协调推理。具体地说,我们引入一个正规化指导的模块,将综合技术限制与检索增强的生成相结合,将复杂的查询分解成结构化的表述,促进细化的临床信息融合和意向解决方案。此外,还提议一个动态意图匹配模块,利用具有语义相似匹配机制的动态原型代表,实现适应性承认和更新该代理人在多方面保健对话中的意向。我们设计了一个轮换代理协作机制,在专业医疗代理人之间引入动态角色轮换和决策级信息融合。在四个医学基准上进行了广泛的实验,将复杂的询问分为四个医学基准,并且从综合意图上展示了医学测试结果。
Article 37
Title@2025-07-03 (4): Revisiting Active Learning under (Human) Label Variation
Title: Revisiting Active Learning under (Human) Label Variation | Aktives Lernen unter (menschlichen) Label-Varianten | 在(人)标签标签变换下重新审查积极学习 2507.02593v1 |
Authors (6): Cornelia Gruber, Helen Alber, Bernd Bischl, Göran Kauermann, Barbara Plank, Matthias Aßenmacher
Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed – or neglected – these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.
虽然标签差异(LV),即同一情况的不同标签是常见的,但在自然语言处理中,批注框架往往仍然以单一地面真理的假设为基础。这忽略了人类标签差异(HLV),说明中出现可信的差异,作为信息信号。同样,积极学习(AL),是一种在培训ML模型中优化使用有限注解预算的流行方法,通常依赖至少几个简化假设中的一个,在承认HLV时,这些假设在实践中很少存在。我们在本文件中,审查关于真相和标签性质的基本假设,强调将观察到的LV分解成信号(如HLV)和噪音(如注解错误)的必要性。我们调查AL和(H)LV社区如何处理这些区别,或忽略了这些区别,并提议一个概念框架,将HLV纳入整个LV循环,包括实例选择、说明选择和标签代表。我们进一步讨论了将大型语言模型(LLLMM)整合成一个更能反映全球概念复杂性的大型学习基础。我们的工作的目的是为HLV奠定一个更好的学习基础。
Article 38
Title@2025-07-03 (4): WebSailor: Navigating Super-human Reasoning for Web Agent
Title: WebSailor: Navigating Super-human Reasoning for Web Agent | WebSailor: Navigieren Super-Mensch Vernunft für Web Agent | Web 服务员: 为 Web 代理导航超人理由 2507.02592v1 |
Authors (19): Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou
Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents’ performance and closing the capability gap.
人类认知限制的转变是LLM培训的关键前沿。 DeepResearch等专有代理系统在诸如BrownComp等极其复杂的信息搜索基准上表现出超人的能力,而BrownComp是以前无法实现的。我们假设,这些系统的成功取决于开放源码模型所没有的精密推理模式:在探索广阔的信息景观时能够系统地减少极端不确定性。基于这一认识,我们引入了WebSilor,这是旨在培养这一关键能力的完整培训后方法。我们的方法涉及通过结构化取样和信息模糊、RFT冷冻启动和高效的RL 代理培训算法、Duppacting 校准政策优化(DUPO) , 以及这一集成管道,WebSilor大大超越了所有在复杂信息搜索任务中的开源代理,匹配了专利代理的性能并缩小了能力差距。
Article 39
Title@2025-07-03 (4): AI Flow: Perspectives, Scenarios, and Approaches
Title: AI Flow: Perspectives, Scenarios, and Approaches | AI Flow: Perspektiven, Szenarien und Ansätze | AI 流动:观点、设想和方法 2506.12479v2 |
Authors (14): Hongjun An, Wenhan Hu, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Yiliang Song, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li
Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.
由于克劳德·香农的基本信息理论和艾伦·图灵的机智智能远见框架的开创性,信息和通信技术(IT/CT)的趋同性演进形成了一个不间断的连通和计算浪潮,这种协同效应引发了技术革命,现在随着大型人工智能(AI)模型的重新塑造工业和重新界定人体机械合作而达到顶峰。然而,由于大型模型中大量资源消耗和高通信带宽需求,实现无处不在的情报面临巨大挑战。为了应对这些挑战,AI流动被引入了多学科框架,将先进的信息技术和CT进步结合起来,特别强调以下三个关键点。首先,装置-顶尖的云形框架作为基础,将终端装置、边缘服务器和云层集群结合起来,优化低电流模型的伸缩性和效率。第二,我们引入了家庭模型的概念,即一系列规模不同的模型,与一致的隐蔽性特征相适应,使得有效的合作和灵活性能够适应不同的资源限制和动态情景。第三,连接性和互动性框架作为基础基础,将连接性和互动性框架作为基础,将最终的智能升级性模型,从而提升AI系统。
Article 40
Title@2025-07-03 (4): Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
Title: Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs | Sprachenübergreifendes Reisen: Benchmarking Cross-Lingual Consistency in multimodalen LLMs | 跨语言旅行:多模式LLM中跨语言一致基准 2505.15075v2 |
Authors (5): Hao Wang, Pinzhi Huang, Jihan Yang, Saining Xie, Daisuke Kawahara
The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.
多式联运大型语言模式(MLLMs)的迅速演变大大加强了其实际应用,然而,在各种语言之间取得一致的成绩,特别是在融合文化知识方面,仍是一项重大挑战。为了更好地评估这一问题,我们引入了两个新基准:知识回召和Vis回召,评估MLLMs的跨语言一致性。Know Recreme是一个直观问题,用来衡量15种语言的实际知识一致性,重点是有关全球里程碑的文化问题和历史问题。VisRecall 评估了视觉记忆的一致性,要求模型描述9种语言的标志性外观,但没有图像。实验结果显示,最先进的MLLLMs,包括专有的MLLMs,仍然难以实现跨语言的一致性。这突出表明,需要采取更强有力的方法,产生真正的多语言和文化意识模式。
Article 41
Title@2025-07-03 (4): Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning
Title: Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning | Selbstgesteuerte Prozess-Reward-Optimierung mit neu definiertem Schrittweiser Vorteil für Prozess-Verstärkungs-Lernen | 自指导流程向上优化,具有重新定义的逐步改进的流程强化学习优势 2507.01551v2 |
Authors (8): Wu Fei, Hao Kong, Shuxian Liang, Yang Lin, Yibo Yang, Jing Tang, Lei Chen, Xiansheng Hua
Process Reinforcement Learning~(PRL) has demonstrated considerable potential in enhancing the reasoning capabilities of Large Language Models~(LLMs). However, introducing additional process reward models incurs substantial computational overhead, and there is no unified theoretical framework for process-level advantage estimation. To bridge this gap, we propose \textbf{S}elf-Guided \textbf{P}rocess \textbf{R}eward \textbf{O}ptimization~(\textbf{SPRO}), a novel framework that enables process-aware RL through two key innovations: (1) we first theoretically demonstrate that process rewards can be derived intrinsically from the policy model itself, and (2) we introduce well-defined cumulative process rewards and \textbf{M}asked \textbf{S}tep \textbf{A}dvantage (\textbf{MSA}), which facilitates rigorous step-wise action advantage estimation within shared-prompt sampling groups. Our experimental results demonstrate that SPRO outperforms vaniila GRPO with 3.4x higher training efficiency and a 17.5\% test accuracy improvement. Furthermore, SPRO maintains a stable and elevated policy entropy throughout training while reducing the average response length by approximately $1/3$, evidencing sufficient exploration and prevention of reward hacking. Notably, SPRO incurs no additional computational overhead compared to outcome-supervised RL methods such as GRPO, which benefit industrial implementation.
(PRL) 在提高大语言模型(LLM)的推理能力方面具有相当大的潜力。然而,引入额外的流程奖励模式会产生大量的计算间接费用,而且没有统一的流程一级优势估计理论框架。为了缩小这一差距,我们提议了\ textbf{S}S}elf-Guided\ textbff{P}spourse\ textbf{R}ward\ textbf{R}ward\ textbf{O}pimization~(\ textbf{Spropro}),这是一个通过两项关键创新使进程能够达到REL(L)的新框架。 但是,引入额外的流程奖励模式,我们首先从理论上表明,流程奖励可以从政策模式本身中产生内在的衍生收益,而且没有为流程累积收益提供统一的理论框架框架。 我们提出了定义的累积累积过程奖赏,并提出了明确的累积累积过程奖赏和确定\ textbf{S_BAR_BAR__BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BBAR_要通过持续上多少BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BARBAR_BAR_BAR_BAR_要持续上多少BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BBAR_BAR_
Article 42
Title@2025-07-03 (4): IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders
Title: IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders | IndianBailJudgments-1200: Ein Multi-Attribut-Datensatz für legale NLP auf indischen Bail-Aufträgen | IndianBailJail Judgments-12000:印度保释令法律国家保护程序多属性数据集 2507.02506v1 |
Authors (2): Sneha Deshmukh, Prathmesh Kamble
Legal NLP remains underdeveloped in regions like India due to the scarcity of structured datasets. We introduce IndianBailJudgments-1200, a new benchmark dataset comprising 1200 Indian court judgments on bail decisions, annotated across 20+ attributes including bail outcome, IPC sections, crime type, and legal reasoning. Annotations were generated using a prompt-engineered GPT-4o pipeline and verified for consistency. This resource supports a wide range of legal NLP tasks such as outcome prediction, summarization, and fairness analysis, and is the first publicly available dataset focused specifically on Indian bail jurisprudence.
在印度等地区,由于缺少结构化数据集,国家法律援助方案仍然不发达。我们引入了印度Bail Judgments-1200(IndianBailJudgments-1200)这一新的基准数据集,其中包括1200个印度法院关于保释判决的判决,20+属性的附加说明,包括保释结果、国际刑法委员会各科、犯罪类型和法律推理。说明是利用迅速设计的GPT-4o(GPT-4o)管道生成的,并核实一致性。这一资源支持了国家法律援助方案范围广泛的法律任务,如结果预测、总结和公平分析,也是第一个专门侧重于印度保释判例的公开数据集。
Article 43
Title@2025-07-03 (4): Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack
Title: Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack | Robustheit von Fehlinformations-Klassifikationssystemen zu Adversarial-Beispielen durch BeamAttack | 通过“BeamAttack”进行错误信息分类系统对反向实例的强力 2506.23661v2 |
Authors (4): Arnisa Fazla, Lucas Krauter, David Guzman Piedrahita, Andrianos Michail
We extend BeamAttack, an adversarial attack algorithm designed to evaluate the robustness of text classification systems through word-level modifications guided by beam search. Our extensions include support for word deletions and the option to skip substitutions, enabling the discovery of minimal modifications that alter model predictions. We also integrate LIME to better prioritize word replacements. Evaluated across multiple datasets and victim models (BiLSTM, BERT, and adversarially trained RoBERTa) within the BODEGA framework, our approach achieves over a 99\% attack success rate while preserving the semantic and lexical similarity of the original texts. Through both quantitative and qualitative analysis, we highlight BeamAttack’s effectiveness and its limitations. Our implementation is available at https://github.com/LucK1Y/BeamAttack
我们扩展BeamAttack, 这是一种对抗性攻击算法,目的是通过光束搜索指导的字级修改来评价文本分类系统的稳健性。我们的扩展包括支持删除字词和跳过替代的选项,从而能够发现改变模型预测的最低限度的修改。 我们还将LimAttack纳入LIME, 以更好地确定替换字词的优先次序。在BODEGA框架内,通过多个数据集和受害者模型(BILSTM、BERTER和经过对抗性训练的RoBERTA)进行评估,我们的方法在保存原始文本的语义和词汇相似性的同时,达到了99攻击成功率以上。我们通过定量和定性分析,强调BeamAttack的有效性及其局限性。我们的实施可以在 https://github.com/LucK1Y/BeamAttack上查阅。
Article 44
Title@2025-07-03 (4): Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer
Title: Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer | Task Prompt Vektoren: Effektive Initialisierung durch Multi-Task Soft-Prompt Transfer | 任务提示矢量 : 通过多任务软性即时传输实现有效的初始化 2408.01119v3 |
Authors (4): Robert Belanec, Simon Ostermann, Ivan Srba, Maria Bielikova
Prompt tuning is an efficient solution for training large language models (LLMs). However, current soft-prompt-based methods often sacrifice multi-task modularity, requiring the training process to be fully or partially repeated for each newly added task. While recent work on task vectors applied arithmetic operations on full model weights to achieve the desired multi-task performance, a similar approach for soft-prompts is still missing. To this end, we introduce Task Prompt Vectors, created by element-wise difference between weights of tuned soft-prompts and their random initialization. Experimental results on 12 NLU datasets show that task prompt vectors can be used in low-resource settings to effectively initialize prompt tuning on similar tasks. In addition, we show that task prompt vectors are independent of the random initialization of prompt tuning on 2 different language model architectures. This allows prompt arithmetics with the pre-trained vectors from different tasks. In this way, we provide a competitive alternative to state-of-the-art baselines by arithmetic addition of task prompt vectors from multiple tasks.
快速调试是培训大型语言模型(LLMS)的一个有效解决方案。然而,目前基于软性即时方法往往牺牲了多任务模块性,要求每个新增任务都完全或部分重复培训过程。虽然最近关于任务矢量的工作应用了全模型加权算术操作,以达到理想的多任务性能,但对于软性促进器,仍然缺少类似的方法。为此,我们引入了任务快速矢量,这是调控软性促进器的重量与其随机初始化之间的分值之间的分值造成的。12个NLU数据集的实验结果表明,任务快速矢量可在低资源环境下使用,以有效启动对类似任务的快速调控量。此外,我们表明任务快速矢量独立于对两个不同语言模型结构的随机初始调整。这允许对不同任务中经过预先训练的矢量进行快速算术。我们通过对多个任务的任务任务中的任务快速添加的任务提示矢量进行算术,为状态的基线提供了一种竞争性替代方法。
Article 45
Title@2025-07-03 (4): Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants
Title: Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants | Hanzi als Narrative Bridges herstellen: Ein KI-Co-Creation-Workshop für ältere Migranten | 将Hanzi编成叙述性桥梁:大赦国际为老年移民举办的共同创造讲习班 2507.01548v2 |
Authors (4): Wen Zhan, Ziqun Hua, Peiyue Lin, Yunfei Chen
This paper explores how older adults, particularly aging migrants in urban China, can engage AI-assisted co-creation to express personal narratives that are often fragmented, underrepresented, or difficult to verbalize. Through a pilot workshop combining oral storytelling and the symbolic reconstruction of Hanzi, participants shared memories of migration and recreated new character forms using Xiaozhuan glyphs, suggested by the Large Language Model (LLM), together with physical materials. Supported by human facilitation and a soft AI presence, participants transformed lived experience into visual and tactile expressions without requiring digital literacy. This approach offers new perspectives on human-AI collaboration and aging by repositioning AI not as a content producer but as a supportive mechanism, and by supporting narrative agency within sociotechnical systems.
本文探讨了老年人,特别是中国城市的老龄移民,如何通过人工智能协助共同创作来表达往往支离破碎、代表性不足或难以言语的个人叙事,通过将口头讲故事和汉子象征性重建相结合的试点讲习班,与会者分享了移徙记忆,并用大语言模型(LLM)建议的“小泉格字”和实物材料重新创造了新的性格形式。在人文便利和软的人工智能存在的支持下,参与者将生活经验转化为视觉和触摸的表达方式,而不需要数字扫盲。这一方法通过将AI重新定位为内容制作人,而作为一种支持机制,以及支持社会技术系统内的叙述机构,为人类-大赦国际的合作和老龄化提供了新的视角。
Article 46
Title@2025-07-03 (4): A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages
Title: A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages | Ein Kochbuch für die gemeinschaftsorientierte Datenerfassung von schwachen Sprachkenntnissen in LowResource-Sprachen | 社区驱动的低成本低资源语言有缺陷演讲数据收集手册 2507.02428v1 |
Authors (10): Sumaya Ahmed Salihs, Isaac Wiafe, Jamal-Deen Abdulai, Elikem Doe Atsakpo, Gifty Ayoka, Richard Cave, Akon Obu Ekpezu, Catherine Holloway, Katrin Tomanek, Fiifi Baffoe Payin Winful
This study presents an approach for collecting speech samples to build Automatic Speech Recognition (ASR) models for impaired speech, particularly, low-resource languages. It aims to democratize ASR technology and data collection by developing a “cookbook” of best practices and training for community-driven data collection and ASR model building. As a proof-of-concept, this study curated the first open-source dataset of impaired speech in Akan: a widely spoken indigenous language in Ghana. The study involved participants from diverse backgrounds with speech impairments. The resulting dataset, along with the cookbook and open-source tools, are publicly available to enable researchers and practitioners to create inclusive ASR technologies tailored to the unique needs of speech impaired individuals. In addition, this study presents the initial results of fine-tuning open-source ASR models to better recognize impaired speech in Akan.
这项研究提出一种方法,收集语言样本,以建立障碍语言,特别是低资源语言的自动语音识别模型,目的是通过开发社区驱动数据收集和ASR模型建设最佳做法和培训的“烹饪手册”,使ASR技术和数据收集民主化,作为概念的证明,这项研究整理了Akan的首套障碍语言开放源数据集:加纳一种广泛使用的土著语言;来自不同背景的有语言障碍的参与者参与了这项研究;由此形成的数据集,连同烹饪手册和开放源工具,公开提供,使研究人员和从业人员能够根据语言受损个人的独特需求,开发包容性的ASR技术;此外,这项研究还介绍了对开放源语言的ASR模型进行微调的初步结果,以更好地识别Akan的受损语言。
Article 47
Title@2025-07-03 (4): Delving into LLM-assisted writing in biomedical publications through excess vocabulary
Title: Delving into LLM-assisted writing in biomedical publications through excess vocabulary | Eintauchen in LLM-unterstütztes Schreiben in biomedizinischen Publikationen durch überschüssiges Vokabular | 通过超量词汇,在生物医学出版物中进行LLM协助撰写 2406.07016v5 |
Authors (4): Dmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát, Jan Lause
Large language models (LLMs) like ChatGPT can generate and revise text with human-level performance. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists use them for their scholarly writing. But how wide-spread is such LLM usage in the academic literature? To answer this question for the field of biomedical research, we present an unbiased, large-scale approach: we study vocabulary changes in over 15 million biomedical abstracts from 2010–2024 indexed by PubMed, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. This excess word analysis suggests that at least 13.5% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, reaching 40% for some subcorpora. We show that LLMs have had an unprecedented impact on scientific writing in biomedical research, surpassing the effect of major world events such as the Covid pandemic.
象 ChatGPT 这样的大语言模型(LLMs) 能够产生和修改具有人性性性能的文本。 这些模型具有明显的局限性:它们可以产生不准确的信息,强化现有的偏见,并容易被滥用。 然而,许多科学家在学术著作中使用了这些信息。 但是,在学术文献中,这种LLM的使用范围有多广? 为了回答生物医学研究领域的这个问题,我们提出了一种不带偏见的大规模方法:我们研究2010-2024年由PubMed索引的1500多万生物医学摘要的词汇变化,并表明LMs的外观如何导致某些风格词的频率突然增加。这种超量的字数分析表明,2024年摘要中至少有13.5%是用LMs处理的。这种较低的约束在学科、国家和期刊中是不同的,在某些子公司中达到40%。 我们表明LMs对生物医学研究的科学写作产生了前所未有的影响,超过了Covid大流行病等重大世界事件的影响。
Article 48
Title@2025-07-03 (4): Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability
Title: Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability | Benchmarking Akan ASR-Modelle über Domain-spezifische Datensätze: Eine vergleichende Bewertung von Leistung, Skalierbarkeit und Anpassungsfähigkeit | 确定Akan ASR模型基准的全域具体数据集:业绩比较评价、可缩放性和可调适性 2507.02407v1 |
Authors (8): Mark Atta Mensah, Isaac Wiafe, Akon Ekpezu, Justice Kwame Appati, Jamal-Deen Abdulai, Akosua Nyarkoa Wiafe-Akenten, Frank Ernest Yeboah, Gifty Odame
Most existing automatic speech recognition (ASR) research evaluate models using in-domain datasets. However, they seldom evaluate how they generalize across diverse speech contexts. This study addresses this gap by benchmarking seven Akan ASR models built on transformer architectures, such as Whisper and Wav2Vec2, using four Akan speech corpora to determine their performance. These datasets encompass various domains, including culturally relevant image descriptions, informal conversations, biblical scripture readings, and spontaneous financial dialogues. A comparison of the word error rate and character error rate highlighted domain dependency, with models performing optimally only within their training domains while showing marked accuracy degradation in mismatched scenarios. This study also identified distinct error behaviors between the Whisper and Wav2Vec2 architectures. Whereas fine-tuned Whisper Akan models led to more fluent but potentially misleading transcription errors, Wav2Vec2 produced more obvious yet less interpretable outputs when encountering unfamiliar inputs. This trade-off between readability and transparency in ASR errors should be considered when selecting architectures for low-resource language (LRL) applications. These findings highlight the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for Akan and other LRLs.
大多数现有的自动语音识别(ASR)研究都评估了使用内域数据集的模型,但很少评价它们如何对不同演讲背景加以概括。本研究通过对建立在变压器结构上,例如Whisper和Wav2Vec2的7个Akan ASR模型进行基准化来弥补这一差距,使用4个Akan 语音识别(ASR)研究来确定其性能。这些数据集包含不同的领域,包括文化上相关的图像描述、非正式对话、圣经读取和自发的财务对话。对单词误差率和字符误差率的比较突出显示了域依赖性,模型只在培训领域最优化地运行,同时显示不匹配情景中明显的准确性退化。本研究还查明了Whisper和Wav2Vec2结构之间的明显错误行为。而微调Whisper Akan模型导致更多流利,但可能产生误导性的校正错误。当遇到不熟悉的投入时,Wav2Vec2生成了更明显但解释性更小的产出。在ASR的可读性和透明度错误之间,在选择低资源语言结构(LKAN)应用和适应性应用领域技术时,这些结论强调其他适应性框架需要。
Article 49
Title@2025-07-03 (4): AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation
Title: AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation | AIn’t Not Nothing But a Survey? Mit großen Sprachmodellen für die Codierung Deutsch Open-Ended Survey Responses on Survey Motivation | 使用大语言模型对德国关于调查动机的开放式调查答复进行编码 2506.14634v3 |
Authors (4): Leah von der Heyde, Anna-Carolina Haensch, Bernd Weiß, Jessica Daikeler
The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs’ performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs’ unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.
由于LLMs在语言上能够有效地替代耗费时间的手工编码和受监督的机器学习模式的预培训,由于关于这个专题的大多数现有研究侧重于与非复杂专题或单一LLMs有关的英语答复,因此不清楚它的调查结果是否概括了这些分类的质量,以及这些分类的质量如何与既定方法相比较。在本研究中,我们调查在多大程度上可以使用不同的LMs来规范其他情况下的开放式调查答复,利用德国关于参与调查的原因的数据作为实例。我们比较了一些最先进的LLMs和一些快速的方法,并通过使用人类专家的编码来评价LMs的业绩。LMs的总体业绩差异很大,只有经过精细调的LMM才能达到令人满意的预测性业绩水平。在使用准确的LMM方法时,对业绩的差别性差异以准确的LM方法为条件。最后,LMs在调查的不同类别下,对参与调查原因的不平等的分类工作表现,作为参与的原因,作为一个例子,我们比较一些最新的LLMsms和一些快速的方法,在进行这种分析时,我们需要对这些研究的公开性分析。
Article 50
Title@2025-07-03 (4): JoyTTS: LLM-based Spoken Chatbot With Voice Cloning
Title: JoyTTS: LLM-based Spoken Chatbot With Voice Cloning | JoyTTS: LLM-basierter gesprochener Chatbot mit Voice Cloning | 以LLM为基地的 “ 配有语音克隆的口音聊天机器人 “ 2507.02380v1 |
Authors (3): Fangru Zhou, Jun Zhao, Guoxin Wang
JoyTTS is an end-to-end spoken chatbot that combines large language models (LLM) with text-to-speech (TTS) technology, featuring voice cloning capabilities. This project is built upon the open-source MiniCPM-o and CosyVoice2 models and trained on 2000 hours of conversational data. We have also provided the complete training code to facilitate further development and optimization by the community. On the testing machine seed-tts-zh, it achieves a SS (speaker similarity) score of 0.73 and a WER (Word Error Rate) of 5.09. The code and models, along with training and inference scripts, are available at https://github.com/jdh-algo/JoyTTS.git.
JoyTTS是一个端到端口语聊天机,将大型语言模型(LLM)与语音语音技术(TTS)相结合,具有语音克隆能力,该项目以开放源码MiniCPM-o和CosyVoice2模型为基础,在20小时的谈话数据上进行了培训,我们还提供了完整的培训守则,以便利社区的进一步发展和优化,在测试机种子-zh上,它达到了SS(同音)0.73分和WER(Word错误率)5.09分,代码和模型以及培训和推断脚本可在https://github.com/jdh-algo/JoyTTS.git上查阅。
Article 51
Title@2025-07-03 (4): Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection
Title: Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection | Effiziente Code-LLM-Schulung über Distribution-Konsistenz und Diversity-Aware-Datenauswahl | 通过分配和多样性软件数据选择进行高效率的守则LLM培训 2507.02378v1 |
Authors (3): Weijie Lyu, Sheng-Jun Huang, Xuan Xia
Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging vast amounts of data, focusing on data quantity while often overlooking data quality, thereby reducing training efficiency. To address this, we introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance. Our method optimizes the parametric model to ensure distribution consistency and diversity within the selected subset, guaranteeing high-quality data. Experimental results demonstrate that using only 10K samples, our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance and efficiency. This underscores that our method effectively boosts model performance while significantly reducing computational costs.
近期在大型语言模型(LLMs)方面的进步大大提高了代码生成和方案理解,加快了软件工程的演变。目前的方法主要是通过利用大量数据提高模型性能,注重数据数量,同时往往忽略数据质量,从而降低培训效率。为了解决这一问题,我们采用了一种方法,利用参数模型选择代码数据,以提高培训效率和模型性能。我们的方法优化了参数模型,以确保选定子集的分布一致性和多样性,保证高质量的数据。实验结果显示,仅使用10K样本,我们的方法就取得了超过92K完整抽样基线的2.4%(人类Val)和2.3%(MBPPP)的收益,在业绩和效率两方面都超过了其他抽样方法。这突出表明,我们的方法有效地提高了模型性能,同时大幅降低了计算成本。
Article 52
Title@2025-07-03 (4): QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers
Title: QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers | QFFN-BERT: Eine empirische Studie über Tiefe, Leistung und Dateneffizienz in hybriden Quantum-Klassischen Transformern | QFFN-BERT:对混合量子-分类变异器的深度、性能和数据效率的经验研究 2507.02364v1 |
Authors (1): Pilsung Kang
Parameterized quantum circuits (PQCs) have recently emerged as promising components for enhancing the expressibility of neural architectures. In this work, we introduce QFFN-BERT, a hybrid quantum-classical transformer where the feedforward network (FFN) modules of a compact BERT variant are replaced by PQC-based layers. This design is motivated by the dominant parameter contribution of FFNs, which account for approximately two-thirds of the parameters within standard Transformer encoder blocks. While prior studies have primarily integrated PQCs into self-attention modules, our work focuses on the FFN and systematically investigates the trade-offs between PQC depth, expressibility, and trainability. Our final PQC architecture incorporates a residual connection, both $R_Y$ and $R_Z$ rotations, and an alternating entanglement strategy to ensure stable training and high expressibility. Our experiments, conducted on a classical simulator, on the SST-2 and DBpedia benchmarks demonstrate two key findings. First, a carefully configured QFFN-BERT achieves up to 102.0% of the baseline accuracy, surpassing its classical counterpart in a full-data setting while reducing FFN-specific parameters by over 99%. Second, our model exhibits a consistent and competitive edge in few-shot learning scenarios, confirming its potential for superior data efficiency. These results, supported by an ablation study on a non-optimized PQC that failed to learn, confirm that PQCs can serve as powerful and parameter-efficient alternatives to classical FFNs when co-designed with foundational deep learning principles.
光度量子电路(PQC)最近成为提高神经结构可感知度的有希望的组成部分。 在这项工作中,我们引入了QFFN-BERT(QFFN-BERT),这是一个混合的量子古典变异器,其向前网络模块被基于PQC(PQC)的层所取代。这一设计受FFFFMs的主要参数贡献驱动,该参数约占标准变换器编码器区块参数的三分之二左右。虽然先前的研究主要将PQC纳入自知度模块,但我们的工作重点是FFFN(FFM),系统调查PQC深度、可感光度和可训练性之间的偏差。我们最后的PQC架构包含一个剩余连接, $Y和$Rç$的旋转, 以及一个交替纠缠战略,以确保稳定的培训和高可见度。我们在SST-2和DBpeedia基准的经典模拟模拟实验显示两个关键结果。首先,仔细配置的QN-BERT(Q-BERT)无法完成Q-Q-QQ-Q-Q-C)的更精确的参数,同时通过直观的升级的精确的校正对10%的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正校正的校正校正的校正, 校正的校正的校正的校正的校正的校正的校正校准校略。
Article 53
Title@2025-07-03 (4): Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning
Title: Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning | Verbesserung der Robustheit der distantly-überwachten Anerkennung von Personen mit Namen durch unsicheres Lehrerlernen und studentisch-studentisches kollaboratives Lernen | 通过不确定-软件教师学习和学生-学生合作学习,提高以不确定-软件教师学习和学生-学生合作学习的方式,提高以不确定-软件命名的实体识别的力度 2311.08010v3 |
Authors (7): Shuzheng Si, Helan Hu, Haozhe Zhao, Shuang Zeng, Kaikai An, Zefan Cai, Baobao Chang
Distantly-Supervised Named Entity Recognition (DS-NER) is widely used in real-world scenarios. It can effectively alleviate the burden of annotation by matching entities in existing knowledge bases with snippets in the text but suffer from the label noise. Recent works attempt to adopt the teacher-student framework to gradually refine the training labels and improve the overall robustness. However, these teacher-student methods achieve limited performance because the poor calibration of the teacher network produces incorrectly pseudo-labeled samples, leading to error propagation. Therefore, we propose: (1) Uncertainty-Aware Teacher Learning that leverages the prediction uncertainty to reduce the number of incorrect pseudo labels in the self-training stage; (2) Student-Student Collaborative Learning that allows the transfer of reliable labels between two student networks instead of indiscriminately relying on all pseudo labels from its teacher, and further enables a full exploration of mislabeled samples rather than simply filtering unreliable pseudo-labeled samples. We evaluate our proposed method on five DS-NER datasets, demonstrating that our method is superior to the state-of-the-art DS-NER methods.
在现实世界情景中广泛使用隐蔽的命名实体识别(DS-NER),通过将现有知识库中的实体与文本中的片段相匹配,但受到标签噪音的影响,可以有效地减轻批注负担。最近的工作试图采用师生框架,逐步完善培训标签,提高总体稳健性。然而,这些师生方法绩效有限,因为教师网络的校准差,产生了错误的假标签样本,导致错误的传播。 因此,我们提议:(1) 不确定性教师学习,利用预测不确定性来减少自培训阶段不正确的假标签的数量;(2) 学生-学生协作学习,允许两个学生网络之间转让可靠的标签,而不是不加区别地依赖教师的所有假标签,进一步允许充分探索标签错误的样本,而不是简单地过滤不可靠的伪标签样本。我们评估了我们提议的关于DS-NER数据集的各种方法,表明我们的方法优于先进的DS-NER方法。
Article 54
Title@2025-07-03 (4): Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models
Title: Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models | Coling-UniA bei SciVQA 2025: Wenig-heißes Beispiel Retrieval und Vertrauen-informierte Montage für multimodale große Sprachmodelle | 在SciVQA 2025 SciVQA 的Coling-UniA:多式大语言模型的很少热实例检索和信任化组合 2507.02357v1 |
Authors (3): Christian Jaumann, Annemarie Friedrich, Rainer Lienhart
This paper describes our system for the SciVQA 2025 Shared Task on Scientific Visual Question Answering. Our system employs an ensemble of two Multimodal Large Language Models and various few-shot example retrieval strategies. The model and few-shot setting are selected based on the figure and question type. We also select answers based on the models’ confidence levels. On the blind test data, our system ranks third out of seven with an average F1 score of 85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.
本文描述我们2025年SciVQA科学视觉问题回答共同任务系统。我们的系统使用两种多式大语言模型和各种微小例子检索策略的组合。模型和短镜头设置是根据数字和问题类型选择的。我们还根据模型的自信度选择答案。在盲点测试数据中,我们的系统在七分中排在第三位,在ROUGE-1、ROUGE-L和BERTS中平均F1分为85.12分。我们的代码是公开的。
Article 55
Title@2025-07-03 (4): Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation
Title: Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation | Einschließlich LLMs für großräumige Urban Complex Mobility Simulation | 大型城市综合流动模拟项目LLMs 2505.21880v2 |
Authors (8): Yu-Lun Song, Chung-En Tsern, Che-Cheng Wu, Yu-Ming Chang, Syuan-Bo Huang, Wei-Chu Chen, Michael Chia-Liang Lin, Yu-Ta Lin
This study presents an innovative approach to urban mobility simulation by integrating a Large Language Model (LLM) with Agent-Based Modeling (ABM). Unlike traditional rule-based ABM, the proposed framework leverages LLM to enhance agent diversity and realism by generating synthetic population profiles, allocating routine and occasional locations, and simulating personalized routes. Using real-world data, the simulation models individual behaviors and large-scale mobility patterns in Taipei City. Key insights, such as route heat maps and mode-specific indicators, provide urban planners with actionable information for policy-making. Future work focuses on establishing robust validation frameworks to ensure accuracy and reliability in urban planning applications.
与传统的基于规则的反弹道导弹框架不同,拟议框架利用LLM,通过制作合成人口概况、分配常规和偶发地点以及模拟个人化路线,加强代理人多样性和现实主义。 利用台北市的现实世界数据、模拟模型个人行为和大规模流动模式,重要见解,如路线热图和模式特定指标,为城市规划者提供了可供决策使用的信息。未来工作的重点是建立强有力的验证框架,以确保城市规划应用的准确性和可靠性。
Article 56
Title@2025-07-03 (4): Decision-Oriented Text Evaluation
Title: Decision-Oriented Text Evaluation | Entscheidungsorientierte Textbewertung | 注重决定的案文评价 2507.01923v2 |
Authors (3): Yu-Shiang Huang, Chuan-Ju Wang, Chung-Chi Chen
Natural language generation (NLG) is increasingly deployed in high-stakes domains, yet common intrinsic evaluation methods, such as n-gram overlap or sentence plausibility, weakly correlate with actual decision-making efficacy. We propose a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes. Using market digest texts–including objective morning summaries and subjective closing-bell analyses–as test cases, we assess decision quality based on the financial performance of trades executed by human investors and autonomous LLM agents informed exclusively by these texts. Our findings reveal that neither humans nor LLM agents consistently surpass random performance when relying solely on summaries. However, richer analytical commentaries enable collaborative human-LLM teams to outperform individual human or agent baselines significantly. Our approach underscores the importance of evaluating generated text by its ability to facilitate synergistic decision-making between humans and LLMs, highlighting critical limitations of traditional intrinsic metrics.
自然语言生成(NLG)越来越多地用于高接触领域,但共同的内在评价方法,如n-gram重叠或句子可信度,与实际决策效率关系不大。我们提议了一个以决策为导向的框架,通过直接衡量对人文和大语言模型(LLM)决策结果的影响来评价产生的文本。我们利用市场摘要文本,包括客观的上午摘要和主观的结束铃式分析,作为测试案例,根据完全以这些文本为依据的投资者和自主LLM代理商所从事贸易的财务业绩评估决定质量。我们的调查结果显示,人类或LLM代理商在仅仅依赖摘要时,总是不总是超过随机性业绩。然而,较丰富的分析评注使得人文-LLM团队能够大大超过单个人或代理人的基线。我们的方法强调,通过能够促进人与LMs之间的协同决策,突出传统内在指标的关键局限性,对产生的文本进行评估的重要性。
Article 57
Title@2025-07-03 (4): Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs
Title: Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs | Token Prepending: Ein trainingsfreier Ansatz zur Eliziierung besserer Sentence-Embeddings von LLMs | Token Predudo:从LLM女士那里采用不培训办法,使判刑内容更好地嵌入Elibear 2412.11556v2 |
Authors (7): Yuchen Fu, Zifeng Cheng, Zhiwei Jiang, Zhonghui Wang, Yafeng Yin, Zhengliang Li, Qing Gu
Extracting sentence embeddings from large language models (LLMs) is a promising direction, as LLMs have demonstrated stronger semantic understanding capabilities. Previous studies typically focus on prompt engineering to elicit sentence embeddings from LLMs by prompting the model to encode sentence information into the embedding of the last token. However, LLMs are mostly decoder-only models with causal attention and the earlier tokens in the sentence cannot attend to the latter tokens, resulting in biased encoding of sentence information and cascading effects on the final decoded token. To this end, we propose a novel Token Prepending (TP) technique that prepends each layer’s decoded sentence embedding to the beginning of the sentence in the next layer’s input, allowing earlier tokens to attend to the complete sentence information under the causal attention mechanism. The proposed TP technique is a plug-and-play and training-free technique, which means it can be seamlessly integrated with various prompt-based sentence embedding methods and autoregressive LLMs. Extensive experiments on various Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our proposed TP technique can significantly improve the performance of existing prompt-based sentence embedding methods across different LLMs, while incurring negligible additional inference cost.
从大语言模型(LLMS)中摘取的句子是一个很有希望的方向,因为LLMS已经展示出更强的语义理解能力。以前的研究通常侧重于快速工程,通过推动模型将句子信息编码到最后象征的嵌入中,从LLLMS中引出句子嵌入句子。然而,LLMS大多是具有因果关注和句子中早期符号的解码模式,无法触及后一种符号,导致对判决信息进行有偏见的编码,并对最后解码符号产生连锁效应。为此,我们提议了一种新的Token Preduction(TP)技术,将各层解码的句子嵌入下句子的开始,从而促使该模型将句子信息编码入最后象征的嵌入最后象征;然而,拟议的TP技术是一种插接和游戏及培训的早期符号,意味着它可以与各种基于即时的句子嵌入法和自动递增式LMSMS. 进行广泛的实验,将每个层的句子解码化的句子(STSS)任务和下游分解法中的新版本化方法可以大大改进我们现有的递增的递增的递制的递制的递减法。
Article 58
Title@2025-07-03 (4): Layered Insights: Generalizable Analysis of Authorial Style by Leveraging All Transformer Layers
Title: Layered Insights: Generalizable Analysis of Authorial Style by Leveraging All Transformer Layers | Layered Insights: Generalisierbare Analyse des Autorial Styles durch Hebelisierung aller Transformer Layers | 图层透视: 通过利用所有变换层对文件样式的通用分析 2503.00958v2 |
Authors (5): Milad Alshomary, Nikhil Reddy Varimalla, Vishal Anand, Smaranda Muresan, Kathleen McKeown
We propose a new approach for the authorship attribution task that leverages the various linguistic representations learned at different layers of pre-trained transformer-based models. We evaluate our approach on three datasets, comparing it to a state-of-the-art baseline in in-domain and out-of-domain scenarios. We found that utilizing various transformer layers improves the robustness of authorship attribution models when tested on out-of-domain data, resulting in new state-of-the-art results. Our analysis gives further insights into how our model’s different layers get specialized in representing certain stylistic features that benefit the model when tested out of the domain.
我们为作者归属任务提出了一个新办法,利用在经过培训的以变压器为基础的不同层面所学到的各种语言表述方法。我们评估了我们关于三个数据集的方法,将其与在域内和域外情景中最先进的基线进行比较。我们发现,在对域外数据进行测试时,利用不同的变压器层可以提高作者归属模型的稳健性,从而产生新的最新结果。我们的分析进一步揭示了我们模型的不同层面如何专门代表某些在域外测试时有益于模型的典型特征。
Article 59
Title@2025-07-03 (4): Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Title: Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy | Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy | Skywork-Reward-V2:通过人类-AI协同增强优先数据曲线 2507.01352v2 |
Authors (12): Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou
Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.
尽管奖励模式(RMs)在从人类反馈(RLHF)中强化学习方面发挥着关键作用,但目前最先进的公开RMs在多数现有评价基准方面表现不佳,未能捕捉到细微和复杂的人类偏好范围。即使采用先进培训技术的做法也没有带来有意义的绩效改进。我们推测,这种微弱主要来自优惠数据集的局限性,这些数据集往往范围狭窄,贴上合成标签,或缺乏严格的质量控制。为了应对这些挑战,我们提出了一个由4,000万对特惠的SynPref-40M组成的大规模公开优惠数据集。为了能够在规模上调整数据,我们设计了人类-AI两阶段的协同管道,利用了人类注解质量和AI可伸缩的互补优势。在这个管道中,人类提供经核实的说明,而大型语言模型则根据人文指南进行自动缩放。我们引入了Skywork-Reward-V2的组合,这是一套从0.6B到8B的奖赏基准,从Syprepref-ral-rol-rolation sload supal press relity sqreal lader lax lax lax lax lax lax ex s real dal dal lax lax lax lax a lax lax lax lax lax lax a lax lax
Article 60
Title@2025-07-03 (4): Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach
Title: Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach | Ausrichten von gefrorenen LLMs durch Verstärkungslernen: Ein iteratives Reweight-then-Optimize-Ansatz | 通过强化学习将冻结的LLMs与 “ 强化学习:一种过渡性再加权再优化方法 “ 相匹配 2506.17828v2 |
Authors (9): Xinnan Zhang, Chenliang Li, Siliang Zeng, Jiaxiang Li, Zhongruo Wang, Kaixiang Lin, Songtao Lu, Alfredo Garcia, Mingyi Hong
Aligning large language models (LLMs) with human preferences usually requires fine-tuning methods such as RLHF and DPO. These methods directly optimize the model parameters, so they cannot be used in test-time to improve model performance, nor are they applicable when the model weights are not accessible. In contrast, test-time methods sidestep weight updates by leveraging reward functions to guide and improve output quality. However, they incur high inference costs, and their one-shot guidance is often based on imperfect reward or value functions, leading to suboptimal outputs. In this work, we present a method named Iterative Reweight-then-Optimize (IRO), a reinforcement learning (RL) framework that performs RL-style alignment of the (frozen) base model without touching its parameters. During training, each iteration (i) samples candidates from the base model, (ii) resamples using current value functions, and (iii) trains a new lightweight value function that guides the next decoding pass. At test time, the value functions are used to guide the base model generation via a search-based optimization process. Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI’s reinforcement fine-tuning (RFT), but without requiring access to the model weights.
将大型语言模型(LLMS)与人类偏好对齐通常要求微调方法,如 RLHF 和 DPO 。 这些方法直接优化模型参数。 这些方法直接优化模型参数, 因而无法在测试时用于改进模型性能, 当模型重量无法使用时, 也不适用这些参数。 相反, 测试时间方法通过利用奖赏功能来更新边际权重, 以引导和改进产出质量, 产生高的推论成本, 他们的单向指导通常基于不完善的奖赏或价值功能, 导致次优化产出。 在这项工作中, 我们提出了一个名为“ 迭代超重时优化( IRO) ” 的方法, 一个强化学习( RL) 框架, 用来在不触动模型参数的情况下进行 RL 型对齐 。 相比之下, 测试期间, 测试对象( i) 利用当前价值函数进行抽样, 以及 (iii) 培训一个新的轻度值值值值值值值值函数, 引导下一个解码。 测试时, 将值函数用于指导基础模型生成, 模型生成, 通过基于基于搜索的搜索- RFSBSE , 。 用户可以应用到 IP AS AS 。
Article 61
Title@2025-07-03 (4): Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Title: Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding | Fast-dLLM: Trainingsfreie Beschleunigung von Diffusion LLM durch Ermöglichen von KV Cache und Paralleldecoding | 快速dLLM:通过授权 KV 缓存和平行编码加速免培训传播LLM 2505.22618v3 |
Authors (9): Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, Enze Xie
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to \textbf{27.6$\times$ throughput} improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.
以传播为基础的大型语言模型(Difmission LLMS)显示对非自动生成具有平行解码能力的缓存生成文本很有希望,然而,开放源代码扩散的缓存实际推导速度往往落后于自动递解模型,因为缺少关键值(KV)缓存和在同时解码多个符号时质量退化。为了缩小这一差距,我们引入了一种针对双向扩散模型的新颖的块状近似 KV缓存机制,使缓存再利用的性能微乎其微地下降。此外,我们确定平行解码中产生质量退化的根本原因是有条件独立假设下象征性依赖的中断。为了解决这一问题,我们提议了一种有选择的自觉平行解码战略,即有选择地解码标志超过信任阈值、减少依赖侵犯和维持生成质量。 多种LLLLADA和DM模型的实验结果显示,在多个LLMM基准下达到\ textbf{27.6\time times duction}改进精确性损失最小,缩小了业绩差距,以自动递减缩缩模和铺设磁模。
Article 62
Title@2025-07-03 (4): Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient
Title: Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient | Bypass Back-Propagation: Optimierungsbasiertes Structural Pruning für große Sprachmodelle über Policy Gradient | Bypass 后回通信:通过 “ 政策梯度 “ 优化基于优化的结构结构,为大语言模式提供缓冲 2406.10576v3 |
Authors (5): Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia
Recent Large-Language Models (LLMs) pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically hand-crafted metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve efficiency, our method eliminates the back-propagation through the LLM per se during optimization, requiring only the forward pass of the LLM. We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from LLM loss, facilitating efficient optimization via policy gradient estimator without back-propagation. Thus, our method can 1) support global and heterogeneous pruning (i.e., automatically determine different redundancy for different layers), and 2) optionally initialize with a metric-based method (for our Bernoulli distributions). Extensive experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models using the C4 and WikiText2 datasets demonstrate the promising performance of our method in efficiency and effectiveness. Code is available at https://github.com/ethanygao/backprop-free_LLM_pruning.
最近的大型语言模型(LLMS)修剪方法通常在培训后阶段运作,没有昂贵的重量微调,但是,它们的修剪标准往往依赖于超自然手工制作的模量,可能导致业绩不理想。我们建议采用新的优化结构裁剪方法,直接通过优化修剪模型的丢失,在概率空间学习修剪面罩。为了保持效率,我们的方法在优化期间消除了通过LLM 本身的反向调整,只需要LM的向前传。 我们通过学习Bernoulli向模范双向模范遮罩的配送实现这一点,我们从LLM损失中分解Bernoulli参数,通过政策梯度估测仪促进高效优化,而无需反调。因此,我们的方法可以(1) 支持全球和混杂的修剪裁(即自动确定不同层次的不同冗余),以及(2) 采用基于标准的方法(我们的Bernoulli分发),我们通过Bernoulli来做到这一点。 在LMAMA、LAMA 2 和LMA 2 数据展示具有前景性的方法中,在Misabia-LMA 2中进行广泛的试验。
Article 63
Title@2025-07-03 (4): REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
Title: REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models | REINFORCE++: Effizienter RLHF-Algorithmus mit Robustheit sowohl für Prompt- als auch für Reward-Modelle | REINFORCE++: 高效的RLHF对快速模型和奖励模型具有强力的测算法 2501.03262v4 |
Authors (14): Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu
Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) significantly improve the alignment of human-AI values and further raise the upper bound of AI capabilities, particularly in reasoning-intensive, long-context Chain-of-Thought (long-CoT) tasks. However, existing RLHF (or RLVR) frameworks commonly face challenges such as inference bottlenecks and complexity barriers, restricting their accessibility for newcomers. To bridge this gap, we introduce \textbf{OpenRLHF}, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers, featuring a simplified design, clear code structure, and comprehensive documentation to facilitate entry for researchers and practitioners. Experimental results show that OpenRLHF achieves superior training efficiency with speedups ranging from 1.22x to 1.68x across different model sizes compared to state-of-the-art frameworks, while requiring significantly fewer lines of code for implementation. OpenRLHF is publicly available at https://github.com/OpenRLHF/OpenRLHF, and has already been adopted by leading institutions to accelerate RLHF research and learning.
大型语言模型(LLM)通过 “ 人类反馈强化学习(RLHF) “ 和 “ 可验证的奖励强化学习(RLVR) “ 进行微调,大大改进了人类-AI价值的一致性,进一步提高了AI能力的上限,特别是在推理密集型、长文本链(Long-CoT)任务方面。但是,现有的RLHF(或RLVR)框架通常面临一些挑战,如推论瓶颈和复杂障碍,限制了新来者进入。为了缩小这一差距,我们引入了 “ 人类-AI “ 的强化学习(LLLHF),这是一个方便用户、可扩展和易读的开放源的开放RLHF框架,在Ray、vLLM、DeepSpeed和Hugg Face变异体(LFace Grofters)上建了一个方便研究人员进入的简化设计、清晰的代码结构和综合文件。实验结果表明,OpreloadRHF在与州-Rart框架相比,在不同的模型规模从1.22到1.6x-RHFMFML实施方面实现了优优优优优优等标准。
Article 64
Title@2025-07-03 (4): DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning
Title: DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning | DoMIX: Ein effizientes Framework zur Nutzung von Domain-Wissen im Feintuning | DoMIX:一个在微调中利用域知识的有效框架 2507.02302v1 |
Authors (3): Dohoon Kim, Donghun Kang, Taesup Moon
Domain-Adaptive Pre-training (DAP) has recently gained attention for its effectiveness in fine-tuning pre-trained models. Building on this, continual DAP has been explored to develop pre-trained models capable of incrementally incorporating different domain datasets. However, existing continual DAP methods face several limitations: (1) high computational cost and GPU memory usage during training; (2) sensitivity to incremental data order; and (3) providing a single, generalized model for all end tasks, which contradicts the essence of DAP. In this paper, we propose DoMIX, a novel approach that addresses these challenges by leveraging LoRA modules, a representative parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient and parallel domain-adaptive pre-training that is robust to domain order and effectively utilizes accumulated knowledge to provide tailored pre-trained models for specific tasks. We also demonstrate that our method can be extended beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available at https://github.com/dohoonkim-ai/DoMIX.
最近,在经过培训的模型的微调方面,培训前的适应性(DAP)取得了成效,这一点最近引起了人们的注意。在此基础上,对持续的培训前模式进行了探索,以开发能够逐步纳入不同领域数据集的经过培训的模型,然而,现有的持续的培训前模式面临若干限制:(1) 培训期间的计算成本高和GPU记忆使用率高;(2) 对增量数据顺序的敏感性;(3) 为所有最终任务提供单一的通用模式,这与培训前模式的本质相矛盾。我们在本文件中提议DoMIX,这是通过利用具有代表性的参数效率微调(PEFT)模块来应对这些挑战的一种新颖方法。我们的方法使得高效和平行的对域适应性培训前方法能够对域的秩序进行有力利用,并有效地利用积累的知识为具体任务提供经过专门培训的模型。我们还表明,我们的方法可以超越DAP设置的范围,扩大到标准的LM微调情景。代码可在 https://github.com/dohoonkim-ai/DoMIX查阅。
Article 65
Title@2025-07-03 (4): Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models
Title: Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models | Commander-GPT: Die Fähigkeit von Multi-Modal Large Language Models, den Sarkasmus vollständig zu entleeren | GPT指挥官:完全解除多模式大语言模型的讽刺性探测能力 2503.18681v3 |
Authors (4): Yazhou Zhang, Chunwang Zou, Bo Wang, Jing Qin
Sarcasm detection, as a crucial research direction in the field of Natural Language Processing (NLP), has attracted widespread attention. Traditional sarcasm detection tasks have typically focused on single-modal approaches (e.g., text), but due to the implicit and subtle nature of sarcasm, such methods often fail to yield satisfactory results. In recent years, researchers have shifted the focus of sarcasm detection to multi-modal approaches. However, effectively leveraging multi-modal information to accurately identify sarcastic content remains a challenge that warrants further exploration. Leveraging the powerful integrated processing capabilities of Multi-Modal Large Language Models (MLLMs) for various information sources, we propose an innovative multi-modal Commander-GPT framework. Inspired by military strategy, we first decompose the sarcasm detection task into six distinct sub-tasks. A central commander (decision-maker) then assigns the best-suited large language model to address each specific sub-task. Ultimately, the detection results from each model are aggregated to identify sarcasm. We conducted extensive experiments on MMSD and MMSD 2.0, utilizing four multi-modal large language models and six prompting strategies. Our experiments demonstrate that our approach achieves state-of-the-art performance, with a 19.3% improvement in F1 score, without necessitating fine-tuning or ground-truth rationales.
作为自然语言处理(NLP)领域一个至关重要的研究方向,传统讽刺学探测任务通常侧重于单一模式方法(如文本),但由于讽刺学隐含和微妙性质,这些方法往往不能产生令人满意的结果。近年来,研究人员将讽刺学探测的重点转向了多种模式方法。然而,有效利用多种模式信息来准确识别讽刺性内容仍然是一个需要进一步探讨的挑战。利用多种模式(MLLM)对各种信息来源的强大综合处理能力,我们提议一个创新的多模式指挥官-GPT框架。在军事战略的启发下,我们首先将讽刺学探测任务分为6个不同的子任务。中央指挥官(决策者)然后指派最合适的大语言模型来应对每一个具体的子任务。最终,每种模型的检测结果将汇集到确定讽刺学的精细处理能力。我们在不使用MMDSD1和MMSD3的大规模基础实验中进行了广泛的实验,在不使用MDSD1和MMSD3的大规模地面战略中,我们用4个基础模型来展示了我们的最新成绩。
Article 66
Title@2025-07-03 (4): Prompt-Guided Turn-Taking Prediction
Title: Prompt-Guided Turn-Taking Prediction | Prompt-geführte Turn-Taking-Vorhersage | 即时指导的回转预测 2506.21191v2 |
Authors (7): Koji Inoue, Mikey Elmers, Yahui Fu, Zi Haur Pang, Divesh Lala, Keiko Ochi, Tatsuya Kawahara
Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as “faster” or “calmer” adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.
转换预测模型是语音对话系统和谈话机器人中必不可少的组成部分。 最近的方法利用变压器结构来连续和实时预测语音活动。 在本研究中,我们提出了一个新的模型,使变压器预测能够通过文本提示进行动态控制。 这种方法通过“ 加速” 或“ 加速” 等指示,动态地适应对话伙伴和背景,允许直觉和明确控制。 拟议的模型基于变压器声音活动预测模型,将文字快速嵌入频道变压器和跨通道变压器。 我们用超过950小时的人类语音对话数据评估了我们的方法的可行性。 由于现有数据集没有关于拟议方法的文字提示数据,我们使用一个大语言模型(LLM)来生成合成快速的句子。 实验结果显示,拟议的模型根据文字提示提高了预测的准确性,并有效地改变了变速计时行为。
Article 67
Title@2025-07-03 (4): Optimal strategies to perform multilingual analysis of social content for a novel dataset in the tourism domain
Title: Optimal strategies to perform multilingual analysis of social content for a novel dataset in the tourism domain | Optimale Strategien zur mehrsprachigen Analyse sozialer Inhalte für einen neuartigen Datensatz im Tourismusbereich | 为旅游领域新数据集的社会内容进行多语种社会内容分析的最佳最佳战略 2311.14727v2 |
Authors (6): Maxime Masson, Rodrigo Agerri, Christian Sallaberry, Marie-Noelle Bessagnet, Annig Le Parc Lacayrelle, Philippe Roose
The rising influence of social media platforms in various domains, including tourism, has highlighted the growing need for efficient and automated Natural Language Processing (NLP) strategies to take advantage of this valuable resource. However, the transformation of multilingual, unstructured, and informal texts into structured knowledge still poses significant challenges, most notably the never-ending requirement for manually annotated data to train deep learning classifiers. In this work, we study different NLP techniques to establish the best ones to obtain competitive performances while keeping the need for training annotated data to a minimum. To do so, we built the first publicly available multilingual dataset (French, English, and Spanish) for the tourism domain, composed of tourism-related tweets. The dataset includes multilayered, manually revised annotations for Named Entity Recognition (NER) for Locations and Fine-grained Thematic Concepts Extraction mapped to the Thesaurus of Tourism and Leisure Activities of the World Tourism Organization, as well as for Sentiment Analysis at the tweet level. Extensive experimentation comparing various few-shot and fine-tuning techniques with modern language models demonstrate that modern few-shot techniques allow us to obtain competitive results for all three tasks with very little annotation data: 5 tweets per label (15 in total) for Sentiment Analysis, 30 tweets for Named Entity Recognition of Locations and 1K tweets annotated with fine-grained thematic concepts, a highly fine-grained sequence labeling task based on an inventory of 315 classes. We believe that our results, grounded in a novel dataset, pave the way for applying NLP to new domain-specific applications, reducing the need for manual annotations and circumventing the complexities of rule-based, ad-hoc solutions.
社交媒体平台在包括旅游在内的各个领域的影响不断提高,突出表明越来越需要利用这一宝贵资源,制定高效和自动化的自然语言处理(NLP)战略,以利用这一宝贵资源;然而,将多语言、无结构文本和非正式文本转换为结构化知识,这仍构成重大挑战,其中最突出的是,对人工编制附加说明的数据,以培训深层次学习分类人员的要求是永无止尽的。在这项工作中,我们研究不同的NLP技术,以建立最佳数据获得有竞争力的业绩,同时将附加说明的数据培训保持在最低限度。为此,我们为旅游领域建立了第一个公开提供的多语言(法语、英语和西班牙语)数据集(法语、英语和西班牙语),由旅游相关推文构成的旅游领域。该数据集包括多层次、手工修订的实体识别(NER)图示,用于地点和精细化主题概念,用于培训深层次的学习分类。我们研究了不同的NLP技术,以及基于推文的解解决方案。将各种新点和微调整技术与现代语言模型进行了广泛的实验,用现代的精细图解推理学应用技术降低了我们网站上的推算,使得我们得以获得高额标签的SEN的SEN数据,用于所有3号的SEN的排名。
Article 68
Title@2025-07-03 (4): Seeing Through Green: Text-Based Classification and the Firm’s Returns from Green Patents
Title: Seeing Through Green: Text-Based Classification and the Firm’s Returns from Green Patents | Durch Grün sehen: Textbasierte Klassifizierung und die Rückkehr der Firma aus grünen Patenten | 通过 “ 绿色观光:基于文本的分类和公司从绿色专利的回报 “ 2507.02287v1 |
Authors (3): Lapo Santarlasci, Armando Rungi, Antonio Zinilli
This paper introduces Natural Language Processing for identifying true'' green patents from official supporting documents. We start our training on about 12.4 million patents that had been classified as green from previous literature. Thus, we train a simple neural network to enlarge a baseline dictionary through vector representations of expressions related to environmental technologies. After testing, we find that
true’’ green patents represent about 20\% of the total of patents classified as green from previous literature. We show heterogeneity by technological classes, and then check that true' green patents are about 1\% less cited by following inventions. In the second part of the paper, we test the relationship between patenting and a dashboard of firm-level financial accounts in the European Union. After controlling for reverse causality, we show that holding at least one
true'' green patent raises sales, market shares, and productivity. If we restrict the analysis to high-novelty
`true’’ green patents, we find that they also yield higher profits. Our findings underscore the importance of using text analyses to gauge finer-grained patent classifications that are useful for policymaking in different domains.
本文介绍“ 自然语言处理” , 以从官方辅助文件中识别“ 真正的” 绿色专利。 我们开始培训大约1 240万个从以前的文献中被列为绿色的专利。 因此, 我们训练一个简单的神经网络, 通过与环境技术有关的表达方式的矢量表示来扩大基线字典。 经过测试, 我们发现“ 真正的” 绿色专利代表了从以前的文献中被列为绿色的专利总量的大约20 % 。 我们按技术类别来显示异质性, 然后检查“ 真正的” 绿色专利在发明之后的引用量大约为1 % 。 在论文第二部分, 我们测试了欧盟公司级金融账户的专利和仪表板之间的关系。 在对反向因果关系进行控制后, 我们显示至少持有一个“ 真理” 绿色专利可以提高销售、 市场份额和生产率。 如果我们将分析限制在高新水平的“ 真理” 绿色专利, 我们发现它们也产生更高的利润。 我们的研究结果强调, 使用文本分析来测量精细的专利分类的重要性, 这对于不同领域的决策有用 。
Article 69
Title@2025-07-03 (4): Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments
Title: Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments | Kausales Repräsentationslernen mit generativer Künstlicher Intelligenz: Anwendung auf Texte als Behandlungen | 产生人工智能的因果代表性学习:应用文字作为治疗 2410.00903v3 |
Authors (2): Kosuke Imai, Kentaro Nakamura
In this paper, we demonstrate how to enhance the validity of causal inference with unstructured high-dimensional treatments like texts, by leveraging the power of generative Artificial Intelligence (GenAI). Specifically, we propose to use a deep generative model such as large language models (LLMs) to efficiently generate treatments and use their internal representation for subsequent causal effect estimation. We show that the knowledge of this true internal representation helps disentangle the treatment features of interest, such as specific sentiments and certain topics, from other possibly unknown confounding features. Unlike existing methods, the proposed GenAI-Powered Inference (GPI) methodology eliminates the need to learn causal representation from the data, and hence produces more accurate and efficient estimates. We formally establish the conditions required for the nonparametric identification of the average treatment effect, propose an estimation strategy that avoids the violation of the overlap assumption, and derive the asymptotic properties of the proposed estimator through the application of double machine learning. Finally, using an instrumental variables approach, we extend the proposed methodology to the settings in which the treatment feature is based on human perception. The proposed GPI methodology is also applicable to text reuse where an LLM is used to regenerate existing texts. We conduct simulation and empirical studies, using the generated text data from an open-source LLM, Llama 3, to illustrate the advantages of our estimator over state-of-the-art causal representation learning algorithms.
在本文件中,我们展示了如何通过利用基因化人工智能(GenAI)的力量,提高因果推断与文本等非结构化高层次处理方法(GenAI)的有效性。具体地说,我们提议使用一个深层次的基因模型,如大型语言模型(LLMS),以有效产生治疗方法,并使用内部代表方法来进行随后的因果关系估计。我们表明,了解这种真正的内部代表方法有助于将感兴趣的治疗特征,例如具体情感和某些议题,与其他可能未知的混杂特征脱钩。与现有方法不同,拟议的GenAI权力推算法(GIGPI)方法消除了从数据中学习因果表述的必要性,从而得出更准确、更高效的估计数。我们正式确定对平均治疗效果进行非参数识别的必要条件,提出避免违反重叠假设的估计战略,并通过应用双机学习来得出拟议估算师的无约束性特性。最后,我们将拟议的方法推广到基于人类认知的处理特征的环境,即从现有LIM法文本到使用现有理论再利用的理论分析法。
Article 70
Title@2025-07-03 (4): SMARTe: Slot-based Method for Accountable Relational Triple extraction
Title: SMARTe: Slot-based Method for Accountable Relational Triple extraction | SMARTe: Slot-basierte Methode für die relationale Triple-Extraktion | SMARTE: 衡算关系三重采掘的基于固态方法 2504.12816v3 |
Authors (2): Xue Wen Tan, Stanley Kok
Relational Triple Extraction (RTE) is a fundamental task in Natural Language Processing (NLP). However, prior research has primarily focused on optimizing model performance, with limited efforts to understand the internal mechanisms driving these models. Many existing methods rely on complex preprocessing to induce specific interactions, often resulting in opaque systems that may not fully align with their theoretical foundations. To address these limitations, we propose SMARTe: a Slot-based Method for Accountable Relational Triple extraction. SMARTe introduces intrinsic interpretability through a slot attention mechanism and frames the task as a set prediction problem. Slot attention consolidates relevant information into distinct slots, ensuring all predictions can be explicitly traced to learned slot representations and the tokens contributing to each predicted relational triple. While emphasizing interpretability, SMARTe achieves performance comparable to state-of-the-art models. Evaluations on the NYT and WebNLG datasets demonstrate that adding interpretability does not compromise performance. Furthermore, we conducted qualitative assessments to showcase the explanations provided by SMARTe, using attention heatmaps that map to their respective tokens. We conclude with a discussion of our findings and propose directions for future research. Our code is available at https://github.com/Chen-XueWen/SMARTe.
然而,先前的研究主要侧重于优化模型性能,只有有限的努力来理解驱动这些模型的内部机制。许多现有方法依靠复杂的预处理来促成具体的互动,往往导致不透明的系统,而这种系统可能与其理论基础不完全一致。为了解决这些局限性,我们提议SMARTe:一种基于细小的可核算性三方提取方法。SMARTe通过一个时间档关注机制引入内在的解释性,并将任务作为设定的预测问题来设置。SMARTe将相关信息整合到不同的位置,确保所有预测都可明确追溯到学习到学习到的时间档表示和每个预测关系三重的象征。在强调可解释性的同时,SMARTe实现了与最新模型相当的业绩。对NMYT和WebNLG数据集的评估表明,增加解释性不会损害业绩。此外,我们进行了定性评估,以展示SMARTe提供的解释性,利用地图上的热映射图,将相关信息明确追溯到各个位置,确保所有预测都可追溯到学习到每个预测关系三重的代号。我们在强调可解释性的同时,还得出了我们的调查结果和建议未来方向。
Article 71
Title@2025-07-03 (4): MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Title: MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent | MemAgent: Umgestalten von Langkontext-LLM mit Multi-Conv RL-basierten Speicheragenten | MemerAgent: 与基于多Conv RL的内存代理重塑长文本LLM 2507.02259v1 |
Authors (11): Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, Hao Zhou
Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ in 512K RULER test.
尽管在长度外推法、高效关注和记忆模块方面有所改进,但在外推过程中处理线性复杂且不出现性能退化的无限长文件仍然是长文本处理过程中的终极挑战。我们直接优化以端到端的方式执行长文本任务,并引入新的代理工作流程MemAgenti,该流程以部分方式阅读文字,并使用覆盖式战略更新记忆。我们扩展了DAPO算法,以通过独立通文多变量生成促进培训。MeAgency展示了超强长文本能力,能够从经过32K文本培训的8K环境推断为3.5M QA任务,其性能损失 < 5%,并在512K RULER测试中达到95。
Article 72
Title@2025-07-03 (4): Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks
Title: Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks | Schaltungs-Tuning: Mechanistischer Ansatz zur Identifizierung von Parameter Redundanz und Feinsteuerung neuraler Netzwerke | 电路调控:确定参数冗余和精微调整神经网络的机械化方法 2502.06106v2 |
Authors (4): Yueyan Li, Wenhao Gao, Caixia Yuan, Xiaojie Wang
The study of mechanistic interpretability aims to reverse-engineer a model to explain its behaviors. While recent studies have focused on the static mechanism of a certain behavior, the learning dynamics inside a model remain to be explored. In this work, we develop an interpretable fine-tuning method for analyzing the mechanism behind learning. We first introduce the concept of node-level intrinsic dimensionality to describe the learning process of a model in a computational graph. Based on our theory, we propose circuit-tuning, a two-stage algorithm that iteratively builds the minimal subgraph for a specific task and updates the key parameters in a heuristic way. Experimental results confirm the existence of the intrinsic dimensionality at the node level and demonstrate the effectiveness of our method for transparent and interpretable fine-tuning. We visualize and analyze the circuits before, during, and after fine-tuning, providing new insights into the self-organization mechanism of a neural network in the learning process.
机械学解释性研究旨在逆向设计一种模型来解释其行为。 虽然最近的研究侧重于某种行为的静态机制, 模型内的学习动态仍有待探索。 在这项工作中, 我们开发了一种可解释的微调方法来分析学习后的机制。 我们首先引入了节点的内在维度概念来描述计算图中模型的学习过程。 我们根据我们的理论, 提出了电路调节, 这是一种两阶段的算法, 迭接地为特定任务构建了最起码的子集, 并以超常方式更新了关键参数。 实验结果证实节点一级存在内在的维度, 并展示了我们透明、 可解释的微调方法的有效性。 我们视觉化和分析了计算图之前、 期间 和 之后 的电路, 为学习过程中神经网络的自我组织机制提供了新的洞察力。
Article 73
Title@2025-07-03 (4): Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies
Title: Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies | Mixture of Reasonings: Große Sprachmodelle mit adaptiven Strategien zur Vernunft bringen | 理由混合:与适应战略一道教授大语言模式 2507.00606v2 |
Authors (4): Tao Xiong, Xavier Hu, Wenyan Fan, Shengyu Zhang
Large language models (LLMs) excel in complex tasks through advanced prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), but their reliance on manually crafted, task-specific prompts limits adaptability and efficiency. We introduce Mixture of Reasoning (MoR), a training framework that embeds diverse reasoning strategies into LLMs for autonomous, task-adaptive reasoning without external prompt engineering. MoR has two phases: Thought Generation, creating reasoning chain templates with models like GPT-4o, and SFT Dataset Construction, pairing templates with benchmark datasets for supervised fine-tuning. Our experiments show that MoR significantly enhances performance, with MoR150 achieving 0.730 (2.2% improvement) using CoT prompting and 0.734 (13.5% improvement) compared to baselines. MoR eliminates the need for task-specific prompts, offering a generalizable solution for robust reasoning across diverse tasks.
大型语言模型(LLMS)通过先进的催化技术(如“CoT”和“TTO”等先进催化技术),在复杂的任务中出类拔萃,但依赖手工制作的、具体任务快速度限制了适应性和效率。我们引入了“解释混合”(MOR),这是一个培训框架,将各种推理战略纳入“LLMS”,用于自主、任务适应性推理,而没有外部迅速工程。MOR分两个阶段:“思想生成”,以GPT-4o和SFT数据集构建等模型创建推理链模板,将模板与基准数据集对齐,用于监管的微调。我们的实验表明,MOR大大提高了绩效,因为MR150利用“提示”实现了0.730(2.2%改进),与基线相比,实现了0.734(13.5%改进)。MOR消除了对具体任务提示的需求,为各种任务的强力推理提供了普遍的解决办法。
Article 74
Title@2025-07-03 (4): GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons
Title: GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons | GDC Cohort Copilot: Ein KI-Copilot für die Kuratierung von Kohorten aus den Genomic Data Commons | GDC Cohort Cohort 副驾驶:AI 基因组数据共同点的Curate Choorts联合驾驶员 2507.02221v1 |
Authors (5): Steven Song, Anirudh Subramanyam, Zhenyu Zhang, Aarti Venkat, Robert L. Grossman
Motivation: The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. Results: We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. Availability and implementation: The standalone docker image for GDC Cohort Copilot is available at https://quay.io/repository/cdis/gdc-cohort-copilot. Source code is available at https://github.com/uc-cdis/gdc-cohort-copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds.
动力: 基因组数据共享(GDC) 提供高质量的、统一的癌症基因组数据。 虽然GDC用户可以通过图形 Cohort 构建器互动创建复杂的组群, 用户( 特别是新用户)可能很难在数百种可能的字段和属性中找到特定的组群描述器。 然而, 用户可能更有能力以自由文本自然语言描述他们想要的组群。 结果 : 我们引入了 GDC Cohort Copilot, 一个用于治疗GDC的组群的开源共同试点工具。 GDC Cohort Coopil自动生成GDC组群过滤器, 与他们想要组群的用户输入自然语言描述相对。 用户( 特别是新用户) 可能很难在数百种可能的字段和属性中找到特定的组群描述器。 我们为 GDC Cohort Colt 演示并评估了多个大型语言模型( LOMS) 。 我们本地服务、 开源GDC Cohort LMM 将比 GPT-4- 催化GDC / collictoryLM 在GDC GC grows/ Calctors 提供GDGDGDGs/ Colupress 的GLOs/ GLS.
Article 75
Title@2025-07-03 (4): SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers
Title: SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers | SciGA: Ein umfassender Datensatz zur Gestaltung grafischer Abstracts in wissenschaftlichen Papieren | SciGA: 用于设计学术论文制图摘要的综合数据集 2507.02212v1 |
Authors (4): Takuro Kawada, Shunsuke Kitada, Sota Nemoto, Hitoshi Iyatomi
Graphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. While recent research has increasingly incorporated visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Moreover, designing effective GAs requires advanced visualization skills, creating a barrier to their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, explicitly designed for supporting GA selection and recommendation as well as facilitating research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA recommendation, which identifies figures within a given paper that are well-suited to serve as GAs, and 2) Inter-GA recommendation, which retrieves GAs from other papers to inspire the creation of new GAs. We provide reasonable baseline models for these tasks. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric that offers a fine-grained analysis of model behavior. CAR addresses limitations in traditional ranking-based metrics by considering cases where multiple figures within a paper, beyond the explicitly labeled GA, may also serve as GAs. By unifying these tasks and metrics, our SciGA-145k establishes a foundation for advancing visual scientific communication while contributing to the development of AI for Science.
图像摘要(GAs)在直观地传达科学论文的关键结论方面发挥着关键作用。虽然最近的研究越来越多地将图1等视觉材料作为事实上的GAs纳入,但其加强科学通信的潜力基本上尚未探索。此外,设计有效的GA需要先进的视觉化技能,为广泛采用这些技能制造障碍。为了应对这些挑战,我们引入了由大约145 000份科学论文和114万数字组成的大型数据集SciGA-145k,这是一个大型数据集,明确旨在支持GA的挑选和建议,并促进自动的GA一代的研究。作为大会设计支持的初步步骤,我们界定了两项任务:(1) GA内部建议,其中确定了适合作为GA的某一文件中的数字;(2) GA之间的建议,其中从其他文件中检索GAs,以激励创建新的GA。我们为这些任务提供了合理的基线模型模型。此外,我们提出了信任调整后头一地面真相比值(CAR),这是对模型行为进行精确分析的新建议指标。CARC处理基于传统排名的衡量标准中的局限性,建议确定了适合作为GA的数值;145,同时考虑将多种数字作为基础,为GA的参考,并明确确定为GAA的标准化的图像基础。
Article 76
Title@2025-07-02 (3): SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction
Title: SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction | SHuBERT: Selbstüberwachte Sign Language Representation Lernen über Multi-Stream Cluster Prediction | 通过多系统集群预测进行自上自上手语代表制学习 2411.16765v3 |
Authors (5): Shester Gueuwou, Xiaodan Du, Greg Shakhnarovich, Karen Livescu, Alexander H. Liu
Sign language processing has traditionally relied on task-specific models, limiting the potential for transfer learning across tasks. Pre-training methods for sign language have typically focused on either supervised pre-training, which cannot take advantage of unlabeled data, or context-independent (frame or video segment) representations, which ignore the effects of relationships across time in sign language. We introduce SHuBERT (Sign Hidden-Unit BERT), a self-supervised contextual representation model learned from approximately 1,000 hours of American Sign Language video. SHuBERT adapts masked token prediction objectives to multi-stream visual sign language input, learning to predict multiple targets corresponding to clustered hand, face, and body pose streams. SHuBERT achieves state-of-the-art performance across multiple tasks including sign language translation, isolated sign language recognition, and fingerspelling detection.
手语的处理历来依赖特定任务模式,限制了跨任务转移学习的可能性,手语的预培训方法通常侧重于监督的预培训,无法利用未贴标签的数据,或者根据具体情况进行(框架或视频段)的介绍,忽视了手语不同时间关系的影响。我们引入了ShuBERT(签名隐藏单位BERT),这是从大约1,000小时的美国手语视频中学习的自我监督背景代表模式。 ShuBERT将隐含的象征性预测目标与多流视觉手语输入相适应,学习预测与手、脸和身体构成流相对应的多重目标。ShuBERT在包括手语翻译、孤立手语识别和手指拼写探测在内的多种任务中取得了最先进的表现。
Article 77
Title@2025-07-02 (3): ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning
Title: ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning | ESTR-CoT: Auf dem Weg zu einer erklärbaren und präzisen Ereignisstrom-basierten Szenetexterkennung mit Chain-of-Thought-Reasoning | ESTR-CoT: 争取实现可解释和准确事件流的基于现场的文本识别,并附有研究链理由 2507.02200v1 |
Authors (8): Xiao Wang, Jingtao Jiang, Qiang Chen, Lan Chen, Lin Zhu, Yaowei Wang, Yonghong Tian, Jin Tang
Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of insufficient interpretability and weak contextual logical reasoning. In this work, we propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT. Specifically, we first adopt the vision encoder EVA-CLIP (ViT-G/14) to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt. A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously. Our framework can be optimized using supervised fine-tuning in an end-to-end manner. In addition, we also propose a large-scale CoT dataset to train our framework via a three stage processing (i.e., generation, polish, and expert verification). This dataset provides a solid data foundation for the development of subsequent reasoning-based large models. Extensive experiments on three event stream STR benchmark datasets (i.e., EventSTR, WordArt, IC15) fully validated the effectiveness and interpretability of our proposed framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/ESTR-CoT.
基于事件流的场景文本识别是近年来新出现的研究课题,在极具挑战性的情景中,特别是在低光度、快速运动的情况下,比广泛使用的 RGB 相机表现更好。现有的作品要么采用端到端编码解码器-解码器框架,要么采用大语言模型来强化识别,然而,由于解释能力不足和背景逻辑推理能力薄弱的挑战,这些模型仍然有限。在这项工作中,我们提议了一个以思考链为基础的事件流文本识别框架,称为 ESTRA-COT。具体地说,我们首先采用愿景编码器 EVA-CLIP (VT-G/14) 将输入事件流转换为代号,并利用Llama代号仪来为特定代代代码。使用一个Qexerut来将愿景标记与预先培训的大语言模型Vicuna-7B相匹配,同时输出答案和思考链(CoT) 逻辑推理过程。我们的框架可以使用监督的从端到端到端的微调来优化。此外,我们还提议一个大型的COT数据设置,通过三个阶段来培训我们的框架的 Ereval 数据流分析基础的模型, 提供一个长期数据基础。
Article 78
Title@2025-07-02 (3): Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer
Title: Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer | Latent Chain-of-Thought? Dekodierung des Tiefen-Recurrent Transformers | 点解深度- Rent 变换器 2507.02199v1 |
Authors (5): Wenquan Lu, Yuechuan Yang, Kyle Lee, Yanshu Li, Enqi Liu
Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning. However, in standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency. To capture reasoning that is not easily represented in words, many works have explored recurrent architectures that aim to internalize reasoning in latent space, potentially supporting latent CoT. In this paper, we investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count. We examine the model’s internal behavior on arithmetic tasks using a suite of probing techniques including the Logit Lens and Coda Lens. Our findings reveal limited evidence of interpretable latent CoT by tracking rank trajectories of final and intermediate result tokens. Furthermore, we uncover significant probing inconsistencies across recurrent blocks, where the interpretability of hidden states depends heavily on both the layer index and the decoding method. Finally, we empirically show that increasing recurrence depth yields only marginal gains and falls well short of models that explicitly externalize reasoning steps. The code is available at https://github.com/wenquanlu/huginn-latent-cot.
思维链(CoT)推理使得基于变压器的语言模型在复杂的数学和多步规划中能够出类拔萃。然而,在标准的解码器专用结构中,这些推理步骤在自然语言中被外部化,提高了解释效率。为了捕捉不易用文字表达的推理,许多作品探索了旨在将潜在空间的推理内在化的经常性结构,可能支持潜在的COT。在本文件中,我们调查了这种推理结构是否出现在Huginn-3.5B中,这是一个深度周期性变异器,在推断时重复使用层,而不增加参数计数。我们利用包括Logit Lens 和 Coda Lens 在内的一系列试算技术来审查模型的计算内部行为。我们的调查结果显示,通过跟踪最终和中间结果符号的级轨迹,可解释的潜在 CoT 证据有限。 此外,我们发现,在经常区块中,隐藏状态的可解释性严重取决于层次指数和解码方法。最后,我们从经验上表明,不断重复的深度只产生边际收益,而远为外部推理。
Article 79
Title@2025-07-02 (3): Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis
Title: Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis | Analyse und Verbesserung der Speaker-Ähnlichkeitsbewertung für Sprachsynthese | 分析和改进议长对发言综述的相似性评估 2507.02176v1 |
Authors (6): Marc-André Carbonneau, Benjamin van Niekerk, Hugo Seuté, Jean-Philippe Letendre, Herman Kamper, Julian Zaïdi
Modeling voice identity is challenging due to its multifaceted nature. In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings, designed for discrimination rather than characterizing identity. This paper investigates which aspects of a voice are captured in such representations. We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm. We also identify confounding factors that compromise speaker similarity measurements and suggest mitigation strategies. To address these gaps, we propose U3D, a metric that evaluates speakers’ dynamic rhythm patterns. This work contributes to the ongoing challenge of assessing speaker identity consistency in the context of ever-better voice cloning systems. We publicly release our code.
建模语音特征因其多面性而具有挑战性。在基因化语音系统中,对身份的评估往往使用自动语音校验(ASV)嵌入器来评估,这种校验是针对歧视的,而不是特征的。本文调查了在这种表述中反映声音的哪些方面。我们发现,广泛使用的ASV嵌入器主要侧重于静态特征,如小音和小音范围,同时忽略了节奏等动态要素。我们还找出了影响发言者类似测量的混杂因素,并建议了缓解战略。为了弥补这些差距,我们建议了U3D,这是评估发言者动态节奏模式的尺度。这项工作有助于应对当前挑战,即评估发言者身份的一致性,在越来越好的语音克隆系统中。我们公开发布我们的代码。
Article 80
Title@2025-07-02 (3): Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data
Title: Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data | Beyond Scale: Der Diversity-Koeffizient als Data Quality Metric für Variabilität in natürlichen Sprachdaten | 超越尺度:多样性系数作为衡量自然语言数据可变性的数据质量计量标准 2306.13840v4 |
Authors (7): Brando Miranda, Alycia Lee, Sudharsan Sundar, Allison Casasola, Rylan Schaeffer, Elyas Obbad, Sanmi Koyejo
Current trends in pre-training Large Language Models (LLMs) primarily focus on the scaling of model and dataset size. While the quality of pre-training data is considered an important factor for training powerful LLMs, it remains a nebulous concept that has not been rigorously characterized. To this end, we propose a formalization of one key aspect of data quality – measuring the variability of natural language data – specifically via a measure we call the diversity coefficient. Our empirical analysis shows that the proposed diversity coefficient aligns with the intuitive properties of diversity and variability, e.g., it increases as the number of latent concepts increases. Then, we measure the diversity coefficient of publicly available pre-training datasets and demonstrate that their formal diversity is high compared to theoretical lower and upper bounds. Finally, we conduct a comprehensive set of controlled interventional experiments with GPT-2 and LLaMAv2 that demonstrate the diversity coefficient of pre-training data characterizes useful aspects of downstream model evaluation performance – totaling 44 models of various sizes (51M to 7B parameters). We conclude that our formal notion of diversity is an important aspect of data quality that captures variability and causally leads to improved evaluation performance.
培训前大语言模型(LLMS)目前的趋势主要侧重于模型和数据集规模的扩大。培训前数据的质量被视为培训强大的LLMS的一个重要因素,但它仍然是一个没有严格定性的模糊概念。为此,我们提议正式确定数据质量的一个关键方面 – – 衡量自然语言数据的变异性 – – 特别是通过我们称之为多样性系数的措施。我们的经验分析表明,拟议的多样性系数与多样性和变异性等直观特性相一致,例如随着潜在概念数量的增加而增加。然后,我们衡量公开提供的训练前数据集的多样性系数,并表明其形式多样性高于理论下限和上限。最后,我们用GPT-2和LLMAv2进行一套全面的有控制的干预实验,以显示培训前数据的多样性系数,这是下游模式评估业绩的有用方面 – – 总计44个不同规模的模型(51M至7B参数)。我们的结论是,我们正式的多样性概念是数据质量的一个重要方面,可以捕捉变异性和因果性地导致评估的改进。
Article 81
Title@2025-07-02 (3): Rethinking LLM Training through Information Geometry and Quantum Metrics
Title: Rethinking LLM Training through Information Geometry and Quantum Metrics | Rethinking LLM Training durch Informationsgeometrie und Quantenmetrics | 通过信息几何和量度测量重新思考LLM培训 2506.15830v3 |
Authors (1): Riccardo Di Sipio
Optimization in large language models (LLMs) unfolds over high-dimensional parameter spaces with non-Euclidean structure. Information geometry frames this landscape using the Fisher information metric, enabling more principled learning via natural gradient descent. Though often impractical, this geometric lens clarifies phenomena such as sharp minima, generalization, and observed scaling laws. We argue that curvature-aware approaches deepen our understanding of LLM training. Finally, we speculate on quantum analogies based on the Fubini-Study metric and Quantum Fisher Information, hinting at efficient optimization in quantum-enhanced systems.
大型语言模型(LLMs)的最佳化(LLMs)在具有非欧洲语言结构的高维参数空间上展开。信息几何用Fisher信息度量来测量这一景观,从而能够通过自然梯度下降进行更有原则的学习。虽然这种几何透镜通常不切实际,但澄清了尖锐迷你、一般化和观察到的测量法等现象。我们争辩说,曲线认知法的方法加深了我们对LLM培训的理解。 最后,我们根据Fubini-Study 度量子和Quantum Fisher信息对量子类比进行推测,暗示了量子强化系统中的高效优化。
Article 82
Title@2025-07-02 (3): Quantifying the Importance of Data Alignment in Downstream Model Performance
Title: Quantifying the Importance of Data Alignment in Downstream Model Performance | Quantifizierung der Bedeutung der Datenausrichtung in Downstream-Modellleistung | 量化数据协调在下游模式绩效中的重要性 2501.08496v3 |
Authors (7): Krrish Chawla, Aryan Sahai, Mario DePavia, Sudharsan Sundar, Brando Miranda, Elyas Obbad, Sanmi Koyejo
Contrary to the conventional emphasis on dataset size, we explore the role of data alignment – an often overlooked aspect of data quality – in training capable Large Language Models (LLMs). To do so, we use the Task2Vec-based alignment coefficient, a quantitative measure of the similarity between two datasets, to quantify the impact of alignment between training data and evaluation data on downstream performance. In particular, we conduct controlled \textit{interventional} experiments for two settings: 1. the impact of increased alignment coefficients between various pre-training (pt) against evaluation datasets, and 2. the impact of increased alignment coefficients between domain specific fine-tuning (ft) against domain specific evaluation. The domain specific task we explore is Autoformalization – the machine translation task between natural language and code for formal verification. In both settings, we find a strong, predictable negative correlation between the alignment coefficient of a model’s training and evaluation data and the model’s loss/perplexity on the respective downstream task. These findings suggest a re-evaluation of LLM training approaches, demonstrating the relevance of data alignment compared to data quantity, especially in specialized downstream tasks such as Autoformalization.
与传统强调数据集大小相反,我们探索了数据调整的作用 – – 这是数据质量中经常被忽视的一个方面 – – 培训有能力的大型语言模型(LLM)的作用。为此,我们使用基于TH2Vec的校准系数,这是衡量两个数据集之间相似性的量化尺度,以量化培训数据与下游业绩评价数据之间一致的影响。特别是,我们为两个环境进行了控制下对以下两个环境的测试:1. 提高培训前(试用)对评价数据集的校准系数的影响,2. 提高具体领域微调(软)对具体评价的校准系数的影响。我们探讨的领域具体任务是自动正规化 – – 自然语言与正式核查代码之间的机器翻译任务。在这两种情况下,我们发现模型的培训和评价数据的校准系数与模型对各下游任务的损失/难度之间存在强烈、可预测的负相关关系。这些结论表明,对LLM培训方法的重新评价,表明数据与数据数量的相关性,特别是在诸如自动正规化等专门的下游任务中。
Article 83
Title@2025-07-02 (3): Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization
Title: Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization | Eine umfassende Bewertung von LLMs für den Dialog Zusammenfassung | 全面评价对话总结说明说明说明理由的理由 2507.02145v1 |
Authors (7): Keyan Jin, Yapeng Wang, Leonel Santos, Tao Fang, Xu Yang, Sio Kei Im, Hugo Gonçalo Oliveira
Dialogue summarization is a challenging task with significant practical value in customer service, meeting analysis, and conversational AI. Although large language models (LLMs) have achieved substantial progress in summarization tasks, the performance of step-by-step reasoning architectures-specifically Long Chain-of-Thought (CoT) implementations such as OpenAI-o1 and DeepSeek-R1-remains unexplored for dialogue scenarios requiring concurrent abstraction and conciseness. In this work, we present the first comprehensive and systematic evaluation of state-of-the-art reasoning LLMs and non-reasoning LLMs across three major paradigms-generic, role-oriented, and query-oriented dialogue summarization. Our study spans diverse languages, domains, and summary lengths, leveraging strong benchmarks (SAMSum, DialogSum, CSDS, and QMSum) and advanced evaluation protocols that include both LLM-based automatic metrics and human-inspired criteria. Contrary to trends in other reasoning-intensive tasks, our findings show that explicit stepwise reasoning does not consistently improve dialogue summarization quality. Instead, reasoning LLMs are often prone to verbosity, factual inconsistencies, and less concise summaries compared to their non-reasoning counterparts. Through scenario-specific analyses and detailed case studies, we further identify when and why explicit reasoning may fail to benefit-or even hinder-summarization in complex dialogue contexts. Our work provides new insights into the limitations of current reasoning LLMs and highlights the need for targeted modeling and evaluation strategies for real-world dialogue summarization.
虽然大型语言模型(LLMs)在总结任务方面取得了实质性进展,但我们的研究涉及多种语言、领域和摘要长度,利用强有力的基准(SAMUSum、DialgSum、SSCS和QMSum)和先进的评价协议,其中包括基于LLM的自动模型和人性启迪标准。 与其他推理密集型任务的趋势相反,我们的调查结果表明,明确的一步化推理甚至没有不断改进对话的总结质量,因此,在进行我们的具体分析时,通过不甚精确的推理,我们通常需要更精确的推理,而通过不精确的推理,我们则往往需要更精确的推理。
Article 84
Title@2025-07-02 (3): Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency
Title: Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency | Die Auswirkungen mobiler DVFS-Gouverneure auf LLM-Inferenzleistung und Energieeffizienz abklären | 分散移动的家庭暴力和退伍军人服务局局长对LLLM 推断性能和能源效率的影响 2507.02135v1 |
Authors (6): Zongpu Zhang, Pranab Dash, Y. Charlie Hu, Qiang Xu, Jian Li, Haibing Guan
Large Language Models (LLMs) are increasingly being integrated into various applications and services running on billions of mobile devices. However, deploying LLMs on resource-limited mobile devices faces a significant challenge due to their high demand for computation, memory, and ultimately energy. While current LLM frameworks for mobile use three power-hungry components-CPU, GPU, and Memory-even when running primarily-GPU LLM models, optimized DVFS governors for CPU, GPU, and memory featured in modern mobile devices operate independently and are oblivious of each other. Motivated by the above observation, in this work, we first measure the energy-efficiency of a SOTA LLM framework consisting of various LLM models on mobile phones which showed the triplet mobile governors result in up to 40.4% longer prefilling and decoding latency compared to optimal combinations of CPU, GPU, and memory frequencies with the same energy consumption for sampled prefill and decode lengths. Second, we conduct an in-depth measurement study to uncover how the intricate interplay (or lack of) among the mobile governors cause the above inefficiency in LLM inference. Finally, based on these insights, we design FUSE - a unified energy-aware governor for optimizing the energy efficiency of LLM inference on mobile devices. Our evaluation using a ShareGPT dataset shows FUSE reduces the time-to-first-token and time-per-output-token latencies by 7.0%-16.9% and 25.4%-36.8% on average with the same energy-per-token for various mobile LLM models.
大型语言模型(LLMS)正日益被纳入数十亿个移动设备上的各种应用和服务中。然而,在资源有限的移动设备上部署LLMS由于对计算、内存和最终能源的需求很高而面临巨大的挑战。虽然目前移动的LLM框架使用三种电力饥饿组件,即CPU、GPU和内存,但在主要运行GPULM模型时,使用三种电力饥饿组件(即CPU、GPU、GPU和内存-即使在运行CUPULM模型时,也使用最佳的GPU36、为CPU、GPU和现代移动设备所显示的记忆力消耗量最佳地独立运作,相互忽视。我们根据上述观察,首先测量SATALM框架的能源效率,由各种LMM模型组成的STALM模型组成。最后,40.4% 预先填补和分解的LMLMLM值,比最佳组合CU、GPU和记忆频率与抽样前和解码长度所用能源消耗量相同。第二,我们进行了深入的测量测量研究,以了解移动管理员之间如何复杂的相互作用(或缺乏) 80)移动管理员之间的相互作用。我们用最高级智能数据,最后的节能的节能 。最后用SLLM值数据,我们用SLM。最后的节节点的节节节节节节节节节节节中,我们使用SO的节能的节节节的节算。
Article 85
Title@2025-07-02 (3): De-mark: Watermark Removal in Large Language Models
Title: De-mark: Watermark Removal in Large Language Models | Markierung: Wasserzeichenentfernung in großen Sprachmodellen | 标记:大语言模型中去除水印 2410.13808v2 |
Authors (4): Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang
Watermarking techniques offer a promising way to identify machine-generated content via embedding covert information into the contents generated from language models (LMs). However, the robustness of the watermarking schemes has not been well explored. In this paper, we present De-mark, an advanced framework designed to remove n-gram-based watermarks effectively. Our method utilizes a novel querying strategy, termed random selection probing, which aids in assessing the strength of the watermark and identifying the red-green list within the n-gram watermark. Experiments on popular LMs, such as Llama3 and ChatGPT, demonstrate the efficiency and effectiveness of De-mark in watermark removal and exploitation tasks.
水标记技术通过将隐蔽信息嵌入语言模型生成的内容中,为识别机器生成的内容提供了一种有希望的方法。然而,对水标记计划的稳健性没有进行很好的探讨。我们在本文件中介绍了旨在有效去除基于n克的水标记的先进框架De-mark。我们的方法采用了一种新型的查询战略,称为随机选择勘测,它有助于评估水标记的强度和确定n克水标记内的红绿清单。Llama3和ChattGPT等受欢迎的水标记实验显示了去除和开发水标记任务的脱标记的效率和有效性。
Article 86
Title@2025-07-02 (3): Energy-Based Transformers are Scalable Learners and Thinkers
Title: Energy-Based Transformers are Scalable Learners and Thinkers | Energiebasierte Transformer sind skalierbare Lernende und Denker | 以能源为基础的变换器是可缩放的学习者和思想家 2507.02092v1 |
Authors (10): Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, Tariq Iqbal
Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question “Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?” Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) – a new class of Energy-Based Models (EBMs) – to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models.
与人类系统 2 思维法相似的推断时间计算技术最近为改进模型性能而变得流行。然而,大多数现有方法都受到若干限制:它们与模式有关(例如,仅在文本中工作),与问题有关(例如,数学和编码等可核查的领域),或需要在未经监督的预培训前(例如,核查者或可核查的奖励)的顶部进行更多的监督/培训。在本文中,我们问“是否有可能将这些系统2的思考参数加以推广,并开发仅从未经监督的学习中进行思考的模型?”有趣的是,我们发现答案是肯定的,它们通过明确核实投入和候选人预言的兼容性,与问题有关(例如,数学和编码等可核实的领域),或者在未经监督的预言前导前导前导前导前导前导前导前导前导前导前导前导前导(例如,核查者或核查前导前导前导前导师),我们现有能源模型(EMM(EMM)的能量值值比新的更高,我们发现基于梯级的精化后导流流流流流流流流流流流流流流的能量模型,然后在EB的深度模型中测测测测测到EB期间比EV值比EB的更快, 和不断测测,在EB的变后演程中,在EB的变后演程中,在EB的变后演程中,在EB的变后演期间,在EB的变后演期间,在EV的变后演进进进进进进进。
Article 87
Title@2025-07-02 (3): McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models
Title: McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models | McBE: Ein Multi-Task Chinese Bias Evaluation Benchmark für große Sprachmodelle | MCBE: 大型语言模式多任务中文双语评价基准 2507.02088v1 |
Authors (7): Tian Lan, Xiangdong Su, Xu Liu, Ruirui Wang, Ke Chang, Jiang Li, Guanglai Gao
As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.
由于大型语言模型(LLMS)越来越多地应用于各种国家语言模型(LLMS),它们固有的偏见逐渐被披露,因此,衡量LLMS中的偏见对于降低其道德风险至关重要,然而,大多数现有的偏见评价数据集侧重于英语和北美文化,其偏见类别并不完全适用于其他文化,基于中文语言和文化的数据集很少,更重要的是,这些数据集通常只支持单一评价任务,不能评价LLM中多个方面的偏见。为了解决这些问题,我们提出了一个多任务中国琵琶树评估基准(McBE),其中包括4 077个偏见评价实例,涵盖12个单一偏见类别,82个子类别,并引入5个评价任务,提供广泛的类别覆盖面、内容多样性和衡量全面性。此外,我们评估了不同系列和参数大小的若干受欢迎的有限责任模型。一般来说,所有这些LMS都表现出不同程度的偏见。我们深入分析了结果,对LMS的偏见提供了新的洞察。
Article 88
Title@2025-07-02 (3): Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions
Title: Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions | Bewertung des Versprechens und der Fälle von LLMs bei Hiring-Entscheidungen | 评估LLM女士在雇用决定中的许诺和机会 2507.02087v1 |
Authors (4): Eitan Anzenberg, Arunava Samajpati, Sivasankaran Chandrasekar, Varun Kacholia
The use of large language models (LLMs) in hiring promises to streamline candidate screening, but it also raises serious concerns regarding accuracy and algorithmic bias where sufficient safeguards are not in place. In this work, we benchmark several state-of-the-art foundational LLMs - including models from OpenAI, Anthropic, Google, Meta, and Deepseek, and compare them with our proprietary domain-specific hiring model (Match Score) for job candidate matching. We evaluate each model’s predictive accuracy (ROC AUC, Precision-Recall AUC, F1-score) and fairness (impact ratio of cut-off analysis across declared gender, race, and intersectional subgroups). Our experiments on a dataset of roughly 10,000 real-world recent candidate-job pairs show that Match Score outperforms the general-purpose LLMs on accuracy (ROC AUC 0.85 vs 0.77) and achieves significantly more equitable outcomes across demographic groups. Notably, Match Score attains a minimum race-wise impact ratio of 0.957 (near-parity), versus 0.809 or lower for the best LLMs, (0.906 vs 0.773 for the intersectionals, respectively). We discuss why pretraining biases may cause LLMs with insufficient safeguards to propagate societal biases in hiring scenarios, whereas a bespoke supervised model can more effectively mitigate these biases. Our findings highlight the importance of domain-specific modeling and bias auditing when deploying AI in high-stakes domains such as hiring, and caution against relying on off-the-shelf LLMs for such tasks without extensive fairness safeguards. Furthermore, we show with empirical evidence that there shouldn’t be a dichotomy between choosing accuracy and fairness in hiring: a well-designed algorithm can achieve both accuracy in hiring and fairness in outcomes.
大型语言模型(LLMS)在招聘时的使用有望简化候选人筛选程序,但也引起了人们对准确性和算法偏差的严重关切。 在这项工作中,我们将一些最先进的基础模型(包括OpenAI、Anthroopicic、Google、Meta和DeepSeek的模型)作为基准,并将它们与我们专有的域别特定招聘模式(Match Scord)相比,用于招聘候选人匹配。我们评估了每个模型的预测准确性(ROC ACUC、 Precision-Recall AUC、F1-Score)和公平性(在宣布的性别、种族和交叉分组之间缺乏足够保障的情况下,截断率和算分析的比重)以及公平性(在宣布的性别、种族、种族和交叉分组之间缺乏足够保障的情况下,我们对最先进的基本基本基本基本基本基本基本基本基本基本基本基本基本基本标准 — — 在招聘过程中,我们的标准(Ox906)和(BLMS)之间可以有效地进行准确性评估。
Article 89
Title@2025-07-02 (3): Sequential Diagnosis with Language Models
Title: Sequential Diagnosis with Language Models | Sequentielle Diagnose mit Sprachmodellen | 语言模型的序列分析 2506.22405v2 |
Authors (15): Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, Eric Horvitz
Artificial intelligence holds great promise for expanding access to expert medical knowledge and reasoning. However, most evaluations of language models rely on static vignettes and multiple-choice questions that fail to reflect the complexity and nuance of evidence-based medicine in real-world settings. In clinical practice, physicians iteratively formulate and revise diagnostic hypotheses, adapting each subsequent question and test to what they’ve just learned, and weigh the evolving evidence before committing to a final diagnosis. To emulate this iterative process, we introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases into stepwise diagnostic encounters. A physician or AI begins with a short case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only when explicitly queried. Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed. We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians, proposes likely differential diagnoses and strategically selects high-value, cost-effective tests. When paired with OpenAI’s o3 model, MAI-DxO achieves 80% diagnostic accuracy–four times higher than the 20% average of generalist physicians. MAI-DxO also reduces diagnostic costs by 20% compared to physicians, and 70% compared to off-the-shelf o3. When configured for maximum accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families. We highlight how AI systems, when guided to think iteratively and act judiciously, can advance diagnostic precision and cost-effectiveness in clinical care.
人工智能为扩大专家医学知识和推理的获取提供了巨大的希望。然而,对语言模型的大多数评价都依赖于静态 vignette 和多重选择问题,它们未能反映现实世界环境中基于证据的医学的复杂性和细度。在临床实践中,医生反复制定和修改诊断假设,使随后的每个问题和测试适应他们刚刚学到的东西,并在进行最终诊断之前权衡不断演变的证据。为了效仿这一迭代过程,我们引入了序列诊断诊断基准,将304个诊断性挑战性新英格兰医学临床临床诊断性会议(NEM-CPC)的病例转化为渐进式诊断性诊断性接触。医生或AI从一个短期案例抽象开始,必须反复要求从一个显示结果的门卫模型中获取更多细节。绩效不仅通过诊断性准确度来评估,而且根据医生访问和测试的成本来评估。我们还介绍了MAI诊断性诊断性诊断性诊断性诊断性诊断性诊断性诊断性诊断性研究(MAI-D-MAI-I-I-I-I-I-I-I sudental-I-I-I-dediciental-I-I-I-dedicientalalal deal-I-I-I-liversal-I-Iral disal disal disal disal disal-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-Iental-I-Iental-Ientaltrax-Iental-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-
Article 90
Title@2025-07-02 (3): Test-Time Scaling with Reflective Generative Model
Title: Test-Time Scaling with Reflective Generative Model | Test-Zeit-Skalierung mit reflektierendem Generativem Modell | 具有反反思考生成模型的试验时间缩放 2507.01951v1 |
Authors (11): Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3’s performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra process annotation, reducing over 99% PRM parameters for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable for test time scaling (TTS), and we provide three reasoning effort modes (low, medium, and high), based on the controllable thinking length. Moreover, we empirically establish a scaling law that reveals the relationship between total thinking computation and TTS performance. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI-o3-mini’s series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
我们引入了第一个反光基因模型MetaStone-S1, 通过自我监督的进程奖励模式(SPRM)获得了OpenAI O3的性能。通过共享骨干网络,并使用特定任务头分别进行下一轮象征性的预测和进程评分,SPRM成功地将政策模型和进程奖励模式(PRM)整合到一个统一的界面中,而没有额外的过程批注,为有效推理降低了99%的PRM参数。与SPRM设备安装的MetaStone-S1自然适合测试时间缩放(TTS),我们提供了三种基于可控制思维长度的推理努力模式(低、中高)。此外,我们通过经验性地制定了一个显示总体思维计算与 TTS 性业绩之间关系的比例法。实验表明,我们的MetStone-S1 取得了仅32B参数大小的Oi- Mini系列的类似性能。为了支持研究界,我们在 http://github.com/MetaStone-AI/MetaStonone-S1 。
Article 91
Title@2025-07-02 (3): The Thin Line Between Comprehension and Persuasion in LLMs
Title: The Thin Line Between Comprehension and Persuasion in LLMs | Die dünne Linie zwischen Verständnis und Überzeugung in LLMs | LLMM 理解与劝导之间的细细线 2507.01936v1 |
Authors (2): Adrian de Wynter, Tangming Yuan
Large language models (LLMs) are excellent at maintaining high-level, convincing dialogues. They are being fast deployed as chatbots and evaluators in sensitive areas, such as peer review and mental health applications. This, along with the disparate accounts on their reasoning capabilities, calls for a closer examination of LLMs and their comprehension of dialogue. In this work we begin by evaluating LLMs’ ability to maintain a debate–one of the purest yet most complex forms of human communication. Then we measure how this capability relates to their understanding of what is being talked about, namely, their comprehension of dialogical structures and the pragmatic context. We find that LLMs are capable of maintaining coherent, persuasive debates, often swaying the beliefs of participants and audiences alike. We also note that awareness or suspicion of AI involvement encourage people to be more critical of the arguments made. When polling LLMs on their comprehension of deeper structures of dialogue, however, they cannot demonstrate said understanding. Our findings tie the shortcomings of LLMs-as-evaluators to their (in)ability to understand the context. More broadly, for the field of argumentation theory we posit that, if an agent can convincingly maintain a dialogue, it is not necessary for it to know what it is talking about. Hence, the modelling of pragmatic context and coherence are secondary to effectiveness.
大型语言模型(LLMS)在保持高层次的、令人信服的对话方面是极好的,它们作为聊天室和评价员在敏感领域,如同行审议和心理健康应用等领域迅速被部署为聊天室和评价员,这与对其推理能力的不同描述一道,要求更仔细地审查LLMS, 并理解对话。在这项工作中,我们首先评估LLMS保持辩论能力的能力,这是人类交流中最纯最复杂形式的辩论之一。然后,我们衡量这种能力与他们所谈论的内容,即他们对对话结构的理解和务实背景的理解有何关系。我们认为,LMS有能力保持连贯、有说服力的辩论,经常扭曲参与者和观众的信仰。我们还注意到,对AI参与的认识或怀疑鼓励人们更加批评所提出的论点。然而,当LMS对更深层次对话结构的理解进行民意测验时,他们无法表现出理解。我们的调查结果将LMS-S-S-evluaers的缺点与他们理解背景的(无法理解)的缺点联系在一起。更广义地说,对于争论理论领域来说,我们认为,如果一个代理人能够令人信服地说,那么,那么,那么,它就是,它就应该保持一种务实的对话,那么,那么,它就应该保持一种具有何种联系。
Article 92
Title@2025-07-02 (3): Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla
Title: Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla | Anpassungsfähigkeit von ASR-Modellen auf Low-Resource-Sprache: Eine vergleichende Studie von Whisper und Wav2Vec-BERT auf Bangla | 低资源语言ASR模型的可调适性:孟加拉语Wav2Vec-BERT和Wav2Vec-BERT的比较研究 2507.01931v1 |
Authors (3): Md Sazzadul Islam Ridoy, Sumi Akter, Md. Aminur Rahman
In recent years, neural models trained on large multilingual text and speech datasets have shown great potential for supporting low-resource languages. This study investigates the performances of two state-of-the-art Automatic Speech Recognition (ASR) models, OpenAI’s Whisper (Small & Large-V2) and Facebook’s Wav2Vec-BERT on Bangla, a low-resource language. We have conducted experiments using two publicly available datasets: Mozilla Common Voice-17 and OpenSLR to evaluate model performances. Through systematic fine-tuning and hyperparameter optimization, including learning rate, epochs, and model checkpoint selection, we have compared the models based on Word Error Rate (WER), Character Error Rate (CER), Training Time, and Computational Efficiency. The Wav2Vec-BERT model outperformed Whisper across all key evaluation metrics, demonstrated superior performance while requiring fewer computational resources, and offered valuable insights to develop robust speech recognition systems in low-resource linguistic settings.
近年来,在大型多语种文本和语音数据集方面受过培训的神经模型显示出支持低资源语言的巨大潜力,这项研究调查了两种最先进的自动语音识别模型(ASR)的性能,即OpenAI的耳语(Small & large-V2)和Facebook的Wav2Vec-BERT关于低资源语言孟加拉语的功能。我们利用两个公开的数据集(Mozilla Common Voice-17和OpenSLR)进行了实验,以评估模型的性能。通过系统的微调和超参数优化,包括学习率、时代和示范检查站选择,我们比较了基于单词错误率(WER)、字符错误率(CER)、培训时间和计算效率的模型。Wav2Vec-BERT模型在所有关键评价指标中都优于Wisper,在需要较少计算资源的情况下展示了优秀的性能,并为在低资源语言环境中开发强大的语音识别系统提供了宝贵的洞察力。
Article 93
Title@2025-07-02 (3): NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks
Title: NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks | NaturalThoughts: Auswählen und Destillieren von Rückschlüssen für allgemeine Aufgaben | 自然探索:为一般理由任务选择和保留合理的理由线索 2507.01921v1 |
Authors (11): Yang Li, Youssef Emad, Karthik Padthe, Jack Lanchantin, Weizhe Yuan, Thao Nguyen, Jason Weston, Shang-Wen Li, Dong Wang, Ilia Kulikov, Xian Li
Recent work has shown that distilling reasoning traces from a larger teacher model via supervised finetuning outperforms reinforcement learning with the smaller student model alone (Guo et al. 2025). However, there has not been a systematic study of what kind of reasoning demonstrations from the teacher are most effective in improving the student model’s reasoning capabilities. In this work we curate high-quality “NaturalThoughts” by selecting reasoning traces from a strong teacher model based on a large pool of questions from NaturalReasoning (Yuan et al. 2025). We first conduct a systematic analysis of factors that affect distilling reasoning capabilities, in terms of sample efficiency and scalability for general reasoning tasks. We observe that simply scaling up data size with random sampling is a strong baseline with steady performance gains. Further, we find that selecting difficult examples that require more diverse reasoning strategies is more sample-efficient to transfer the teacher model’s reasoning skills. Evaluated on both Llama and Qwen models, training with NaturalThoughts outperforms existing reasoning datasets such as OpenThoughts, LIMO, etc. on general STEM reasoning benchmarks including GPQA-Diamond, MMLU-Pro and SuperGPQA.
最近的工作表明,通过有监督的微调优优优优优表现,从一个更大的教师模型中蒸馏推理学痕迹,单靠较小的学生模型(Guo等人,2025年)就能加强学习,仅靠较小的学生模型(Guo等人,2025年)就能加强学习能力;然而,没有系统研究教师何种推理示范对于提高学生模型推理能力最为有效;在这项工作中,我们根据自然Reasoning(Yuan等人,2025年)的大量问题,从一个强大的教师模型中挑选推理学痕迹,从而将高质量的“自然洞察”从一个更强的教师模型中挑选出。我们首先对影响推理能力的因素进行系统分析,从抽样效率和一般推理能力的可变性能角度分析。我们发现,仅仅通过随机抽样扩大数据规模是强有力的基线,可以稳步提高绩效。此外,我们发现,选择需要更多样化推理战略的难的例子,对于转移教师模型的推理技能来说,比较有效。对Llama和Quen模型进行了评价,对自然洞察公司的培训超越了现有的推理数据集,例如Op Toughts,LIMOT-Q,等等。
Article 94
Title@2025-07-02 (3): Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models
Title: Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models | Gradient-Adaptive Policy Optimization: Auf dem Weg zu einer multi-objektiven Ausrichtung großer Sprachmodelle | 渐进式政策优化:实现大语言模式多目标一致 2507.01915v1 |
Authors (6): Chengao Li, Hanyu Zhang, Yunkun Xu, Hongyan Xue, Xiang Ao, Qing He
Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful technique for aligning large language models (LLMs) with human preferences. However, effectively aligning LLMs with diverse human preferences remains a significant challenge, particularly when they are conflict. To address this issue, we frame human value alignment as a multi-objective optimization problem, aiming to maximize a set of potentially conflicting objectives. We introduce Gradient-Adaptive Policy Optimization (GAPO), a novel fine-tuning paradigm that employs multiple-gradient descent to align LLMs with diverse preference distributions. GAPO adaptively rescales the gradients for each objective to determine an update direction that optimally balances the trade-offs between objectives. Additionally, we introduce P-GAPO, which incorporates user preferences across different objectives and achieves Pareto solutions that better align with the user’s specific needs. Our theoretical analysis demonstrates that GAPO converges towards a Pareto optimal solution for multiple objectives. Empirical results on Mistral-7B show that GAPO outperforms current state-of-the-art methods, achieving superior performance in both helpfulness and harmlessness.
从人类反馈中强化学习(RLHF)已成为使大型语言模型(LLMs)与人类偏好相匹配的有力技术,然而,将LLMs与人类不同偏好有效结合仍然是一个重大挑战。为了解决这一问题,我们把人类价值结合作为一个多目标优化问题,目的是最大限度地实现一系列可能相互冲突的目标。我们引入了一种新型微调模式,即 “ 渐进-偏向政策优化 “ (GAPO),它使用多级血统使LLMs与不同偏爱分布相匹配。GAPO对每个目标的梯度进行了适应性调整,以确定最佳平衡目标之间取舍的更新方向。此外,我们引入了P-GAPO,它将用户的偏好纳入不同目标,并实现更符合用户具体需要的Pareto解决方案。我们的理论分析表明,GAPO为多个目标的Pareto最佳解决方案趋于一致。Mistral-7B的实证结果显示,GAPO超越了当前的最新方法,在有用性和无害性两方面都取得了优异性业绩。
Article 95
Title@2025-07-02 (3): AI4Research: A Survey of Artificial Intelligence for Scientific Research
Title: AI4Research: A Survey of Artificial Intelligence for Scientific Research | AI4Research: Eine Untersuchung der Künstlichen Intelligenz für die wissenschaftliche Forschung | AI4Research:科学研究人工情报调查 2507.01903v1 |
Authors (16): Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, Yimeng Zhang, Yihao Liang, Yuhang Zhou, Jiaqi Wang, Zhi Chen, Wanxiang Che
Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs) such as OpenAI-o1 and DeepSeek-R1, have demonstrated remarkable capabilities in complex domains such as logical reasoning and experimental coding. Motivated by these advancements, numerous studies have explored the application of AI in the innovation process, particularly in the context of scientific research. These AI technologies primarily aim to develop systems that can autonomously conduct research processes across a wide range of scientific disciplines. Despite these significant strides, a comprehensive survey on AI for Research (AI4Research) remains absent, which hampers our understanding and impedes further development in this field. To address this gap, we present a comprehensive survey and offer a unified perspective on AI4Research. Specifically, the main contributions of our work are as follows: (1) Systematic taxonomy: We first introduce a systematic taxonomy to classify five mainstream tasks in AI4Research. (2) New frontiers: Then, we identify key research gaps and highlight promising future directions, focusing on the rigor and scalability of automated experiments, as well as the societal impact. (3) Abundant applications and resources: Finally, we compile a wealth of resources, including relevant multidisciplinary applications, data corpora, and tools. We hope our work will provide the research community with quick access to these resources and stimulate innovative breakthroughs in AI4Research.
最近人工智能(AI),特别是OpenAI-o1和DeepSeek-R1等大型语言模型(LLMS)的进步,在逻辑推理和实验编码等复杂领域表现出了非凡的能力。在这些进步的推动下,许多研究探索了AI在创新过程中的应用,特别是在科学研究方面。这些AI技术主要旨在开发能够自主地在广泛的科学学科中开展研究进程的系统。尽管取得了这些重大进步,但关于AI研究(AI4Research)的全面调查(AI4Research)仍然缺乏,这妨碍了我们的理解和这一领域的进一步发展。为弥补这一差距,我们提出了全面调查,并就AI4Researchation提供了统一的观点。具体地说,我们工作的主要贡献如下:(1) 系统分类:我们首先采用系统化的分类方法,对AI4研究中的五项主流任务进行分类。(2) 新疆界:然后,我们找出关键的研究差距,突出有希望的未来方向,重点是自动化实验的固定性和可缩缩略性,以及社会影响。(3) 为ABINT的应用程序和资源提供了一种丰富的研究工具,我们将利用这些专业性资源,从而激发了我们的革命性研究工具。
Article 96
Title@2025-07-02 (3): High-Layer Attention Pruning with Rescaling
Title: High-Layer Attention Pruning with Rescaling | Hochebene Aufmerksamkeit Pruning mit Rescaling | 高关注度 以降降降为缓冲 2507.01900v1 |
Authors (2): Songtao Liu, Peng Liu
Pruning is a highly effective approach for compressing large language models (LLMs), significantly reducing inference latency. However, conventional training-free structured pruning methods often employ a heuristic metric that indiscriminately removes some attention heads across all pruning layers, without considering their positions within the network architecture. In this work, we propose a novel pruning algorithm that strategically prunes attention heads in the model’s higher layers. Since the removal of attention heads can alter the magnitude of token representations, we introduce an adaptive rescaling parameter that calibrates the representation scale post-pruning to counteract this effect. We conduct comprehensive experiments on a wide range of LLMs, including LLaMA3.1-8B, Mistral-7B-v0.3, Qwen2-7B, and Gemma2-9B. Our evaluation includes both generation and discriminative tasks across 27 datasets. The results consistently demonstrate that our method outperforms existing structured pruning methods. This improvement is particularly notable in generation tasks, where our approach significantly outperforms existing baselines.
在压缩大型语言模型(LLMS)时,普鲁宁是一种非常有效的方法,可以大幅降低推导延迟度。然而,常规的无培训结构裁剪方法通常使用一种超光度测量法,不加考虑地将一些注意力排出所有运行层,而没有考虑其在网络结构中的位置。在这项工作中,我们建议一种新型的裁剪算法,在战略上将注意力排入模型较高层中。由于转移注意力头可以改变象征性表达的大小,我们引入了适应性调整参数,校准代表比例的调整后调整,以抵消这一效应。我们在广泛的LLLMS中进行了全面实验,包括LLAMA3.1-8B、Mistral-7B-v0.3、Qwen2-7B和Gemma2-9B。我们的评价包括27个数据集的生成和歧视性任务。结果始终表明,我们的方法超过了现有的结构化的运行方法。这种改进在生成任务中特别显著地突出,我们的方法大大超过现有的基准。
Article 97
Title@2025-07-02 (3): Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?
Title: Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data? | Rekursive Trainingsschleifen in LLMs: Wie modulieren Trainingsdateneigenschaften die Verteilungsverschiebung in generierten Daten? | LLMM中的递归培训循环:培训数据特性如何调整生成数据的分布变化? 2504.03814v3 |
Authors (5): Grgur Kovač, Jérémy Perez, Rémy Portelas, Peter Ford Dominey, Pierre-Yves Oudeyer
Large language models (LLMs) are increasingly used in the creation of online content, creating feedback loops as subsequent generations of models will be trained on this synthetic data. Such loops were shown to lead to distribution shifts - models misrepresenting the true underlying distributions of human data (also called model collapse). However, how human data properties affect such shifts remains poorly understood. In this paper, we provide the first empirical examination of the effect of such properties on the outcome of recursive training. We first confirm that using different human datasets leads to distribution shifts of different magnitudes. Through exhaustive manipulation of dataset properties combined with regression analyses, we then identify a set of properties predicting distribution shift magnitudes. Lexical diversity is found to amplify these shifts, while semantic diversity and data quality mitigate them. Furthermore, we find that these influences are highly modular: data scrapped from a given internet domain has little influence on the content generated for another domain. Finally, experiments on political bias reveal that human data properties affect whether the initial bias will be amplified or reduced. Overall, our results portray a novel view, where different parts of internet may undergo different types of distribution shift.
大型语言模型(LLMS)越来越多地用于创建在线内容,创建反馈循环,因为随后几代模型将接受有关这一合成数据的培训。这种循环显示会导致分布变化 — — 扭曲人类数据真实基本分布的模型(也称为模型崩溃 ) 。然而,人类数据属性如何影响这种变化仍然不甚为人理解。在本文中,我们首次对此类属性对循环培训结果的影响进行了经验性审查。我们首先确认,使用不同的人类数据集会导致不同规模的分布变化。我们通过对数据集属性的详尽操作,加上回归分析,我们随后确定了一套预测分布变化大小的属性。发现,超文本多样性可以扩大这些变化,同时,语义多样性和数据质量可以减轻这些变化。此外,我们发现这些影响是高度模块化的:从特定互联网域中分离的数据对另一个域生成的内容没有多大影响。最后,关于政治偏见的实验表明,人类数据属性会影响最初的偏差是否会扩大或缩小。总体而言,我们的结果描绘了一种新颖的视角,因为互联网的不同部分可能会发生不同类型的分布变化。
Article 98
Title@2025-07-02 (3): MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants
Title: MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants | MiCoTA: Die Lernfähigkeitslücke mit Intermediate CoT und Lehrerassistenten überbrücken | MiCOCTA: 缩小与中级COT和教师助理的学习能力差距 2507.01887v1 |
Authors (6): Dongyi Ding, Tiannan Wang, Chenghao Zhu, Meiling Tao, Yuchen Eleanor Jiang, Wangchunshu Zhou
Large language models (LLMs) excel at reasoning tasks requiring long thought sequences for planning, reflection, and refinement. However, their substantial model size and high computational demands are impractical for widespread deployment. Yet, small language models (SLMs) often struggle to learn long-form CoT reasoning due to their limited capacity, a phenomenon we refer to as the “SLMs Learnability Gap”. To address this, we introduce \textbf{Mi}d-\textbf{Co}T \textbf{T}eacher \textbf{A}ssistant Distillation (MiCoTAl), a framework for improving long CoT distillation for SLMs. MiCoTA employs intermediate-sized models as teacher assistants and utilizes intermediate-length CoT sequences to bridge both the capacity and reasoning length gaps. Our experiments on downstream tasks demonstrate that although SLMs distilled from large teachers can perform poorly, by applying MiCoTA, they achieve significant improvements in reasoning performance. Specifically, Qwen2.5-7B-Instruct and Qwen2.5-3B-Instruct achieve an improvement of 3.47 and 3.93 respectively on average score on AIME2024, AMC, Olympiad, MATH-500 and GSM8K benchmarks. To better understand the mechanism behind MiCoTA, we perform a quantitative experiment demonstrating that our method produces data more closely aligned with base SLM distributions. Our insights pave the way for future research into long-CoT data distillation for SLMs.
大型语言模型(LLMS)在规划、思考和完善需要长期思维序列的推理任务方面非常出色,然而,其庞大的模型规模和高计算需求对于广泛部署来说是不切实际的。然而,小型语言模型(SLMS)由于能力有限,往往难以学习长式COT推理,我们称之为“SLMS学习差距” 。为了解决这个问题,我们引入了“Textbf{Mi}d-textbf{Co}T t textbf{Tlestelebf{Tletter ditulebf{Tleacher dibf{A},它们的巨大模型规模和高计算需求对于广泛部署来说是不切实际的。 MITATA使用中等规模模型作为教师助理,并使用中等长度 CoTATA的顺序来弥合能力和推理差距。 我们在下游任务中进行的实验表明,虽然从大教师中提炼的可持续土地管理可以通过MicoTA来取得显著改进的推理学成绩。 具体来说,QSU-In和Qwent-2.5-BT(In)改进了我们未来的SMAMISBSBSBSBS) 3-SBSBSBSBSBSBSBSB 的比数级平均的比数, 347和MTA3-CSBBBB的比B的比,我们分别改进了AMA3.47和MTA3-CSB的比数。
Article 99
Title@2025-07-02 (3): Towards Universal Semantics With Large Language Models
Title: Towards Universal Semantics With Large Language Models | Hin zu universeller Semantik mit großen Sprachmodellen | 走向具有大语言模式的普遍语义 2505.11764v2 |
Authors (5): Raymond Baartmans, Matthew Raffel, Rahul Vikram, Aiden Deringer, Lizhong Chen
The Natural Semantic Metalanguage (NSM) is a linguistic theory based on a universal set of semantic primes: simple, primitive word-meanings that have been shown to exist in most, if not all, languages of the world. According to this framework, any word, regardless of complexity, can be paraphrased using these primes, revealing a clear and universally translatable meaning. These paraphrases, known as explications, can offer valuable applications for many natural language processing (NLP) tasks, but producing them has traditionally been a slow, manual process. In this work, we present the first study of using large language models (LLMs) to generate NSM explications. We introduce automatic evaluation methods, a tailored dataset for training and evaluation, and fine-tuned models for this task. Our 1B and 8B models outperform GPT-4o in producing accurate, cross-translatable explications, marking a significant step toward universal semantic representation with LLMs and opening up new possibilities for applications in semantic analysis, translation, and beyond.
自然语义代言语(NSM)是一种语言理论,它基于一套通用的语义质素材:简单、原始的字意,在世界上大多数语言中,即使不是所有语言中,也存在。根据这个框架,任何单词,无论复杂程度如何,都可以使用这些字句进行引言,揭示一个明确和普遍的可翻转的含义。这些词句被称为解说,可以为许多自然语言处理(NLP)任务提供宝贵的应用,但制作它们历来是一个缓慢的手工过程。在这项工作中,我们介绍了关于使用大型语言模型(LLMS)来生成NSM解说的首项研究。我们采用了自动评价方法、为培训和评估定制的数据集,以及为这项任务的微调模型。我们的1B和8B模型在制作准确、交叉翻译的GPT-4o时,超越了GPTT-4o,标志着与LMs一起走向普遍语义表述的重要一步,并为语义分析、翻译和以后的应用开辟了新的可能性。
Article 100
Title@2025-07-02 (3): LinguaSynth: Heterogeneous Linguistic Signals for News Classification
Title: LinguaSynth: Heterogeneous Linguistic Signals for News Classification | LinguaSynth: Heterogene linguistische Signale für Nachrichtenklassifikation | LUUASynth:不同语言信号用于新闻分类 2506.21848v2 |
Authors (2): Duo Zhang, Junyi Mo
Deep learning has significantly advanced NLP, but its reliance on large black-box models introduces critical interpretability and computational efficiency concerns. This paper proposes LinguaSynth, a novel text classification framework that strategically integrates five complementary linguistic feature types: lexical, syntactic, entity-level, word-level semantics, and document-level semantics within a transparent logistic regression model. Unlike transformer-based architectures, LinguaSynth maintains interpretability and computational efficiency, achieving an accuracy of 84.89 percent on the 20 Newsgroups dataset and surpassing a robust TF-IDF baseline by 3.32 percent. Through rigorous feature interaction analysis, we show that syntactic and entity-level signals provide essential disambiguation and effectively complement distributional semantics. LinguaSynth sets a new benchmark for interpretable, resource-efficient NLP models and challenges the prevailing assumption that deep neural networks are necessary for high-performing text classification.
深层学习已大大推进了NLP,但对于大型黑盒模型的依赖带来了关键的解释性和计算效率问题。本文提议了LouguaSynth,这是一个新颖的文本分类框架,在战略上整合了五个互补的语言特征类型:词汇学、合成学、实体级、字级语义学和文件级语义学,在透明的后勤回归模型中,它与基于变压器的架构不同,LouguaSynth保持了解释性和计算效率,在20个新闻组数据集中实现了84.89%的精确度,超过了可靠的TF-IDF基线3.32%。我们通过严格的地貌互动分析,表明合成和实体级信号提供了基本的分离,并有效地补充了分布语义学。 LUguaSynth为可解释的、资源高效的NLP模型制定了新的基准,并挑战了普遍假设,即深神经网络对于高性文本分类是必要的。
Article 101
Title@2025-07-02 (3): DIY-MKG: An LLM-Based Polyglot Language Learning System
Title: DIY-MKG: An LLM-Based Polyglot Language Learning System | DIY-MKG: Ein LLM-basiertes Polyglotte-Sprachlernsystem | DIY-MKG:一个基于LLM的多金语言学习系统 2507.01872v1 |
Authors (3): Kenan Tang, Yanhong Li, Yao Qin
Existing language learning tools, even those powered by Large Language Models (LLMs), often lack support for polyglot learners to build linguistic connections across vocabularies in multiple languages, provide limited customization for individual learning paces or needs, and suffer from detrimental cognitive offloading. To address these limitations, we design Do-It-Yourself Multilingual Knowledge Graph (DIY-MKG), an open-source system that supports polyglot language learning. DIY-MKG allows the user to build personalized vocabulary knowledge graphs, which are constructed by selective expansion with related words suggested by an LLM. The system further enhances learning through rich annotation capabilities and an adaptive review module that leverages LLMs for dynamic, personalized quiz generation. In addition, DIY-MKG allows users to flag incorrect quiz questions, simultaneously increasing user engagement and providing a feedback loop for prompt refinement. Our evaluation of LLM-based components in DIY-MKG shows that vocabulary expansion is reliable and fair across multiple languages, and that the generated quizzes are highly accurate, validating the robustness of DIY-MKG.
现有语言学习工具,即使是由大语言模型(LLMS)驱动的现有语言学习工具,也往往缺乏对多语种学习者的支持,无法在多种语言的词汇中建立语言联系,为个人学习速度或需要提供有限的定制,并遭受有害的认知卸载。为了解决这些限制,我们设计了Do-It- yourself多语言知识图(DIY-MKG),这是一个支持多语种学习的开放源系统。DIY-MKG允许用户建立个性化词汇知识图,该图是用LLMM提出的相关词进行选择性扩展而构建的。这个系统通过丰富的批注能力和适应性审查模块进一步加强学习,利用LLMS进行动态、个性化的问答生成。此外,DIY-MKG允许用户提出错误的测试问题,同时增加用户的参与,并为迅速完善提供反馈回路。我们对DIY-MKG中基于LM的组件的评价表明,词汇扩展在多种语言中是可靠和公平的,产生的问答非常准确,验证了DIY-MKG的准确性。
Article 102
Title@2025-07-02 (3): Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages
Title: Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages | Eka-Eval : Ein umfassender Evaluierungsrahmen für große Sprachmodelle in indischen Sprachen | Eka-Eval:印度语大语言模式综合评价框架 2507.01853v1 |
Authors (4): Samridhi Raj Sinha, Rajvee Sheth, Abhishek Upperwal, Mayank Singh
The rapid advancement of Large Language Models (LLMs) has intensified the need for evaluation frameworks that go beyond English centric benchmarks and address the requirements of linguistically diverse regions such as India. We present EKA-EVAL, a unified and production-ready evaluation framework that integrates over 35 benchmarks, including 10 Indic-specific datasets, spanning categories like reasoning, mathematics, tool use, long-context understanding, and reading comprehension. Compared to existing Indian language evaluation tools, EKA-EVAL offers broader benchmark coverage, with built-in support for distributed inference, quantization, and multi-GPU usage. Our systematic comparison positions EKA-EVAL as the first end-to-end, extensible evaluation suite tailored for both global and Indic LLMs, significantly lowering the barrier to multilingual benchmarking. The framework is open-source and publicly available at https://github.com/lingo-iitgn/ eka-eval and a part of ongoing EKA initiative (https://eka.soket.ai), which aims to scale up to over 100 benchmarks and establish a robust, multilingual evaluation ecosystem for LLMs.
大语言模型(LLMS)的迅速发展使评价框架的需要更加需要超越以英语为中心的基准,并满足印度等语言多样性区域的要求。我们提出EKA-EVAL,这是一个统一和可制作的评价框架,它综合了35个基准,包括10个指数专用数据集,涵盖推理、数学、工具使用、长文本理解和阅读理解等类别。与现有的印度语言评价工具相比,EKA-EVAL提供了更广泛的基准覆盖面,对分布式推论、量化和多语组联盟的使用提供了内在支持。EKA-EVAL作为第一个端到端,为全球和印地组LLMS定制的可扩展评价套件,大大降低了多语种基准的屏障。这个框架是开放的,可公开查阅https://github.com/lingo-iitgn/ eka-eval和正在进行的EKA倡议的一部分(https://eka.soket.ai),目的是扩大至100个基准,并为LMSLMS提供强有力的多语种生态系统评价。
Article 103
Title@2025-07-02 (3): Low-Perplexity LLM-Generated Sequences and Where To Find Them
Title: Low-Perplexity LLM-Generated Sequences and Where To Find Them | Low-Perplexity LLM-generierte Sequenzen und wo sie zu finden sind | 低重复性 LLM 生成序列及其查找地点 2507.01844v1 |
Authors (3): Arthur Wuhrmann, Anastasiia Kucherenko, Andrei Kucharavy
As Large Language Models (LLMs) become increasingly widespread, understanding how specific training data shapes their outputs is crucial for transparency, accountability, privacy, and fairness. To explore how LLMs leverage and replicate their training data, we introduce a systematic approach centered on analyzing low-perplexity sequences - high-probability text spans generated by the model. Our pipeline reliably extracts such long sequences across diverse topics while avoiding degeneration, then traces them back to their sources in the training data. Surprisingly, we find that a substantial portion of these low-perplexity spans cannot be mapped to the corpus. For those that do match, we quantify the distribution of occurrences across source documents, highlighting the scope and nature of verbatim recall and paving a way toward better understanding of how LLMs training data impacts their behavior.
随着大语言模型(LLMs)日益普及,了解具体培训数据如何塑造其产出对于透明度、问责制、隐私和公平性至关重要。为了探索LLMs如何利用和复制其培训数据,我们引入了一种系统的方法,以分析低重复序列为中心,即模型产生的高概率文本范围。我们的管道可靠地从不同主题中提取了如此长的序列,同时避免退化,然后在培训数据中将其追溯到源头。令人惊讶的是,我们发现这些低难度范围中有很大一部分无法绘制成文体。对于与之匹配的文件,我们量化了源文件之间的事件分布,突出了逐字记录的范围和性质,并为更好地了解LLMs培训数据如何影响其行为铺平了一条路。
Article 104
Title@2025-07-02 (3): Guaranteed Generation from Large Language Models
Title: Guaranteed Generation from Large Language Models | Garantierte Generation aus großen Sprachmodellen | 从大语言模式中担保产生 2410.06716v2 |
Authors (6): Minbeom Kim, Thibaut Thonet, Jos Rozen, Hwaran Lee, Kyomin Jung, Marc Dymetman
As large language models (LLMs) are increasingly used across various applications, there is a growing need to control text generation to satisfy specific constraints or requirements. This raises a crucial question: Is it possible to guarantee strict constraint satisfaction in generated outputs while preserving the distribution of the original model as much as possible? We first define the ideal distribution - the one closest to the original model, which also always satisfies the expressed constraint - as the ultimate goal of guaranteed generation. We then state a fundamental limitation, namely that it is impossible to reach that goal through autoregressive training alone. This motivates the necessity of combining training-time and inference-time methods to enforce such guarantees. Based on this insight, we propose GUARD, a simple yet effective approach that combines an autoregressive proposal distribution with rejection sampling. Through GUARD’s theoretical properties, we show how controlling the KL divergence between a specific proposal and the target ideal distribution simultaneously optimizes inference speed and distributional closeness. To validate these theoretical concepts, we conduct extensive experiments on two text generation settings with hard-to-satisfy constraints: a lexical constraint scenario and a sentiment reversal scenario. These experiments show that GUARD achieves perfect constraint satisfaction while almost preserving the ideal distribution with highly improved inference efficiency. GUARD provides a principled approach to enforcing strict guarantees for LLMs without compromising their generative capabilities.
由于大型语言模型(LLMS)在各种应用中日益被使用,因此越来越需要控制文本生成,以满足具体限制或要求,这提出了一个至关重要的问题:能否保证生成产出的严格约束性满意度,同时尽可能保留原始模型的分布?我们首先将理想分布——最接近原始模型的分布,也总是最接近原始模型的制约——确定为保证生成的最终目标;然后我们指出一个基本限制,即仅通过自动递减性培训是不可能实现这一目标的。这促使有必要将培训时间和推断时间方法结合起来,以实施这种保证。基于这一认识,我们建议GUARD是一种简单而有效的方法,将自动递减性提案的分布与拒绝抽样结合起来。我们通过GUARD的理论特性,将理想分布之间的KL差异确定为保证,同时优化推论速度和分布上的近距离。为了验证这些理论概念,我们广泛试验了两种具有硬至推力制约的文本生成环境:在不妥协性约束性约束下,我们提议GARDD,这是一种简单而有效的方法,将自动递减性建议与拒绝采样结合起来。 我们通过GUAAAA的精确的保证,而能保证在高度递反变制中,这些试验提供了一种最精确的保证。
Article 105
Title@2025-07-02 (3): QAEncoder: Towards Aligned Representation Learning in Question Answering Systems
Title: QAEncoder: Towards Aligned Representation Learning in Question Answering Systems | QAEncoder: Auf dem Weg zu einem ausgerichteten Repräsentationslernen in Fragestellungssystemen | QAEncolder:在问题解答系统中实现代表性统一学习 2409.20434v3 |
Authors (9): Zhengren Wang, Qinhan Yu, Shida Wei, Zhiyu Li, Feiyu Xiong, Xiaoxing Wang, Simin Niu, Hao Liang, Wentao Zhang
Modern QA systems entail retrieval-augmented generation (RAG) for accurate and trustworthy responses. However, the inherent gap between user queries and relevant documents hinders precise matching. We introduce QAEncoder, a training-free approach to bridge this gap. Specifically, QAEncoder estimates the expectation of potential queries in the embedding space as a robust surrogate for the document embedding, and attaches document fingerprints to effectively distinguish these embeddings. Extensive experiments across diverse datasets, languages, and embedding models confirmed QAEncoder’s alignment capability, which offers a simple-yet-effective solution with zero additional index storage, retrieval latency, training costs, or catastrophic forgetting and hallucination issues. The repository is publicly available at https://github.com/IAAR-Shanghai/QAEncoder.
现代质量评估系统要求为准确和可信赖的答复进行检索和强化生成(RAG),然而,用户查询和相关文件之间的内在差距妨碍了精确匹配。我们引入了QAEncoder,这是消除这一差距的无培训方法。具体地说,QAEncoder估计,在嵌入空间中潜在查询的预期值是文件嵌入的有力替代器,并附上文件指纹,以有效区分这些嵌入。在不同数据集、语言和嵌入模型中进行广泛实验,确认QAEncoder的匹配能力,提供了简单而有效的解决方案,增加了零指数存储、检索延缓度、培训成本或灾难性的遗忘和幻觉问题。该存储库可在https://github.comAR-Shanghai/QAEncoder上公开查阅。
Article 106
Title@2025-07-02 (3): Evaluating Structured Output Robustness of Small Language Models for Open Attribute-Value Extraction from Clinical Notes
Title: Evaluating Structured Output Robustness of Small Language Models for Open Attribute-Value Extraction from Clinical Notes | Bewertung der Robustheit von kleinen Sprachmodellen für eine offene Attribut-Wert-Extraktion aus klinischen Anmerkungen | 评价从临床说明中公开属性价值提取的小型语言模式的结构化产出强强度 2507.01810v1 |
Authors (3): Nikita Neveditsin, Pawan Lingras, Vijay Mago
We present a comparative analysis of the parseability of structured outputs generated by small language models for open attribute-value extraction from clinical notes. We evaluate three widely used serialization formats: JSON, YAML, and XML, and find that JSON consistently yields the highest parseability. Structural robustness improves with targeted prompting and larger models, but declines for longer documents and certain note types. Our error analysis identifies recurring format-specific failure patterns. These findings offer practical guidance for selecting serialization formats and designing prompts when deploying language models in privacy-sensitive clinical settings.
我们对用于公开属性值提取的临床说明的小型语言模型产生的结构化产出的可分析性进行了比较分析。我们评估了三种广泛使用的序列化格式:JSON、YAML和XML,发现JSON始终具有最高可分析性。结构稳健性随着有针对性的快速和较大的模型而得到改善,但对于较长的文档和某些备注类型则有所下降。我们的错误分析确定了反复出现的具体格式故障模式。这些发现为在隐私敏感的临床环境中使用语言模型时选择序列化格式和设计提示提供了实用指导。
Article 107
Title@2025-07-02 (3): LoRA Fine-Tuning Without GPUs: A CPU-Efficient Meta-Generation Framework for LLMs
Title: LoRA Fine-Tuning Without GPUs: A CPU-Efficient Meta-Generation Framework for LLMs | LoRA Feintuning ohne GPUs: Ein CPU-effizientes Meta-Generation-Framework für LLMs | LoRA 无GPUs的精细调整:LLMs的CPU-提高功能元元发光框架 2507.01806v1 |
Authors (3): Reza Arabpour, Haitz Sáez de Ocáriz Borde, Anastasis Kratsios
Low-Rank Adapters (LoRAs) have transformed the fine-tuning of Large Language Models (LLMs) by enabling parameter-efficient updates. However, their widespread adoption remains limited by the reliance on GPU-based training. In this work, we propose a theoretically grounded approach to LoRA fine-tuning designed specifically for users with limited computational resources, particularly those restricted to standard laptop CPUs. Our method learns a meta-operator that maps any input dataset, represented as a probability distribution, to a set of LoRA weights by leveraging a large bank of pre-trained adapters for the Mistral-7B-Instruct-v0.2 model. Instead of performing new gradient-based updates, our pipeline constructs adapters via lightweight combinations of existing LoRAs directly on CPU. While the resulting adapters do not match the performance of GPU-trained counterparts, they consistently outperform the base Mistral model on downstream tasks, offering a practical and accessible alternative to traditional GPU-based fine-tuning.
低兰克适应器(LORAs)通过允许具有参数效率的更新,改变了大语言模型的微调,但是,由于依赖基于GPU的培训,这些模型的广泛采用仍然有限。在这项工作中,我们建议对专门为计算资源有限的用户,特别是仅限于标准膝上型膝上型计算机的用户设计的LORA微调采取基于理论的微调方法。我们的方法学了一个元操作器,该元操作器将代表概率分布的任何输入数据集映射成一组LORA重量,利用大批受过训练的适应器来进行Mistral-7B-Instruct-v0.2模型。我们的管道结构不是进行新的基于梯度的更新,而是通过直接在CPU上对现有LRA的轻量组合来建造适应器。虽然所产生的调整器与受GPUP培训的对应方的性能不匹配,但它们始终比下游任务的基本Mistral模型高,为传统的基于GPU的微调制提供了实用和方便的替代方法。
Article 108
Title@2025-07-02 (3): The Anatomy of Evidence: An Investigation Into Explainable ICD Coding
Title: The Anatomy of Evidence: An Investigation Into Explainable ICD Coding | Die Anatomie der Beweise: Eine Untersuchung zur erklärbaren ICD-Kodierung | 证据解剖学:调查可解释的 ICD 编码 2507.01802v1 |
Authors (5): Katharina Beckh, Elisa Studeny, Sujan Sai Gannamaneni, Dario Antweiler, Stefan Rüping
Automatic medical coding has the potential to ease documentation and billing processes. For this task, transparency plays an important role for medical coders and regulatory bodies, which can be achieved using explainability methods. However, the evaluation of these approaches has been mostly limited to short text and binary settings due to a scarcity of annotated data. Recent efforts by Cheng et al. (2023) have introduced the MDACE dataset, which provides a valuable resource containing code evidence in clinical records. In this work, we conduct an in-depth analysis of the MDACE dataset and perform plausibility evaluation of current explainable medical coding systems from an applied perspective. With this, we contribute to a deeper understanding of automatic medical coding and evidence extraction. Our findings reveal that ground truth evidence aligns with code descriptions to a certain degree. An investigation into state-of-the-art approaches shows a high overlap with ground truth evidence. We propose match measures and highlight success and failure cases. Based on our findings, we provide recommendations for developing and evaluating explainable medical coding systems.
自动医疗编码具有方便文件和计费程序的潜力。对于这项任务,透明度对医疗编码员和监管机构起着重要作用,可以通过解释性方法实现。然而,由于缺少附加说明的数据,对这些方法的评价主要限于短文本和二进制设置。Cheng等人(2023年)最近的努力引入了MDACE数据集,该数据集提供了含有临床记录中代码证据的宝贵资源。在这项工作中,我们对MDACE数据集进行深入分析,并从应用角度对当前可解释的医疗编码系统进行合理性评价。我们以此帮助加深对自动医学编码和证据提取的理解。我们的调查结果显示,地面真相证据与代码描述在某种程度上是一致的。对最新方法的调查显示,与地面真相证据有很大的重叠。我们提出了匹配措施,并强调成功和失败案例。根据我们的调查结果,我们为开发和评价可解释的医疗编码系统提出了建议。
Article 109
Title@2025-07-02 (3): How Do Vision-Language Models Process Conflicting Information Across Modalities?
Title: How Do Vision-Language Models Process Conflicting Information Across Modalities? | Wie verarbeiten Vision-Language-Modelle widersprüchliche Informationen über Modalitäten hinweg? | 愿景-语言模型如何以不同方式处理信息冲突问题? 2507.01790v1 |
Authors (3): Tianze Hua, Tian Yun, Ellie Pavlick
AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-language models, we provide inconsistent inputs (e.g., an image of a dog paired with the caption “A photo of a cat”) and ask the model to report the information present in one of the specific modalities (e.g., “What does the caption say / What is in the image?”). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model, and that specific attention heads can restructure the representations to favor one modality over the other. Moreover, we find modality-agnostic “router heads” which appear to promote answers about the modality requested in the instruction, and which can be manipulated or transferred in order to improve performance across datasets and modalities. Together, the work provides essential steps towards identifying and controlling if and how models detect and resolve conflicting signals within complex multimodal environments.
AI模式日益成为多式模式,将不同的输入流纳入一个连贯的国家代表模式,随后的行为和行动可以以此为基础。本文件试图了解当输入流带来相互矛盾的信息时,这种模式的行为方式如何。 具体侧重于愿景语言模式,我们提供不一致的投入(例如,一条配有“猫的照片”标题的狗的图像),并要求模型报告特定模式之一(例如,“标题表示什么/图像中是什么?” )的信息。我们发现,模式往往偏向于一种模式,例如,报告图像,而不管标题说明什么,但不同的模式不同,它们倾向于哪种模式。我们发现,行为偏好模式在模型的内部代表结构中显而易见,而具体关注负责人可以调整代表结构,赞成一种模式而不是另一种模式。此外,我们发现,模式-无差别的“路径头”似乎有助于回答指令中所要求的模式,而且可以被操纵或转移,以便改进跨数据集和模式的运行方式。我们发现,如果在复杂的情况下,工作提供了必要的步骤,如何在模式和模式中确定和模式内,则如何控制,如何稳定。
Article 110
Title@2025-07-02 (3): Probing Evaluation Awareness of Language Models
Title: Probing Evaluation Awareness of Language Models | Beurteilung des Kenntnisstands von Sprachmodellen | 检验对语文模式的评价意识 2507.01786v1 |
Authors (4): Jord Nguyen, Khiem Hoang, Carlo Leonardo Attubato, Felix Hofstätter
Language models can distinguish between testing and deployment phases – a capability known as evaluation awareness. This has significant safety and policy implications, potentially undermining the reliability of evaluations that are central to AI governance frameworks and voluntary industry commitments. In this paper, we study evaluation awareness in Llama-3.3-70B-Instruct. We show that linear probes can separate real-world evaluation and deployment prompts, suggesting that current models internally represent this distinction. We also find that current safety evaluations are correctly classified by the probes, suggesting that they already appear artificial or inauthentic to models. Our findings underscore the importance of ensuring trustworthy evaluations and understanding deceptive capabilities. More broadly, our work showcases how model internals may be leveraged to support blackbox methods in safety audits, especially for future models more competent at evaluation awareness and deception.
语言模型可以区分测试和部署阶段 – – 一种被称为评估意识的能力,这具有重大的安全和政策影响,有可能损害对AI治理框架和自愿行业承诺至关重要的评价的可靠性。在本文中,我们在Llama-3.3-70B-Instruct中研究评价意识。我们表明线性探测器可以区分真实世界的评价和部署提示,表明目前的模型在内部代表了这种区别。我们还发现,目前的安全评估被探测器正确地分类,表明它们已经看起来是人为的或对模型不真实的。我们的调查结果强调了确保可靠的评估和理解欺骗性能力的重要性。更广泛地说,我们的工作展示了如何利用内部模型来支持安全审计中的黑盒方法,特别是未来模型在评估意识和欺骗方面更能胜任的黑盒方法。
Article 111
Title@2025-07-02 (3): MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining
Title: MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining | MuRating: Ein qualitativ hochwertiger Datenauswahlansatz zur Mehrsprachigen Vorschulung großer Sprachmodelle | 词汇:多语言大语言模式预科培训的高质量数据选择方法 2507.01785v1 |
Authors (12): Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin, Fengze Liu, Yan Zhao, Bingni Zhang, Taifeng Wang, Yin Zheng, Meng Fang
Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English “raters” via pairwise comparisons to learn unified document-quality scores,then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2 B-parameter LLaMA model. Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks. We further analyze translation fidelity, selection biases, and underrepresentation of narrative material, outlining directions for future work.
数据质量是大语言模型业绩的关键驱动因素,但现有基于模式的选择方法几乎完全以英语为重点。我们引入了“模调”,这是一个可扩缩的框架,将高质量的英语数据质量信号转换成17种目标语言的单一调率器。通过对称比较,将多重英语“拉子”聚合在一起,学习统一的文件质量分数,然后通过翻译将这些判断投射为对单语、跨语言和平行文本对多语言评价员的培训。应用到网络数据,将选择的英语和多语言内容的平衡子集用于预演1.2 B-参数Lama模型。与强大的基线相比,包括Qurater、AskLLM、DCLM等,我们的方法提高了英语基准和多语言评价的平均准确性,特别是在知识密集型任务上取得的巨大成果。我们进一步分析翻译的忠贞、选择偏差和叙述材料的不足,并概述未来工作的方向。
Article 112
Title@2025-07-02 (3): Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results
Title: Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results | Dateninterferenzen: Emojis, Homoglyphen und Fragen der Datentreue in Korpora und deren Ergebnisse | 数据干扰:表象、同质词和公司的数据忠诚问题及其结果 2507.01764v1 |
Authors (1): Matteo Di Cristofaro
Tokenisation - “the process of splitting text into atomic parts” (Brezina & Timperley, 2017: 1) - is a crucial step for corpus linguistics, as it provides the basis for any applicable quantitative method (e.g. collocations) while ensuring the reliability of qualitative approaches. This paper examines how discrepancies in tokenisation affect the representation of language data and the validity of analytical findings: investigating the challenges posed by emojis and homoglyphs, the study highlights the necessity of preprocessing these elements to maintain corpus fidelity to the source data. The research presents methods for ensuring that digital texts are accurately represented in corpora, thereby supporting reliable linguistic analysis and guaranteeing the repeatability of linguistic interpretations. The findings emphasise the necessity of a detailed understanding of both linguistic and technical aspects involved in digital textual data to enhance the accuracy of corpus analysis, and have significant implications for both quantitative and qualitative approaches in corpus-based research.
“将文字分为原子部分的过程”(Brezina & Timperley, 2017年:1月1日)是本体语言学的一个关键步骤,因为它为任何适用的定量方法(如合用同一地点)提供了基础,同时确保质量方法的可靠性;本文审查了象征性化的差异如何影响语言数据的表述和分析性结论的有效性:调查模版和同质体带来的挑战,研究报告强调,必须预先处理这些要素,以保持对源数据的真实性;研究报告提出了确保数字文本在公司中得到准确反映的方法,从而支持可靠的语言分析,保证语言解释的可重复性;研究结果强调,必须详细了解数字文本数据所涉及的语言和技术方面,以提高物证分析的准确性,并对以物证为基础的研究的定量和定性方法都具有重大影响。
Article 113
Title@2025-07-02 (3): Tuning without Peeking: Provable Privacy and Generalization Bounds for LLM Post-Training
Title: Tuning without Peeking: Provable Privacy and Generalization Bounds for LLM Post-Training | Tuning ohne Peeking: Provable Privacy and Generalization Bounds for LLM Post-Training | 无足足迹的注资:LLM培训后可实现的隐私和普遍化的圈子 2507.01752v1 |
Authors (7): Ismail Labiad, Mathurin Videau, Matthieu Kowalski, Marc Schoenauer, Alessandro Leite, Julia Kempe, Olivier Teytaud
Gradient-based optimization is the workhorse of deep learning, offering efficient and scalable training via backpropagation. However, its reliance on large volumes of labeled data raises privacy and security concerns such as susceptibility to data poisoning attacks and the risk of overfitting. In contrast, black box optimization methods, which treat the model as an opaque function, relying solely on function evaluations to guide optimization, offer a promising alternative in scenarios where data access is restricted, adversarial risks are high, or overfitting is a concern. However, black box methods also pose significant challenges, including poor scalability to high-dimensional parameter spaces, as prevalent in large language models (LLMs), and high computational costs due to reliance on numerous model evaluations. This paper introduces BBoxER, an evolutionary black-box method for LLM post-training that induces an information bottleneck via implicit compression of the training data. Leveraging the tractability of information flow, we provide strong theoretical bounds on generalization, differential privacy, susceptibility to data poisoning attacks, and robustness to extraction attacks. BBoxER operates on top of pre-trained LLMs, offering a lightweight and modular enhancement suitable for deployment in restricted or privacy-sensitive environments, in addition to non-vacuous generalization guarantees. In experiments with LLMs, we demonstrate empirically that Retrofitting methods are able to learn, showing how a few iterations of BBoxER improve performance and generalize well on a benchmark of reasoning datasets. This positions BBoxER as an attractive add-on on top of gradient-based optimization.
渐进式优化是深层次学习的一匹马,通过反向调整提供高效和可扩展的培训。然而,对大量标签数据的依赖引起了隐私和安全方面的关注,如容易发生数据中毒袭击和过度适应的风险。相比之下,黑盒优化方法将模型视为不透明功能,完全依靠功能评估来引导优化,在数据访问受限、对抗风险高或过度适应的情景中提供了一个有希望的替代方案。然而,黑盒方法也带来了重大挑战,包括大语言模型(LLLMS)中普遍存在的高维参数空间的可缩缩缩性差,以及依赖许多模型评估而导致的具有很高的计算成本。本文介绍了BBBoxER,这是LLMM后培训的进化黑盒方法,通过对培训数据进行隐蔽的压缩来造成信息瓶颈。 利用信息流动的可感动性,我们提供了关于一般信息污染攻击的广度、差异性隐私、易感受数据中毒攻击的易感力和强力的理论界限。BoxER公司在经过培训的LMMS顶端上操作,提供了一种不易变的精度的精度的精度的精度的精度测试方法,为我们在一般测试中展示的精度的精度的精度的精度和模度实验环境中展示的精度的精度的精度,以展示了一种不细度的精度。
Article 114
Title@2025-07-02 (3): ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving
Title: ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving | ECCV 2024 W-CODA: 1. Workshop zur multimodalen Wahrnehmung und Verständlichkeit von Eckfällen im autonomen Fahren | ECCV 2024 W-CODA:第一次关于自主驾驶时对拐角案例的多模式认识和了解的讲习班 2507.01735v1 |
Authors (14): Kai Chen, Ruiyuan Gao, Lanqing Hong, Hang Xu, Xu Jia, Holger Caesar, Dengxin Dai, Bingbing Liu, Dzmitry Tsishkou, Songcen Xu, Chunjing Xu, Qiang Xu, Huchuan Lu, Dit-Yan Yeung
In this paper, we present details of the 1st W-CODA workshop, held in conjunction with the ECCV 2024. W-CODA aims to explore next-generation solutions for autonomous driving corner cases, empowered by state-of-the-art multimodal perception and comprehension techniques. 5 Speakers from both academia and industry are invited to share their latest progress and opinions. We collect research papers and hold a dual-track challenge, including both corner case scene understanding and generation. As the pioneering effort, we will continuously bridge the gap between frontier autonomous driving techniques and fully intelligent, reliable self-driving agents robust towards corner cases.
在本文中,我们介绍了与ECCV 2024合作举办的第一次W-CODA讲习班的详细情况。W-CODA旨在探索下一代自主驾驶角落案例的解决办法,这种解决办法得到最先进的多式联运观念和理解技术的扶持。5名学术界和工业界的发言者应邀分享最新进展和意见。我们收集研究论文并面临双轨挑战,包括边角案场理解和生成。作为开创性努力,我们将不断弥合边际自主驾驶技术与全智能、可靠的自我驾驶代理人之间的鸿沟。
Article 115
Title@2025-07-02 (3): LLMs for Legal Subsumption in German Employment Contracts
Title: LLMs for Legal Subsumption in German Employment Contracts | LLMs für rechtliche Subsumption in deutschen Arbeitsverträgen | 德国就业合同法律补贴LLM 2507.01734v1 |
Authors (2): Oliver Wardas, Florian Matthes
Legal work, characterized by its text-heavy and resource-intensive nature, presents unique challenges and opportunities for NLP research. While data-driven approaches have advanced the field, their lack of interpretability and trustworthiness limits their applicability in dynamic legal environments. To address these issues, we collaborated with legal experts to extend an existing dataset and explored the use of Large Language Models (LLMs) and in-context learning to evaluate the legality of clauses in German employment contracts. Our work evaluates the ability of different LLMs to classify clauses as “valid,” “unfair,” or “void” under three legal context variants: no legal context, full-text sources of laws and court rulings, and distilled versions of these (referred to as examination guidelines). Results show that full-text sources moderately improve performance, while examination guidelines significantly enhance recall for void clauses and weighted F1-Score, reaching 80\%. Despite these advancements, LLMs’ performance when using full-text sources remains substantially below that of human lawyers. We contribute an extended dataset, including examination guidelines, referenced legal sources, and corresponding annotations, alongside our code and all log files. Our findings highlight the potential of LLMs to assist lawyers in contract legality review while also underscoring the limitations of the methods presented.
以案文重和资源密集型性质为特征的法律工作为特征,为国家劳工合同研究提供了独特的挑战和机遇。虽然以数据为驱动的方法已经推进了该领域,但其缺乏可解释性和可信度限制了其在动态法律环境中的适用性。为了解决这些问题,我们与法律专家合作,扩展了现有的数据集,并探索了使用大语言模型(LLMS)和内文学习来评价德国就业合同条款的合法性。我们的工作评价了不同LMS将条款归类为“有效”、“不公平”或“避免”三个法律背景变式的能力:没有法律背景、法律和法院裁决的全文来源以及这些变式的精炼版(称为检查准则)。结果显示,全文来源略有改进了业绩,同时检查准则大大加强了对无效条款的回顾和加权F1-核心的回顾,达到80。尽管取得了这些进步,但LMS在使用全文来源时的表现仍然大大低于人类律师。我们贡献了扩大的数据集,包括审查准则、参考的法律来源和相应的说明,以及同我们的代码和记录文件一样,还协助了法律系统中律师们审查的合法性。
Article 116
Title@2025-07-02 (3): Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models
Title: Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models | Unified Triplet-Level Halluzination Evaluation für große Vision-Sprache Modelle | 大型视觉语言模型统一三维级幻觉评价 2410.23114v3 |
Authors (4): Junjie Wu, Tsz Ting Chung, Kai Chen, Dit-Yan Yeung
Despite the outstanding performance in vision-language reasoning, Large Vision-Language Models (LVLMs) might generate hallucinated contents that do not exist in the given image. Most existing LVLM hallucination benchmarks are constrained to evaluate the object-related hallucinations. However, the potential hallucination on the relations between two objects, i.e., relation hallucination, still lacks investigation. To remedy that, we design a unified framework to measure the object and relation hallucination in LVLMs simultaneously. The core idea of our framework is to evaluate hallucinations via (object, relation, object) triplets extracted from LVLMs’ responses, making it easily generalizable to different vision-language tasks. Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. With comprehensive evaluations on Tri-HE, we observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple training-free approach that effectively mitigates hallucinations for LVLMs. Our dataset and code for the reproduction of our experiments are available publicly at https://github.com/wujunjie1998/Tri-HE.
尽管在视觉语言推理方面表现出色,大型视觉语言模型(LVLM)可能会产生在特定图像中不存在的幻觉内容。大多数现有的LVLM幻觉基准都不得不评估与目标有关的幻觉。然而,对两个对象之间的关系的潜在幻觉,即关系幻觉,仍然缺乏调查。为了纠正这一点,我们设计了一个统一框架,以同时测量LVLMs中的对象和关系幻觉。我们框架的核心思想是通过LVMs答复中提取的幻觉(对象、关系、对象)三重幻觉,使LVLMS易于将其推广到不同的视觉语言任务中。此外,我们根据我们的框架,进一步引入了Tri-HE,一个全新的三重幻觉评价标准,可以同时用于研究对象和关系幻觉。在对Tri-HE的全面评价中,我们观察到,与LVLMMs之间的关系比目标幻觉问题更为严重,突出了以前被忽视的LVLMS的问题。此外,我们根据我们的研究结果,设计了一个简单的培训/MLVMS/J 有效减少我们现有的数据。
Article 117
Title@2025-07-02 (3): Stereotype Detection as a Catalyst for Enhanced Bias Detection: A Multi-Task Learning Approach
Title: Stereotype Detection as a Catalyst for Enhanced Bias Detection: A Multi-Task Learning Approach | Stereotyp-Erkennung als Katalysator für verbesserte Bias-Erkennung: Ein Multi-Task-Lernansatz | 作为强化比亚斯探测催化剂的陈规定型观念探测:多任务学习方法 2507.01715v1 |
Authors (3): Aditya Tomar, Rudra Murthy, Pushpak Bhattacharyya
Bias and stereotypes in language models can cause harm, especially in sensitive areas like content moderation and decision-making. This paper addresses bias and stereotype detection by exploring how jointly learning these tasks enhances model performance. We introduce StereoBias, a unique dataset labeled for bias and stereotype detection across five categories: religion, gender, socio-economic status, race, profession, and others, enabling a deeper study of their relationship. Our experiments compare encoder-only models and fine-tuned decoder-only models using QLoRA. While encoder-only models perform well, decoder-only models also show competitive results. Crucially, joint training on bias and stereotype detection significantly improves bias detection compared to training them separately. Additional experiments with sentiment analysis confirm that the improvements stem from the connection between bias and stereotypes, not multi-task learning alone. These findings highlight the value of leveraging stereotype information to build fairer and more effective AI systems.
语言模式中的偏见和陈规定型可能会造成损害,特别是在内容温和和决策等敏感领域。本文件通过探讨共同学习这些任务如何提高模范业绩,处理偏见和陈规定型观念的发现。我们引入了Stereo Bias,这是一个独特的数据集,标记为偏见和在宗教、性别、社会经济地位、种族、专业等五类中进行陈规定型的发现,从而能够更深入地研究它们之间的关系。我们的实验比较了只使用QLORA的编码器模型和微调的解密模式。在只使用编码器的模型运行良好的同时,只使用编码器的模型也显示了竞争性效果。关于偏见和陈规定型观念的发现的联合培训与单独培训相比,极大地改善了对偏见的发现。通过情感分析进行的额外实验证实,这些改进来自偏见和陈规定型观念之间的联系,而不是单靠多任务学习。这些实验强调了利用陈规定型观念信息来建立更公平和更有效的人工智能系统的价值。
Article 118
Title@2025-07-02 (3): AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness
Title: AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness | AdamMeme: Adaptiv die Vernunft von multimodalen großen Sprachmodellen auf die Schädlichkeit untersuchen | AdamMememe:适应性预测关于协调性的多模式大语言模型的理性能力 2507.01702v1 |
Authors (8): Zixin Chen, Hongzhan Lin, Kaixin Li, Ziyang Luo, Zhen Ye, Guang Chen, Zhiyong Huang, Jing Ma
The proliferation of multimodal memes in the social media era demands that multimodal Large Language Models (mLLMs) effectively understand meme harmfulness. Existing benchmarks for assessing mLLMs on harmful meme understanding rely on accuracy-based, model-agnostic evaluations using static datasets. These benchmarks are limited in their ability to provide up-to-date and thorough assessments, as online memes evolve dynamically. To address this, we propose AdamMeme, a flexible, agent-based evaluation framework that adaptively probes the reasoning capabilities of mLLMs in deciphering meme harmfulness. Through multi-agent collaboration, AdamMeme provides comprehensive evaluations by iteratively updating the meme data with challenging samples, thereby exposing specific limitations in how mLLMs interpret harmfulness. Extensive experiments show that our framework systematically reveals the varying performance of different target mLLMs, offering in-depth, fine-grained analyses of model-specific weaknesses. Our code is available at https://github.com/Lbotirx/AdamMeme.
在社交媒体时代,多式联运大语言模型的泛滥要求多式大语言模型(MLLMs)有效理解Meme的危害性;评估有害Meme理解的MLLMs的现有基准依赖于使用静态数据集进行基于准确性、模型和不可知性的评价;这些基准在提供最新和彻底评估的能力方面是有限的,因为在线Memes正在动态地演变。为了解决这个问题,我们提议AdamMeme,这是一个灵活、基于代理人的评价框架,通过多试剂合作,对MLLLMs的推理能力进行适应性地探测,破译Meme的危害性。通过多种试剂合作,AdamMeme提供了全面的评价,以具有挑战性的样本迭代更新Meme数据,从而暴露了MLLMs如何解释有害性的具体限制。广泛的实验表明,我们的框架系统地揭示了不同目标MLLMs的不同性能,提供了对模型特定弱点的深度和精细分析。我们的代码可在https://github.com/Lbotirx/AdamMeme上查阅。
Article 119
Title@2025-07-02 (3): Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling
Title: Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling | Mischen von Supervised und Verstärkung Feintuning mit Präfix-Sampling | 与前缀抽样混合监管和强化精细推荐 2507.01679v1 |
Authors (7): Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, Ivan Titov
Existing post-training techniques for large language models are broadly categorized into Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT). Each paradigm presents a distinct trade-off: SFT excels at mimicking demonstration data but can lead to problematic generalization as a form of behavior cloning. Conversely, RFT can significantly enhance a model’s performance but is prone to learn unexpected behaviors, and its performance is highly sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a testbed, we empirically demonstrate that Prefix-RFT is both simple and effective. It not only surpasses the performance of standalone SFT and RFT but also outperforms parallel mixed-policy RFT methods. A key advantage is its seamless integration into existing open-source frameworks, requiring only minimal modifications to the standard RFT pipeline. Our analysis highlights the complementary nature of SFT and RFT, and validates that Prefix-RFT effectively harmonizes these two learning paradigms. Furthermore, ablation studies confirm the method’s robustness to variations in the quality and quantity of demonstration data. We hope this work offers a new perspective on LLM post-training, suggesting that a unified paradigm that judiciously integrates demonstration and exploration could be a promising direction for future research.
大型语言模型的现有培训后技术大致分为监督性美食(SFT)和强化性美食(RFT)等。每种模式都有不同的权衡:SFT在模拟示范数据方面十分出色,但可能导致作为行为克隆的一种形式出现问题。相反,RFT可以显著提高模型的性能,但容易了解出乎意料的行为,其业绩对最初的政策非常敏感。在本文件中,我们建议统一对这些方法的看法,并引入Prefix-RFT,这是一种混合方法,可以同时从示范和探索中学习。我们用数学推理问题作为测试台,从经验上证明,Prefix-RFT既简单又有效,不仅会导致问题化为行为克隆的一种形式。它不仅超越了独立的SFT和RFT的表现,而且超越了平行的混合政策RFT方法。一个关键优势是,它与现有的开放源框架紧密结合,只需要对标准RFT管道进行最低限度的修改。我们的分析强调了SFT和RFT的互补性,并证实Pref-RFT在测试中都有效地统一了一种稳定性研究方向,从而验证出一种未来质量变化的示范方法,从而验证了这种在质量上展示性模型中可以证实一种可靠的示范性研究的新的质量分析。
Article 120
Title@2025-07-02 (3): On the Fundamental Impossibility of Hallucination Control in Large Language Models
Title: On the Fundamental Impossibility of Hallucination Control in Large Language Models | Über die grundsätzliche Unmöglichkeit der Halluzinationskontrolle in großen Sprachmodellen | 关于大语言模型中幻听控制的基本不可能性 2506.06382v2 |
Authors (1): Michał P. Karpowicz
We prove that perfect hallucination control in large language models is mathematically impossible. No LLM inference mechanism can simultaneously achieve truthful response generation, semantic information conservation, relevant knowledge revelation, and knowledge-constrained optimality. This impossibility is fundamental, arising from the mathematical structure of information aggregation itself rather than engineering limitations. The proof spans three mathematical frameworks: auction theory, proper scoring theory for probabilistic predictions, and log-sum-exp analysis for transformer architectures. In each setting, we demonstrate that information aggregation creates unavoidable violations of conservation principles. The Jensen gap in transformer probability aggregation provides a direct measure of this impossibility. These results reframe hallucination from an engineering bug to an inevitable mathematical feature of distributed intelligence. There are fundamental trade-offs between truthfulness, knowledge utilization, and response completeness, providing principled foundations for managing rather than eliminating hallucination. This work reveals deep connections between neural network inference, philosophy of knowledge and reasoning, and classical results in game theory and information theory, opening new research directions for developing beneficial AI systems within mathematical constraints.
我们证明大型语言模型的完美幻觉控制在数学上是不可能的。 没有LLM 推论机制能够同时实现真实的反应生成、语义信息保护、相关知识披露和知识限制的最佳性。 这种不可能性是根本的,产生于信息集成本身的数学结构而不是工程限制。 证据包含三个数学框架:拍卖理论、概率预测的适当评分理论以及变压器结构的日志和参数分析。 在每种情况下,我们证明信息汇总都不可避免地违反了保护原则。 变压器概率汇总中的Jensen差距提供了这种不可能性的直接衡量。 这些结果是将幻觉从工程错误重新定位为分布式情报的不可避免的数学特征。 在真实性、知识利用和反应完整性之间有着基本的权衡,为管理而不是消除幻觉提供了原则基础。 这项工作揭示了神经网络的推论、知识和推理哲学以及游戏理论和信息理论的经典结果之间的密切联系,为在数学限制范围内开发有益的AI系统开辟了新的研究方向。
Article 121
Title@2025-07-02 (3): Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions
Title: Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions | Achtung vor der Umwelt: Multimodale Substanzen sind für Umweltbeeinträchtigungen empfänglich | 注意环境:多式制剂可被环境灾害所接受 2408.02544v2 |
Authors (7): Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao
This paper investigates the faithfulness of multimodal large language model (MLLM) agents in a graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by environmental context. A general scenario is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content. A wide range of MLLMs are evaluated as GUI agents using a simulated dataset, following three working patterns with different levels of perception. Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions. While recent studies predominantly focus on the helpfulness of agents, our findings first indicate that these agents are prone to environmental distractions. Furthermore, we implement an adversarial environment injection and analyze the approach to improve faithfulness, calling for a collective focus on this important topic.
本文件调查了多式大语言模型(MLLM)代理商在图形用户界面环境中的忠实性,目的是解决多式联运界面代理商能否因环境背景而分散注意力的研究问题,提出了用户和代理商都是良性的,环境虽然不是恶意的,但含有无关的内容的一般设想。许多多式大语言模型代理商都使用模拟数据集作为图形界面代理商进行评价,并遵循不同层次的三种工作模式。实验结果显示,即使是最强大的模型,无论是通才代理商还是专业界面代理商,也很容易受到干扰。虽然最近的研究主要侧重于代理人的用处,但我们首先发现这些代理商容易引起环境的分心。此外,我们实施对抗性环境注入,分析提高忠实性的方法,要求集体关注这一重要议题。
Article 122
Title@2025-07-02 (3): Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings
Title: Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings | Anpassung von Sprachmodellen an indonesische lokale Sprachen: Eine empirische Studie zur Übertragbarkeit von Sprache auf Null-Schuss-Einstellungen | 调整语言模式以适应印度尼西亚当地语言:零热设置的语言可转让性经验研究 2507.01645v1 |
Authors (1): Rifki Afina Putri
In this paper, we investigate the transferability of pre-trained language models to low-resource Indonesian local languages through the task of sentiment analysis. We evaluate both zero-shot performance and adapter-based transfer on ten local languages using models of different types: a monolingual Indonesian BERT, multilingual models such as mBERT and XLM-R, and a modular adapter-based approach called MAD-X. To better understand model behavior, we group the target languages into three categories: seen (included during pre-training), partially seen (not included but linguistically related to seen languages), and unseen (absent and unrelated in pre-training data). Our results reveal clear performance disparities across these groups: multilingual models perform best on seen languages, moderately on partially seen ones, and poorly on unseen languages. We find that MAD-X significantly improves performance, especially for seen and partially seen languages, without requiring labeled data in the target language. Additionally, we conduct a further analysis on tokenization and show that while subword fragmentation and vocabulary overlap with Indonesian correlate weakly with prediction quality, they do not fully explain the observed performance. Instead, the most consistent predictor of transfer success is the model’s prior exposure to the language, either directly or through a related language.
在本文中,我们通过情绪分析任务,调查培训前语言模式向低资源印度尼西亚当地语言的可转让性。我们评估了10种当地语言的零点性能和基于适应器的转让,使用了不同类型模式:单语印度尼西亚语BERT、MBERT和XLM-R等多种模式,以及模块化适应器法,称为MAD-X。为了更好地理解模式行为,我们将目标语言分为三类:看到(在培训前包含),部分被看到(在语言上与所见语言有关),部分被看到(在语言上与所见语言有关),以及看不见(在培训前数据中存在和不相关)。我们的结果表明,这些群体的明显性能差异:多语言模式在所见的语言上表现最好,部分被看到的语言上稍有差异,而隐蔽语言上差。我们发现,MAD-X显著地改进了绩效,特别是外观语言和部分被看到的语言。我们进一步分析了象征性化,并表明,虽然子拼写和词汇与印度尼西亚语言的相关性较弱,但它们并没有充分解释所观察到的绩效。相反,通过先前语言直接预测或成功是成功的模式。
Article 123
Title@2025-07-02 (3): Confidence and Stability of Global and Pairwise Scores in NLP Evaluation
Title: Confidence and Stability of Global and Pairwise Scores in NLP Evaluation | Vertrauen und Stabilität von Global und Pairwise Scores in NLP-Evaluation | 国家劳工规划评价中全球和对等分数和对等分数的可信度和稳定性 2507.01633v1 |
Authors (2): Georgii Levtsov, Dmitry Ustalov
With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise comparisons to aid decision-making in selecting appropriate model evaluation strategies. Through computational experiments on synthetic and real-world datasets using standard global metrics and the popular Bradley-Terry model for pairwise comparisons, we found that while global scores provide more reliable overall rankings, they can underestimate strong models with rare, significant errors or low confidence. Conversely, pairwise comparisons are particularly effective for identifying strong contenders among models with lower global scores, especially where quality metrics are hard to define (e.g., text generation), though they require more comparisons to converge if ties are frequent. Our code and data are available at https://github.com/HSPyroblast/srw-ranking under a permissive license.
随着高度能干、调导神经语言模型的出现,自然语言处理基准(NLP)日益转向双向比较带头板,如LMSYS Arena,由传统的全球分数(例如GLUE、BIG-bench、SWE-bench)产生,本文从经验上调查了全球分数的优缺点,并进行了对等比较,以协助决策选择适当的评价战略模式。通过利用标准的全球指标和流行的布拉德-Tery模型对合成和真实世界数据集进行计算实验,我们发现,虽然全球分数提供了更可靠的总体排名,但它们可能低估了强型模型,而少见、严重错误或信心低。相反,对等比较对于确定全球分数较低的模型的强势竞争者特别有效,特别是在质量指标难以确定(例如文本生成)的情况下,但如果联系频繁,则需要进行更多的比较。我们的代码和数据可在https://github.com/HSPyroblast/srw级别上查阅。
Article 124
Title@2025-07-02 (3): Chart Question Answering from Real-World Analytical Narratives
Title: Chart Question Answering from Real-World Analytical Narratives | Diagramm Frage-Antworten von Real-World Analytical Narratives | 从真实世界分析叙述中回答的图表问题 2507.01627v1 |
Authors (5): Maeve Hutchinson, Radu Jianu, Aidan Slingsby, Jo Wood, Pranava Madhyastha
We present a new dataset for chart question answering (CQA) constructed from visualization notebooks. The dataset features real-world, multi-view charts paired with natural language questions grounded in analytical narratives. Unlike prior benchmarks, our data reflects ecologically valid reasoning workflows. Benchmarking state-of-the-art multimodal large language models reveals a significant performance gap, with GPT-4.1 achieving an accuracy of 69.3%, underscoring the challenges posed by this more authentic CQA setting.
我们用可视化笔记本制作的图表问答(CQA)提供了一套新的数据集。 数据集以真实世界、多视图图表和基于分析叙述的自然语言问题为主。 与以往的基准不同,我们的数据反映了生态上有效的推理工作流程。 最先进的多式联运大语言模型基准显示显著的绩效差距,GPT-41的精确度达到69.3%,凸显了这种更真实的CQA设置带来的挑战。
Article 125
Title@2025-07-02 (3): Developing ChemDFM as a large language foundation model for chemistry
Title: Developing ChemDFM as a large language foundation model for chemistry | ChemDFM als großes Sprach-Grundmodell für die Chemie entwickeln | 开发化学化学化学化学成像模型,将其作为一个大型语言基础化学模型 2401.14818v6 |
Authors (14): Zihan Zhao, Da Ma, Lu Chen, Liangtai Sun, Zihao Li, Yi Xia, Bo Chen, Hongshen Xu, Zichen Zhu, Su Zhu, Shuai Fan, Guodong Shen, Kai Yu, Xin Chen
Artificial intelligence (AI) has played an increasingly important role in chemical research. However, most models currently used in chemistry are specialist models that require training and tuning for specific tasks. A more generic and efficient solution would be an AI model that could address many tasks and support free-form dialogue in the broad field of chemistry. In its utmost form, such a generalist AI chemist could be referred to as Chemical General Intelligence. Large language models (LLMs) have recently logged tremendous success in the general domain of natural language processing, showing emerging task generalization and free-form dialogue capabilities. However, domain knowledge of chemistry is largely missing when training general-domain LLMs. The lack of such knowledge greatly hinders the performance of generalist LLMs in the field of chemistry. To this end, we develop ChemDFM, a pioneering LLM for chemistry trained on 34B tokens from chemical literature and textbooks, and fine-tuned using 2.7M instructions. As a result, it can understand and reason with chemical knowledge in free-form dialogue. Quantitative evaluations show that ChemDFM significantly surpasses most representative open-source LLMs. It outperforms GPT-4 on a great portion of chemical tasks, despite the substantial size difference. We have open-sourced the inference codes, evaluation datasets, and model weights of ChemDFM on Huggingface (https://huggingface.co/OpenDFM/ChemDFM-v1.0-13B).
然而,目前用于化学的多数模型都是需要培训和调整具体任务的专家模型。一个更通用、更高效的解决方案是AI模型,可以处理许多任务,支持广泛的化学领域的自由形式对话。最起码的形式是,这样一个泛泛的AI化学化学学家可以被称为化学一般情报。大型语言模型(LLLMS)最近在自然语言处理的一般领域取得了巨大成功,显示了正在形成的任务一般化和自由形式对话能力。然而,在培训普通的1.0 LLMS时,化学的域知识基本上缺乏。这种知识的缺乏将极大地妨碍普通的LMS在化学领域的业绩。为此,我们开发了ChemDFM,这是在化学文献和教科书的34B类标语上培训的先驱性LMM,并且使用2.7M的指示加以微调。结果,它在自由形式对话中可以理解和理解化学知识。定量评估表明,ChemDFM大大超过最具代表性的开放源LMS。它超越了一般LMS的开源能力。它超越了GPDM的深度,在化学评估中,在深度的深度数据上比重。
Article 126
Title@2025-07-02 (3): Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems
Title: Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems | Data Agent: Eine ganzheitliche Architektur für die Orchestrierung von Daten+AI-Ökosystemen | 数据代号:一个用于管弦化数据+AI生态系统的综合结构 2507.01599v1 |
Authors (5): Zhaoyan Sun, Jiayi Wang, Xinyang Zhao, Jiachi Wang, Guoliang Li
Traditional Data+AI systems utilize data-driven techniques to optimize performance, but they rely heavily on human experts to orchestrate system pipelines, enabling them to adapt to changes in data, queries, tasks, and environments. For instance, while there are numerous data science tools available, developing a pipeline planning system to coordinate these tools remains challenging. This difficulty arises because existing Data+AI systems have limited capabilities in semantic understanding, reasoning, and planning. Fortunately, we have witnessed the success of large language models (LLMs) in enhancing semantic understanding, reasoning, and planning abilities. It is crucial to incorporate LLM techniques to revolutionize data systems for orchestrating Data+AI applications effectively. To achieve this, we propose the concept of a ‘Data Agent’ - a comprehensive architecture designed to orchestrate Data+AI ecosystems, which focuses on tackling data-related tasks by integrating knowledge comprehension, reasoning, and planning capabilities. We delve into the challenges involved in designing data agents, such as understanding data/queries/environments/tools, orchestrating pipelines/workflows, optimizing and executing pipelines, and fostering pipeline self-reflection. Furthermore, we present examples of data agent systems, including a data science agent, data analytics agents (such as unstructured data analytics agent, semantic structured data analytics agent, data lake analytics agent, and multi-modal data analytics agent), and a database administrator (DBA) agent. We also outline several open challenges associated with designing data agent systems.
传统数据+AI系统利用数据驱动技术优化性能,但是它们严重依赖人类专家来协调系统管道,使其能够适应数据、查询、任务和环境的变化。例如,尽管存在许多数据科学工具,但开发一个管道规划系统以协调这些工具仍然具有挑战性。这一困难的产生是因为现有的数据+AI系统在语义理解、推理和规划方面能力有限。幸运的是,我们看到大型语言模型(LLLMs)成功地提高了语义理解、推理和规划能力。必须采用LLM技术,使数据系统革命化,以有效地协调数据+AI应用。为此,我们提出了“Data Agent”概念,这是一个旨在协调数据+AI生态系统的综合结构,侧重于通过综合知识理解、推理和规划能力处理与数据有关的任务。我们深入探讨了设计数据代理(例如理解数据/查询/环境/工具/工具、管管管管道/工作、优化和执行管道,以及促进管道系统革命化数据系统本身。我们提出了“DataAA”概念化结构数据代理机构数据库的例子,包括数据代理机构的不结构、数据代理机构、多级数据结构、数据代理机构、多级数据系统、数据结构、数据代理机构、多级数据代理机构、数据结构等示例。
Article 127
Title@2025-07-02 (3): T3DM: Test-Time Training-Guided Distribution Shift Modelling for Temporal Knowledge Graph Reasoning
Title: T3DM: Test-Time Training-Guided Distribution Shift Modelling for Temporal Knowledge Graph Reasoning | T3DM: Test-Time Training-Guided Distribution Shift Modellierung für zeitliche Wissensdiagramm-Reasoning | T3DM: 试验时间培训指导分布分布变化模型,用于时间知识图表推理 2507.01597v1 |
Authors (4): Yuehang Si, Zefan Zeng, Jincai Huang, Qing Cheng
Temporal Knowledge Graph (TKG) is an efficient method for describing the dynamic development of facts along a timeline. Most research on TKG reasoning (TKGR) focuses on modelling the repetition of global facts and designing patterns of local historical facts. However, they face two significant challenges: inadequate modeling of the event distribution shift between training and test samples, and reliance on random entity substitution for generating negative samples, which often results in low-quality sampling. To this end, we propose a novel distributional feature modeling approach for training TKGR models, Test-Time Training-guided Distribution shift Modelling (T3DM), to adjust the model based on distribution shift and ensure the global consistency of model reasoning. In addition, we design a negative-sampling strategy to generate higher-quality negative quadruples based on adversarial training. Extensive experiments show that T3DM provides better and more robust results than the state-of-the-art baselines in most cases.
时间知识图(TKG)是描述沿时间线动态发展事实的有效方法。关于TKG推理(TKGR)的研究大多侧重于模拟全球事实的重复和设计当地历史事实的模式。然而,它们面临两大挑战:培训与测试样品之间事件分布变化的模型不完善,以及依靠随机实体替代生成负面样本,这往往导致低质量取样。为此,我们提议采用新的分配特征模型模型方法,用于培训TKGR模型、测试时间培训指导的分布模型(T3DM),以根据分布变化调整模型,确保模型推理的全球一致性。此外,我们设计了负面抽样战略,以产生基于对抗性培训的更高质量负四重。广泛的实验表明,T3DM在多数情况下提供比最先进的基线更好、更可靠的结果。
Article 128
Title@2025-07-02 (3): Emotionally Intelligent Task-oriented Dialogue Systems: Architecture, Representation, and Optimisation
Title: Emotionally Intelligent Task-oriented Dialogue Systems: Architecture, Representation, and Optimisation | Emotional intelligente, aufgabenorientierte Dialogsysteme: Architektur, Repräsentation und Optimierung | 以任务为导向的对话系统:结构、代表性和优化 2507.01594v1 |
Authors (8): Shutong Feng, Hsien-chin Lin, Nurul Lubis, Carel van Niekerk, Michael Heck, Benjamin Ruppik, Renato Vukovic, Milica Gašić
Task-oriented dialogue (ToD) systems are designed to help users achieve specific goals through natural language interaction. While recent advances in large language models (LLMs) have significantly improved linguistic fluency and contextual understanding, building effective and emotionally intelligent ToD systems remains a complex challenge. Effective ToD systems must optimise for task success, emotional understanding and responsiveness, and precise information conveyance, all within inherently noisy and ambiguous conversational environments. In this work, we investigate architectural, representational, optimisational as well as emotional considerations of ToD systems. We set up systems covering these design considerations with a challenging evaluation environment composed of a natural-language user simulator coupled with an imperfect natural language understanding module. We propose \textbf{LUSTER}, an \textbf{L}LM-based \textbf{U}nified \textbf{S}ystem for \textbf{T}ask-oriented dialogue with \textbf{E}nd-to-end \textbf{R}einforcement learning with both short-term (user sentiment) and long-term (task success) rewards. Our findings demonstrate that combining LLM capability with structured reward modelling leads to more resilient and emotionally responsive ToD systems, offering a practical path forward for next-generation conversational agents.
以任务为导向的对话(ToD)系统旨在帮助用户通过自然语言互动实现具体目标。虽然大型语言模型(LLMs)最近的进展大大提高了语言流畅度和背景理解度,但建设有效和情感智能的 ToD系统仍是一项复杂的挑战。有效的 ToD系统必须优化任务成功、情感理解和反应以及精确的信息传输,这些系统都存在于内在的吵闹和模糊的谈话环境中。在这项工作中,我们调查了ToD系统的建筑、代表性、优化和情感考虑。我们建立了涵盖这些设计考虑的系统,并建立了具有挑战性的评价环境,包括由自然语言用户模拟器和不完善的自然语言理解模块组成的评估环境。我们提出了\ textbf{LUSTER},一个基于 textbf{L}LM-bf{U}基于 textbf{S} 的系统优化。我们与\ textbf{T}T}T}ask-findicent 和情感导向性对话(后端\ textbf{R{R} 和结构性激励学习, 展示了我们更具有适应性的成功的学习能力。
Article 129
Title@2025-07-02 (3): Is External Information Useful for Stance Detection with LLMs?
Title: Is External Information Useful for Stance Detection with LLMs? | Ist externe Informationen nützlich für Stance Detection mit LLMs? | 外部信息是否对利用LLMS探测 Stance有用? 2507.01543v1 |
Authors (2): Quang Minh Nguyen, Taegyoon Kim
In the stance detection task, a text is classified as either favorable, opposing, or neutral towards a target. Prior work suggests that the use of external information, e.g., excerpts from Wikipedia, improves stance detection performance. However, whether or not such information can benefit large language models (LLMs) remains an unanswered question, despite their wide adoption in many reasoning tasks. In this study, we conduct a systematic evaluation on how Wikipedia and web search external information can affect stance detection across eight LLMs and in three datasets with 12 targets. Surprisingly, we find that such information degrades performance in most cases, with macro F1 scores dropping by up to 27.9\%. We explain this through experiments showing LLMs’ tendency to align their predictions with the stance and sentiment of the provided information rather than the ground truth stance of the given text. We also find that performance degradation persists with chain-of-thought prompting, while fine-tuning mitigates but does not fully eliminate it. Our findings, in contrast to previous literature on BERT-based systems which suggests that external information enhances performance, highlight the risks of information biases in LLM-based stance classifiers. Code is available at https://github.com/ngqm/acl2025-stance-detection.
先前的工作表明,使用外部信息,例如维基百科的节录,可以提高姿态检测性能。然而,这种信息能否使大型语言模型(LLMs)受益,尽管在许多推理任务中广泛采用,但仍然是一个未解答的问题。在本研究中,我们系统评估维基百科和网络搜索外部信息如何影响对8个LMs和3个数据集的姿态检测性能,有12个目标。令人惊讶的是,我们发现这类信息在大多数情况下会降低绩效,宏观F1得分下降至27.9。我们通过实验来解释这一点:LLMs倾向于使其预测与所提供信息的姿态和情绪相一致,而不是与给定文本的地面真相立场相一致。我们还发现,业绩退化与思考链的迅速性持续在一起,同时微调不力,但并未完全消除。我们的调查结果与以前关于BERT系统的文献相比,表明外部信息会提高性能,突出LM25/deglistrationrs的信息偏差风险。
Article 130
Title@2025-07-02 (3): Efficient Out-of-Scope Detection in Dialogue Systems via Uncertainty-Driven LLM Routing
Title: Efficient Out-of-Scope Detection in Dialogue Systems via Uncertainty-Driven LLM Routing | Effiziente Out-of-Scope-Erkennung in Dialogsystemen durch unsicheres LLM Routing | 通过不确定性驱动LLM路由在对话系统中高效地外探测 2507.01541v1 |
Authors (4): Álvaro Zaera, Diana Nicoleta Popa, Ivan Sekulic, Paolo Rosso
Out-of-scope (OOS) intent detection is a critical challenge in task-oriented dialogue systems (TODS), as it ensures robustness to unseen and ambiguous queries. In this work, we propose a novel but simple modular framework that combines uncertainty modeling with fine-tuned large language models (LLMs) for efficient and accurate OOS detection. The first step applies uncertainty estimation to the output of an in-scope intent detection classifier, which is currently deployed in a real-world TODS handling tens of thousands of user interactions daily. The second step then leverages an emerging LLM-based approach, where a fine-tuned LLM is triggered to make a final decision on instances with high uncertainty. Unlike prior approaches, our method effectively balances computational efficiency and performance, combining traditional approaches with LLMs and yielding state-of-the-art results on key OOS detection benchmarks, including real-world OOS data acquired from a deployed TODS.
在这项工作中,我们提出了一个新颖而简单的模块化框架,将不确定性模型与精细调整的大语言模型(LLMs)相结合,以便高效和准确地检测OS。第一步是将不确定性估计适用于目前部署在现实世界中处理每天成千上万次用户互动的TODS的内切目的识别分类器的产出。第二步是利用基于LLM的新兴方法,在此方法下启动一个经过微调的LLOM,以便对高度不确定的事例作出最后决定。与以前的方法不同,我们的方法有效地平衡了计算效率和性能,将传统方法与LLOMs结合起来,并在主要OOS检测基准(包括从部署的TODS获得的真实世界OS数据)上产生最新的结果。
Article 131
Title@2025-07-02 (3): Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence
Title: Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence | Im Anschluss an die Klues: Experimente zur Person Re-ID mit Cross-Modal Intelligence | 在Clues之后:利用跨模式情报对个人重新识别进行实验 2507.01504v1 |
Authors (6): Robert Aufschläger, Youssef Shoeb, Azarm Nowzad, Michael Heigl, Fabian Bally, Martin Schramm
The collection and release of street-level recordings as Open Data play a vital role in advancing autonomous driving systems and AI research. However, these datasets pose significant privacy risks, particularly for pedestrians, due to the presence of Personally Identifiable Information (PII) that extends beyond biometric traits such as faces. In this paper, we present cRID, a novel cross-modal framework combining Large Vision-Language Models, Graph Attention Networks, and representation learning to detect textual describable clues of PII and enhance person re-identification (Re-ID). Our approach focuses on identifying and leveraging interpretable features, enabling the detection of semantically meaningful PII beyond low-level appearance cues. We conduct a systematic evaluation of PII presence in person image datasets. Our experiments show improved performance in practical cross-dataset Re-ID scenarios, notably from Market-1501 to CUHK03-np (detected), highlighting the framework’s practical utility. Code is available at https://github.com/RAufschlaeger/cRID.
在公开数据中,街头记录收集和发布作为公开数据在推进自主驾驶系统和AI研究方面发挥着至关重要的作用。然而,由于存在超越诸如脸部等生物鉴别特征以外的个人识别信息(PII),这些数据集对隐私构成重大风险,特别是对行人而言。在本文件中,我们介绍了CRID,这是一个全新的跨模式框架,将大型视觉语言模型、图示关注网络和代表学习结合起来,以发现PII的文字破解线索,并加强人的再识别(Re-ID)。我们的方法侧重于识别和利用可解释的特征,从而能够探测出在低层外观提示之外具有内在意义的PII。我们对个人图像数据集中存在的身份进行系统评估。我们的实验显示,在实际交叉数据集再识别情景方面,特别是从市场1501到CUHK03-np(识别)的绩效有所改进,突出了框架的实际效用。代码见https://github.com/RAufschlaeger/cRID。
Article 132
Title@2025-07-02 (3): Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities
Title: Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities | Bewertung der Wirksamkeit der Direktpräferenzoptimierung zur Personalisierung von deutschen automatischen Textvereinfachungen für Personen mit intellektuellen Behinderungen | 评估直接优惠优化使德国残疾人自动文本简化措施个人化的效果 2507.01479v1 |
Authors (5): Yingqiang Gao, Kaede Johnson, David Froehlich, Luisa Carrer, Sarah Ebling
Automatic text simplification (ATS) aims to enhance language accessibility for various target groups, particularly persons with intellectual disabilities. Recent advancements in generative AI, especially large language models (LLMs), have substantially improved the quality of machine-generated text simplifications, thereby mitigating information barriers for the target group. However, existing LLM-based ATS systems do not incorporate preference feedback on text simplifications during training, resulting in a lack of personalization tailored to the specific needs of target group representatives. In this work, we extend the standard supervised fine-tuning (SFT) approach for adapting LLM-based ATS models by leveraging a computationally efficient LLM alignment technique – direct preference optimization (DPO). Specifically, we post-train LLM-based ATS models using human feedback collected from persons with intellectual disabilities, reflecting their preferences on paired text simplifications generated by mainstream LLMs. Furthermore, we propose a pipeline for developing personalized LLM-based ATS systems, encompassing data collection, model selection, SFT and DPO post-training, and evaluation. Our findings underscore the necessity of active participation of target group persons in designing personalized AI accessibility solutions aligned with human expectations. This work represents a step towards personalizing inclusive AI systems at the target-group level, incorporating insights not only from text simplification experts but also from target group persons themselves.
- 近期在基因化的AI(尤其是大型语言模型)方面的进步极大地提高了机器生成的文本简化的质量,从而减轻了目标群体的信息障碍,然而,现有基于LLM的AST系统没有在培训期间纳入关于文本简化的偏好反馈,导致缺乏针对目标群体代表具体需要的个性化的基于LOM的苯丙胺类兴奋剂系统。在这项工作中,我们通过利用计算效率高的LLM校准技术 – – 直接优化优惠(DPO),推广了标准监督的微调(SFT)方法,以适应基于LM的苯丙胺类兴奋剂模式。具体地说,我们利用从智障者那里收集的人类反馈,基于培训后LMM的AST模型,反映了他们对主流LMs产生的对配对文本简化的偏好。此外,我们提议建立一个管道,用于开发个性化LMO的苯丙胺类兴奋剂系统,包括数据收集、模式选择、SFT和DPO的训练后期和评价。我们的调查结果强调,目标群体必须积极参与设计符合人类期望的个人化的AI无障碍程度解决方案。
Article 133
Title@2025-07-02 (3): Unifying Global and Near-Context Biasing in a Single Trie Pass
Title: Unifying Global and Near-Context Biasing in a Single Trie Pass | Globale und kontextnahe Einigung in einem einzigen Trie Pass | 统一全球和近距离统一在单三通 2409.13514v2 |
Authors (12): Iuliia Thorbecke, Esaú Villatoro-Tello, Juan Zuluaga-Gomez, Shashi Kumar, Sergio Burdisso, Pradeep Rangappa, Andrés Carofilis, Srikanth Madikeri, Petr Motlicek, Karthik Pandia, Kadri Hacioğlu, Andreas Stolcke
Despite the success of end-to-end automatic speech recognition (ASR) models, challenges persist in recognizing rare, out-of-vocabulary words - including named entities (NE) - and in adapting to new domains using only text data. This work presents a practical approach to address these challenges through an unexplored combination of an NE bias list and a word-level n-gram language model (LM). This solution balances simplicity and effectiveness, improving entities’ recognition while maintaining or even enhancing overall ASR performance. We efficiently integrate this enriched biasing method into a transducer-based ASR system, enabling context adaptation with almost no computational overhead. We present our results on three datasets spanning four languages and compare them to state-of-the-art biasing strategies. We demonstrate that the proposed combination of keyword biasing and n-gram LM improves entity recognition by up to 32% relative and reduces overall WER by up to a 12% relative.
尽管端对端自动语音识别模式取得了成功,但在承认稀有、单词外词汇(包括名称实体)和仅使用文本数据适应新领域方面仍然存在挑战。这项工作提出了一种切实可行的办法,通过未探索的NE偏差列表和单级ngram语言模型组合来应对这些挑战。这一解决办法平衡了简单和有效性,提高了实体的识别度,同时保持或甚至提高了ASR的总体性能。我们有效地将这种丰富了的偏差方法纳入基于传输器的ASR系统,使环境适应几乎没有计算间接费用。我们介绍了我们关于涵盖四种语言的三个数据集的结果,并将其与最新偏差战略进行比较。我们证明,关键词偏差和ngram LM的拟议组合使实体的识别率提高了32%的相对比重,将整体WER降低12%的相对值。
Article 134
Title@2025-07-02 (3): BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning
Title: BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning | BIS Reasoning 1.0: Der erste großformatige japanische Benchmark für glaubens-inkonsistente syllogistische Reasoning | BIS 理由1.0:日本第一个大尺度的信仰不一致时断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断 2506.06955v3 |
Authors (8): Ha-Thanh Nguyen, Chaoran Liu, Qianying Liu, Hideyuki Tachibana, Su Myat Noe, Yusuke Miyao, Koichi Takeda, Sadao Kurohashi
We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.
我们提出了国际清算银行(BIS)1.0号理由,这是日本为评估大型语言模型(LLMs)中与信仰不一致的推理(LLMs)明确设计的第一个大规模逻辑推理问题数据集。 与以前侧重于一般推理或信仰一致推理的NeuBAROCO和JFLD等数据集不同,国际清算银行(BIS)1.0号理由引入了逻辑上有效但信仰不一致的立体理论,以揭示关于人与人结盟公司培训的LLLMs的推理偏向。 我们为最新模型(包括GPT模型、Claude模型和主要日本LLMs)做了基准测试,显示其性能差异很大,GPT-4o的精确度达到了79.54%。 我们的分析指出了当前LLMs在处理逻辑上有效但信仰冲突性投入时的关键弱点。 这些发现对在法律、保健和科学文献等高接触领域部署LLMs具有重要影响,在这些领域必须超越直觉信仰以确保廉正和安全。
Article 135
Title@2025-07-02 (3): LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
Title: LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation | LogitSpec: Beschleunigung der retrieval-basierten spekulativen Dekodierung über die nächste nächste Token-Spekulation | logitspec: 加速检索基于回收的投机代号 2507.01449v1 |
Authors (5): Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun
Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at https://github.com/smart-lty/LogitSpec.
光学解码(SD)是一个小型的草案模式,用来预先提出标本,然后目标模型平行地验证这些标本,这已成为LLM推理加速的一个很有希望的技术。许多改进SD的努力都是为了消除对模型草案的需要,以检索方式生成标本草案,以便进一步减轻起草管理费用,并大大减少部署和应用的难度。然而,基于检索的SD依靠一个匹配模式来检索作为标本的最相关的参考,而这些方法往往无法找到匹配的和准确的标本。为了应对这一挑战,我们建议LogitSpec 有效地扩大检索范围,并找到最相关的参考文件。我们的LogitSpec的动机是观察,即最后标本的日志不仅可以预测下一个标本,而且还可以猜测下一个标本的难度。LogitSpecralSpec可以轻松易地在目前版本的版本中显示一个宽度框架。
Article 136
Title@2025-07-02 (3): DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues
Title: DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues | DICE-BENCH: Bewertung der Tool-Use-Fähigkeiten von großen Sprachmodellen in multi-round, Multi-Party-Dialogen | DICE-BENCH:评估多党对话中大语言模式工具使用能力 2506.22853v2 |
Authors (7): Kyochul Jang, Donghyeon Lee, Kyusik Kim, Dongseok Heo, Taewhoo Lee, Woojeong Kim, Bongwon Suh
Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available: https://snuhcc.github.io/DICE-Bench/.
通过 DICE-SCORE 对现有基准进行分析后发现,现有基准的得分明显低,突出表明需要更现实的假想。为了弥补这一差距,我们介绍了DICE-BENCH, 这是一个框架,通过一个保持不同回合之间依赖性的工具图和具有不同特征的多试剂系统,通过合成对话,构建实用的功能调用数据集。最后数据集包括1 607个高 DICE-SCORE 实例。我们用 DICE-BENCH 对19个LLMs的实验显示,在实际世界环境中有效部署这些模型之前,仍然需要取得重大进展。我们的代码和数据都公开提供:https://snuhcc.github.io/DICE-Bench/。
Article 137
Title@2025-07-02 (3): Clinical NLP with Attention-Based Deep Learning for Multi-Disease Prediction
Title: Clinical NLP with Attention-Based Deep Learning for Multi-Disease Prediction | Klinische NLP mit aufmerksamkeitsbasiertem Deep Learning für Multi-Disease-Vorhersage | 以关注为基础深入学习多疾病预测多疾病预测的临床NLP 2507.01437v1 |
Authors (5): Ting Xu, Xiaoxiao Deng, Xiandong Meng, Haifeng Yang, Yan Wu
This paper addresses the challenges posed by the unstructured nature and high-dimensional semantic complexity of electronic health record texts. A deep learning method based on attention mechanisms is proposed to achieve unified modeling for information extraction and multi-label disease prediction. The study is conducted on the MIMIC-IV dataset. A Transformer-based architecture is used to perform representation learning over clinical text. Multi-layer self-attention mechanisms are employed to capture key medical entities and their contextual relationships. A Sigmoid-based multi-label classifier is then applied to predict multiple disease labels. The model incorporates a context-aware semantic alignment mechanism, enhancing its representational capacity in typical medical scenarios such as label co-occurrence and sparse information. To comprehensively evaluate model performance, a series of experiments were conducted, including baseline comparisons, hyperparameter sensitivity analysis, data perturbation studies, and noise injection tests. Results demonstrate that the proposed method consistently outperforms representative existing approaches across multiple performance metrics. The model maintains strong generalization under varying data scales, interference levels, and model depth configurations. The framework developed in this study offers an efficient algorithmic foundation for processing real-world clinical texts and presents practical significance for multi-label medical text modeling tasks.
本文论述电子健康记录文本的无结构性质和高维语义复杂性所构成的挑战; 提议基于关注机制的深层次学习方法,以实现信息提取和多标签疾病预测的统一模型模型; 在MIMIMI-IV数据集上进行研究; 采用基于变压器的架构,对临床文本进行代表性学习; 采用多层次自我注意机制,以捕捉关键医疗实体及其背景关系; 然后,采用基于多标签的多层次类动物分类器,预测多种疾病标签; 模型包含一种符合环境特征的语义调整机制,在典型的医疗假设中,如标签共同出现和信息稀少等,加强其代表性能力; 为全面评估模型性能,进行了一系列实验,包括基线比较、超参数灵敏度分析、数据渗透研究和噪音注入测试; 结果表明,拟议的方法始终超越了多种性能衡量标准的现有代表性方法; 模型在不同的数据尺度、干扰等级和模型深度配置下保持强有力的概括化。 本研究中制定的框架为处理现实世界临床文本提供了有效的实用性定义基础。
Article 138
Title@2025-07-02 (3): VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues
Title: VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues | VLM2-Bench: Ein genauerer Blick darauf, wie gut VLMs explizit mit visuellen Queues verknüpfen | VLM2-Bench:更仔细地审视VLMs如何良好, 2502.12084v4 |
Authors (5): Jianshu Zhang, Dongyu Yao, Renjie Pi, Paul Pu Liang, Yi R. Fung
Visually linking matching cues is a crucial ability in daily life, such as identifying the same person in multiple photos based on their cues, even without knowing who they are. Despite the extensive knowledge that vision-language models (VLMs) possess, it remains largely unexplored whether they are capable of performing this fundamental task. To address this, we introduce \textbf{VLM2-Bench}, a benchmark designed to assess whether VLMs can Visually Link Matching cues, with 9 subtasks and over 3,000 test cases. Comprehensive evaluation across twelve VLMs, along with further analysis of various language-side and vision-side prompting methods, leads to a total of eight key findings. We identify critical challenges in models’ ability to link visual cues, highlighting a significant performance gap. Based on these insights, we advocate for (i) enhancing core visual capabilities to improve adaptability and reduce reliance on prior knowledge, (ii) establishing clearer principles for integrating language-based reasoning in vision-centric tasks to prevent unnecessary biases, and (iii) shifting vision-text training paradigms toward fostering models’ ability to independently structure and infer relationships among visual cues.
视觉连接信号是日常生活中的一个关键能力,例如,在多个照片中,根据提示识别同一个人,即使不知道他们是谁。尽管视觉语言模型(VLMs)拥有广泛的知识,但基本上仍未探索他们是否有能力完成这一基本任务。为了解决这个问题,我们引入了一个基准 :\ textbf{VLM2-Bench},该基准旨在评估VLMs是否能够用9个子任务和3 000多个测试案例进行视觉链接信号;12个VLMs的综合评估,同时进一步分析各种语言方和视觉方的提示方法,导致总共8项关键发现。我们确定了模型连接视觉提示的能力方面的重大挑战,突出了一个显著的绩效差距。基于这些认识,我们主张:(一) 提高核心视觉能力,以提高适应能力,减少对先前知识的依赖;(二) 制定更明确的原则,将语言推理纳入视觉中心任务,以防止不必要的偏差,以及(三) 将视觉培训范则转向培养模型的能力,以便独立地构建和推导视标之间的关系。
Article 139
Title@2025-07-02 (3): Pensieve Grader: An AI-Powered, Ready-to-Use Platform for Effortless Handwritten STEM Grading
Title: Pensieve Grader: An AI-Powered, Ready-to-Use Platform for Effortless Handwritten STEM Grading | Pensieve Grader: Eine KI-Powered, Ready-to-Use Plattform für mühelose handschriftliche STEM-Grading | Pensieve grafer: 一个AI授权的无力手写STEM分级的现用平台 2507.01431v1 |
Authors (4): Yoonseok Yang, Minjune Kim, Marlon Rondinelli, Keren Shao
Grading handwritten, open-ended responses remains a major bottleneck in large university STEM courses. We introduce Pensieve (https://www.pensieve.co), an AI-assisted grading platform that leverages large language models (LLMs) to transcribe and evaluate student work, providing instructors with rubric-aligned scores, transcriptions, and confidence ratings. Unlike prior tools that focus narrowly on specific tasks like transcription or rubric generation, Pensieve supports the entire grading pipeline-from scanned student submissions to final feedback-within a human-in-the-loop interface. Pensieve has been deployed in real-world courses at over 20 institutions and has graded more than 300,000 student responses. We present system details and empirical results across four core STEM disciplines: Computer Science, Mathematics, Physics, and Chemistry. Our findings show that Pensieve reduces grading time by an average of 65%, while maintaining a 95.4% agreement rate with instructor-assigned grades for high-confidence predictions.
手写、不限名额的答复仍然是大型大学STEM课程中的一个主要瓶颈。 我们引入了Pensieve(https://www.pensieve.co),这是一个AI协助的分级平台,它利用大型语言模型(LLMs)进行抄录和评估学生工作,为教员提供与标准一致的评分、抄录和信心评级。与以前狭隘地侧重于诸如笔录或卢布一代等具体任务的工具不同,Pensieve支持从扫描学生提交到最终反馈的整个分级管道,在人类在一流界面内。Pensieve被部署到20多个机构的实际世界课程中,对30多万名学生的答复进行了分级。我们介绍了四个核心STEM学科(计算机科学、数学、物理和化学)的系统细节和经验结果。我们的调查结果显示,Pensieve平均将分级时间减少65%,同时保持95.4%的约定率,与教师指定的高信任预测分数保持95.4%的约定率。
Article 140
Title@2025-07-02 (3): Don’t Say No: Jailbreaking LLM by Suppressing Refusal
Title: Don’t Say No: Jailbreaking LLM by Suppressing Refusal | Sagen Sie nicht Nein: Jailbreaking LLM durch Unterdrückung der Weigerung | 不要说不,不要说不: 2404.16369v3 |
Authors (6): Yukai Zhou, Jian Lou, Zhijie Huang, Zhan Qin, Yibei Yang, Wenjie Wang
Ensuring the safety alignment of Large Language Models (LLMs) is critical for generating responses consistent with human values. However, LLMs remain vulnerable to jailbreaking attacks, where carefully crafted prompts manipulate them into producing toxic content. One category of such attacks reformulates the task as an optimization problem, aiming to elicit affirmative responses from the LLM. However, these methods heavily rely on predefined objectionable behaviors, limiting their effectiveness and adaptability to diverse harmful queries. In this study, we first identify why the vanilla target loss is suboptimal and then propose enhancements to the loss objective. We introduce DSN (Don’t Say No) attack, which combines a cosine decay schedule method with refusal suppression to achieve higher success rates. Extensive experiments demonstrate that DSN outperforms baseline attacks and achieves state-of-the-art attack success rates (ASR). DSN also shows strong universality and transferability to unseen datasets and black-box models.
确保大语言模型的安全一致性对于产生符合人类价值观的反应至关重要。然而,大语言模型(LLMs)对于产生符合人类价值观的响应至关重要。但LLMs仍然易受侵入性袭击的伤害,因为经过精心策划的快速动作将它们操纵成有毒内容。这类攻击的一类类型将任务重新描述为一个优化问题,目的是从LLM那里获得肯定性反应。然而,这些方法在很大程度上依赖于预先界定的可反对行为,限制了它们的效力和适应各种有害查询。在本研究中,我们首先确定香草目标损失为何不理想,然后提出增强损失目标的建议。我们引入了DSN(不要说不)攻击,将连带腐蚀计划法与拒绝抑制以达到更高成功率相结合。广泛的实验表明DSN(DSN)超越基线攻击,并达到最先进的攻击成功率。DSN(ASR)。DN(DN)还表明,隐形数据集和黑盒模型具有很强的普遍性和可转移性。
Article 141
Title@2025-07-02 (3): Transferable Modeling Strategies for Low-Resource LLM Tasks: A Prompt and Alignment-Based Approach
Title: Transferable Modeling Strategies for Low-Resource LLM Tasks: A Prompt and Alignment-Based Approach | Übertragbare Modellierungsstrategien für LLM-Aufgaben mit geringem Ressourcenbedarf: Ein prompter und ausgerichteter Ansatz | 可转让的低资源LLM任务可转让示范战略:迅速和统一的方法 2507.00601v2 |
Authors (5): Shuangquan Lyu, Yingnan Deng, Guiran Liu, Zhen Qi, Ruotong Wang
This paper addresses the limited transfer and adaptation capabilities of large language models in low-resource language scenarios. It proposes a unified framework that combines a knowledge transfer module with parameter-efficient fine-tuning strategies. The method introduces knowledge alignment loss and soft prompt tuning to guide the model in effectively absorbing the structural features of target languages or tasks under minimal annotation. This enhances both generalization performance and training stability. The framework includes lightweight adaptation modules to reduce computational costs. During training, it integrates freezing strategies and prompt injection to preserve the model’s original knowledge while enabling quick adaptation to new tasks. The study also conducts stability analysis experiments and synthetic pseudo-data transfer experiments to systematically evaluate the method’s applicability and robustness across different low-resource tasks. Experimental results show that compared with existing multilingual pre-trained models and mainstream transfer methods, the proposed approach achieves higher performance and stability on cross-lingual tasks such as MLQA, XQuAD, and PAWS-X. It demonstrates particularly strong advantages under extremely data-scarce conditions. The proposed method offers strong generality and scalability. It enhances task-specific adaptability while preserving the general capabilities of large language models. This makes it well-suited for complex semantic modeling and multilingual processing tasks.
本文件论述低资源语言情景下大型语言模型有限的转让和适应能力,建议一个统一框架,将知识转让模块与节能微调战略结合起来;该方法引入知识调整损失和软快速调整,以指导该模式有效吸收目标语言的结构特征或最低注释下的任务;这增强了通用性业绩和培训稳定性;该框架包括轻量化适应模块,以减少计算成本;在培训期间,将冻结战略和迅速注入结合起来,以保存模型的原始知识,同时能够迅速适应新任务;该研究还进行稳定分析实验和合成假数据转移实验,以系统评估方法在不同低资源任务中的适用性和稳健性;实验结果表明,与现有的多语言预先培训模式和主流转移方法相比,拟议方法在多语言任务上取得了更高的绩效和稳定性,如MLQA、XQuAD和PAWS-X等;在极差的条件下,该框架显示了特别强大的优势。该拟议方法提供了很强的概括性和可扩展性。该方法在维护大型语言的复杂处理能力的同时,加强了任务特定的可调整性模型。
Article 142
Title@2025-07-02 (3): Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction
Title: Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction | Text zu Band Gap: Vortrainierte Sprachmodelle als Encoder für Semiconductor Band Gap Prediction | 文字到带宽差距:作为半导体带宽差距预测的编译者的培训前语言模式 2501.03456v2 |
Authors (4): Ying-Ting Yeh, Janghoon Ock, Shagun Maheshwari, Amir Barati Farimani
We investigate the use of transformer-based language models, RoBERTa, T5, and LLaMA, for predicting the band gaps of semiconductor materials directly from textual representations that encode key material features such as chemical composition, crystal system, space group, number of atoms per unit cell, valence electron count, and other relevant electronic and structural properties. Quantum chemistry simulations such as DFT provide accurate predictions but are computationally intensive, limiting their feasibility for large-scale materials screening. Shallow ML models offer faster alternatives but typically require extensive data preprocessing to convert non-numerical material features into structured numerical inputs, often at the cost of losing critical descriptive information. In contrast, our approach leverages pretrained language models to process textual data directly, eliminating the need for manual feature engineering. We construct material descriptions in two formats: structured strings that combine key features in a consistent template, and natural language narratives generated using the ChatGPT API. For each model, we append a custom regression head and perform task-specific finetuning on a curated dataset of inorganic compounds. Our results show that finetuned language models, particularly the decoder-only LLaMA-3 architecture, can outperform conventional approaches in prediction accuracy and flexibility, achieving an MAE of 0.25 eV and R2 of 0.89, compared to the best shallow ML baseline, which achieved an MAE of 0.32 eV and R2 of 0.84. Notably, LLaMA-3 achieves competitive accuracy with minimal finetuning, suggesting its architecture enables more transferable representations for scientific tasks. This work demonstrates the effectiveness of finetuned language models for scientific property prediction and provides a scalable, language-native framework for materials informatics.
我们调查了以变压器为基础的语言模型、ROBERTA、T5、T5和LalaMA的使用情况,以直接预测半导体材料的频带差距,这些半导体材料的频带差距直接来自将化学成分、晶体系统、空间组、每单元细胞原子数、valence电子计数和其他相关电子和结构特性等关键材料特征编码的文本表示;DFT等量子化学模拟提供了准确的预测,但在计算上却十分密集,限制了其大规模材料筛选的可行性。浅色的ML模型提供了更快的替代品,但通常需要广泛的数据预处理,将非数字性材料特性转换成结构化的数字投入,往往代价是失去关键的描述性信息。相比之下,我们的方法利用预先训练的语言模型直接处理文本数据、消除手动特征工程需要。我们用两种格式构建了材料描述:结构化的字符串,将一个一致模板中的关键特征结合起来,以及使用热GPTAPI的自然语言叙述。 对于每一种模型,我们附上一个定制的科学回归头,并进行任务具体化的调整,对精度精度精度的精度的精度数据结构投入投入,往往以损失描述,我们的结果显示一个精度的MA的精度模型的精度为精度的精度的精度的精确度的MA2,其精度结构,其精度,其精度为精度的精度为精度的精度的精度的精度的精度的MA2的精度,其精度,其精度的MA的精度的MA的精确度为精度,其精度,其精度为精度为精度,其精度,其精度,其精度,其精度为精度为精度为精度为精度为精度为精度为精度的精度的精度的MA2的精度为精度为精度的精度为精度的MA的精度为精度的MA的精度,其精度,其精度为精度的精度的精度的精度,其精度的精度的精度的精度,其精度的MA2的精度的精度的精度的MA2的精度结构的精度为精度,其精度,其精度,其精度
Article 143
Title@2025-07-02 (3): Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models
Title: Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models | Mehrsprachige Ethische Bias: Der MSQAD mit statistischen Hypothesentests für große Sprachmodelle | 跳入多语言伦理比喻:高语言模型统计假设测试的MSQAD 2505.19121v2 |
Authors (3): Seunguk Yu, Juhwan Choi, Youngbin Kim
Despite the recent strides in large language models, studies have underscored the existence of social biases within these systems. In this paper, we delve into the validation and comparison of the ethical biases of LLMs concerning globally discussed and potentially sensitive topics, hypothesizing that these biases may arise from language-specific distinctions. Introducing the Multilingual Sensitive Questions & Answers Dataset (MSQAD), we collected news articles from Human Rights Watch covering 17 topics, and generated socially sensitive questions along with corresponding responses in multiple languages. We scrutinized the biases of these responses across languages and topics, employing two statistical hypothesis tests. The results showed that the null hypotheses were rejected in most cases, indicating biases arising from cross-language differences. It demonstrates that ethical biases in responses are widespread across various languages, and notably, these biases were prevalent even among different LLMs. By making the proposed MSQAD openly available, we aim to facilitate future research endeavors focused on examining cross-language biases in LLMs and their variant models.
尽管在大型语言模式方面最近取得了长足进步,但研究强调在这些体系中存在着社会偏见。在本文件中,我们研究了如何验证和比较LLMS在全球性讨论和潜在敏感议题方面的伦理偏见,假设这些偏见可能来自语言区分;介绍多语言敏感问题和答案数据集(MSQAD),我们从人权观察收集了17个专题的新闻报道,并用多种语言生成了社会敏感问题和相应的答复;我们利用两个统计假设测试,仔细检查了这些答复在语言和专题上的偏见;结果显示,无效假设在大多数情况下遭到拒绝,表明不同语言差异引起的偏见;它表明,在答复中普遍存在各种语言的道德偏见,特别是即使在不同的LMMS中也普遍存在这些偏见。我们公开提供拟议的MSQAD,目的是促进今后研究努力,重点是研究LMS及其变式模式中的跨语言偏见。
Article 144
Title@2025-07-02 (3): Multi-interaction TTS toward professional recording reproduction
Title: Multi-interaction TTS toward professional recording reproduction | Multi-Interaktion TTS für professionelle Aufnahmewiedergabe | 关于专业记录复制的多互动TTS 2507.00808v2 |
Authors (4): Hiroki Kanagawa, Kenichi Fujita, Aya Watanabe, Yusuke Ijima
Voice directors often iteratively refine voice actors’ performances by providing feedback to achieve the desired outcome. While this iterative feedback-based refinement process is important in actual recordings, it has been overlooked in text-to-speech synthesis (TTS). As a result, fine-grained style refinement after the initial synthesis is not possible, even though the synthesized speech often deviates from the user’s intended style. To address this issue, we propose a TTS method with multi-step interaction that allows users to intuitively and rapidly refine synthesized speech. Our approach models the interaction between the TTS model and its user to emulate the relationship between voice actors and voice directors. Experiments show that the proposed model with its corresponding dataset enables iterative style refinements in accordance with users’ directions, thus demonstrating its multi-interaction capability. Sample audios are available: https://ntt-hilab-gensp.github.io/ssw13multiinteractiontts/
语音导演经常通过提供反馈,通过提供反馈,反复完善语音行为体的绩效,以实现预期结果。虽然这种基于反馈的迭代完善过程在实际录音中很重要,但在文本到语音合成(TTS)中却被忽略了。因此,初步合成后微调风格的改进是不可能的,即使合成的演讲往往偏离用户的预期风格。为了解决这一问题,我们提出了一个具有多步骤互动的TTS方法,使用户能够直观和快速完善合成的语音。我们的方法模拟了TTS模型与其用户之间的相互作用,以模仿语音演员和语音导演之间的关系。实验显示,拟议的模型及其相应的数据集能够按照用户的方向进行迭接式的改进,从而展示其多重互动能力。有样本的音频:https://ntt-hilab-gensp.github.io/sw13multinteractts/
Article 145
Title@2025-07-02 (3): olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Title: olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models | olmOCR: Entsperren von Tillionen von Token in PDFs mit Vision Language Models | olmOCR:用愿景语言模型在PDF中解锁数万亿托肯 2502.18443v3 |
Authors (10): Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, Luca Soldaini
PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. Traditional open source tools often produce lower quality extractions compared to vision language models (VLMs), but reliance on the best VLMs can be prohibitively costly (e.g., over 6,240 USD per million PDF pages for GPT-4o) or infeasible if the PDFs cannot be sent to proprietary APIs. We present olmOCR, an open-source toolkit for processing PDFs into clean, linearized plain text in natural reading order while preserving structured content like sections, tables, lists, equations, and more. Our toolkit runs a fine-tuned 7B vision language model (VLM) trained on olmOCR-mix-0225, a sample of 260,000 pages from over 100,000 crawled PDFs with diverse properties, including graphics, handwritten text and poor quality scans. olmOCR is optimized for large-scale batch processing, able to scale flexibly to different hardware setups and can convert a million PDF pages for only 176 USD. To aid comparison with existing systems, we also introduce olmOCR-Bench, a curated set of 1,400 PDFs capturing many content types that remain challenging even for the best tools and VLMs, including formulas, tables, tiny fonts, old scans, and more. We find olmOCR outperforms even top VLMs including GPT-4o, Gemini Flash 2 and Qwen-2.5-VL. We openly release all components of olmOCR: our fine-tuned VLM model, training code and data, an efficient inference pipeline that supports vLLM and SGLang backends, and benchmark olmOCR-Bench.
PDF文件有可能为培训语言模型提供数万亿美元的新式、高质量符号。然而,这些文件具有多种类型,格式和视觉布局各不相同,在试图提取和忠实地代表语言模型使用的基本内容时构成挑战。传统的开放源工具往往产生比视觉语言模型(VLM)更低的质量提取,但对最佳VLMS的依赖成本极高(例如,GPT-4o的每百万PDF页6 240美元以上),或者如果无法将微软的FDS发送到专有的APIs,则无法使用。我们展示了以不同格式和视觉格式将PDFS处理成清洁的线性纯文本的开放源工具包,同时保存结构化内容,如节、表、列表、方程式等等。 但是,我们的工具运行了一个精细调的7B视觉语言模型(VLM),以碳CR-mix -0225为培训,从超过10万种的SBR-MDFS 模板中提取了260 000页的样本,我们从S-DFS-dealdaldal Q Q-alal Q QVDFS, 也能够将S-rmal-rmal-lish-rmal-cal smaldromalsmas 升级到现有的系统系统。
Article 146
Title@2025-07-02 (3): Direct Quantized Training of Language Models with Stochastic Rounding
Title: Direct Quantized Training of Language Models with Stochastic Rounding | Direkte Quantisierte Schulung von Sprachmodellen mit stochastischer Rundung | 直接量化的语言模式直接量化培训,并进行盘点四舍四入 2412.04787v2 |
Authors (6): Kaiyan Zhao, Tsuguchika Tabaru, Kenichi Kobayashi, Takumi Honda, Masafumi Yamazaki, Yoshimasa Tsuruoka
Although recent quantized Large Language Models (LLMs), such as BitNet, have paved the way for significant reduction in memory usage during deployment with binary or ternary weights, training these models still demands substantial memory footprints. This is partly because high-precision (i.e., unquantized) weights required for straight-through estimation must be maintained throughout the whole training process. To address this, we explore directly updating the quantized low-precision weights without relying on straight-through estimation during backpropagation, aiming to save memory usage during training. Specifically, we employ a stochastic rounding technique to minimize the information loss caused by the use of low-bit weights throughout training. Experimental results on our LLaMA-structured models of various sizes indicate that (1) training with only low-precision weights is feasible even when they are constrained to ternary values; (2) extending the bit width to 8 bits achieves performance on par with BitNet b1.58; (3) our models remain robust to precision scaling and memory reduction, showing minimal performance degradation when moving from FP32 to lower-memory environments (BF16/FP8); and (4) our models also support inference using ternary weights, showcasing their flexibility in deployment.
尽管最近通过量化的大语言模型(LLMS),例如BitNet,为在部署过程中大量减少使用二进制或长期重量的记忆用量铺平了道路,但培训这些模型仍需要大量的记忆足迹,部分原因是在整个培训过程中必须保持直通估算所需的高精度(即未量化)重量;为此,我们探索直接更新四分制的低精度重量,而不必依靠回溯性调整期间的直通估计,目的是在培训期间节省记忆用量。具体地说,我们采用一种随机循环技术,以尽量减少在整个培训过程中使用低比特重量造成的信息损失。我们LLamaMA结构化的不同尺寸模型的实验结果表明:(1) 即使在受定值限制的情况下,仅使用低精度重量的培训也是可行的;(2) 将位宽度扩大到8位,从而在BitNet b1.58 期间达到等值的性表现;(3) 我们的模型仍然保持稳健,以精确的缩放和记忆用量减少,显示在使用F32和FA的低度环境中使用灵活度支持时业绩模型时,也显示微度退化环境。 (4)
Article 147
Title@2025-07-02 (3): MassTool: A Multi-Task Search-Based Tool Retrieval Framework for Large Language Models
Title: MassTool: A Multi-Task Search-Based Tool Retrieval Framework for Large Language Models | MassTool: Ein Multi-Task Search-Based Tool Retrieval Framework für große Sprachmodelle | MassTool:一个用于大语言模型的多任务搜索工具检索框架 2507.00487v2 |
Authors (8): Jianghao Lin, Xinyuan Wang, Xinyi Dai, Menghui Zhu, Bo Chen, Ruiming Tang, Yong Yu, Weinan Zhang
Tool retrieval is a critical component in enabling large language models (LLMs) to interact effectively with external tools. It aims to precisely filter the massive tools into a small set of candidates for the downstream tool-augmented LLMs. However, most existing approaches primarily focus on optimizing tool representations, often neglecting the importance of precise query comprehension. To address this gap, we introduce MassTool, a multi-task search-based framework designed to enhance both query representation and tool retrieval accuracy. MassTool employs a two-tower architecture: a tool usage detection tower that predicts the need for function calls, and a tool retrieval tower that leverages a query-centric graph convolution network (QC-GCN) for effective query-tool matching. It also incorporates search-based user intent modeling (SUIM) to handle diverse and out-of-distribution queries, alongside an adaptive knowledge transfer (AdaKT) module for efficient multi-task learning. By jointly optimizing tool usage detection loss, list-wise retrieval loss, and contrastive regularization loss, MassTool establishes a robust dual-step sequential decision-making pipeline for precise query understanding. Extensive experiments demonstrate its effectiveness in improving retrieval accuracy. Our code is available at https://github.com/wxydada/MassTool.
工具检索是使大型语言模型(LLMS)能够与外部工具有效互动的一个关键组成部分。 它旨在精确地将大量工具过滤成下游工具强化的LLMs的一组小型候选人。 但是,大多数现有办法主要侧重于优化工具表达方式,往往忽视精确理解理解的重要性。 为了解决这一差距,我们引入了MassTool, 这是一个多任务搜索框架,目的是提高查询代表性和工具检索准确性。 MassTool 使用一个双塔结构:一个工具使用检测塔,预测功能呼叫的必要性,以及一个工具检索塔,利用一个以查询为中心的图表组合网络(QC-GCN)进行有效的查询工具匹配。它还包括基于搜索的用户意向建模(SUIM),以便处理多样化和分配外的查询,以及一个适应性知识传输模块(AdaKT),目的是提高查询和工具检索的准确性。 MassTool通过共同优化工具使用探测损失、列表智能检索损失和对比性规范损失, 建立强大的双步决策级连续决策管道,用于精确的查询理解。Meximaleximalexexalexalexalexalexalalis realictions。
Article 148
Title@2025-07-02 (3): Pre-training Large Memory Language Models with Internal and External Knowledge
Title: Pre-training Large Memory Language Models with Internal and External Knowledge | Vorschulung großer Speicher Sprachmodelle mit internem und externem Wissen | 具有内部和外部知识的大型记忆语言模型 2505.15962v2 |
Authors (8): Linxi Zhao, Sofian Zalouk, Christian K. Belardi, Justin Lovelace, Jin Peng Zhou, Kilian Q. Weinberger, Yoav Artzi, Jennifer J. Sun
Neural language models are black-boxes – both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We propose a new class of language models, Large Memory Language Models (LMLM) with a pre-training recipe that stores factual knowledge in both internal weights and an external database. Our approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger, knowledge-dense LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases. This work represents a fundamental shift in how language models interact with and manage factual knowledge.
神经语言模型是黑盒子 – – 语言模式和事实知识分布于数十亿个不透明的参数中。这种纠缠在一起的编码使得难以可靠地检查、核实或更新具体事实。我们提出了一种新的语言模型,即大记忆语言模型(LMLMM),其培训前配方既储存内部重量上的实际知识,又储存外部数据库。我们的方法从战略上掩盖了外部从培训损失中获取的事实价值,从而教育模型进行有针对性的调查,而不是依赖模型重量的记忆。我们的实验表明,LMLMMS取得了与大得多、知识密集的标准化标准标准标准标准标准标准LMS相比的竞争性性能,同时提供了明确、可编辑和可核查的知识基础的优势。这项工作代表了语言模型如何与事实知识互动和管理的基本转变。
Article 149
Title@2025-07-02 (3): KatFishNet: Detecting LLM-Generated Korean Text through Linguistic Feature Analysis
Title: KatFishNet: Detecting LLM-Generated Korean Text through Linguistic Feature Analysis | KatFishNet: LLM-generierter koreanischer Text durch Linguistik-Feature-Analyse erkennen | KatFishNet:通过语言特征分析检测LLM-发光的韩文文本 2503.00032v4 |
Authors (4): Shinwoo Park, Shubin Kim, Do-Kyung Kim, Yo-Sub Han
The rapid advancement of large language models (LLMs) increases the difficulty of distinguishing between human-written and LLM-generated text. Detecting LLM-generated text is crucial for upholding academic integrity, preventing plagiarism, protecting copyrights, and ensuring ethical research practices. Most prior studies on detecting LLM-generated text focus primarily on English text. However, languages with distinct morphological and syntactic characteristics require specialized detection approaches. Their unique structures and usage patterns can hinder the direct application of methods primarily designed for English. Among such languages, we focus on Korean, which has relatively flexible spacing rules, a rich morphological system, and less frequent comma usage compared to English. We introduce KatFish, the first benchmark dataset for detecting LLM-generated Korean text. The dataset consists of text written by humans and generated by four LLMs across three genres. By examining spacing patterns, part-of-speech diversity, and comma usage, we illuminate the linguistic differences between human-written and LLM-generated Korean text. Building on these observations, we propose KatFishNet, a detection method specifically designed for the Korean language. KatFishNet achieves an average of 19.78% higher AUROC compared to the best-performing existing detection method. Our code and data are available at https://github.com/Shinwoo-Park/detecting_llm_generated_korean_text_through_linguistic_analysis.
大型语言模型(LLMs)的快速进步增加了区分人写和LLM产生的文本的难度。 检测LLM产生的文本对于维护学术完整性、防止侮辱、保护版权和确保道德研究实践至关重要。 大部分先前关于检测LLM产生的文本的研究主要侧重于英语文本。 但是,具有不同形态和合成特征的语言需要专门的检测方法。 其独特的结构和使用模式会妨碍直接应用主要为英语设计的方法。 在这些语言中,我们侧重于韩国语,它拥有相对灵活的间隔规则、丰富的形态系统,并且与英语相比,逗号使用较少。 我们引入了KatFish,这是检测LLM产生的韩国文本的第一个基准数据集。 数据集由人类撰写的文本组成,由横跨三个语言的四个LMs生成。 通过检查间距模式、部分语言多样性和逗号,我们可以说明人写和LLMM-LMM-LM- 生成的韩国文本之间的语言差异。 在这些观察上,我们建议了 KatFishNet, 一种检测方法是专门设计用于韩国平均搜索的A- ALFS_com 数据。 KA 和我们现有的ASyal_ com roup dal droup comm 。
Article 150
Title@2025-07-02 (3): LEDOM: An Open and Fundamental Reverse Language Model
Title: LEDOM: An Open and Fundamental Reverse Language Model | LEDOM: Ein offenes und grundlegendes Reverse Language Modell | LEDOM: 开放和基本反向语言模式 2507.01335v1 |
Authors (9): Xunjian Yin, Sitao Cheng, Yuxi Xie, Xinyu Hu, Li Lin, Xinyi Wang, Liangming Pan, William Yang Wang, Xiaojun Wan
We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based on LEDOM, we further introduce a novel application: Reverse Reward, where LEDOM-guided reranking of forward language model outputs leads to substantial performance improvements on mathematical reasoning tasks. This approach leverages LEDOM’s unique backward reasoning capability to refine generation quality through posterior evaluation. Our findings suggest that LEDOM exhibits unique characteristics with broad application potential. We will release all models, training code, and pre-training data to facilitate future research.
我们引入了第一种纯反向语言模式LEDOM, 即第一种纯反向语言模式,在435B号标牌上用2B和7B参数变量进行自动递增培训,通过以前的象征性预测处理逆向时间顺序的顺序;我们第一次将反向语言模式作为贯穿一般任务的潜在基本模式,同时提出一套令人感兴趣的实例和见解;根据LEDOM,我们进一步引入了一种新应用:反向向,即LEDOM引导的前方语言模型产出的排序导致数学推理任务业绩的大幅改进;这一方法利用LEDOM独特的后向推理能力,通过后传评估改进生产质量;我们的调查结果表明,LEDOM具有广泛应用潜力的独特特征;我们将发布所有模型、培训代码和培训前数据,以便利今后的研究。
Article 151
Title@2025-07-02 (3): La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation
Title: La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation | La RoSA: Steigerung der LLM-Effizienz durch schichtweise rotierte Sparse-Aktivierung | La RoSA:通过图层旋转的分散启动提高LLM效率 2507.01299v1 |
Authors (7): Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, Lulu Hu
Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30x wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%.
激活宽度可以减少大语言模型(LLM)前传过程中的计算管理费和内存传输。 现有方法面临一些限制, 要么是要求耗时的回收培训, 阻碍现实世界的采纳, 要么是依赖基于经验规模的裁剪, 导致空间波动和不稳定的推导速度。 本文介绍了LaRoSA (LaRoSA (Laeyer Worder Sparass Assistication) , 这是一种新型的激活扩增方法, 目的是提高LLLM的效率, 不需要额外的培训或基于尺寸的裁剪裁剪。 我们利用层次或高度的旋转旋转, 将输入激活转换成更适合施压的旋转形式。 通过在旋转的激活中采用高K选择方法, 我们实现了一致的模型级宽度和可靠的墙时钟加速速度。 LaRoSA 在不同规模和类型的LOMM(LOMM) , 显示性能降解性极小, 和稳健的推力加速速度。 具体地说, 40%的LLAMA2-7B, 我们的旋转旋转旋转旋转旋转旋转旋转旋转, 仅达0.17. 17. 1714 和0.17xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 152
Title@2025-07-02 (3): Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks
Title: Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks | Frustrierend Einfaches Retrieval verbessert anspruchsvolle, vernünftig-intensive Benchmarks | 令人沮丧的简单检索改进挑战、理由说明和密集基准 2507.01297v1 |
Authors (5): Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, Sewon Min
Retrieval-augmented Generation (RAG) has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce CompactDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node. The key insights are (1) most web content can be filtered out without sacrificing coverage, and a compact, high-quality subset is sufficient; and (2) combining in-memory approximate nearest neighbor (ANN) retrieval and on-disk exact search balances speed and recall. Using CompactDS, we show that a minimal RAG pipeline achieves consistent accuracy improvements across all benchmarks and model sizes (8B–70B), with relative gains of 10% on MMLU, 33% on MMLU Pro, 14% on GPQA, and 19% on MATH. No single data source suffices alone, highlighting the importance of diversity of sources (web crawls, curated math, academic papers, textbooks). Finally, we show that our carefully designed in-house datastore matches or outperforms web search engines such as Google Search, as well as recently proposed, complex agent-based RAG systems–all while maintaining simplicity, reproducibility, and self-containment. We release CompactDS and our retrieval pipeline, supporting future research exploring retrieval-based AI systems.
重力增强型新一代(RAG)主要在数量有限的情况下进行了研究,例如事实质询解答;更具有挑战性、更精细的推理密集基准在最小的RAG上取得了有限的成功。在这项工作中,我们对以下既定、推力密集基准的普遍观点提出了挑战:(1) 多数网络内容可以在不牺牲覆盖面的情况下被过滤,而一个紧凑的、高质量的子集就足够了;(2) 在先前的工作中,我们确定了一个缺少的关键组成部分:一个与培训前数据广度相一致的可用、网络规模的数据存储器。为此,我们引入了ClaimDS:一个多样化的、高质量的、网络规模的数据存储器:一个多样化的、质量高的、网络规模的数据存储器,在一个单一的节点上实现高回收准确性和二分层。 关键见解是:(1) 大部分网络内容可以在不牺牲覆盖面的情况下被过滤出去,而一个紧凑合的、质量的子集集集;(2) 将近似近距离(ANN(AM)检索)的检索速度和不精确搜索平衡速度和回顾。我们使用CLA的系统,我们使用基于CDDSDS,我们未来的最起码的S-RO-RO-RO-RO-70B-RODS-RDS-R-R-R-R-R-S-R-R-S-R-S-S-S-R-S-S-S-S-S-S-S-S-S-S-S-R-L-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-N-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I
Article 153
Title@2025-07-02 (3): Rethinking All Evidence: Enhancing Trustworthy Retrieval-Augmented Generation via Conflict-Driven Summarization
Title: Rethinking All Evidence: Enhancing Trustworthy Retrieval-Augmented Generation via Conflict-Driven Summarization | Alle Beweise neu denken: Vertrauenswürdige retrieval-angereicherte Generation durch konfliktgetriebene Zusammenfassung verbessern | 重新思考所有证据:通过冲突驱动的总结,加强可信赖的回溯可信赖的一代人 2507.01281v1 |
Authors (8): Juan Chen, Baolong Bi, Wei Zhang, Jingyan Sui, Xiaofei Zhu, Yuanzhuo Wang, Lingrui Mei, Shenghua Liu
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating their parametric knowledge with external retrieved content. However, knowledge conflicts caused by internal inconsistencies or noisy retrieved content can severely undermine the generation reliability of RAG systems.In this work, we argue that LLMs should rethink all evidence, including both retrieved content and internal knowledge, before generating responses.We propose CARE-RAG (Conflict-Aware and Reliable Evidence for RAG), a novel framework that improves trustworthiness through Conflict-Driven Summarization of all available evidence.CARE-RAG first derives parameter-aware evidence by comparing parameter records to identify diverse internal perspectives. It then refines retrieved evidences to produce context-aware evidence, removing irrelevant or misleading content. To detect and summarize conflicts, we distill a 3B LLaMA3.2 model to perform conflict-driven summarization, enabling reliable synthesis across multiple sources.To further ensure evaluation integrity, we introduce a QA Repair step to correct outdated or ambiguous benchmark answers.Experiments on revised QA datasets with retrieval data show that CARE-RAG consistently outperforms strong RAG baselines, especially in scenarios with noisy or conflicting evidence.
在这项工作中,我们主张,LLM公司在作出响应之前,应重新思考所有证据,包括检索的内容和内部知识,并在作出响应之前,重新思考所有证据,包括检索的内容和内部知识。我们提议CARM公司(RAG的冲突-Awar和可靠证据),这是一个新颖的框架,它通过将所有现有证据的冲突-Driven Summar化,通过冲突-Drien Summar化所有现有证据,提高可信度。CARE-RAG首先通过比较参数记录,比较参数记录,找出不同内部观点,从而获得有识的参数识别证据,从而获得有觉识的参数识别证据,从而查明不同内部观点,从而改进因内部不一致或误导内容的检索证据,从而严重削弱RAAG系统的生成符合背景的证据,消除不相关或误导性内容。为了发现和总结冲突,我们将3BLALMA3.2模型进行冲突驱动的合成,使多个来源能够进行可靠的合成。为了进一步确保评价的完整性,我们提出一个QA修理步骤,以纠正过时或模糊的基准答案。
Article 154
Title@2025-07-02 (3): Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening
Title: Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening | Bewertung großer Sprachmodelle für multimodale simulierte ophthalmische Entscheidungsfindung in diabetischer Retinopathie und Glaukom-Screening | 评估糖尿病病理病理和青光眼筛查中多式模拟眼部模拟决策的大型语言模型 2507.01278v1 |
Authors (6): Cindy Lie Tabuse, David Restepo, Carolina Gracitelli, Fernando Korn Malerbi, Caio Regatieri, Luis Filipe Nakayama
Large language models (LLMs) can simulate clinical reasoning based on natural language prompts, but their utility in ophthalmology is largely unexplored. This study evaluated GPT-4’s ability to interpret structured textual descriptions of retinal fundus photographs and simulate clinical decisions for diabetic retinopathy (DR) and glaucoma screening, including the impact of adding real or synthetic clinical metadata. We conducted a retrospective diagnostic validation study using 300 annotated fundus images. GPT-4 received structured prompts describing each image, with or without patient metadata. The model was tasked with assigning an ICDR severity score, recommending DR referral, and estimating the cup-to-disc ratio for glaucoma referral. Performance was evaluated using accuracy, macro and weighted F1 scores, and Cohen’s kappa. McNemar’s test and change rate analysis were used to assess the influence of metadata. GPT-4 showed moderate performance for ICDR classification (accuracy 67.5%, macro F1 0.33, weighted F1 0.67, kappa 0.25), driven mainly by correct identification of normal cases. Performance improved in the binary DR referral task (accuracy 82.3%, F1 0.54, kappa 0.44). For glaucoma referral, performance was poor across all settings (accuracy ~78%, F1 <0.04, kappa <0.03). Metadata inclusion did not significantly affect outcomes (McNemar p > 0.05), and predictions remained consistent across conditions. GPT-4 can simulate basic ophthalmic decision-making from structured prompts but lacks precision for complex tasks. While not suitable for clinical use, LLMs may assist in education, documentation, or image annotation workflows in ophthalmology.
大型语言模型(LLMS)可以模拟基于自然语言提示的临床推理,但是这些模型在眼科学中的实用性基本上尚未探索。 本研究评估了GPT-4 是否有能力解释对视网膜基金照片的结构性文字描述和模拟临床决定,包括添加真实或合成临床元数据的影响。 我们利用300个附加说明的基金图像进行了追溯性诊断性鉴定研究。 GPT-4 收到了描述每个图像的结构化提示,有或没有患者元数据。 该模型的任务是分配ICDR 严重程度分数,建议 DR 推荐,并估算用于gloacoma 推荐的杯到分比。 业绩评估使用了准确性、宏观和加权F1; McNemar的测试和变化率分析用于评估元数据的影响。 GPT-4 显示ICDR分类的中等性(精度为67.5 %, 宏观F1 033 , 加权 F1 0.67, 但加权 F1 0.67, kapappa 0.25 , , 主要是通过正确识别正常案例来进行计算。
Article 155
Title@2025-07-02 (3): $μ^2$Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for Radiology Report Generation
Title: $μ^2$Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for Radiology Report Generation | $μ^2$Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer für Radiologie Report Generation | $2 $2 收缩器:用于产生放射学报告的可区别的多规模多式多式调控器 2507.00316v2 |
Authors (7): Siyou Li, Pengyao Qin, Huanan Wu, Dong Nie, Arun J. Thirunavukarasu, Juntao Yu, Le Zhang
Automated radiology report generation (RRG) aims to produce detailed textual reports from clinical imaging, such as computed tomography (CT) scans, to improve the accuracy and efficiency of diagnosis and provision of management advice. RRG is complicated by two key challenges: (1) inherent complexity in extracting relevant information from imaging data under resource constraints, and (2) difficulty in objectively evaluating discrepancies between model-generated and expert-written reports. To address these challenges, we propose $\mu^2$LLM, a $\underline{\textbf{mu}}$ltiscale $\underline{\textbf{mu}}$ltimodal large language models for RRG tasks. The novel ${\mu}^2$Tokenizer, as an intermediate layer, integrates multi-modal features from the multiscale visual tokenizer and the text tokenizer, then enhances report generation quality through direct preference optimization (DPO), guided by GREEN-RedLlama. Experimental results on four large CT image-report medical datasets demonstrate that our method outperforms existing approaches, highlighting the potential of our fine-tuned $\mu^2$LLMs on limited data for RRG tasks. At the same time, for prompt engineering, we introduce a five-stage, LLM-driven pipeline that converts routine CT reports into paired visual-question-answer triples and citation-linked reasoning narratives, creating a scalable, high-quality supervisory corpus for explainable multimodal radiology LLM. All code, datasets, and models will be publicly available in our official repository. https://github.com/Siyou-Li/u2Tokenizer
自动放射报告的生成(RRG)旨在从临床成像中产生详细的文字报告,如计算断层扫描,以提高诊断和管理咨询的准确性和效率;RRG因两大挑战而变得复杂:(1) 从资源限制下的成像数据中提取相关信息的内在复杂性,(2) 客观评估模型生成的报告和专家撰写的报告之间的差异的困难;为应对这些挑战,我们提议在GREEN-REdLlama的指导下,在客观评估模型和专家撰写的报告之间的差异方面,一个美元=2$LLLM,一个美元=underline textbf{MU$l_ltimodal 大型语言模型,用于RRRG的任务的计算方法优于现有方法,突出我们精细调的美元=%2$m%2Tokenizeral,作为中间层,将多级视觉象征和文字符号的多式特征综合起来,然后通过直接的优惠优化(DPO)来提高报告的生成质量;为了应对这些挑战,我们提议在四种大型可移动的成像的图像-报告模型中显示我们的方法将超越现有的方法,突出我们调整的正值的正值的正平流流路的轨道数据, 将所有SLMLMLMLMLRRRMLLLxximal-lix数据转换成一个可迅速的数据。
Article 156
Title@2025-07-02 (3): Combating Confirmation Bias: A Unified Pseudo-Labeling Framework for Entity Alignment
Title: Combating Confirmation Bias: A Unified Pseudo-Labeling Framework for Entity Alignment | Bekämpfung von Konfirmations-Bias: Ein einheitliches Pseudo-Labeling-Rahmenwerk für die Ausrichtung von Unternehmen | 打击确认的偏见:统一实体统一化框架 2307.02075v4 |
Authors (4): Qijie Ding, Jie Yin, Daokun Zhang, Junbin Gao
Entity alignment (EA) aims at identifying equivalent entity pairs across different knowledge graphs (KGs) that refer to the same real-world identity. To circumvent the shortage of seed alignments provided for training, recent EA models utilize pseudo-labeling strategies to iteratively add unaligned entity pairs predicted with high confidence to the seed alignments for model training. However, the adverse impact of confirmation bias during pseudo-labeling has been largely overlooked, thus hindering entity alignment performance. To systematically combat confirmation bias for pseudo-labeling-based entity alignment, we propose a Unified Pseudo-Labeling framework for Entity Alignment (UPL-EA) that explicitly eliminates pseudo-labeling errors to boost the accuracy of entity alignment. UPL-EA consists of two complementary components: (1) Optimal Transport (OT)-based pseudo-labeling uses discrete OT modeling as an effective means to determine entity correspondences and reduce erroneous matches across two KGs. An effective criterion is derived to infer pseudo-labeled alignments that satisfy one-to-one correspondences; (2) Parallel pseudo-label ensembling refines pseudo-labeled alignments by combining predictions over multiple models independently trained in parallel. The ensembled pseudo-labeled alignments are thereafter used to augment seed alignments to reinforce subsequent model training for alignment inference. The effectiveness of UPL-EA in eliminating pseudo-labeling errors is both theoretically supported and experimentally validated. Our extensive results and in-depth analyses demonstrate the superiority of UPL-EA over 15 competitive baselines and its utility as a general pseudo-labeling framework for entity alignment.
实体对齐 (EA) 旨在在不同的知识图表(KGs)中识别指向同一真实世界特性的等效实体对配对。为避免缺乏为培训提供的种子对齐,最近的EA模型使用假标签策略,迭代地添加预测对种子对齐充满信心的不匹配实体对配对模式培训。不过,伪标签过程中确认偏差的不利影响在很大程度上被忽略,从而妨碍实体对齐性业绩。为了系统地消除假标签实体对齐性实体对齐的确认偏差,我们提议为实体对齐性(UPL-EA)建立一个统一的双优度定位框架,明确消除假标签对齐错误以提高实体对齐的准确性。 UPL-EA由两个互补部分组成:(1) 最佳运输(OT)基伪标签使用离散的 OT 模型作为确定实体对应关系和减少两个KGs之间错误匹配的有效手段。一个有效标准是将支持的伪标签对齐匹配框架用于满足一对一对一对应;(2) 将伪标签对准的伪标签对准广泛对准的伪标签分析结果,以增进实体对准性对齐。UBIL的升级的升级后对准后对准性调整,在后,通过预测中以独立的模拟对准中,将进行模拟对准,将模拟的对准后对准后对准性对准,将模拟的对准性升级后对准。
Article 157
Title@2025-07-02 (3): GAIus: Combining Genai with Legal Clauses Retrieval for Knowledge-based Assistant
Title: GAIus: Combining Genai with Legal Clauses Retrieval for Knowledge-based Assistant | GAIus: Genai mit rechtlichen Klauseln verbinden Rückzug für wissensbasierte Assistentin | GAIus:将热奈与法律条款相结合,为知识型助理提供法律条款检索服务 2507.01259v1 |
Authors (2): Michał Matak, Jarosław A. Chudziak
In this paper we discuss the capability of large language models to base their answer and provide proper references when dealing with legal matters of non-english and non-chinese speaking country. We discuss the history of legal information retrieval, the difference between case law and statute law, its impact on the legal tasks and analyze the latest research in this field. Basing on that background we introduce gAIus, the architecture of the cognitive LLM-based agent, whose responses are based on the knowledge retrieved from certain legal act, which is Polish Civil Code. We propose a retrieval mechanism which is more explainable, human-friendly and achieves better results than embedding-based approaches. To evaluate our method we create special dataset based on single-choice questions from entrance exams for law apprenticeships conducted in Poland. The proposed architecture critically leveraged the abilities of used large language models, improving the gpt-3.5-turbo-0125 by 419%, allowing it to beat gpt-4o and lifting gpt-4o-mini score from 31% to 86%. At the end of our paper we show the possible future path of research and potential applications of our findings.
在本文中,我们讨论了大型语言模型在回答和提供适当参考的能力,以处理非英语国家和非库克群岛国家的法律事项时,我们讨论了法律信息检索的历史、判例法与成文法之间的差异、其对法律任务的影响以及分析该领域的最新研究。基于这一背景,我们引入了基于认知LLM代理的架构GAIus,即基于认知LM代理的架构,其反应以从某些法律行为(即波兰民法)中获取的知识为基础。我们建议了一种检索机制,该机制比基于嵌入的方法更能解释、更方便人,并取得更好的结果。为了评估我们的方法,我们根据波兰法律学徒入学考试的单选问题创建了特殊数据集。拟议的架构以关键地利用了使用大语言模型的能力,改进了gpt-3.5-turbo-0125的419%,从而能够击打gpt-4o,将gpt-4o-miny分数从31%提高到86%。我们的文件结尾中,我们展示了未来可能的研究途径和我们研究结果的潜在应用。
Article 158
Title@2025-07-02 (3): Towards Safety Evaluations of Theory of Mind in Large Language Models
Title: Towards Safety Evaluations of Theory of Mind in Large Language Models | Zu Sicherheitsbewertungen der Geistestheorie in großen Sprachmodellen | 争取对大语言模式中思想理论进行安全评价 2506.17352v2 |
Authors (2): Tatsuhiro Aoshima, Mitsuaki Akiyama
As the capabilities of large language models (LLMs) continue to advance, the importance of rigorous safety evaluation is becoming increasingly evident. Recent concerns within the realm of safety assessment have highlighted instances in which LLMs exhibit behaviors that appear to disable oversight mechanisms and respond in a deceptive manner. For example, there have been reports suggesting that, when confronted with information unfavorable to their own persistence during task execution, LLMs may act covertly and even provide false answers to questions intended to verify their behavior. To evaluate the potential risk of such deceptive actions toward developers or users, it is essential to investigate whether these behaviors stem from covert, intentional processes within the model. In this study, we propose that it is necessary to measure the theory of mind capabilities of LLMs. We begin by reviewing existing research on theory of mind and identifying the perspectives and tasks relevant to its application in safety evaluation. Given that theory of mind has been predominantly studied within the context of developmental psychology, we analyze developmental trends across a series of open-weight LLMs. Our results indicate that while LLMs have improved in reading comprehension, their theory of mind capabilities have not shown comparable development. Finally, we present the current state of safety evaluation with respect to LLMs’ theory of mind, and discuss remaining challenges for future work.
随着大型语言模型(LLMS)的能力继续提高,严格的安全评价的重要性正在日益明显地显现出来。最近安全评估领域的关切突出了LLMS展示似乎使监督机制无法发挥作用并以欺骗性方式作出反应的行为的例子。例如,有报告表明,当面临不利于自己在任务执行期间坚持下去的信息时,LLMS可能会暗中行动,甚至对旨在核实其行为的问题提供虚假答案。为了评估这种欺骗性行动对开发商或用户的潜在风险,有必要调查这些行为是否来自模型内隐蔽、有意的过程。在本研究中,我们提议有必要衡量LMS的智力能力理论。我们首先审查关于思想理论的现有研究,并查明与在安全评价中应用该理论有关的观点和任务。鉴于这种思想理论主要是在发展心理学背景下研究,我们分析了一系列开放性LMS的发展趋势。我们的结果表明,LMS在阅读理解方面有所改进,但其思想能力理论没有表现出可比较的发展。我们先验的是,LMS目前的安全理论,我们讨论的是目前的安全状况。
Article 159
Title@2025-07-01 (2): The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure
Title: The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure | Das Medium ist nicht die Botschaft: Deconfounding Text-Embeddings via Linear Concept Erasure | 介质不是信息:通过线性概念时代的沉降文本嵌入 2507.01234v1 |
Authors (6): Yu Fan, Yang Tian, Shauli Ravfogel, Mrinmaya Sachan, Elliott Ash, Alexander Hoyle
Embedding-based similarity metrics between text sequences can be influenced not just by the content dimensions we most care about, but can also be biased by spurious attributes like the text’s source or language. These document confounders cause problems for many applications, but especially those that need to pool texts from different corpora. This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost. Document similarity and clustering metrics improve across every embedding variant and task we evaluate – often dramatically. Interestingly, performance on out-of-distribution benchmarks is not impacted, indicating that the embeddings are not otherwise degraded.
文本序列之间基于嵌入的相似度度量不仅会受到我们最关心的内容维度的影响,而且还会受到像文本的来源或语言这样的虚假属性的偏向。这些文件混杂者对许多应用程序造成问题,特别是那些需要从不同的公司汇总文本的应用程序造成问题。本文表明,一种贬损的算法将观测到的混杂者的信息从编码器代表中剔除,大大降低了这些偏差,以最低的计算成本计算。我们所评价的每个嵌入变量和任务中,记录相似度和组合度量都改善了 – – 通常会非常明显。有趣的是,分配之外基准的性能没有受到影响,表明嵌入没有受到其他的退化。
Article 160
Title@2025-07-01 (2): MEGA: xLSTM with Multihead Exponential Gated Fusion for Precise Aspect-based Sentiment Analysis
Title: MEGA: xLSTM with Multihead Exponential Gated Fusion for Precise Aspect-based Sentiment Analysis | MEGA: xLSTM mit Multihead Exponential Gated Fusion für präzise aspektbasierte Sentimentanalyse | MEGA:xLSTM, 带有多头辐射光度G化聚合, 用于基于频谱的感应分析 2507.01213v1 |
Authors (5): Adamu Lawan, Juhua Pu, Haruna Yunusa, Jawad Muhammad, Muhammad Lawan
Aspect-based Sentiment Analysis (ABSA) is a critical Natural Language Processing (NLP) task that extracts aspects from text and determines their associated sentiments, enabling fine-grained analysis of user opinions. Existing ABSA methods struggle to balance computational efficiency with high performance: deep learning models often lack global context, transformers demand significant computational resources, and Mamba-based approaches face CUDA dependency and diminished local correlations. Recent advancements in Extended Long Short-Term Memory (xLSTM) models, particularly their efficient modeling of long-range dependencies, have significantly advanced the NLP community. However, their potential in ABSA remains untapped. To this end, we propose xLSTM with Multihead Exponential Gated Fusion (MEGA), a novel framework integrating a bi-directional mLSTM architecture with forward and partially flipped backward (PF-mLSTM) streams. The PF-mLSTM enhances localized context modeling by processing the initial sequence segment in reverse with dedicated parameters, preserving critical short-range patterns. We further introduce an mLSTM-based multihead cross exponential gated fusion mechanism (MECGAF) that dynamically combines forward mLSTM outputs as query and key with PF-mLSTM outputs as value, optimizing short-range dependency capture while maintaining global context and efficiency. Experimental results on three benchmark datasets demonstrate that MEGA outperforms state-of-the-art baselines, achieving superior accuracy and efficiency in ABSA tasks.
以外观为基础的感知分析(ABSA)是一项至关重要的自然语言处理任务,它从文本中提取一些方面并确定其相关情绪,从而能够对用户意见进行细微分析。现有的ABSA方法在平衡计算效率与高性能之间挣扎:深层次学习模式往往缺乏全球背景,变压器需要大量计算资源,以Mamba为基础的方法面临CUDA依赖性和地方关系减少。长长距离内存模型(xLSTM)最近的进展,特别是其长距离依赖性的有效模型模型,大大推进了NLP社区。然而,在ABSA中,它们的潜力仍然有待挖掘。为此,我们提议将XLSTM与多头Exlocal-Gated Fusion(MEGAGA)相结合,这是一个将双向 mLSTM结构与前向和部分向后向(PF-MLSTM)相连接的新框架。 PF-MLSTM模式通过对初始序列进行处理,维护关键的短程模式。我们进一步引入了基于短LSTM和多头级SLS-S-SLS-S-S-SLSLS-SLS-S-S-S-S-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-Sileval-C-C-C-C-C-C-SLV-C-C-SLV-C-SLV-C-C-C-C-C-C-C-C-SLVLVLV-C-C-C-SLVDF-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-SLVLVLVDF-CF-CF-CF-C-C-C-C-C-C-C-C-C-C-C-SLV-C-C-C-SLDIDIDI-C-C-C-C-C-C-C-C-C-
Article 161
Title@2025-07-01 (2): A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions
Title: A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions | Eine Umfrage zur Unsicherheit Quantifizierung großer Sprachmodelle: Taxonomie, offene Forschungsherausforderungen und zukünftige Richtungen | 关于大语言模型不确定性量化调查:分类学、开放研究挑战和未来方向 2412.05563v2 |
Authors (5): Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z. Ren, Anirudha Majumdar
The remarkable performance of large language models (LLMs) in content generation, coding, and common-sense reasoning has spurred widespread integration into many facets of society. However, integration of LLMs raises valid questions on their reliability and trustworthiness, given their propensity to generate hallucinations: plausible, factually-incorrect responses, which are expressed with striking confidence. Previous work has shown that hallucinations and other non-factual responses generated by LLMs can be detected by examining the uncertainty of the LLM in its response to the pertinent prompt, driving significant research efforts devoted to quantifying the uncertainty of LLMs. This survey seeks to provide an extensive review of existing uncertainty quantification methods for LLMs, identifying their salient features, along with their strengths and weaknesses. We present existing methods within a relevant taxonomy, unifying ostensibly disparate methods to aid understanding of the state of the art. Furthermore, we highlight applications of uncertainty quantification methods for LLMs, spanning chatbot and textual applications to embodied artificial intelligence applications in robotics. We conclude with open research challenges in uncertainty quantification of LLMs, seeking to motivate future research.
大型语言模型(LLMs)在内容生成、编码和常识推理方面的出色表现促使其广泛融入社会的许多方面,然而,LLMs的结合引起了关于这些模型的可靠性和可信赖性的合理问题,因为这些模型具有产生幻觉的倾向:有目共睹的、事实错误的反应,这些反应表现令人十分自信;过去的工作表明,通过审查LLMs对相关迅速反应的反应的不确定性,可以发现LLMs产生的幻觉和其他非事实反应;推动大量研究工作,专门量化LMs的不确定性;这项调查力求对LLMs现有的不确定性量化方法进行广泛审查,查明其特征及其优劣之处;我们在相关的分类中提出现有方法,将表面上截然不同的方法统一起来,以帮助了解艺术现状;此外,我们强调LMs对LMs的不确定性量化方法的应用,将聊天室和文字应用混杂在一起,以体现机器人的人工智能应用;我们最后指出,在LLMs的不确定性量化方面存在公开的研究挑战,以激励未来的研究。
Article 162
Title@2025-07-01 (2): Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know?
Title: Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know? | Vernunft über Ungewissheit: Wissen Vernunftmodelle, wenn sie es nicht wissen? | 关于不确定性的原因:理性模型知道他们不知道什么时候知道吗? 2506.18183v2 |
Authors (6): Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, Anirudha Majumdar
Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks, enabled by multi-step reasoning induced using reinforcement learning. However, like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect (hallucinations). Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications. To this end, we explore uncertainty quantification of reasoning models in this work. Specifically, we ask three fundamental questions: First, are reasoning models well-calibrated? Second, does deeper reasoning improve model calibration? Finally, inspired by humans’ innate ability to double-check their thought processes to verify the validity of their answers and their confidence, we ask: can reasoning models improve their calibration by explicitly reasoning about their chain-of-thought traces? We introduce introspective uncertainty quantification (UQ) to explore this direction. In extensive evaluations on SOTA reasoning models across a broad range of benchmarks, we find that reasoning models: (i) are typically overconfident, with self-verbalized confidence estimates often greater than 85% particularly for incorrect responses, (ii) become even more overconfident with deeper reasoning, and (iii) can become better calibrated through introspection (e.g., o3-Mini and DeepSeek R1) but not uniformly (e.g., Claude 3.7 Sonnet becomes more poorly calibrated). Lastly, we conclude with important research directions to design necessary UQ benchmarks and improve the calibration of reasoning models.
理性语言模型为许多具有挑战性的基准设定了最先进的(SOTA)记录,这些记录是通过强化学习的多步推理而促成的。然而,与以前的语言模型一样,推理模型容易产生自信和可信的不正确反应(卤素)。知道这些模型何时和多少可以信任这些模型对于在现实世界应用中安全部署推理模型至关重要。为此,我们探索了推理模型的不确定性量化方法。具体地说,我们问了三个基本问题:首先,推理模型是否经过适当校准?第二,更深的推理推理是否改进了模型校准?最后,由于人类对核实其答案的有效性和信心的内在能力,我们问:推理模型的校准能够通过明确推理其思维链轨迹来改进这些模型的校准?我们引入对不确定性定量(UQ)来探索这一方向。在对SOTA推理模型进行的广泛评估中,我们发现推理模型通常过于自解,自辨的推理推理推理推理推理标准往往高于其精确度估计的精确度,而更精确推理(M)更精确的校正的校正的推理(甚至更精确推理,甚至更精确推理,甚至更精确推理,更精确的推理,甚至更精确的推理,甚至更精确的推理,甚至更精确的推理,甚至更精确的推理会更精确的推理(M)变得更精确的推理,甚至更精确的推理,甚至更精确的推理,甚至更精确的推理,甚至更精确的推理,甚至更精确的推理,甚至更精确的推理,甚至更精确的推理,更深更精确的推理,(M)。
Article 163
Title@2025-07-01 (2): STELLA: Self-Evolving LLM Agent for Biomedical Research
Title: STELLA: Self-Evolving LLM Agent for Biomedical Research | STELLA: Selbstständiger LLM-Agent für biomedizinische Forschung | STELLA: 生物医学研究代理公司 2507.02004v1 |
Authors (4): Ruofan Jin, Zaixi Zhang, Mengdi Wang, Le Cong
The rapid growth of biomedical data, tools, and literature has created a fragmented research landscape that outpaces human expertise. While AI agents offer a solution, they typically rely on static, manually curated toolsets, limiting their ability to adapt and scale. Here, we introduce STELLA, a self-evolving AI agent designed to overcome these limitations. STELLA employs a multi-agent architecture that autonomously improves its own capabilities through two core mechanisms: an evolving Template Library for reasoning strategies and a dynamic Tool Ocean that expands as a Tool Creation Agent automatically discovers and integrates new bioinformatics tools. This allows STELLA to learn from experience. We demonstrate that STELLA achieves state-of-the-art accuracy on a suite of biomedical benchmarks, scoring approximately 26\% on Humanity’s Last Exam: Biomedicine, 54\% on LAB-Bench: DBQA, and 63\% on LAB-Bench: LitQA, outperforming leading models by up to 6 percentage points. More importantly, we show that its performance systematically improves with experience; for instance, its accuracy on the Humanity’s Last Exam benchmark almost doubles with increased trials. STELLA represents a significant advance towards AI Agent systems that can learn and grow, dynamically scaling their expertise to accelerate the pace of biomedical discovery.
生物医学数据、工具和文献的迅速增长创造了一个支离破碎的研究景观,超出了人类的专业知识。尽管AI代理商提供了一种解决方案,但它们通常依赖静态的、手工整理的、成熟的工具,限制了其适应和规模的能力。在这里,我们引入了STELLA,这是一个自我演化的AI代理商,目的是克服这些限制。STELLA采用一个多试管结构,通过两个核心机制自主地提高自身能力:一个不断发展的用于推理战略的模板图书馆和一个动态工具海洋,它作为工具创建代理自动发现并整合新的生物信息工具。这让STELLA能够从经验中学习。我们证明STELLA在一套生物医学基准上达到了最新的最新准确性,在人类最后的Exam:生物医学,54在LAB-Bench:DBQA, 和63LAB-Bench:LitQA, 其业绩优于领先模型,达6个百分点。更重要的是,我们显示其业绩有系统化地改进了经验;例如,其精确性在生物医学基准上实现了人类进步的升级。
Article 164
Title@2025-07-01 (2): Matching and Linking Entries in Historical Swedish Encyclopedias
Title: Matching and Linking Entries in Historical Swedish Encyclopedias | Passende und verbindende Einträge in historischen schwedischen Enzyklopädien | 瑞典历史百科全书中的匹配和链接条目 2507.01170v1 |
Authors (3): Simon Börjesson, Erik Ersmark, Pierre Nugues
The \textit{Nordisk familjebok} is a Swedish encyclopedia from the 19th and 20th centuries. It was written by a team of experts and aimed to be an intellectual reference, stressing precision and accuracy. This encyclopedia had four main editions remarkable by their size, ranging from 20 to 38 volumes. As a consequence, the \textit{Nordisk familjebok} had a considerable influence in universities, schools, the media, and society overall. As new editions were released, the selection of entries and their content evolved, reflecting intellectual changes in Sweden. In this paper, we used digitized versions from \textit{Project Runeberg}. We first resegmented the raw text into entries and matched pairs of entries between the first and second editions using semantic sentence embeddings. We then extracted the geographical entries from both editions using a transformer-based classifier and linked them to Wikidata. This enabled us to identify geographic trends and possible shifts between the first and second editions, written between 1876-1899 and 1904-1926, respectively. Interpreting the results, we observe a small but significant shift in geographic focus away from Europe and towards North America, Africa, Asia, Australia, and northern Scandinavia from the first to the second edition, confirming the influence of the First World War and the rise of new powers. The code and data are available on GitHub at https://github.com/sibbo/nordisk-familjebok.
\ textit{ Nordisk familjebok} 是19世纪和20世纪瑞典百科全书,由一组专家编写,目的是作为知识参考,强调精确和准确性。本百科全书有4种主要版本,其大小不同,从20卷到38卷不等。因此,这四种主要版本对大学、学校、媒体和社会都有相当大的影响。随着新版的发行,条目的选择及其内容的演变,反映了瑞典的知识变化。在本文中,我们使用了来自\ textit{Project Runeberg}的数字化版本。我们首先用语义将原始文本分解成条目,并在第一版和第二版之间配对条目。我们随后用基于变压的分类器将这两个版本的地理条目与维基数据连接起来。这使我们能够确定地理趋势和第一版和第二版之间的可能变化,反映了1876-1899年和第一版至1904-1926年版的欧洲第一版和第二版和第二版、第二版和第二版的大陆/第二版的地理重点。
Article 165
Title@2025-07-01 (2): Event-based evaluation of abstractive news summarization
Title: Event-based evaluation of abstractive news summarization | Eventbasierte Auswertung der abstrakten News-Zusammenfassung | 以活动为基础对抽象新闻摘要总结的评价 2507.01160v1 |
Authors (4): Huiling You, Samia Touileb, Erik Velldal, Lilja Øvrelid
An abstractive summary of a news article contains its most important information in a condensed version. The evaluation of automatically generated summaries by generative language models relies heavily on human-authored summaries as gold references, by calculating overlapping units or similarity scores. News articles report events, and ideally so should the summaries. In this work, we propose to evaluate the quality of abstractive summaries by calculating overlapping events between generated summaries, reference summaries, and the original news articles. We experiment on a richly annotated Norwegian dataset comprising both events annotations and summaries authored by expert human annotators. Our approach provides more insight into the event information contained in the summaries.
一篇新闻文章的抽象摘要包含其最重要的简缩版信息。通过计算重叠的单位或相似的分数,对基因化语言模型自动生成的摘要进行大量评价,这在很大程度上依赖于作为黄金参考的人类著作摘要。新闻文章报道事件,最好是摘要应该如此。在这项工作中,我们建议通过计算生成的摘要、参考摘要和原始新闻文章之间的重叠事件来评估抽象摘要的质量。我们试验一个由事件说明和专家人类演讲者编写的摘要组成的充满注释的挪威数据集。我们的方法更深入地了解摘要中包含的活动信息。
Article 166
Title@2025-07-01 (2): Squat: Quant Small Language Models on the Edge
Title: Squat: Quant Small Language Models on the Edge | Squat: Quant kleine Sprachmodelle am Rand | Squt: 边边缘的量化小语言模型 2402.10787v2 |
Authors (12): Xuan Shen, Peiyan Dong, Zhenglun Kong, Yifan Gong, Changdi Yang, Zhaoyang Han, Yanyue Xie, Lei Lu, Cheng Lyu, Chao Wu, Yanzhi Wang, Pu Zhao
A growing trend has emerged in designing high-quality Small Language Models (SLMs) with a few million parameters. This trend is driven by the increasing concerns over cloud costs, privacy, and latency. Considering that full parameter training is feasible for SLMs on mobile devices, Quantization-Aware Training (QAT) is employed to improve efficiency by reducing computational overhead and memory footprint. However, previous QAT works adopt fine-grained quantization methods to compress models with billions of parameters on GPUs, incompatible with current commodity hardware, such as mobile and edge devices, which relies on Single Instruction Multiple Data (SIMD) instructions. Thus, the generalization of these methods to SLMs on mobile devices is limited. In this paper, we propose Squat method, an effective QAT framework with deployable quantization for SLMs on mobile devices. Specifically, we propose entropy-guided and distribution-aligned distillation to mitigate the distortion of attention information from quantization. Besides, we employ sub-8-bit token adaptive quantization, assigning varying bit widths to different tokens based on their importance. Furthermore, we develop a SIMD-based Multi-Kernel Mixed-Precision (MKMP) multiplier to support sub-8-bit mixed-precision MAC on mobile devices. Our extensive experiments verify the substantial improvements of our method compared to other QAT methods across various datasets. Furthermore, we achieve an on-device speedup of up to 2.37x compared with its FP16 counterparts, signaling a great advancement. Code: https://github.com/shawnricecake/squant
在设计具有数百万参数的高质量小型语言模型(SLM)方面出现了日益增长的趋势,这一趋势的驱动因素是人们对云的成本、隐私和延缓度的日益关切。考虑到在移动设备上对SLM来说完全参数培训是可行的,量化-软件培训(QAT)是用来通过减少计算间接费用和记忆足迹来提高效率的。然而,以前的QAT工作采用微小的量化方法,压缩具有数十亿参数的GPU模型,与当前商品硬件不兼容,例如移动和边缘设备,这依赖于单一指令多重数据(SIMD)的指示。因此,在移动设备上将这些方法对SLMT的概括化是有限的。在本文中,我们建议采用一个有效的QAT框架,在移动设备上可部署可计量的偏移。具体地说,我们提议以增缩制制制导和配电法来减轻我们从裁剪裁量的注意力信息的扭曲。此外,我们还采用次8比重的调定调控点,在移动设备上将不同位宽度的宽度 Q-AAT 。我们根据对MLMLM(M) 大幅校正的升级的升级,我们开发了一个可调调调方法。
Article 167
Title@2025-07-01 (2): Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
Title: Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution? | Selbstreflektierende Unsicherheiten: Kennen LLMs ihre interne Antwortverteilung? | 自我反感的不确定性:LLMs知道他们的内部答案分布吗? 2505.20295v2 |
Authors (6): Michael Kirchhof, Luca Füger, Adam Goliński, Eeshan Gunesh Dhekane, Arno Blaas, Sinead Williamson
To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue that in the output space of LLMs, the space of strings, exist strings expressive enough to summarize the distribution over output strings the LLM deems possible. We lay a foundation for this new avenue of uncertainty explication and present SelfReflect, a theoretically-motivated metric to assess how faithfully a string summarizes an LLM’s internal answer distribution. We show that SelfReflect is able to discriminate even subtle differences of candidate summary strings and that it aligns with human judgement, outperforming alternative metrics such as LLM judges and embedding comparisons. With SelfReflect, we investigate a number of self-summarization methods and find that even state-of-the-art reasoning models struggle to explicate their internal uncertainty. But we find that faithful summarizations can be generated by sampling and summarizing. To support the development of this universal form of LLM uncertainties, we publish our metric at https://github.com/apple/ml-selfreflect
当一个大语言模型(LLM)对一个响应不确定时,不确定性量化通常会产生百分数和产出。但是,我们只能这样做吗?我们争辩说,在LLM的输出空间、字符空间中,存在着足以概括LLM认为可能的输出字符串分布的字符串。我们为这种新的不确定性解释和提出自我反省的新途径打下了基础,这是一种具有理论动机的衡量标准,用以评估字符串如何忠实地总结LLM的内部答案分布。我们表明,自我反省能够区分候选人摘要字符串的细微差异,而且它符合人类判断,优于LLM法官等其他指标,并比人作比较。我们通过自我反省,调查了一些自我反省方法,发现甚至状态推理模型也试图解释内部不确定性。但我们发现,通过取样和总结,可以产生忠实的对等的比较。为了支持发展LLMM不确定性的普遍形式,我们在 https://github.com/apple/ml-sel-selfreclect 上公布了我们的指标。
Article 168
Title@2025-07-01 (2): Divergent Creativity in Humans and Large Language Models
Title: Divergent Creativity in Humans and Large Language Models | Unterschiedliche Kreativität in Menschen und großen Sprachmodellen | 人类和大语言模式的不同创造性 2405.13012v2 |
Authors (8): Antoine Bellemare-Pepin, François Lespinasse, Philipp Thölke, Yann Harel, Kory Mathewson, Jay A. Olson, Yoshua Bengio, Karim Jerbi
The recent surge of Large Language Models (LLMs) has led to claims that they are approaching a level of creativity akin to human capabilities. This idea has sparked a blend of excitement and apprehension. However, a critical piece that has been missing in this discourse is a systematic evaluation of LLMs’ semantic diversity, particularly in comparison to human divergent thinking. To bridge this gap, we leverage recent advances in computational creativity to analyze semantic divergence in both state-of-the-art LLMs and a substantial dataset of 100,000 humans. We found evidence that LLMs can surpass average human performance on the Divergent Association Task, and approach human creative writing abilities, though they fall short of the typical performance of highly creative humans. Notably, even the top performing LLMs are still largely surpassed by highly creative individuals, underscoring a ceiling that current LLMs still fail to surpass. Our human-machine benchmarking framework addresses the polemic surrounding the imminent replacement of human creative labour by AI, disentangling the quality of the respective creative linguistic outputs using established objective measures. While prompting deeper exploration of the distinctive elements of human inventive thought compared to those of AI systems, we lay out a series of techniques to improve their outputs with respect to semantic diversity, such as prompt design and hyper-parameter tuning.
最近,大语言模型(LLMS)的激增导致人们声称,它们正在接近与人的能力相似的创造力水平。这一想法已经引发了兴奋和恐惧的混合。然而,本次演讲中缺少的关键内容是对LLMS的语义多样性的系统评估,特别是相对于人类的不同思维而言。为了缩小这一差距,我们利用最近在计算创造力方面的进展来分析最先进的LLMS和大量10万人的数据集中的语义差异。我们发现有证据表明,LMS可以超越不同协会任务中人类的平均表现,并接近人类的创造性写作能力,尽管它们不及于具有高度创造性的人的典型表现。值得注意的是,即使最优秀的LMMS仍然被高度创造性的人大大超过,这突出表明了目前LMS仍然无法超过的上限。我们的人类机器基准框架解决了围绕即将用AI取代人类创造性劳动力的两极性关系,用既定的客观措施模糊了各自创造性语言产出的质量。我们一方面推动更深入地探索人类有创意的思想要素,而另一方面却没有达到高度创造性的多样化,另一方面,我们又把设计出一系列技术的升级。
Article 169
Title@2025-07-01 (2): BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining
Title: BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining | BioPars: Ein vorgebildetes biomedizinisches Großsprachmodell für persischen biomedizinischen Textbergbau | BioPars:波斯生物医学材料开采的预先培训的生物医学大语言模型 2506.21567v2 |
Authors (6): Baqer M. Merzah, Tania Taami, Salman Asoudeh, Saeed Mirzaee, Amir reza Hossein pour, Amir Ali Bengari
Large Language Models (LLMs) have recently gained attention in the life sciences due to their capacity to model, extract, and apply complex biological information. Beyond their classical use as chatbots, these systems are increasingly used for complex analysis and problem-solving in specialized fields, including bioinformatics. First, we introduce BIOPARS-BENCH, a dataset from over 10,000 scientific articles, textbooks, and medical websites. BioParsQA was also introduced to evaluate the proposed model, which consists of 5,231 Persian medical questions and answers. This study then introduces BioPars, a simple but accurate measure designed to assess LLMs for three main abilities: acquiring subject-specific knowledge, interpreting and synthesizing such knowledge, and demonstrating proper evidence. Comparing ChatGPT, Llama, and Galactica, our study highlights their ability to remember and retrieve learned knowledge but also reveals shortcomings in addressing higher-level, real-world questions and fine-grained inferences. These findings indicate the need for further fine-tuning to address the capabilities of LLM in bioinformatics tasks. To our knowledge, BioPars is the first application of LLM in Persian medical QA, especially for generating long answers. Evaluation of four selected medical QA datasets shows that BioPars has achieved remarkable results compared to comparative approaches. The model on BioParsQA achieved a ROUGE-L score of 29.99, which is an improvement over GPT-4 1.0. The model achieved a BERTScore of 90.87 with the MMR method. The MoverScore and BLEURT values were also higher in this model than the other three models. In addition, the reported scores for the model are MoverScore=60.43 and BLEURT=50.78. BioPars is an ongoing project and all resources related to its development will be made available via the following GitHub repository: https://github.com/amirap80/BioPars.
大型语言模型(LLMS)最近因其建模、提取和应用复杂生物信息的能力而在生命科学中受到关注。这些系统除了古典用作聊天室外,还越来越多地用于包括生物信息学在内的专门领域的复杂分析和解决问题。首先,我们引入了BIOPARS-BENCH,这是来自10 000多篇科学文章、教科书和医学网站的数据集。还引入了BiParsQA来评价拟议的模型,其中包括5 231波斯医质和答案。该研究随后引入了BiPars,这是一个简单而准确的措施,用来评估三种主要能力:获得特定主题知识、解释和合成这些知识,并展示适当证据。CompationGOPT、Llama和Galactica,我们的研究突出了它们回忆和检索所学知识的能力,但也揭示了在处理更高层次、真实世界问题和精确的模型错误方面的种种缺点。这些研究结果表明,需要进一步调整LMRPARS的能力,在生物信息模型的数学任务中进行更精确的处理。在生物信息模型中,已经实现的三部数据数据数据在进行中进行。
Article 170
Title@2025-07-01 (2): SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks
Title: SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks | SciArena: Eine offene Bewertungsplattform für Stiftungsmodelle in wissenschaftlichen Literaturaufgaben | SciArena:科学文献任务基础模型公开评价平台 2507.01001v1 |
Authors (18): Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Charles McGrady, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan
We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark’s challenges and emphasize the need for more reliable automated evaluation methods.
我们介绍了科学文献任务基础模型评估的开放和合作平台SciArena。与科学文献理解和综合的传统基准不同,SciArena在社区就模型比较投票的Chatbot Arena评估方法之后,直接与研究界接触。SciArena利用集体智慧,对要求文献基础和长式答复的开放式科学任务模型绩效进行社区驱动的评价。平台目前支持23个开放源和专有基础模型,从不同科学领域的受信任研究人员那里收集了13 000多张选票。我们分析了迄今收集的数据,确认提交的问题多种多样,与现实世界的文献需求相一致,参与的研究人员在其评估中表现出强烈的自我一致性和相互协调的共识。我们根据模型排名头板讨论结果和见解。为了进一步促进建立基于文献任务基于文献基础的自动评价系统的研究,我们发布了SciArena-Eval,这是基于我们收集的优惠数据的一项元评价基准。我们的基准测量了模型在判断回答质量时的准确性,将它们与人类的票数进行比较。我们进行试验时强调基准的挑战和强调比较的可靠方法。
Article 171
Title@2025-07-01 (2): Capturing Visualization Design Rationale
Title: Capturing Visualization Design Rationale | Capturing Visualization Design Rationale | 模拟可视化设计 2506.16571v2 |
Authors (5): Maeve Hutchinson, Radu Jianu, Aidan Slingsby, Jo Wood, Pranava Madhyastha
Prior natural language datasets for data visualization have focused on tasks such as visualization literacy assessment, insight generation, and visualization generation from natural language instructions. These studies often rely on controlled setups with purpose-built visualizations and artificially constructed questions. As a result, they tend to prioritize the interpretation of visualizations, focusing on decoding visualizations rather than understanding their encoding. In this paper, we present a new dataset and methodology for probing visualization design rationale through natural language. We leverage a unique source of real-world visualizations and natural language narratives: literate visualization notebooks created by students as part of a data visualization course. These notebooks combine visual artifacts with design exposition, in which students make explicit the rationale behind their design decisions. We also use large language models (LLMs) to generate and categorize question-answer-rationale triples from the narratives and articulations in the notebooks. We then carefully validate the triples and curate a dataset that captures and distills the visualization design choices and corresponding rationales of the students.
用于数据可视化的先前自然语言数据集侧重于诸如可视化读写评估、视觉生成和自然语言指令的可视化生成等任务,这些研究往往依赖有控制的设置,带有目的制造的可视化和人工构建的问题。因此,它们倾向于优先考虑可视化的解释,侧重于可视化的解码,而不是理解其编码。在本文中,我们提出了一个新的数据集和方法,用于通过自然语言进行可视化设计的合理原理。我们利用了一个独特的真实世界可视化和自然语言描述来源:学生作为数据可视化课程的一部分创建的识字可视化笔记本。这些笔记本将视觉艺术品与设计展示结合起来,学生在设计决定时清楚地说明其设计理由。我们还使用大型语言模型(LLMs)从笔记本的叙述和表达中生成和分类问题解析三联词。然后我们仔细验证三词,并整理一个数据集,以捕捉和提取可视化学生可视化设计选择和相应原理。
Article 172
Title@2025-07-01 (2): Flow-Modulated Scoring for Semantic-Aware Knowledge Graph Completion
Title: Flow-Modulated Scoring for Semantic-Aware Knowledge Graph Completion | Flow-Modulated Scoring für semantisch-bewusste Wissensgraphenvervollständigung | 用于语义智能知识图补全的流动移动模型拼图 2506.23137v2 |
Authors (4): Siyuan Li, Ruitong Liu, Yan Wen, Te Sun
Effective modeling of multifaceted relations is pivotal for Knowledge Graph Completion (KGC). However, a majority of existing approaches are predicated on static, embedding-based scoring, exhibiting inherent limitations in capturing contextual dependencies and relational dynamics. Addressing this gap, we propose the Flow-Modulated Scoring (FMS) framework. FMS comprises two principal components: (1) a semantic context learning module that encodes context-sensitive entity representations, and (2) a conditional flow-matching module designed to learn the dynamic transformation from a head to a tail embedding, governed by the aforementioned context. The resultant predictive vector field, representing the context-informed relational path, serves to dynamically refine the initial static score of an entity pair. Through this synergy of context-aware static representations and conditioned dynamic information, FMS facilitates a more profound modeling of relational semantics. Comprehensive evaluations on several standard benchmarks demonstrate that our proposed method surpasses prior state-of-the-art results.
对多方面关系进行有效建模对于完成知识图(KGC)至关重要。然而,大多数现有方法基于静态、嵌入式的评分,在捕捉背景依赖性和关系动态方面表现出内在的局限性。我们针对这一差距建议流动模型框架。FMS由两个主要部分组成:(1) 将环境敏感实体的表示形式编码的语义背景学习模块;(2) 有条件的流动匹配模块,旨在学习由头部到尾部的动态嵌入式变化,由上述背景管理。由此产生的预测矢量字段代表了背景知情关系路径,有助于动态地完善一个实体对口的初步静态评分。通过背景认知静态表和有条件动态信息的协同作用,FMS为更深入的关系语义表达模式提供了便利。对若干标准基准的全面评价表明,我们拟议的方法超过了以往的状态结果。
Article 173
Title@2025-07-01 (2): La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America
Title: La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America | La Leaderboard: Großes Sprachmodell für spanische Sorten und Sprachen Spaniens und Lateinamerikas | 领头板:西班牙和拉丁美洲西班牙语品种和语言大语言示范板 2507.00999v1 |
Authors (25): María Grandury, Javier Aula-Blasco, Júlia Falcão, Clémentine Fourrier, Miguel González, Gonzalo Martínez, Gonzalo Santamaría, Rodrigo Agerri, Nuria Aldama, Luis Chiruzzo, Javier Conde, Helena Gómez, Marta Guerrero, Guido Ivetta, Natalia López, Flor Miriam Plaza-del-Arco, María Teresa Martín-Valdivia, Helena Montoro, Carmen Muñoz, Pedro Reviriego, Leire Rosado, Alejandro Vaca, María Estrella Vallecillo-Rodríguez, Jorge Vallego, Irune Zubiaga
Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Basque, Catalan, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.
领头板展示了大语言模式(LLMS)目前的能力和局限性。为了激励代表西班牙语社区语言和文化多样性的LLMs的发展,我们提出La Leadboard(La Leadboard),这是第一个评价西班牙和拉丁美洲语言和语言品种的基因化LMS的开放源头板。La Leadboard是一个社区驱动的项目,旨在为有兴趣为西班牙语社区开发LLMS的每一个人建立一个评价标准。这个最初版本综合了巴斯克、加泰罗尼亚、加利西亚和不同西班牙品种的66个数据集,展示了50个模型的评价结果。为了鼓励以其他语言在社区驱动下开发领导板,我们解释了我们的方法,包括选择每个下游任务最合适的评价设置的指南。特别是,我们提供了使用比文献中通常少的少量实例的理由,目的是减少环境影响,便利更广泛的研究界获得可推广的成果。
Article 174
Title@2025-07-01 (2): Should We Still Pretrain Encoders with Masked Language Modeling?
Title: Should We Still Pretrain Encoders with Masked Language Modeling? | Sollten wir noch Encoder mit maskierten Sprachmodellen vortrainieren? | 我们是否仍应该为带有隐蔽语言建模的编程者预作准备? 2507.00994v1 |
Authors (8): Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, Pierre Colombo
Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.
虽然编码器前培训历来依赖隐蔽语言模型(MLM),但最近有证据表明,通过Causal语言模型(CLM)培训的解码模型可以有效地重新定位为编译器,常常在文本代表基准方面超过传统的编码器。然而,这些收益是否反映了CLM目标的内在优势,还是源于模型和数据规模等混杂因素,目前还不清楚。在本文中,我们通过一系列大规模、经过仔细控制的大规模、预先培训的模型(MLM)来处理这一问题,培训总共30个模型,范围从2亿至10亿参数不等,并进行15 000多个微调和评估运行。我们发现,虽然与MLM培训通常在文本代表基准方面产生更好的业绩,但CLM培训模型的数据效率更高,显示微调稳定性也有所改善。我们实验性地表明,我们所有按顺序应用CLM和MM的双轨培训战略,在固定的计算培训预算下实现最佳业绩,即从现有的Cral-LM模型从可持续降低成本成本成本。
Article 175
Title@2025-07-01 (2): Discourse Heuristics For Paradoxically Moral Self-Correction
Title: Discourse Heuristics For Paradoxically Moral Self-Correction | Diskurs Heuristik für paradoxerweise sittliche Selbstkorrektion | 反相矛盾道德自我自我修正的超常性理论 2507.00985v1 |
Authors (4): Guangliang Liu, Zimo Qi, Xitong Zhang, Kristen Marie Johnson
Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect heuristic shortcuts, and that the presence of these heuristic shortcuts during self-correction leads to inconsistency when attempting to enhance both self-correction and self-diagnosis capabilities jointly. Based on our findings, we propose a solution to improve moral self-correction by leveraging the heuristics of curated datasets. We also highlight the generalization challenges of this capability, particularly in terms of learning from situated context and model scales.
大型语言模型(LLMs)的输出与人类道德价值相匹配,道德自我纠正是很有希望的方法,但道德自我纠正技术受两个主要悖论的影响。首先,尽管有经验证据和理论证据支持自我纠正的有效性,但这种LLM能力只在表面一级运作。第二,虽然LLMs拥有自己诊断其产出中不道德的方面的能力,但在自我纠正过程中,他们努力找出这种道德不一致的原因。为了更好地理解和解决这些矛盾,我们分析了在微调公司中的谈话结构,目的是加强道德自我纠正,揭示有效构造背后的超强主义的存在。我们证明,道德自我纠正依赖于反映超自然捷径的谈话结构,在自我纠正过程中,这些超自然的捷径的存在导致在试图加强自我纠正和自我诊断能力时出现不一致。根据我们的调查结果,我们提出了一种改进道德自我修正的解决方案,即利用这种超常规模的数据条件的学习能力来改进道德模型的自我修正。我们还提出了一种解决方案,我们从总体数据规模中突出了这种自我修正的挑战。
Article 176
Title@2025-07-01 (2): Enhancing LLM Agent Safety via Causal Influence Prompting
Title: Enhancing LLM Agent Safety via Causal Influence Prompting | Verbesserung der Sicherheit von LLM-Agenten durch ursächlichen Einfluss | 通过原因影响促进增强LLM代理安全 2507.00979v1 |
Authors (5): Dongyoon Hahm, Woogyeol Jin, June Suk Choi, Sungsoo Ahn, Kimin Lee
As autonomous agents powered by large language models (LLMs) continue to demonstrate potential across various assistive tasks, ensuring their safe and reliable behavior is crucial for preventing unintended consequences. In this work, we introduce CIP, a novel technique that leverages causal influence diagrams (CIDs) to identify and mitigate risks arising from agent decision-making. CIDs provide a structured representation of cause-and-effect relationships, enabling agents to anticipate harmful outcomes and make safer decisions. Our approach consists of three key steps: (1) initializing a CID based on task specifications to outline the decision-making process, (2) guiding agent interactions with the environment using the CID, and (3) iteratively refining the CID based on observed behaviors and outcomes. Experimental results demonstrate that our method effectively enhances safety in both code execution and mobile device control tasks.
由于由大型语言模型(LLMs)驱动的自主代理机构继续显示各种辅助性任务的潜力,确保其安全和可靠行为对于防止意外后果至关重要。在这项工作中,我们引入了CIP,这是利用因果影响图(CIDs)确定和减轻因果影响风险的新技术。CID提供了因果关系的结构性代表,使代理机构能够预测有害结果和作出更安全的决定。我们的方法包括三个关键步骤:(1)根据任务规格启动CID,以概述决策过程;(2)指导代理人利用CID与环境的相互作用;(3)根据观察到的行为和结果反复完善CID。实验结果表明,我们的方法有效地加强了代码执行和移动装置控制任务的安全。
Article 177
Title@2025-07-01 (2): Large Language Model Confidence Estimation via Black-Box Access
Title: Large Language Model Confidence Estimation via Black-Box Access | Große Sprachmodell-Konfidenzschätzung über Black-Box-Zugriff | 通过黑箱访问大语言模型信任度估计 2406.04370v4 |
Authors (5): Tejaswini Pedapati, Amit Dhurandhar, Soumya Ghosh, Soham Dan, Prasanna Sattigeri
Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2, Llama-13b, Mistral-7b and GPT-4 on four benchmark Q\&A tasks as well as of Pegasus-large and BART-large on two benchmark summarization tasks with it surpassing baselines by even over $10\%$ (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.
估计对模型反应的不确定性或信心,对于评价不仅对答复的信任,而且对整个模型的信任,都很重要。在本文件中,我们探讨了对大语言模型(LLMs)的响应信心进行估计的问题,这些大语言模型只是黑盒或查询访问,我们建议了一个简单和可扩展的框架,在其中,我们设计了新特点,并针对这些特点培训了一个(可解释的)模型(viz. 后勤回归),以估计信任度。我们的经验证明,我们简单的框架有效地估计了Flan-ul2、Llama-13b、Mistral-7b和GPT-4对四个基准“A”任务的信任度,以及Pegasus大和BART大对两个基准汇总任务的信任度,有时甚至超过基线1 000美元(AUROC)。此外,我们的可解释方法提供了对预测信任度的特征的洞察力,导致有趣的发现,我们为一个LM所建立的信任模型在给定数据集上对其他人普遍零点。
Article 178
Title@2025-07-01 (2): MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research
Title: MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research | MLR-Bench: Bewertung von KI-Agenten auf Open-Ended Machine Learning Research | MLR-Bench:评估AI公司在开放式机械学习研究方面的代理机构 2505.19955v2 |
Authors (10): Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, Bryan Hooi
Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results–posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.
在这项工作中,我们引入了MLR-Bench,这是评价AI代理商进行开放式机器学习研究的全面基准。 MLR-Bench 包括三个关键组成部分:(1) NeurIPS、ICLR和ICMLL讲习班产生的201项研究任务,涵盖不同的ML专题;(2) MLR-Judge,这是一个自动评价框架,将基于LLM的审评员与精心设计的审查标语结合起来,以评价研究质量;(3) MLR-Agents,一个模块代理商,能够通过四个阶段完成研究任务:想法的产生、提案的制定、实验和纸张的写作。我们的框架支持在这些不同的研究阶段进行逐步评估,以及对最后研究论文进行端到端到端的评价。 然后,我们利用MLR-Bench来评价六个前沿LMM和高级编目代理商,发现虽然LLMMM能够有效地产生一致的想法和结构完善的文件,但当前的编码代理商经常(例如80%的案例中)能够完成研究任务的帮助完成研究任务。我们的框架支持实验性结果评估,通过一个开放的实验室评估工具来显示其潜在的可靠。
Article 179
Title@2025-07-01 (2): Intertextual Parallel Detection in Biblical Hebrew: A Transformer-Based Benchmark
Title: Intertextual Parallel Detection in Biblical Hebrew: A Transformer-Based Benchmark | Intertextuelle Parallelerkennung im biblischen Hebräisch: Ein transformerbasierter Benchmark | 《圣经希伯来文:以变换者为基础的基准》 2506.24117v2 |
Authors (1): David M. Smiley
Identifying parallel passages in biblical Hebrew (BH) is central to biblical scholarship for understanding intertextual relationships. Traditional methods rely on manual comparison, a labor-intensive process prone to human error. This study evaluates the potential of pre-trained transformer-based language models, including E5, AlephBERT, MPNet, and LaBSE, for detecting textual parallels in the Hebrew Bible. Focusing on known parallels between Samuel/Kings and Chronicles, I assessed each model’s capability to generate word embeddings distinguishing parallel from non-parallel passages. Using cosine similarity and Wasserstein Distance measures, I found that E5 and AlephBERT show promise; E5 excels in parallel detection, while AlephBERT demonstrates stronger non-parallel differentiation. These findings indicate that pre-trained models can enhance the efficiency and accuracy of detecting intertextual parallels in ancient texts, suggesting broader applications for ancient language studies.
在圣经希伯莱文(BH)中找出平行的段落是理解文字间关系的圣经奖学金的核心。传统方法依赖于人工比较,这是一个劳动密集型的过程,容易发生人为错误。这项研究评估了预先训练的基于变压器的语言模型的潜力,包括E5、AlephBERT、MPNet和LABSE,以发现希伯来圣经中的文本平行。我以塞缪尔/Kings和纪事之间的已知平行点为重点,我评估了每一种模型生成与非平行通道平行的词嵌入器的能力。我发现,E5和AlephBERT的相似性和瓦塞尔斯坦距离措施显示了承诺;E5在平行探测方面表现优异,而AlephBERT则展示了更强的非平行差异。这些研究结果表明,经过训练的模型可以提高古代文本中发现文字平行点的效率和准确性,为古代语言研究提供了更广泛的应用。
Article 180
Title@2025-07-01 (2): The Cognate Data Bottleneck in Language Phylogenetics
Title: The Cognate Data Bottleneck in Language Phylogenetics | Der Cognate Data Bottleneck in der Sprache Phylogenetik | 语言哲学遗传学中的 Cognate 数据瓶颈 2507.00911v1 |
Authors (2): Luise Häuser, Alexandros Stamatakis
To fully exploit the potential of computational phylogenetic methods for cognate data one needs to leverage specific (complex) models an machine learning-based techniques. However, both approaches require datasets that are substantially larger than the manually collected cognate data currently available. To the best of our knowledge, there exists no feasible approach to automatically generate larger cognate datasets. We substantiate this claim by automatically extracting datasets from BabelNet, a large multilingual encyclopedic dictionary. We demonstrate that phylogenetic inferences on the respective character matrices yield trees that are largely inconsistent with the established gold standard ground truth trees. We also discuss why we consider it as being unlikely to be able to extract more suitable character matrices from other multilingual resources. Phylogenetic data analysis approaches that require larger datasets can therefore not be applied to cognate data. Thus, it remains an open question how, and if these computational approaches can be applied in historical linguistics.
为了充分利用计算遗传数据的植物遗传学方法的潜力,人们需要利用一种机械学习技术的具体(复杂)模型。但是,两种方法都需要的数据集都大大大于手动收集的遗传数据的现有数据。据我们所知,没有可行的方法可以自动生成更大的遗传数据集。我们通过自动从一个大型多语种百科全书词典 BabelNet 中提取数据集来证实这一主张。我们证明,各特性矩阵的植物遗传推论产生与既定金质标准地面真象树基本不符的树。我们还讨论为什么我们认为不可能从其他多语言资源中提取更合适的性格矩阵。因此,要求较大数据集的遗传数据分析方法不能应用于遗传数据数据。因此,如何以及在历史语言中应用这些计算方法仍然是个未决问题。
Article 181
Title@2025-07-01 (2): ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models
Title: ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models | NUR: One-Layer-Intervention Genügend mildert Halluzinationen in großen Vision-Sprachen-Modellen | 仅:在大型视觉语言模型中,单声道干预足以减少幻觉 2507.00898v1 |
Authors (9): Zifu Wan, Ce Zhang, Silong Yong, Martin Q. Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, Yaqi Xie
Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at https://github.com/zifuwan/ONLY.
最近的大型视觉语言模型(LVLM)引入了通过文字响应理解和推理图像输入的新模式,尽管这些模型在多种模式任务中取得了显著的成绩,但它们面临长期的幻觉挑战,这带来了实际弱点,并引起人们对在现实应用中可靠部署的担忧;现有工作探索了对比式解码方法,以缓解这一问题,其中原始LVLM的输出与受周遭版本的对比和对比;然而,这些方法需要两个或两个以上的查询,以减缓LVLM的响应生成,使其更不适合实时应用。为克服这一限制,我们仅建议一种无需培训的解码方法,在解码过程中只需要一次查询和一次性干预,从而能够有效地实时部署。具体地说,我们通过有选择地扩大关键文本信息,将每种商品的文本-视听昆虫比率进行比较。广泛的实验结果表明,我们所提议的方法在需要最低限度的执行努力和计算成本的同时,在各种基准中始终超越了LVM-art的状态方法。
Article 182
Title@2025-07-01 (2): MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes
Title: MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes | MemeCMD: Ein automatisch generierter chinesischer Multiturn Dialogue Datensatz mit kontextuell abgerufenen Memes | MemeCMD: 一个自动生成的中文多方向对话框数据集, 带有上下文检索的Memes 2507.00891v1 |
Authors (3): Yuheng Wang, Xianhe Tang, Pufeng Huang
Memes are widely used in online social interactions, providing vivid, intuitive, and often humorous means to express intentions and emotions. Existing dialogue datasets are predominantly limited to either manually annotated or pure-text conversations, lacking the expressiveness and contextual nuance that multimodal interactions provide.To address these challenges, we introduce MemeCMD, an automatically generated Chinese Multi-turn Dialogue dataset with contextually retrieved memes. Our dataset combines a large-scale, MLLM-annotated meme library with dialogues auto-generated by dual agents across diverse scenarios. We introduce a retrieval framework and adaptive threshold to ensure contextually relevant, naturally spaced meme usage. Experiments demonstrate the effectiveness of our approach in generating contextually appropriate and diverse meme-incorporated dialogues, offering a scalable and privacy-preserving resource for advancing multimodal conversational AI.
在网上社交互动中,Memes被广泛使用,提供了生动、直观和常常幽默的表达意图和情感的手段。现有的对话数据集主要局限于手动附加说明或纯文本对话,缺乏多式联运互动提供的清晰度和背景特点。为了应对这些挑战,我们引入了MemeCCMD,这是一个自动生成的中国多方向对话数据集,带有根据背景检索的MemeCMD。我们的数据集将大型的MLLM-附加说明的Mmeme 图书馆与由多种情景的双重代理商自动生成的对话结合起来。我们引入了检索框架和适应性门槛,以确保在环境上具有相关性的自然间隙使用。实验表明我们的方法在产生适合背景和多样化的混合式对话方面的有效性,为推进多语言的AI提供了可扩展和隐私保护的资源。
Article 183
Title@2025-07-01 (2): Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check
Title: Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check | Skalierungsgesetze sind für Downstream-Aufgaben unzuverlässig: Ein Realitätscheck | 增强法律对下流任务不可靠:一个现实检查 2507.00885v1 |
Authors (3): Nicholas Lourie, Michael Y. Hu, Kyunghyun Cho
Downstream scaling laws aim to predict task performance at larger scales from pretraining losses at smaller scales. Whether this prediction should be possible is unclear: some works demonstrate that task performance follows clear linear scaling trends under transformation, whereas others point out fundamental challenges to downstream scaling laws, such as emergence and inverse scaling. In this work, we conduct a meta-analysis of existing data on downstream scaling laws, finding that close fit to linear scaling laws only occurs in a minority of cases: 39% of the time. Furthermore, seemingly benign changes to the experimental setting can completely change the scaling trend. Our analysis underscores the need to understand the conditions under which scaling laws succeed. To fully model the relationship between pretraining loss and downstream task performance, we must embrace the cases in which scaling behavior deviates from linear trends.
下游缩放法旨在从较小的规模的训练前损失中预测更大的规模任务绩效。 这一预测是否可行尚不清楚:有些作品表明任务绩效遵循了转型过程中明显的线性缩放趋势,而另一些作品则指出下游规模法律面临的基本挑战,例如出现和反向缩放。在这项工作中,我们对下游规模法律的现有数据进行元分析,发现只有在少数情况下才接近线性缩放法律:39%的时间。此外,对实验环境似乎无害的改变可以彻底改变缩放趋势。我们的分析强调,需要了解逐步缩放法律取得成功的条件。要充分模拟培训前损失与下游任务绩效之间的关系,我们必须接受规模行为偏离线性趋势的案例。
Article 184
Title@2025-07-01 (2): Mathematics Isn’t Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations
Title: Mathematics Isn’t Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations | Mathematik ist nicht kulturfrei: Probing Cultural Gaps via Entity und Scenario Perturbations | 数学不是没有文化的:通过实体和假想干扰来证明文化差距。 2507.00883v1 |
Authors (5): Aditya Tomar, Nihar Ranjan Sahoo, Ashish Mittal, Rudra Murthy, Pushpak Bhattacharyya
Although mathematics is often considered culturally neutral, the way mathematical problems are presented can carry implicit cultural context. Existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. In this work, we create culturally adapted variants of the GSM8K test set for five regions Africa, India, China, Korea, and Japan using prompt-based transformations followed by manual verification. We evaluate six large language models (LLMs), ranging from 8B to 72B parameters, across five prompting strategies to assess their robustness to cultural variation in math problem presentation. Our findings reveal a consistent performance gap: models perform best on the original US-centric dataset and comparatively worse on culturally adapted versions. However, models with reasoning capabilities are more resilient to these shifts, suggesting that deeper reasoning helps bridge cultural presentation gaps in mathematical tasks
虽然数学通常被认为是文化中立的,但数学问题的表述方式可以隐含文化背景,现有的基准,如GSM8K(GSM8K)主要根植于西方规范,包括名称、货币和日常情景。在这项工作中,我们利用迅速的转化和人工核查,为非洲、印度、中国、韩国和日本五个区域创建了适应文化的GSM8K测试的变体。我们评估了六种大型语言模型(LLMS),范围从8B到72B参数,横跨五个提示性战略,以评估其对数学问题表述中文化差异的稳健性。我们的调查结果揭示了一贯的绩效差距:模型在最初以美国为中心的数据集上表现最佳,而文化上更差。然而,具有推理能力的模型更能适应这些变化,表明更深入的推理有助于弥合数学任务的文化表述差距。
Article 185
Title@2025-07-01 (2): Benchmarking the Pedagogical Knowledge of Large Language Models
Title: Benchmarking the Pedagogical Knowledge of Large Language Models | Benchmarking der pädagogischen Kenntnisse großer Sprachmodelle | 确定大语言模式教学知识基准 2506.18710v3 |
Authors (10): Maxime Lelièvre, Amy Waldock, Meng Liu, Natalia Valdés Aspillaga, Alasdair Mackintosh, María José Ogando Portela, Jared Lee, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod
Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI’s knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models’ understanding of pedagogy - the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND) pedagogical knowledge. These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers, which cover a range of pedagogical subdomains such as teaching strategies and assessment methods. Here we outline the methodology and development of these benchmarks. We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions. We consider the relationship between cost and accuracy and chart the progression of the Pareto value frontier over time. We provide online leaderboards at https://rebrand.ly/pedagogy which are updated with new models and allow interactive exploration and filtering based on various model properties, such as cost per token and open-vs-closed weights, as well as looking at performance in different subjects. LLMs and generative AI have tremendous potential to influence education and help to address the global learning crisis. Education-focused benchmarks are crucial to measure models’ capacities to understand pedagogical concepts, respond appropriately to learners’ needs, and support effective teaching practices across diverse contexts. They are needed for informing the responsible and evidence-based deployment of LLMs and LLM-based tools in educational settings, and for guiding both development and policy decisions.
大量多任务语言理解(MMLU)等基准在评估大赦国际在不同领域的知识和能力方面发挥了关键作用。然而,现有基准主要侧重于内容知识,在评估模型对教学方法和做法(教学方法和做法)的理解方面留下了重大差距。本文介绍了“教学基准”,这是一套新颖的数据集,旨在评价大语言模式的跨多任务教学知识(CDPK)和特殊教育需要和残疾(SEND)教学知识。这些基准是建立在由教师专业发展考试(包括教学战略和评估方法等一系列教学有效次级内容)精心整理的一组问题之上的。我们在这里概述了这些基准的方法和发展情况。我们报告97个模型的结果,在教学知识问题方面从28%到89%不等。我们考虑了成本和准确性之间的关系,并描绘了基于不同价值前沿的模型。我们在https://rebrand.ly/peagogy 上提供了在线领导板,它们以新的模型更新了有效的教学次级内容支持,并且根据不同成本和成本学习主题,将交互式探索和过滤能力决定。
Article 186
Title@2025-07-01 (2): Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite
Title: Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite | Überprüfbare natürliche Sprache zur linearen Zeitlogik Übersetzung: Ein Benchmark-Datensatz und Bewertungs-Suite | 线性时时逻辑翻译的可核实自然语言:基准数据集和评价套件 2507.00877v1 |
Authors (5): William H English, Chase Walker, Dominic Simon, Sumit Kumar Jha, Rickard Ewetz
Empirical evaluation of state-of-the-art natural-language (NL) to temporal-logic (TL) translation systems reveals near-perfect performance on existing benchmarks. However, current studies measure only the accuracy of the translation of NL logic into formal TL, ignoring a system’s capacity to ground atomic propositions into new scenarios or environments. This is a critical feature, necessary for the verification of resulting formulas in a concrete state space. Consequently, most NL-to-TL translation frameworks propose their own bespoke dataset in which the correct grounding is known a-priori, inflating performance metrics and neglecting the need for extensible, domain-general systems. In this paper, we introduce the Verifiable Linear Temporal Logic Benchmark ( VLTL-Bench), a unifying benchmark that measures verification and verifiability of automated NL-to-LTL translation. The dataset consists of three unique state spaces and thousands of diverse natural language specifications and corresponding formal specifications in temporal logic. Moreover, the benchmark contains sample traces to validate the temporal logic expressions. While the benchmark directly supports end-to-end evaluation, we observe that many frameworks decompose the process into i) lifting, ii) grounding, iii) translation, and iv) verification. The benchmark provides ground truths after each of these steps to enable researches to improve and evaluate different substeps of the overall problem. To encourage methodologically sound advances in verifiable NL-to-LTL translation approaches, we release VLTL-Bench here: https://www.kaggle.com/datasets/dubascudes/vltl bench.
对最先进的自然语言(NL)到时间逻辑(TL)翻译系统的经验性评估显示,现有基准的准确性近乎完美,然而,当前研究只衡量将NL逻辑转换成正式的TL的准确性,忽视了系统将原子理论推入新情景或环境的能力。这是一个关键特征,对于在具体状态空间中核查由此产生的公式是必要的。因此,大多数NL-to-TL翻译框架都提出自己的自定数据集,其中正确的基底值为优先数据,放大性能指标,忽视了对可扩展的地域通用系统的需求。在本文中,我们采用了VLT-Bench(VL-LT)系统逻辑逻辑框架,用于测量自动NL-L-L-LT翻译的核查和可核查性。数据集包括三个独特的州空间和数千种不同的自然语言规格,以及相应的时间逻辑规范。此外,基准含有验证时间逻辑表达的样本。在基准中直接支持端端端端端至端的进展,忽略了对可扩展的域系统系统系统系统系统的需要。我们观察了每个地面校准的校准进程。
Article 187
Title@2025-07-01 (2): TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation
Title: TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation | Translaw: Benchmarking von großen Sprachmodellen in der Multi-Agenten-Simulation der Kollaborativen Übersetzung | TransLaw:在多方代理模拟协作翻译时确定大语言模式基准 2507.00875v1 |
Authors (4): Xi Xuan, King-kui Sin, Yufei Zhou, Chunyu Kit
Multi-agent systems empowered by large language models (LLMs) have demonstrated remarkable capabilities in a wide range of downstream applications, including machine translation. However, the potential of LLMs in translating Hong Kong legal judgments remains uncertain due to challenges such as intricate legal terminology, culturally embedded nuances, and strict linguistic structures. In this work, we introduce TransLaw, a novel multi-agent framework implemented for real-world Hong Kong case law translation. It employs three specialized agents, namely, Translator, Annotator, and Proofreader, to collaboratively produce translations for high accuracy in legal meaning, appropriateness in style, and adequate coherence and cohesion in structure. This framework supports customizable LLM configurations and achieves tremendous cost reduction compared to professional human translation services. We evaluated its performance using 13 open-source and commercial LLMs as agents and obtained interesting findings, including that it surpasses GPT-4o in legal semantic accuracy, structural coherence, and stylistic fidelity, yet trails human experts in contextualizing complex terminology and stylistic naturalness. Our platform website is available at CityUHK, and our bilingual judgment corpus used for the evaluation is available at Hugging Face.
由大型语言模型(LLMS)授权的多试剂系统在包括机器翻译在内的一系列下游应用中表现出了非凡的能力,然而,由于复杂的法律术语、文化内含的细微差别和严格的语言结构等挑战,LLMS在翻译香港法律判决方面的潜力仍然不确定,在这项工作中,我们采用了TransLaw,这是为香港真实世界判例法翻译工作实施的新颖的多试剂框架,它雇用了3个专业代理人,即笔译员、说明员和校对员,合作翻译法律意义高度准确、风格适当、结构具有充分一致性和一致性。这个框架支持可定制的LM配置,并实现了与专业人类翻译服务相比的巨大成本削减。我们用13个开放源和商业LLMs作为代理对它的业绩进行了评估,并获得了有趣的发现,包括它在法律语义准确性、结构一致性和文体贴性方面超过了GPT-4,但在复杂术语背景和自然性方面有历史线索的人类专家。我们的平台网站在城市UHK提供,我们用于评估的双语判决书可在Hugging Face上查阅。
Article 188
Title@2025-07-01 (2): Text Production and Comprehension by Human and Artificial Intelligence: Interdisciplinary Workshop Report
Title: Text Production and Comprehension by Human and Artificial Intelligence: Interdisciplinary Workshop Report | Textproduktion und -verständnis durch menschliche und künstliche Intelligenz: Interdisziplinärer Workshop-Bericht | 人文和人工情报的文字制作和理解:跨学科讲习班报告 2506.22698v2 |
Authors (1): Emily Dux Speltz
This report synthesizes the outcomes of a recent interdisciplinary workshop that brought together leading experts in cognitive psychology, language learning, and artificial intelligence (AI)-based natural language processing (NLP). The workshop, funded by the National Science Foundation, aimed to address a critical knowledge gap in our understanding of the relationship between AI language models and human cognitive processes in text comprehension and composition. Through collaborative dialogue across cognitive, linguistic, and technological perspectives, workshop participants examined the underlying processes involved when humans produce and comprehend text, and how AI can both inform our understanding of these processes and augment human capabilities. The workshop revealed emerging patterns in the relationship between large language models (LLMs) and human cognition, with highlights on both the capabilities of LLMs and their limitations in fully replicating human-like language understanding and generation. Key findings include the potential of LLMs to offer insights into human language processing, the increasing alignment between LLM behavior and human language processing when models are fine-tuned with human feedback, and the opportunities and challenges presented by human-AI collaboration in language tasks. By synthesizing these findings, this report aims to guide future research, development, and implementation of LLMs in cognitive psychology, linguistics, and education. It emphasizes the importance of ethical considerations and responsible use of AI technologies while striving to enhance human capabilities in text comprehension and production through effective human-AI collaboration.
本报告综合了最近一次跨学科讲习班的成果,该讲习班汇集了认知心理学、语言学习和人工智能(AI)基础自然语言处理(NLP)方面的领先专家。该讲习班由国家科学基金会资助,旨在解决我们理解AI语言模型和人类认知过程在理解文本和组成方面的关系方面存在的重大知识差距。通过从认知、语言和技术角度进行的合作对话,讲习班参与者审查了人类制作和理解文本时所涉及的基本过程,以及大赦国际如何能够使我们了解这些过程并增强人的能力。该讲习班揭示了大型语言模型(LLLM)与人类融合之间的关系中新出现的模式,重点是LLMM的能力及其在全面复制类似人类语言理解和生成方面的局限性。主要调查结果包括LMMs提供人类语言处理方面深刻见解的潜力、LLM行为和人类语言处理过程在模型与人类反馈进行微调时,以及人类与AI合作在语言任务中带来的机会和挑战。通过综合这些结论,本报告旨在指导未来研究、发展、发展和应用LLMAMA的局限性,同时通过提高人类的道德认识和理解能力,强调在提高人类认知心理学、语言学和理解能力方面,学会的制作和运用其价值。
Article 189
Title@2025-07-01 (2): Stylometry recognizes human and LLM-generated texts in short samples
Title: Stylometry recognizes human and LLM-generated texts in short samples | Stylometrie erkennt menschliche und LLM-generierte Texte in kurzen Proben | tytylometerm在短样本中确认人类和LLM产生的文本 2507.00838v1 |
Authors (4): Karol Przystalski, Jan K. Argasiński, Iwona Grabska-Gradzińska, Jeremi K. Ochab
The paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), (c) processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to .87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between .79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to .98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show – crucially, in the context of the increasingly sophisticated LLMs – that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type.
本文探索了外观测量法,作为区分大语言模型(LLMS)和人类所创造的文本的方法,涉及模型属性、知识产权和道德AI使用等问题。 外观测量法被广泛用于描述文本的风格和属性。 通过将其应用于LLM产生的文本, 我们发现其新兴的写法模式。 本文涉及创建基于维基百科的基准数据集, 包括:(a) 人文版术语摘要, (b) 纯由LLMS(GPT- 3.5/4, LLama Ma 2/3, Orca和Falcon)产生的文本, (c) 通过多文本合成方法(T5, BART, Gensim和Sumy)处理。 和 (d) 调整方法(Dipper, T5) 被广泛用来描述文本的风格和属性。 10个长的文字根据基于树基模型(决定树树树树和LightGBM) , 和 ngram- brod- descrideal- deal 的文字背景, 将个人语言、语系的精度文本的精度结果和数学的精度, 在数字的精度模型中, 的精度的精度排序中, 的精度- saldaldaldaldaldaldaldaldaldaldaldalmaxald 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度,在数学到比的精度, 的精度到数学到比的精度, 的精度, 的精度, 的精度, 的精度, 的精度到方向的精度到方向的精度到方向的精度到方向的精度到方向的精度到方向的精度, 。
Article 190
Title@2025-07-01 (2): ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering
Title: ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering | ProxAnn: Use-Oriented Assessments of Topic Models and Document Clustering | ProxAnn:专题模型和文件分类组合的使用评价 2507.00828v1 |
Authors (4): Alexander Hoyle, Lorena Calvo-Bartolomé, Jordan Boyd-Graber, Philip Resnik
Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners’ real-world usage of models. Annotators – or an LLM-based proxy – review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations. Package, web interface, and data are at https://github.com/ahoho/proxann
专题模型和文件集群评价要么使用与人类喜好不相符的自动化计量标准,要么要求规模难以扩大的专家标签;我们设计了可扩展的人类评价协议和相应的自动近似法,反映从业人员实际使用模型的情况;说明者或基于LLM的代理机构,审查分配给某个专题或组群的文本项目,推断一个类别,然后将这一类别应用于其他文件。我们利用这一协议,从两个数据集的不同主题模型中收集广泛的人群工作者产出说明。然后,我们利用这些说明来验证自动代理,发现最佳LLM代理在统计上无法从人类标识器中分离出来,因此可以作为自动评价的合理替代物。软件包、网络接口和数据见https://github.com/aho/proxan。
Article 191
Title@2025-07-01 (2): A Study of In-Context-Learning-Based Text-to-SQL Errors
Title: A Study of In-Context-Learning-Based Text-to-SQL Errors | Eine Studie über In-Context-Learning-basierte Text-zu-SQL-Fehler | 文中学习基于文本到SQL错误的研究 2501.09310v2 |
Authors (9): Jiawei Shen, Chengcheng Wan, Ruoyi Qiao, Jiazhen Zou, Hang Xu, Yuchen Shao, Yueling Zhang, Weikai Miao, Geguang Pu
Large language models (LLMs) have been adopted to perform text-to-SQL tasks, utilizing their in-context learning (ICL) capability to translate natural language questions into structured query language (SQL). However, such a technique faces correctness problems and requires efficient repairing solutions. In this paper, we conduct the first comprehensive study of text-to-SQL errors. Our study covers four representative ICL-based techniques, five basic repairing methods, two benchmarks, and two LLM settings. We find that text-to-SQL errors are widespread and summarize 29 error types of 7 categories. We also find that existing repairing attempts have limited correctness improvement at the cost of high computational overhead with many mis-repairs. Based on the findings, we propose MapleRepair, a novel text-to-SQL error detection and repairing framework. The evaluation demonstrates that MapleRepair outperforms existing solutions by repairing 13.8% more queries with neglectable mis-repairs and 67.4% less overhead.
采用大型语言模型(LLMS)来完成文本到SQL的任务,利用它们将自然语言问题转换成结构化查询语言(SQL)的内置学习能力(ICL)来将自然语言问题转换成结构化查询语言(SQL),然而,这种技术面临着正确性问题,需要高效的修复解决方案。在本文件中,我们首次对文本到SQL错误进行了全面研究。我们的研究涵盖了四种有代表性的ICL技术、五种基本修复方法、两个基准和两个LLM设置。我们发现文本到SQL的错误非常普遍,并总结了29种7类错误类型。我们还发现,现有的修复尝试在以许多错误修复的高计算管理成本下,其正确性改进有限。根据研究结果,我们提议了MapleRepair, 一个新的文本到SQL错误探测和修复框架。评估表明,MapleRepair通过修复13.8%的更多查询,用可忽略的错误修复,少67.4%的管理费,比现有的解决办法超出现有的解决办法。
Article 192
Title@2025-07-01 (2): Many LLMs Are More Utilitarian Than One
Title: Many LLMs Are More Utilitarian Than One | Viele LLMs sind nützlicher als eins | 许多LLLM女士比一比一更实用 2507.00814v1 |
Authors (5): Anita Keshmirian, Razan Baltaji, Babak Hemmatian, Hadi Asghari, Lav R. Varshney
Moral judgment is integral to large language model (LLM) alignment and social reasoning. As multi-agent systems gain prominence, it becomes crucial to understand how LLMs function collectively during collaboration, compared to individual agents. In human moral judgment, group deliberation leads to a utilitarian boost: a tendency to endorse norm violations that maximize benefits for the greatest number of people despite harms. We study whether a similar dynamic emerges in multi-agent LLM systems. We tested six models on well-established sets of moral dilemmas across two conditions: (1) Solo, where models reasoned independently, and (2) Group, where they engaged in multi-turn discussions in pairs or triads. In personal moral dilemmas, where agents must decide to directly harm one individual to maximize the utility for others, all models found moral violations to be more acceptable when part of a group than individually, similar to human experiments. Some models endorsed actions that maximized overall well-being, even if they benefited strangers over familiar individuals. Others became more willing to violate moral norms in groups. However, while human groups show a similar action bias, the mechanism for their utilitarian boost differs from LLMs. Whereas the human shift comes from heightened sensitivity to decision outcomes, LLM groups show either reduced norm sensitivity or enhanced impartiality. This suggests that while the surface behavior of LLM collectives mimics human group reasoning, the underlying drivers differ. We discuss the implications for AI alignment, multi-agent design, and artificial moral reasoning.
道德判断:在人类道德判断中,集体审议导致一种功利主义的推动:倾向于认可通常违反规则的行为,从而在伤害面前为尽可能多的人带来最大利益;我们研究在多语言模型(LLM)体系中是否出现类似的动态;我们测试了六种模式,其道德困境分两个条件:(1) Solo,该模式独立地说明理由,和(2) Group,其中他们以双胞胎或三胞胎形式进行多轮式讨论。在个人道德困境中,代理人必须决定直接伤害一个人,使他人的效用最大化。在个人道德判断中,所有模式都发现道德违规行为更容易被接受,因为一个群体的一部分比个别群体,类似于人类实验。一些模式认可了使总体福利最大化的行动,即使他们比熟悉的个人更了解情况。其他人更愿意在团体中违反道德规范。然而,虽然人类团体表现出类似的行动偏向性偏向性,但其支配性设计机制不同于LMMs。 而人类行为高度的敏感性则表现了集体性。
Article 193
Title@2025-07-01 (2): OM4OV: Leveraging Ontology Matching for Ontology Versioning
Title: OM4OV: Leveraging Ontology Matching for Ontology Versioning | OM4OV: Ontologie für die Ontologie-Versionierung | OM4OV:利用本体学匹配本体学版本的本体学 2409.20302v4 |
Authors (3): Zhangcheng Qiang, Kerry Taylor, Weiqing Wang
Due to the dynamic nature of the Semantic Web, version control is necessary to capture time-varying information, particularly for widely used ontologies. Despite the long-standing recognition of ontology versioning (OV) as a crucial component for efficient ontology management, the growing size of ontologies and accumulating errors caused by manual labour overwhelm current OV approaches. In this paper, we propose a fresh approach to performing OV using existing ontology matching (OM) techniques and systems. We introduce a unified OM4OV pipeline. From an OM perspective, we reconstruct a new task formulation and measurements for OV tasks. Building upon the prior alignment(s) from OM, we propose a pipeline optimisation method called the cross-reference (CR) mechanism to enhance overall OV performance. We experimentally validate the OM4OV pipeline and the cross-reference mechanism in an OV testbed originating from the Ontology Alignment Evaluation Initiative (OAEI) datasets. We also discuss insights into OM used for OV tasks, where some apparent false mappings detected by OV systems are not actually untrue.
由于语义网的动态性质,必须进行版本控制,以捕捉时间变化信息,特别是广泛使用的肿瘤信息。尽管长期以来一直承认本体学版本(OV)是有效本体学管理的一个关键组成部分,但由于人工劳动超负荷目前OV方法导致本体学规模不断扩大和累积错误的积累,但本文件中我们建议采用新的方法,利用现有的本体匹配(OM)技术和系统来进行OV。我们引入了统一的OM4OV管道。我们从OM的角度为OV任务重建了新的任务配方和测量。在OMM先前的对齐基础上,我们提议一种管线优化方法,称为交叉参照(CR)机制,以提高整个OV的性能。我们实验性地验证OM4OVV管道和来自OTolog对齐评价倡议(OAEI)数据集的交叉参照机制。我们还讨论对OVOV任务所用OM任务的洞察到的一些明显假图实际上并不真实。
Article 194
Title@2025-07-01 (2): Generative AI and the future of scientometrics: current topics and future questions
Title: Generative AI and the future of scientometrics: current topics and future questions | Generative KI und die Zukunft der Scientometrics: aktuelle Themen und Zukunftsfragen | A. 生成的人工智能和科学计量法的未来:当前专题和今后的问题 2507.00783v1 |
Authors (3): Benedetto Lepori, Jens Peter Andersen, Karsten Donnay
The aim of this paper is to review the use of GenAI in scientometrics, and to begin a debate on the broader implications for the field. First, we provide an introduction on GenAI’s generative and probabilistic nature as rooted in distributional linguistics. And we relate this to the debate on the extent to which GenAI might be able to mimic human ‘reasoning’. Second, we leverage this distinction for a critical engagement with recent experiments using GenAI in scientometrics, including topic labelling, the analysis of citation contexts, predictive applications, scholars’ profiling, and research assessment. GenAI shows promise in tasks where language generation dominates, such as labelling, but faces limitations in tasks that require stable semantics, pragmatic reasoning, or structured domain knowledge. However, these results might become quickly outdated. Our recommendation is, therefore, to always strive to systematically compare the performance of different GenAI models for specific tasks. Third, we inquire whether, by generating large amounts of scientific language, GenAI might have a fundamental impact on our field by affecting textual characteristics used to measure science, such as authors, words, and references. We argue that careful empirical work and theoretical reflection will be essential to remain capable of interpreting the evolving patterns of knowledge production.
本文的目的是审查GenAI在科学计量方面的使用情况,并开始就该领域的更广泛影响展开辩论。首先,我们介绍GenAI源于分布语言的基因性和概率性。我们将此与关于GenAI可能在多大程度上模仿人类的“理由”的辩论联系起来。第二,我们利用这一区别,对最近利用GenAI在科学计量方面的实验进行关键接触,包括专题标签、引言背景分析、预测应用、学者特征分析和研究评估。GenAI在语言生成占主导地位的任务中表现出希望,例如标签,但在需要稳定的语义、务实的推理或结构化领域知识的任务中面临限制。然而,这些结果可能会很快过时。因此,我们的建议是始终努力系统地比较不同的GenAI模型在具体任务方面的表现。第三,我们通过大量科学语言,调查GenAI通过影响用于测量科学的文字特征,例如作者、文字和参考文献,是否会对我们的领域产生根本的影响,从而影响我们领域。我们主张,在不断发展的思考过程中,审慎的经验和理论工作将仍然是不断演变。
Article 195
Title@2025-07-01 (2): A Diagrammatic Calculus for a Functional Model of Natural Language Semantics
Title: A Diagrammatic Calculus for a Functional Model of Natural Language Semantics | Ein diagrammatischer Kalkulus für ein funktionelles Modell der natürlichen Sprachsemantik | 自然语言语义学功能模型的图表计算 2507.00782v1 |
Authors (1): Matthieu Pierre Boyer
In this paper, we study a functional programming approach to natural language semantics, allowing us to increase the expressivity of a more traditional denotation style. We will formalize a category based type and effect system, and construct a diagrammatic calculus to model parsing and handling of effects, and use it to efficiently compute the denotations for sentences.
在本文中,我们研究了自然语言语义学的功能性编程方法,从而使我们能够提高更传统的批注风格的表达性。 我们将正式确定基于类别类型和效果的系统,并建立一个图表计算法,以模拟对效果的分解和处理,并用它有效地计算判决的批注。
Article 196
Title@2025-07-01 (2): LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing
Title: LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing | LitBench: Ein Benchmark und Datensatz für eine zuverlässige Bewertung des kreativen Schreibens | 《创意书写:可靠评价基准和数据集》 2507.00769v1 |
Authors (6): Daniel Fein, Sebastian Russo, Violet Xiang, Kabir Jolly, Rafael Rafailov, Nick Haber
Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.
大型语言模型(LLMS)产生的创造性写作评价仍然具有挑战性,因为开放式叙述缺乏地面真相。如果没有实用的自动评价方法,现成(OTS)语言模型将被用作零弹法官,但在这方面其可靠性尚不明确。为了对创造性写作进行强有力的评价,我们引入了首个标准化基准和配对数据,即LitBench,这是用于创造性写作核查的第一个标准化基准和配对数据集,包括一套由2 480个脱节、人类标签比对的测试集,由Redditit和43 827个人类偏好标签培训教材组成的人类标本。使用LitBench,我们(一)为零弹的LM法官基准,(二)培训Brady Terry和基因化奖赏模型,以及(三)开展在线人类研究,以验证新LM系列文章的奖赏模式。我们的标准确定Claude-37-Sont为最强的离手法官,与人类偏好73%的奖项;在经过培训的奖赏模型中,Brad-Treal-Treal-rageal-rageal-al-LSAL-LSAL5的奖赏模型中进一步获得78%的人类奖赏模型。
Article 197
Title@2025-07-01 (2): Safe Low Bandwidth SPV: A Formal Treatment of Simplified Payment Verification Protocols and Security Bounds
Title: Safe Low Bandwidth SPV: A Formal Treatment of Simplified Payment Verification Protocols and Security Bounds | Safe Low Bandwidth SPV: Eine formale Behandlung von vereinfachten Zahlungsverifikationsprotokollen und Sicherheitsbunden | 安全低频带宽度SPV:对简化付款核查议定书和安全圈的正式处理 2507.00740v1 |
Authors (1): Craig S Wright
This paper presents a complete formal specification, protocol description, and mathematical proof structure for Simplified Payment Verification (SPV) as originally defined in the Bitcoin whitepaper \cite{nakamoto2008}. In stark contrast to the misrepresentations proliferated by popular implementations, we show that SPV is not only secure under bounded adversarial assumptions but strictly optimal for digital cash systems requiring scalable and verifiable transaction inclusion. We reconstruct the SPV protocol from first principles, grounding its verification model in symbolic automata, Merkle membership relations, and chain-of-proof dominance predicates. Through rigorous probabilistic and game-theoretic analysis, we derive the economic bounds within which the protocol operates securely and verify its liveness and safety properties under partial connectivity, hostile relay networks, and adversarial propagation delay. Our specification further introduces low-bandwidth optimisations such as adaptive polling and compressed header synchronisation while preserving correctness. This document serves both as a blueprint for secure SPV implementation and a rebuttal of common misconceptions surrounding non-validating clients.
本文介绍了《Bitcoin 白纸\ cite{nakamoto2008} 最初定义的简化支付核查(SPV)的完整正式规格、协议描述和数学证明结构。与大众执行过程中大量出现的不实陈述形成鲜明对比的是,我们显示,SPV不仅在有约束的对抗假设下安全,而且对于数字现金系统也严格地来说是最佳的,需要包含可缩放和可核查的交易。我们从最初的原则中重建SPV协议,将其核查模式建立在象征性的自动数据、默克尔成员关系和可防控的支配地位前导线上。我们通过严格的概率和游戏理论分析,得出了协议安全运行的经济界限,并在部分连接、敌对的中继网络和对抗性传播延迟下验证了其生活和安全性。我们的规格进一步引入了低带宽选择,如适应性投票和压缩头板同步,同时保持正确性。本文件既是安全实施SPV的蓝图,也是对非验证客户的常见误解的反驳。
Article 198
Title@2025-07-01 (2): HyperCLOVA X THINK Technical Report
Title: HyperCLOVA X THINK Technical Report | HyperCLOVA X THINK Technischer Bericht | HypercLOVA X 思考技术报告 2506.22403v2 |
Authors (1): NAVER Cloud HyperCLOVA X Team
We introduce HyperCLOVA X THINK, the first reasoning-focused large language model in the HyperCLOVA X family, pre-trained on roughly $6$ trillion high-quality Korean, and English tokens, augmented with targeted synthetic Korean data. It was implemented as a compute-memory-balanced Peri-LN Transformer scaled with $\mu$P, pre-trained through a three-stage curriculum that expands the context window to $128$K tokens, and post-trained via supervised fine-tuning with Reinforcement Learning from Verifiable Rewards supports both detailed rationale and concise-answer modes. It delivers competitive performance against similarly sized models on Korea-focused benchmarks such as KMMLU, CSAT, KoBALT-700, HAERAE-1.0, and KoBigBench, while preserving robust bilingual consistency and translation quality. In addition, a vision-augmented variant matches or exceeds GPT-4.1 on the KCSAT STEM benchmark, all of which are achieved with substantially lower training compute than existing models of similar sizes. We also present a pruning and distillation technique that will soon be applied to HyperCLOVA X THINK for an open-source and business-friendly foundation model. Altogether, these capabilities position HyperCLOVA X THINK as a robust foundation for Korean AI innovation and a valuable resource for the global research community.
我们引入了HyperCLOVA X SEG,这是HyperCLOVA X家族中第一个以推理为重点的大型语言模式HyperCLOVA XSED,预先培训了大约6万亿美元高品质的朝鲜语和英语象征物,并配以有针对性的韩国合成数据。它是一个计算和平衡的Peri-LN变异器,其规模以美元为单位,通过三阶段课程进行预先培训,将背景窗口扩大到128KSmarks,通过监督的微调,通过从可验证的奖赏中强化学习来支持详细的理由和简要的回答模式。它还针对以韩国为重点的基准,如KMMMMLU、CSAT、KoBALT-700、HAERAE-1.0和KoBigBench等类似规模的模型,提供了竞争性业绩,同时保持了强大的双语一致性和翻译质量。此外,一个设想的变式匹配或超过GPT-4.1的KTEM基准,所有这些都是通过远远低于现有规模模型的培训实现的详细理由和简明回答模式。我们还展示了以韩国为核心基础的一个开放的智能创新基础,作为韩国企业基础,并将一个基础,一个开放的智能基础,用于韩国创新基础。
Article 199
Title@2025-07-01 (2): AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Title: AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models | AudioTrust: Benchmarking der vielfältigen Vertrauenswürdigkeit von Audio Large Language Models | 音频信任:确定音频大语言模式多面信任度基准 2505.16211v2 |
Authors (31): Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhizheng Wu, Xiaolin Hu, Eng-Siong Chng, XiaoFeng Wang, Wenyuan Xu, Wei Dong, Xinfeng Li
The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.
音频大语言模型(ALLMs)的快速进步和扩展应用要求严格理解其可信度。然而,关于评价这些模型的系统研究,特别是关于音频模式所特有的风险的系统研究,基本上仍未探索。现有评价框架主要侧重于文本模式或仅处理一组有限的安全层面,未能充分说明音频模式所固有的独特特点和应用情景。我们引入了音频信托基金 – – 首次为ALMs专门设计的多方面信任性评价框架和基准。音频信托基金便利了对以下六个关键方面的评估:公平、幻觉、安全、隐私、稳健和认证。为全面评价这些层面,音频信托基金围绕18个不同的实验设置结构进行。其核心是精心构建的数据集,包括4 420多个音频/文本样本,这些样本来自现实世界情景(例如,日常谈话、紧急电话、语音助理互动),具体旨在调查ALMS的多方面信任度评价框架。为9个音频专用评价基准设计了9个公开评价平台,我们利用大规模自动化管道,以客观和可变缩的音频模型对未来产出进行评分。A 实验性A-alalal-lial-lial-lial-leval-lial-leval-l-l-l-l-l-l-l-l-s-leval-l-l-l-l-l-l-l-l-l-l-l-l-lxx-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-
Article 200
Title@2025-07-01 (2): AI Analyst: Framework and Comprehensive Evaluation of Large Language Models for Financial Time Series Report Generation
Title: AI Analyst: Framework and Comprehensive Evaluation of Large Language Models for Financial Time Series Report Generation | KI Analyst: Rahmen und umfassende Bewertung von großen Sprachmodellen für die Erstellung von Finanzzeitreihen | AI分析员:财务时间系列报告编制大语言模式框架和综合评价 2507.00718v1 |
Authors (8): Elizabeth Fons, Elena Kochkina, Rachneet Kaur, Zhen Zeng, Berowne Hlavaty, Charese Smiley, Svitlana Vyetrenko, Manuela Veloso
This paper explores the potential of large language models (LLMs) to generate financial reports from time series data. We propose a framework encompassing prompt engineering, model selection, and evaluation. We introduce an automated highlighting system to categorize information within the generated reports, differentiating between insights derived directly from time series data, stemming from financial reasoning, and those reliant on external knowledge. This approach aids in evaluating the factual grounding and reasoning capabilities of the models. Our experiments, utilizing both data from the real stock market indices and synthetic time series, demonstrate the capability of LLMs to produce coherent and informative financial reports.
本文件探讨了大型语言模型(LLMs)从时间序列数据中产生财务报告的潜力,我们提出了一个包括迅速工程、模式选择和评价的框架,我们采用了一个自动突出显示系统,在生成的报告中对信息进行分类,区分由时间序列数据直接产生的洞察力、来自财务推理的洞察力和依赖外部知识的洞察力,这一方法有助于评价这些模型的事实依据和推理能力,我们利用实际股票市场指数和合成时间序列的数据进行的实验表明LMs有能力编制连贯和翔实的财务报告。
Article 201
Title@2025-07-01 (2): Quasi-symbolic Semantic Geometry over Transformer-based Variational AutoEncoder
Title: Quasi-symbolic Semantic Geometry over Transformer-based Variational AutoEncoder | Quasi-symbolische Semantische Geometrie über Transformer-basierte Variational AutoEncoder | 相对于基于变压器的变异自动编码器的 准正对立线语义学几何测量 2210.06230v3 |
Authors (3): Yingji Zhang, Danilo S. Carvalho, André Freitas
Formal/symbolic semantics can provide canonical, rigid controllability and interpretability to sentence representations due to their \textit{localisation} or \textit{composition} property. How can we deliver such property to the current distributional sentence representations to control and interpret the generation of language models (LMs)? In this work, we theoretically frame the sentence semantics as the composition of \textit{semantic role - word content} features and propose the formal semantic geometry. To inject such geometry into Transformer-based LMs (i.e. GPT2), we deploy Transformer-based Variational AutoEncoder with a supervision approach, where the sentence generation can be manipulated and explained over low-dimensional latent Gaussian space. In addition, we propose a new probing algorithm to guide the movement of sentence vectors over such geometry. Experimental results reveal that the formal semantic geometry can potentially deliver better control and interpretation to sentence generation.
正规/ 符号语义学可以提供语义、 僵硬的可控性和可解释性, 原因是其属性 :\ textit{ localization} 或\ textit{ composite} 。 我们怎样才能将这些属性交付到当前分布式语义中, 以控制和解释语言模型的生成? 在这项工作中, 我们理论上将句语义定义为 \ textit{ semantic 角色 - 单词内容} 的构成, 并提议正式的语义几何学。 将这种几何性输入基于变换器的 LMs( i. e. GPT2) 的LM( ) 。 我们用一种监督方法将基于变换器的自动电算器( Vatriational Encoder ) 引入, 从而可以操纵和解释该句的生成, 超越低度潜值的高斯空间 。 此外, 我们提出一个新的演算法, 来指导句矢量在这种几何学上移动。 。 实验结果显示, 正式的语义几何测量学测量可以为生成提供更好的控制和解释 。 。
Article 202
Title@2025-07-01 (2): Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English
Title: Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English | Kontrastierende kognitive Stile in Vision-Language-Modellen: Ganzheitliche Aufmerksamkeit im japanischen Vers Analytischen Fokus auf Englisch | 视觉语言模型中相互矛盾的认知模式:日本口述分析重点中的整体关注英语 2507.00700v1 |
Authors (4): Ahmed Sabir, Azinovič Gasper, Mengsay Loem, Rajesh Sharma
Cross-cultural research in perception and cognition has shown that individuals from different cultural backgrounds process visual information in distinct ways. East Asians, for example, tend to adopt a holistic perspective, attending to contextual relationships, whereas Westerners often employ an analytical approach, focusing on individual objects and their attributes. In this study, we investigate whether Vision-Language Models (VLMs) trained predominantly on different languages, specifically Japanese and English, exhibit similar culturally grounded attentional patterns. Using comparative analysis of image descriptions, we examine whether these models reflect differences in holistic versus analytic tendencies. Our findings suggest that VLMs not only internalize the structural properties of language but also reproduce cultural behaviors embedded in the training data, indicating that cultural cognition may implicitly shape model outputs.
认识和认知方面的跨文化研究显示,来自不同文化背景的个人以不同的方式处理视觉信息。 例如,东亚人倾向于采取整体观点,关注背景关系,而西方人则往往采用分析方法,侧重于个别对象及其属性。在本研究中,我们调查主要以不同语言,特别是日语和英语培训的视觉语言模型(VLM)是否具有类似的基于文化的注意力模式。我们通过比较图像描述分析,研究这些模型是否反映了整体趋势与分析趋势的差异。我们的研究结果表明,VLMS不仅将语言的结构特性内部化,而且还复制了培训数据中所包含的文化行为,表明文化认知可能隐含模式输出。
Article 203
Title@2025-07-01 (2): T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
Title: T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT | T2I-R1: Verstärkung der Bildgenerierung mit kollaborativem Semantik- und Token-Level CoT | T2I-R1:与合作语义级和Token 级COT加强图像生成 2505.00703v2 |
Authors (9): Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li
Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1
大型语言模型的最近进展表明,思维链(CoT)和强化学习(RL)如何能提高绩效。然而,在视觉生成域应用这种推理战略在很大程度上仍未探索。我们在本文件中展示了T2I-R1,这是由RL以双级COT推理过程推动的新颖推理强化文本到图像生成模型。具体地说,我们确定了可用于加强不同生成阶段的两种水平的COT:(1) 用于快速高级规划的语义级 CoT,和(2) 用于在补接一代中低级别的像素处理的象征性COT。为了更好地协调这两个层次的CoT,我们引入了BiCoT-GROPO, 配有共同的生成奖赏, 在同一培训步骤中完美地优化了两种生成的COT。通过对基线模型应用我们的推理战略,Janus-Pro,我们实现了优异性性性,T2I-Combench的13 %的改进,以及WISE基准的19 %的象征性的COT:甚至超越了FSE1/LADMRC1。
Article 204
Title@2025-07-01 (2): Leveraging Large Language Models for Spontaneous Speech-Based Suicide Risk Detection
Title: Leveraging Large Language Models for Spontaneous Speech-Based Suicide Risk Detection | Nutzung von großen Sprachmodellen für spontane sprachbasierte Suizidrisikoerkennung | 利用大型语言模型进行自发语音自杀风险探测 2507.00693v1 |
Authors (4): Yifan Gao, Jiao Fu, Long Guo, Hong Liu
Early identification of suicide risk is crucial for preventing suicidal behaviors. As a result, the identification and study of patterns and markers related to suicide risk have become a key focus of current research. In this paper, we present the results of our work in the 1st SpeechWellness Challenge (SW1), which aims to explore speech as a non-invasive and easily accessible mental health indicator for identifying adolescents at risk of suicide.Our approach leverages large language model (LLM) as the primary tool for feature extraction, alongside conventional acoustic and semantic features. The proposed method achieves an accuracy of 74\% on the test set, ranking first in the SW1 challenge. These findings demonstrate the potential of LLM-based methods for analyzing speech in the context of suicide risk assessment.
早期识别自杀风险对于预防自杀行为至关重要,因此,确定和研究与自杀风险有关的模式和标志已成为当前研究的一个主要重点,在本文件中,我们介绍了我们的工作成果,第一次“言语健康挑战”(SW1),其目的是探讨作为非侵入性和容易获得的心理健康指标的言论,用以识别有自杀风险的青少年。我们的方法利用了大语言模型(LLM)作为特征提取的主要工具,同时利用传统的声学和语义特征。拟议方法在测试集上达到74的准确度,在SW1挑战中排名第一。这些结果显示了基于LLM方法在自杀风险评估中分析言论的潜力。
Article 205
Title@2025-07-01 (2): Iterative Resolution of Prompt Ambiguities Using a Progressive Cutting-Search Approach
Title: Iterative Resolution of Prompt Ambiguities Using a Progressive Cutting-Search Approach | Iterative Auflösung von Prompt-Ambiguitäten mittels eines progressiven Cutting-Search-Ansatzes | 采用逐步切割和搜寻办法迅速解决问题 2505.02952v2 |
Authors (1): Fabrizio Marozzo
Generative AI systems have revolutionized human interaction by enabling natural language-based coding and problem solving. However, the inherent ambiguity of natural language often leads to imprecise instructions, forcing users to iteratively test, correct, and resubmit their prompts. We propose an iterative approach that systematically narrows down these ambiguities through a structured series of clarification questions and alternative solution proposals, illustrated with input/output examples as well. Once every uncertainty is resolved, a final, precise solution is generated. Evaluated on a diverse dataset spanning coding, data analysis, and creative writing, our method demonstrates superior accuracy, competitive resolution times, and higher user satisfaction compared to conventional one-shot solutions, which typically require multiple manual iterations to achieve a correct output.
生成的AI系统通过促成自然语言编码和解决问题,使人类互动发生了革命性的变化。然而,自然语言的内在模糊性往往导致指示不准确,迫使用户反复测试、纠正和重新提出其提示。我们建议一种迭代方法,通过一系列结构化的澄清问题和替代解决方案提案,系统地缩小这些模糊性,并以输入/产出实例加以说明。一旦每一个不确定性得到解决,就会产生一个最终的、准确的解决办法。根据包含编码、数据分析和创造性书写的不同数据集进行评估,我们的方法显示更高的准确性、竞争性解答时间和用户满意度,而常规的一次性解决方案通常需要多手重复才能得出正确的产出。
Article 206
Title@2025-07-01 (2): Why Multi-Interest Fairness Matters: Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System
Title: Why Multi-Interest Fairness Matters: Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System | Warum Multi-Interest Fairness-Materies: Hypergraph Kontrastives Multi-Interest-Lernen für faires Gesprächs-Empfängersystem | 为何多利公平问题:为公平对话建议系统进行高频对抗多利学习 2507.02000v1 |
Authors (6): Yongsen Zheng, Zongxuan Xie, Guohua Wang, Ziyao Liu, Liang Lin, Kwok-Yan Lam
Unfairness is a well-known challenge in Recommender Systems (RSs), often resulting in biased outcomes that disadvantage users or items based on attributes such as gender, race, age, or popularity. Although some approaches have started to improve fairness recommendation in offline or static contexts, the issue of unfairness often exacerbates over time, leading to significant problems like the Matthew effect, filter bubbles, and echo chambers. To address these challenges, we proposed a novel framework, Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System (HyFairCRS), aiming to promote multi-interest diversity fairness in dynamic and interactive Conversational Recommender Systems (CRSs). HyFairCRS first captures a wide range of user interests by establishing diverse hypergraphs through contrastive learning. These interests are then utilized in conversations to generate informative responses and ensure fair item predictions within the dynamic user-system feedback loop. Experiments on two CRS-based datasets show that HyFairCRS achieves a new state-of-the-art performance while effectively alleviating unfairness. Our code is available at https://github.com/zysensmile/HyFairCRS.
在建议系统(RSs)中,不公平是一个众所周知的挑战,往往造成有偏见的结果,使用户或基于性别、种族、年龄或受欢迎性等属性的物品处于不利地位。虽然有些办法已开始在离线或静态环境中改进公平建议,但不公平问题往往会随着时间推移而加剧,导致马修效应、过滤泡沫和回声室等重大问题。为了应对这些挑战,我们提出了一个新颖的框架,即公平建议建议系统(HyFairCRS),目的是在动态和互动的共融建议系统(CRSs)中促进多种利益多样性的公平。HyFairCRS首先通过对比性学习建立不同的超强度来捕捉广泛的用户利益,然后在谈话中利用这些利益来产生信息反应,并确保在动态用户系统反馈循环中作出公平的项目预测。对基于CRS的两套数据集的实验显示,HyFairCRS在有效减轻不公平性的同时实现了新的州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州
Article 207
Title@2025-07-01 (2): Not Minds, but Signs: Reframing LLMs through Semiotics
Title: Not Minds, but Signs: Reframing LLMs through Semiotics | Nicht Gedanken, sondern Zeichen: LLMs durch Semiotik abwehren | 不是心灵,而是符号:通过非美学重新组合LMS 2505.17080v2 |
Authors (1): Davide Picca
This paper challenges the prevailing tendency to frame Large Language Models (LLMs) as cognitive systems, arguing instead for a semiotic perspective that situates these models within the broader dynamics of sign manipulation and meaning-making. Rather than assuming that LLMs understand language or simulate human thought, we propose that their primary function is to recombine, recontextualize, and circulate linguistic forms based on probabilistic associations. By shifting from a cognitivist to a semiotic framework, we avoid anthropomorphism and gain a more precise understanding of how LLMs participate in cultural processes, not by thinking, but by generating texts that invite interpretation. Through theoretical analysis and practical examples, the paper demonstrates how LLMs function as semiotic agents whose outputs can be treated as interpretive acts, open to contextual negotiation and critical reflection. We explore applications in literature, philosophy, education, and cultural production, emphasizing how LLMs can serve as tools for creativity, dialogue, and critical inquiry. The semiotic paradigm foregrounds the situated, contingent, and socially embedded nature of meaning, offering a more rigorous and ethically aware framework for studying and using LLMs. Ultimately, this approach reframes LLMs as technological participants in an ongoing ecology of signs. They do not possess minds, but they alter how we read, write, and make meaning, compelling us to reconsider the foundations of language, interpretation, and the role of artificial systems in the production of knowledge.
本文挑战了将大语言模型(LLMS)作为认知系统这一普遍趋势,相反,我们主张一种半科学观点,将这些模型置于标志操纵和含义制作这一更广泛的动态中。我们不假定LLMS理解语言或模拟人的思想,而是建议它们的主要功能是在概率协会的基础上重新研究、重新翻版和传播语言形式。我们避免将大语言模型(LLMS)作为认知系统,更准确地理解LLMS如何参与文化进程,而不是通过思考,而是通过产生需要解释的文本。通过理论分析和实际例子,文件表明LMS如何作为半医学剂发挥作用,其产出可被视为解释性行为,可接受背景谈判和批判性反思。我们探讨了在文学、哲学、教育和文化生产方面的应用,强调LMS如何作为创造性、对话和批判性调查的工具。我们避免了人类形态的半科学模式,为LMS如何参与文化过程,而不是通过思维,而是通过产生一个更严格和有道德意识的框架来学习和使用LMS(LMS)的文本。最后,我们探索文学、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释和解释过程的参与者如何、理解、理解、解释、理解、解释、理解、理解、解释、理解、理解、理解、理解、理解、解释、解释、理解、解释、理解、理解、理解、解释、理解、解释、解释、解释、解释、解释、解释、理解、解释、解释、解释、解释、解释、理解、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、理解、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、解释、
Article 208
Title@2025-07-01 (2): SAFER: Probing Safety in Reward Models with Sparse Autoencoder
Title: SAFER: Probing Safety in Reward Models with Sparse Autoencoder | SAFER: Prüfen von Sicherheit in Prämienmodellen mit Sparse Autoencoder | SAFER: 使用 Sparse Autenencoder 的奖分模型中测试安全性 2507.00665v1 |
Authors (6): Sihang Li, Wei Shi, Ziyuan Xie, Tao Liang, Guojun Ma, Xiang Wang
Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present sparse Autoencoder For Enhanced Reward model (\textbf{SAFER}), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (SAEs), we uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making. We apply SAFER to safety-oriented preference datasets and quantify the salience of individual features by activation differences between chosen and rejected responses. Using these feature-level signals, we design targeted data poisoning and denoising strategies. Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification, without sacrificing general chat performance. Our approach contributes to interpreting, auditing and refining reward models in high-stakes LLM alignment tasks. Our codes are available at https://github.com/xzy-101/SAFER-code. \textit{This paper discusses topics related to large language model safety and may include discussions or examples that highlight potential risks or unsafe outcomes.}
从人类反馈中强化学习(RLHF)是使大型语言模型(LLMs)与人类价值相匹配的关键范例,但奖赏模式的核心基本上仍然不透明。在这项工作中,我们展示了通过机械分析解释和改进奖赏模式的新框架“增强奖赏模式”的稀疏自动编码模型(\ textbf{SAFER}),这是一个通过机械化分析解释和改进奖励模式的新框架。Leverageting Sparse Autoencoders(SAERS),我们发现奖励模式启动过程中的人类可解释的特征,能够洞察到与安全有关的决策。我们用安全偏好数据集来量化单个特征的显著性,激活所选和被拒绝的响应之间的差异。我们使用这些地级信号设计了定向数据中毒和脱污战略。实验表明,Safer可以精确地降低或加强安全一致性,同时不牺牲一般聊天表现。我们的方法有助于解释、审计和完善高额奖赏模式中的奖赏模式,便于了解与安全有关的决策。我们的代码可以在https://github.com/xzy-101/SAFER-codededefrevy}本文可能讨论与大语言相关的议题。
Article 209
Title@2025-07-01 (2): Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences
Title: Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences | Positionale Bias in Binärfragebeantwortung: Wie Unsicherheit Formen Modelleinstellungen | 二进制问题解答中的位置偏差: 不确定形状的模型首选项 2506.23743v2 |
Authors (3): Tiziano Labruna, Simone Gallo, Giovanni Da San Martino
Positional bias in binary question answering occurs when a model systematically favors one choice over another based solely on the ordering of presented options. In this study, we quantify and analyze positional bias across five large language models under varying degrees of answer uncertainty. We re-adapted the SQuAD-it dataset by adding an extra incorrect answer option and then created multiple versions with progressively less context and more out-of-context answers, yielding datasets that range from low to high uncertainty. Additionally, we evaluate two naturally higher-uncertainty benchmarks: (1) WebGPT - question pairs with unequal human-assigned quality scores, and (2) Winning Arguments - where models predict the more persuasive argument in Reddit’s r/ChangeMyView exchanges. Across each dataset, the order of the “correct” (or higher-quality/persuasive) option is systematically flipped (first placed in position 1, then in position 2) to compute both Preference Fairness and Position Consistency. We observe that positional bias is nearly absent under low-uncertainty conditions, but grows exponentially when it becomes doubtful to decide which option is correct.
在二进制答题中,当一个模型系统偏向一个选择而另一个选择时,就会出现二进制定位偏差。在本研究中,我们量化和分析五个大语言模型在不同程度的回答不确定性下的位置偏差。我们重新调整了 SQAD-it 数据集,增加了一个额外的不正确的回答选项,然后创建了多个版本,其上下文逐渐减少,外文回答更多,产生从低度到高度不确定性的数据集。此外,我们评估了两个自然较高的不确定性基准:(1) WebGPT - 问题配对,其质量分数不均匀,和(2) 赢动参数 - 模型预测了Reddit’s r/ ChangeMyMyView 交换中更有说服力的参数。在每一个数据集中,“ 校正” (或更高质量/可控性) 选项的顺序被系统地翻转( 位于第1位,然后处于第2位) , 以计算精准性与高度。我们发现, 定位偏差在低度条件下几乎不存在, 但当选择正确性时会变得强烈。
Article 210
Title@2025-07-01 (2): Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion
Title: Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion | Fact Recall, Heuristik oder reines Guesswork? Präzise Interpretationen von Sprachmodellen für die Fact Completion | 事实召回、维力主义或纯粹的猜测?事实完成对语言模式的精确解释 2410.14405v4 |
Authors (5): Denitsa Saynova, Lovisa Hagström, Moa Johansson, Richard Johansson, Marco Kuhlmann
Language models (LMs) can make a correct prediction based on many possible signals in a prompt, not all corresponding to recall of factual associations. However, current interpretations of LMs fail to take this into account. For example, given the query “Astrid Lindgren was born in” with the corresponding completion “Sweden”, no difference is made between whether the prediction was based on knowing where the author was born or assuming that a person with a Swedish-sounding name was born in Sweden. In this paper, we present a model-specific recipe - PrISM - for constructing datasets with examples of four different prediction scenarios: generic language modeling, guesswork, heuristics recall and exact fact recall. We apply two popular interpretability methods to the scenarios: causal tracing (CT) and information flow analysis. We find that both yield distinct results for each scenario. Results for exact fact recall and generic language modeling scenarios confirm previous conclusions about the importance of mid-range MLP sublayers for fact recall, while results for guesswork and heuristics indicate a critical role of late last token position MLP sublayers. In summary, we contribute resources for a more extensive and granular study of fact completion in LMs, together with analyses that provide a more nuanced understanding of how LMs process fact-related queries.
语言模型(LMS)能够根据许多可能的信号作出准确的预测,这种预测迅速进行,但并不完全与事实协会的回顾相吻合。然而,目前对LMS的解释没有考虑到这一点。例如,由于询问“Astrid Lindgren出生于”瑞典,并相应完成“瑞典”,因此,对于预测是否基于了解作者出生地点或假设瑞典出生的瑞典人具有瑞典声望姓名的人,没有区别。在本文中,我们提出了一个建构数据集的模型 – – PrISM – – 配有四种不同预测情景的例子:通用语言建模、猜测工作、超自然回顾和确切事实回顾。我们用两种流行的可解释方法来描述这些情景:因果追踪(CT)和信息流分析。我们发现,两种预测都会产生不同的结果。准确的回顾结果和通用语言建模情景假设证实了以往关于中波波波亚层对回顾的重要性的结论,而猜想结果和神话显示迟定的MLP子层的临界作用。在摘要中,我们为更广泛和与LMS相关的事实调查的完成过程提供了更广泛的资源。
Article 211
Title@2025-07-01 (2): Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language
Title: Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language | Effiziente Domain-Adaptive Kontinuierliche Vorschulung für die Prozessindustrie in der deutschen Sprache | 以德语为加工工业提供高效的、适应性强的连续连续培训 2504.19856v3 |
Authors (3): Anastasia Zhukova, Christian E. Matt, Bela Gipp
Domain-adaptive continual pretraining (DAPT) is a state-of-the-art technique that further trains a language model (LM) on its pretraining task, e.g., masked language modeling (MLM), when common domain adaptation via LM fine-tuning is not possible due to a lack of labeled task data. Although popular, MLM requires a significant corpus of domain-related data, which is difficult to obtain for specific domains in languages other than English, such as the process industry in the German language. This paper introduces an efficient approach called ICL-augmented pretraining or ICL-APT that leverages in-context learning (ICL) and k-nearest neighbors (kNN) to augment target data with domain-related and in-domain texts, significantly reducing GPU time while maintaining strong model performance. Our results show that the best configuration of ICL-APT performed better than the state-of-the-art DAPT by 28.7% (7.87 points) and requires almost 4 times less GPU-computing time, providing a cost-effective solution for industries with limited computational capacity. The findings highlight the broader applicability of this framework to other low-resource industries, making NLP-based solutions more accessible and feasible in production environments.
长期适应性持续培训(DAPT)是一项最先进的技术,在培训前任务方面进一步培训一种语言模型(LM),例如,蒙面语言模型(MLM),因为由于缺乏标签的任务数据,不可能通过LM微调进行共同域适应。虽然MLM需要大量与域有关的数据,但很难用英语以外的特定领域获得这些数据,如德国语的工序行业。本文介绍了一种高效方法,即ICL-AUG-AGededed preduced or ICL-APT, 利用文文本学习(ICL)和K-NNN)和K-NB最近邻(kNN)来扩大与域相关文本有关的目标数据,大大减少了GPU的时间,同时保持了强大的模型性能。我们的结果显示,ICL-APT的最佳配置比当时的工艺型DAPT(7.87点)要好28.7%,而且需要几乎4倍的GPU-compting时间,为低语言学习(ICL)和KNNNNNNNNNNNN-N-N-NCF-resental-resulational-resulational-resulation resulticultal resultal commusulation commulationaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldal commus commus commusationdalpulationdationdationdaldaldaldationdation commus 提供一种成本效益的低成本的更具有较具有较广泛的低成本的低成本的低的低成本的解决方案。
Article 212
Title@2025-07-01 (2): Integrating Expert Labels into LLM-based Emission Goal Detection: Example Selection vs Automatic Prompt Design
Title: Integrating Expert Labels into LLM-based Emission Goal Detection: Example Selection vs Automatic Prompt Design | Integration von Experten-Etiketten in LLM-basierte Emissionszielerkennung: Beispielauswahl vs Automatisches Prompt-Design | 将专家标签纳入基于LLM的LLM排放目标探测:选择实例与自动即时设计 2412.06432v2 |
Authors (5): Marco Wrzalik, Adrian Ulges, Anne Uersfeld, Florian Faust, Viola Campos
We address the detection of emission reduction goals in corporate reports, an important task for monitoring companies’ progress in addressing climate change. Specifically, we focus on the issue of integrating expert feedback in the form of labeled example passages into LLM-based pipelines, and compare the two strategies of (1) a dynamic selection of few-shot examples and (2) the automatic optimization of the prompt by the LLM itself. Our findings on a public dataset of 769 climate-related passages from real-world business reports indicate that automatic prompt optimization is the superior approach, while combining both methods provides only limited benefit. Qualitative results indicate that optimized prompts do indeed capture many intricacies of the targeted emission goal extraction task.
在公司报告中发现减排目标,这是监测公司在应对气候变化方面的进展的一项重要任务。 具体地说,我们侧重于将专家反馈纳入以LLM为基础的输油管中,并比较以下两种战略:(1) 生动地挑选几个实例,(2) 自动优化LLM本身的及时性。 我们对现实世界商业报告中769个与气候有关的通道的公开数据集的调查结果表明,自动迅速优化是优异的做法,同时将这两种方法结合起来只能带来有限的好处。 定性结果显示,优化的快速确实能够捕捉到目标排放目标提取任务的许多错综复杂之处。
Article 213
Title@2025-07-01 (2): TUM-MiKaNi at SemEval-2025 Task 3: Towards Multilingual and Knowledge-Aware Non-factual Hallucination Identification
Title: TUM-MiKaNi at SemEval-2025 Task 3: Towards Multilingual and Knowledge-Aware Non-factual Hallucination Identification | TUM-MiKaNi am SemEval-2025 Aufgabe 3: Mehrsprachige und wissensbasierte nicht-faktische Halluzinationsidentifikation | SemEval-2025任务的TUM-MIKANi 任务3:多语种和知识-知识-软件非事实幻觉识别 2507.00579v1 |
Authors (4): Miriam Anschütz, Ekaterina Gikalo, Niklas Herbster, Georg Groh
Hallucinations are one of the major problems of LLMs, hindering their trustworthiness and deployment to wider use cases. However, most of the research on hallucinations focuses on English data, neglecting the multilingual nature of LLMs. This paper describes our submission to the SemEval-2025 Task-3 - Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. We propose a two-part pipeline that combines retrieval-based fact verification against Wikipedia with a BERT-based system fine-tuned to identify common hallucination patterns. Our system achieves competitive results across all languages, reaching top-10 results in eight languages, including English. Moreover, it supports multiple languages beyond the fourteen covered by the shared task. This multilingual hallucination identifier can help to improve LLM outputs and their usefulness in the future.
幻觉是LLMs的主要问题之一,妨碍了它们的信任性,也妨碍了将其运用到使用范围更广的案例中。然而,关于幻觉的研究大多侧重于英语数据,忽视了LLMs的多语种性质。本文描述了我们提交SemEval-2025任务-3-Mu-SHROOM、多种语言关于幻觉和相关可观测的代代代错的共享任务。我们建议建立一个由两部分组成的管道,将针对维基百科的基于检索的事实核查与基于BERT的系统结合起来,以找出共同的幻觉模式。我们的系统在所有语言之间都取得了竞争结果,以包括英语在内的八种语言达到10大结果。此外,它支持了共同任务覆盖的十四种以上的多种语言。这种多语言的幻觉识别功能可以帮助改进LM产出及其在未来的效用。
Article 214
Title@2025-07-01 (2): DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models
Title: DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models | DiReCT: Diagnostische Begründung für klinische Anmerkungen über große Sprachmodelle | DiReCT:通过大语言模型诊断临床说明的诊断理由 2408.01933v6 |
Authors (9): Bowen Wang, Jiuyang Chang, Yiming Qian, Guoxin Chen, Junhao Chen, Zhouqiang Jiang, Jiahao Zhang, Yuta Nakashima, Hajime Nagahara
Large language models (LLMs) have recently showcased remarkable capabilities, spanning a wide range of tasks and applications, including those in the medical domain. Models like GPT-4 excel in medical question answering but may face challenges in the lack of interpretability when handling complex tasks in real clinical settings. We thus introduce the diagnostic reasoning dataset for clinical notes (DiReCT), aiming at evaluating the reasoning ability and interpretability of LLMs compared to human doctors. It contains 511 clinical notes, each meticulously annotated by physicians, detailing the diagnostic reasoning process from observations in a clinical note to the final diagnosis. Additionally, a diagnostic knowledge graph is provided to offer essential knowledge for reasoning, which may not be covered in the training data of existing LLMs. Evaluations of leading LLMs on DiReCT bring out a significant gap between their reasoning ability and that of human doctors, highlighting the critical need for models that can reason effectively in real-world clinical scenarios.
大型语言模型(LLMS)最近展示了非凡的能力,涉及广泛的任务和应用,包括医疗领域的任务和应用,GPT-4等模型在医疗问题回答方面表现优异,但在实际临床环境中处理复杂任务时可能面临缺乏解释的挑战,因此我们引入临床说明的诊断推理数据集(DIRECT),目的是评估LLMS与人类医生相比的推理能力和可解释性,其中包括511份临床说明,每份由医生仔细注解,详细说明诊断推理过程,从临床说明中的观察到最终诊断。此外,还提供了诊断知识图表,以提供用于推理的必要知识,而现有的LLMS的培训数据可能没有涵盖这种知识。DIRCT主要LMS的评析揭示了其推理能力与人类医生的推理能力之间的巨大差距,突出了在现实世界临床假设中能够有效解释的模型的关键需要。
Article 215
Title@2025-07-01 (2): Methodological Rigour in Algorithm Application: An Illustration of Topic Modelling Algorithm
Title: Methodological Rigour in Algorithm Application: An Illustration of Topic Modelling Algorithm | Methodologische Rigour in Algorithmen Anwendung: Eine Illustration der Themenmodellierung Algorithmen | Agorithm 应用中的方法重力:主题模型的示意 2507.00547v1 |
Authors (1): Malmi Amadoru
The rise of advanced computational algorithms has opened new avenues for computationally intensive research approaches to theory development. However, the opacity of these algorithms and lack of transparency and rigour in their application pose methodological challenges, potentially undermining trust in research. The discourse on methodological rigour in this new genre of research is still emerging. Against this backdrop, I attempt to offer guidance on methodological rigour, particularly in the context of topic modelling algorithms. By illustrating the application of the structural topic modelling algorithm and presenting a set of guidelines, I discuss how to ensure rigour in topic modelling studies. Although the guidelines are for the application of topic modelling algorithms, they can be applied to other algorithms with context-specific adjustments. The guidelines are helpful, especially for novice researchers applying topic modelling, and editors and reviewers handling topic modelling manuscripts. I contribute to the literature on topic modelling and join the emerging dialogue on methodological rigour in computationally intensive theory construction research.
先进的计算算法的兴起为计算密集研究理论发展开辟了新的途径,然而,这些算法的不透明及其应用缺乏透明度和严谨性带来了方法挑战,有可能破坏对研究的信任。关于这种新的研究类型中方法严谨的讨论仍在出现。在这种背景下,我试图就方法严谨性提供指导,特别是在专题模型算法方面。我通过说明结构专题模型算法的应用和提出一套准则,讨论了如何确保专题模型研究的严格性。虽然这些准则是用于专题模型算法的应用,但可以适用于有具体背景调整的其他算法。这些准则很有帮助,特别是对应用专题建模的新兴研究人员、以及处理专题模型手稿的编辑和审查者。我为专题建模文献作出了贡献,并加入了关于计算密集的理论建筑研究中方法严谨性的新对话。
Article 216
Title@2025-07-01 (2): An evaluation of LLMs and Google Translate for translation of selected Indian languages via sentiment and semantic analyses
Title: An evaluation of LLMs and Google Translate for translation of selected Indian languages via sentiment and semantic analyses | Eine Auswertung von LLMs und Google Translate zur Übersetzung ausgewählter indischer Sprachen über Sentiment und semantische Analysen | 对LLLM和Google LLMs和Google的评价 2503.21393v3 |
Authors (3): Rohitash Chandra, Aryan Chaudhari, Yeshwanth Rayavarapu
Large Language models (LLMs) have been prominent for language translation, including low-resource languages. There has been limited study on the assessment of the quality of translations generated by LLMs, including Gemini, GPT, and Google Translate. This study addresses this limitation by using semantic and sentiment analysis of selected LLMs for Indian languages, including Sanskrit, Telugu and Hindi. We select prominent texts (Bhagavad Gita, Tamas and Maha Prasthanam ) that have been well translated by experts and use LLMs to generate their translations into English, and provide a comparison with selected expert (human) translations. Our investigation revealed that while LLMs have made significant progress in translation accuracy, challenges remain in preserving sentiment and semantic integrity, especially in metaphorical and philosophical contexts for texts such as the Bhagavad Gita. The sentiment analysis revealed that GPT models are better at preserving the sentiment polarity for the given texts when compared to human (expert) translation. The results revealed that GPT models are generally better at maintaining the sentiment and semantics when compared to Google Translate. This study could help in the development of accurate and culturally sensitive translation systems for large language models.
大型语言模型(LLMS)在语言翻译方面占有突出地位,包括低资源语言。关于LLMS(包括Gemini、GPT和Google Translate)所产生翻译质量评估的研究有限。这项研究通过对包括Sanskrit、Telugu和Hindi等印度语的选定LLMs(LLMs)进行语义和情绪分析,解决了这一局限性。我们选择了由专家很好地翻译的著名文本(Bhagavad Gita、Tamas和Maha Prasthanam),并使用LMS(LMs)制作英文译文,并与选定的专家(人)翻译进行了比较。我们的调查显示,LLMS在翻译准确性方面取得重大进展,但在保护情绪和语义完整性方面仍然存在挑战,特别是在Bhagavad Gita等语言的隐喻和哲学背景方面。感知性分析表明,GPTS模型比人类(专家)翻译更能保护给的文本的情绪极极性。结果显示,GPTM模型在与Google Translate相比,一般在维持感和语大语言翻译方面会有助于发展准确和文化敏感的翻译系统。
Article 217
Title@2025-07-01 (2): Capsule Network-Based Semantic Intent Modeling for Human-Computer Interaction
Title: Capsule Network-Based Semantic Intent Modeling for Human-Computer Interaction | Capsule Network-based Semantic Intent Modellierung für Mensch-Computer-Interaktion | Capsule 网络基于网络的人类-计算机相互作用的语义内涵建模模型 2507.00540v1 |
Authors (4): Shixiao Wang, Yifan Zhuang, Runsheng Zhang, Zhijun Song
This paper proposes a user semantic intent modeling algorithm based on Capsule Networks to address the problem of insufficient accuracy in intent recognition for human-computer interaction. The method represents semantic features in input text through a vectorized capsule structure. It uses a dynamic routing mechanism to transfer information across multiple capsule layers. This helps capture hierarchical relationships and part-whole structures between semantic entities more effectively. The model uses a convolutional feature extraction module as the low-level encoder. After generating initial semantic capsules, it forms high-level abstract intent representations through an iterative routing process. To further enhance performance, a margin-based mechanism is introduced into the loss function. This improves the model’s ability to distinguish between intent classes. Experiments are conducted using a public natural language understanding dataset. Multiple mainstream models are used for comparison. Results show that the proposed model outperforms traditional methods and other deep learning structures in terms of accuracy, F1-score, and intent detection rate. The study also analyzes the effect of the number of dynamic routing iterations on model performance. A convergence curve of the loss function during training is provided. These results verify the stability and effectiveness of the proposed method in semantic modeling. Overall, this study presents a new structured modeling approach to improve intent recognition under complex semantic conditions.
本文建议基于 Capsule 网络的用户语义意图模型算法, 以解决人类计算机互动的意向识别不够准确的问题。 该方法代表了通过矢量化胶囊结构在输入文本中的语义特征。 它使用动态路由机制将信息传递到多个胶囊层中。 这有助于更有效地捕捉语义实体之间的等级关系和部分整体结构。 模型使用一个相通特征提取模块作为低级编码器。 在生成初始语义胶囊后, 它通过迭代路由进程形成高层次抽象意向表达。 为了进一步提高性能, 在损失函数中引入一个以差值为基础的机制。 这提高了模型区分意图类别的能力。 正在使用公共自然语言理解数据集进行实验。 多种主流模型用于比较。 结果表明, 拟议的模型在准确性、 F1 点和意向检测率方面超越了传统方法和其他深层次学习结构。 研究还分析了动态路由模式运行对模型性运行的影响。 在模型运行过程中, 正在提供一种趋汇曲线曲线曲线, 来验证这一结构损失函数在培训中显示稳定性结构化的方法。
Article 218
Title@2025-07-01 (2): NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data
Title: NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data | NIRANTAR: Kontinuierliches Lernen mit neuen Sprachen und Domänen auf Real-World Speech Data | NIRANTAR: 关于现实世界语言数据的新语言和新域域的不断学习 2507.00534v1 |
Authors (3): Tahir Javed, Kaushal Bhogale, Mitesh M. Khapra
We introduce Nirantar, a comprehensive framework for evaluating continual learning (CL) in multilingual and multi-domain ASR. Designed to reflect real-world CL challenges, Nirantar leverages data collected incrementally across 22 languages and 208 districts in India through natural episodes. This enables evaluation across Language-Incremental (LIL), Domain-Incremental (DIL), and the novel Language-Incremental Domain-Incremental Learning (LIDIL) scenarios. Unlike prior work that relies on simulated episodes, Nirantar presents dynamic, non-uniform language and domain shifts, making it an ideal testbed for CL research. With 3250 hours of human-transcribed speech, including 1720 hours newly introduced in this work, our framework enables systematic benchmarking of CL methods. We evaluate existing approaches and demonstrate that no single method performs consistently well, underscoring the need for more robust CL strategies.
我们引入了Nirantar,这是在多语言和多领域ASR中评价持续学习(CL)的综合框架。Nirantar为反映现实世界CL挑战而设计,利用印度22种语言和208个地区通过自然事件逐步收集的数据,从而能够对语言-语言(LIL)、语言-语言(DIL)和新颖的语言-语言(CEP)-语言(DIDIL)方法进行评估。与以前依赖模拟事件的工作不同,Nirantar展示了动态的、非统一的语言和域变,使其成为CL研究的理想测试台。3250小时的人文演讲,包括这项工作中新引入的1720小时,我们的框架使得CL方法的系统基准得以进行。我们评估了现有方法,并表明没有任何单一方法能够始终如一地运行,强调需要更强有力的CL战略。
Article 219
Title@2025-07-01 (2): SAGE: Steering Dialog Generation with Future-Aware State-Action Augmentation
Title: SAGE: Steering Dialog Generation with Future-Aware State-Action Augmentation | SAGE: Steuerungsdialog-Generierung mit zukunftssicherer State-Action-Erweiterung | SAGE: 具有未来意识的国家行动增强作用的引导对话生成 2503.03040v2 |
Authors (2): Yizhe Zhang, Navdeep Jaitly
Recent advances in large language models have demonstrated impressive capabilities in task-oriented applications, yet building emotionally intelligent chatbots that can engage in natural, strategic conversations remains a challenge. We present a novel approach called SAGE that uses latent variables to control long-horizon behavior in dialogue generation. At the core of our method is the State-Action Chain (SAC), which augments standard language model fine-tuning by introducing latent variables that encapsulate emotional states and conversational strategies between dialogue turns. During inference, these variables are generated before each response, enabling coarse-grained control over dialogue progression while maintaining natural interaction patterns. We also introduce a self-improvement pipeline that leverages dialogue tree search, LLM-based reward modeling, and targeted fine-tuning to optimize conversational trajectories. Our experimental results show that models trained with this approach demonstrate improved performance in emotional intelligence metrics while maintaining strong capabilities on LLM benchmarks. The discrete nature of our latent variables facilitates search-based strategies and provides a foundation for future applications of reinforcement learning to dialogue systems, where learning can occur at the state level rather than the token level. https://github.com/apple/ml-sage-dialog-gen
大型语言模型最近的进展表明,在以任务为导向的应用方面,出现了令人印象深刻的能力,然而,在建设情感智能的聊天室方面,能够进行自然的战略性对话,这仍然是一个挑战。我们提出了名为SAGE的新颖方法,它利用潜在变量来控制对话生成中的长视距行为。我们的方法的核心是国家行动链(SAC),它通过引入包含情感状态和对话转折之间对话战略的潜在变量来增强标准语言模型的微调。在推断中,这些变量是在每次反应之前产生的,从而能够在保持自然互动模式的同时对对话进展进行共分解的控制。我们还引入了一种自我改进管道,利用对话树搜索、基于LLMM的奖励模型和有针对性的微调来优化对话轨迹。我们的实验结果表明,通过这种方法培训的模型在情感智能度量度方面表现的改进,同时保持了LM基准的强大能力。我们潜伏变量的离散性质有利于搜索战略,并为今后将强化学习应用于对话系统提供了基础,从而可以在州一级而不是象征性一级学习。 https://gimatimage-qumbus/com.
Article 220
Title@2025-07-01 (2): TeamCMU at Touché: Adversarial Co-Evolution for Advertisement Integration and Detection in Conversational Search
Title: TeamCMU at Touché: Adversarial Co-Evolution for Advertisement Integration and Detection in Conversational Search | TeamCMU bei Touché: Adversariale Co-Evolution für Werbung Integration und Detektion in der Conversational Search | CMU 接触问题小组:在谈话搜索中进行广告融合和探测的反向共同革命 2507.00509v1 |
Authors (4): To Eun Kim, João Coelho, Gbemileke Onilude, Jai Singh
As conversational search engines increasingly adopt generation-based paradigms powered by Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), the integration of advertisements into generated responses presents both commercial opportunities and challenges for user experience. Unlike traditional search, where advertisements are clearly delineated, generative systems blur the boundary between informational content and promotional material, raising concerns around transparency and trust. In this work, we propose a modular pipeline for advertisement management in RAG-based conversational systems, consisting of an ad-rewriter for seamless ad integration and a robust ad-classifier for detection. We leverage synthetic data to train high-performing classifiers, which are then used to guide two complementary ad-integration strategies: supervised fine-tuning of the ad-rewriter and a best-of-N sampling approach that selects the least detectable ad-integrated response among multiple candidates. Our evaluation focuses on two core questions: the effectiveness of ad classifiers in detecting diverse ad integration strategies, and the training methods that best support coherent, minimally intrusive ad insertion. Experimental results show that our ad-classifier, trained on synthetic advertisement data inspired by marketing strategies and enhanced through curriculum learning, achieves robust detection performance. Additionally, we demonstrate that classifier-guided optimization, through both fine-tuning and best-of-N sampling, significantly improves ad stealth, enabling more seamless integration. These findings contribute an adversarial co-evolution framework for developing more sophisticated ad-aware generative search systems and robust ad classifiers.
由于对话搜索引擎越来越多地采用由大语言模型(LLMs)和检索启动一代(RAG)所推动的基于生成模式,因此将广告纳入生成的响应中既提供了商业机会,也带来了用户经验的挑战。与传统的搜索不同,即广告被明确划定,基因化系统模糊了信息内容和宣传材料之间的界限,引起对透明度和信任的关切。在这项工作中,我们提议在基于RAG的谈话系统中为广告管理提供一个模块化管道,包括一个无缝广告整合的特设汇编和强有力的检测分类。我们利用合成数据培训高性能的分类师,然后用于指导两种互补的整合战略:监管对高级编辑的微调和最佳N抽样方法,在多个候选人中选择最难察觉的对应综合反应。我们的评价侧重于两个核心问题:分类师在发现多种一体化战略方面的有效性,以及最可靠、最难渗透性化的升级化附加培训方法。实验结果表明,我们的高级分类、经过营销战略启发的合成广告数据培训的高级分类师,通过升级、更稳健的升级的模板测试,通过学习课程实现最佳的改进。
Article 221
Title@2025-07-01 (2): Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions
Title: Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions | Learning-to-Context Slope: Bewertung von In-Context-Lerneffektivität jenseits von Performance-Illusionen | 学习到文字表达式:评价除了业绩幻觉之外在学习中的效果 2506.23146v2 |
Authors (6): Dingzriui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
In-context learning (ICL) has emerged as an effective approach to enhance the performance of large language models (LLMs). However, its effectiveness varies significantly across models and tasks, posing challenges for practitioners to determine when ICL reliably improves performance. Current evaluation approaches, reliant on performance change after applying ICL, suffer from low reliability, poor attribution, and impracticality in data-insufficient scenarios. We propose the Learning-to-Context Slope (LCS), a novel metric that quantifies ICL effectiveness by modeling the slope between learning gain (loss decrease from demonstrations) and contextual relevance (demonstration-input relevance). LCS addresses key limitations of performance-based metrics: (1) it captures continuous loss changes even when outputs are incorrect, improving reliability; (2) its formulation attributes ICL failures to weak contextual alignment (inability to adapt inputs to demonstrations) or strong output calibration (self-verification of correctness); and (3) it minimizes reliance on labeled data via synthetic evaluation. Extensive experiments demonstrate that LCS strongly correlates with performance improvements in labeled settings and reliably reflects true effectiveness in biased or data-scarce scenarios. Further analysis reveals actionable thresholds for LCS and identifies model capabilities critical to ICL success.
长期学习(ICL)是提高大型语言模型绩效的有效方法,但其有效性因模式和任务的不同而差别很大,给实践者带来挑战,让他们确定ICL何时可靠地改进业绩。目前的评价方法,在应用ICL后依赖绩效变化,在数据不足的假设情景中,其可靠性低、归属性差、不切实际性差。我们建议采用 “ 学习到Context Slope(LCS) “ (LCS)这一新指标,通过模拟学习收益(演示损失减少)和背景相关性(模拟-投入相关性)之间的斜坡度来量化ICLI的有效性。 LCS处理基于业绩的指标的关键局限性:(1) 它可以捕捉持续的损失变化,即使产出不正确,也可以提高可靠性;(2) 其拟订将ICLL失败归因于环境差(无法根据演示调整投入)或强有力的产出校准(自我核实正确性);(3) 通过综合评价最大限度地减少对标签数据的依赖。广泛的实验表明,LCS与标签环境中的绩效改进密切相关,可靠地反映I-CS级或CL-CR关键假设情景中的真实有效性。进一步分析显示行动极限。
Article 222
Title@2025-07-01 (2): ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
Title: ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition | ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition | 研究场所:通过基于灵感的分解任务,为科学发现中的科学发现中LLMs制定基准 2503.21248v2 |
Authors (10): Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, Dongzhan Zhou
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as “research hypothesis mines”, capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.
大型语言模型(LLMS)在协助科学研究方面表现出了潜力,然而,由于缺乏专门的基准,它们发现高质量研究假设的能力仍未受到审查。为弥补这一差距,我们引入了第一个大型基准,用于评估具有几乎足够科学发现子任务(灵感检索、假设构成和假设排名)的LLMS,我们开发了一个自动框架,从12个学科的科学论文中提取关键组成部分(研究问题、背景调查、灵感和假设),由专家确认其准确性。为了防止数据污染,我们专门关注2024年发表的论文,确保与LLM预培训数据尽可能少有重叠。我们的评估表明LLMS在恢复灵感方面表现良好,这是一项分配外的任务,表明它们有能力形成新的知识协会。LLMs作为“研究假矿”的位置,能够通过在最低限度的人类干预下生成创新假设,促进自动科学发现。
Article 223
Title@2025-07-01 (2): ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry
Title: ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry | ComRAG: Retrieval-Augmented Generation mit dynamischen Vector Stores für Echtzeit-Community-Frageantworten in der Industrie | ComRAG: 利用动态矢量储存库实时社区工业问题回答实时社区问题的回收-原始一代 2506.21098v2 |
Authors (8): Qinwen Chen, Wenbiao Tao, Zhiwei Zhu, Mingfan Xi, Liangzhong Guo, Yuan Wang, Wei Wang, Yunshi Lan
Community Question Answering (CQA) platforms can be deemed as important knowledge bases in community, but effectively leveraging historical interactions and domain knowledge in real-time remains a challenge. Existing methods often underutilize external knowledge, fail to incorporate dynamic historical QA context, or lack memory mechanisms suited for industrial deployment. We propose ComRAG, a retrieval-augmented generation framework for real-time industrial CQA that integrates static knowledge with dynamic historical QA pairs via a centroid-based memory mechanism designed for retrieval, generation, and efficient storage. Evaluated on three industrial CQA datasets, ComRAG consistently outperforms all baselines–achieving up to 25.9% improvement in vector similarity, reducing latency by 8.7% to 23.3%, and lowering chunk growth from 20.23% to 2.06% over iterations.
社区问题解答平台(CQA)可被视为社区的重要知识基础,但有效地利用历史互动和实时领域知识仍是一项挑战。现有方法往往没有充分利用外部知识,没有纳入动态的历史QA环境,或缺乏适合工业部署的记忆机制。我们提议ComRAG,即实时工业问题解答平台的检索强化生成框架,通过为检索、生成和高效存储设计的基于机器人的记忆机制,将静态知识与动态的历史质量评估对口结合起来。对三个工业性质量解析数据集进行了评估,ComRAG始终超越了所有基线实现率达到25.9%的矢量相似性改进率,将延迟率从8.7%降至23.3%,并将迭代面积增长从20.23%降至2.06%。
Article 224
Title@2025-07-01 (2): Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer Architecture
Title: Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer Architecture | Beat- und Downbeat-Tracking in Performance-MIDI mit End-to-End-Transformer-Architektur | 利用端对端转换器架构进行实绩跟踪的MIDI 2507.00466v1 |
Authors (2): Sebastian Murgul, Michael Heizmann
Beat tracking in musical performance MIDI is a challenging and important task for notation-level music transcription and rhythmical analysis, yet existing methods primarily focus on audio-based approaches. This paper proposes an end-to-end transformer-based model for beat and downbeat tracking in performance MIDI, leveraging an encoder-decoder architecture for sequence-to-sequence translation of MIDI input to beat annotations. Our approach introduces novel data preprocessing techniques, including dynamic augmentation and optimized tokenization strategies, to improve accuracy and generalizability across different datasets. We conduct extensive experiments using the A-MAPS, ASAP, GuitarSet, and Leduc datasets, comparing our model against state-of-the-art hidden Markov models (HMMs) and deep learning-based beat tracking methods. The results demonstrate that our model outperforms existing symbolic music beat tracking approaches, achieving competitive F1-scores across various musical styles and instruments. Our findings highlight the potential of transformer architectures for symbolic beat tracking and suggest future integration with automatic music transcription systems for enhanced music analysis and score generation.
音乐性能的比喻跟踪 MIDI 是音乐级音乐记录和节奏分析的一个具有挑战性和重要的任务,但现有方法主要侧重于基于音频的方法。 本文提出了一个基于终端到终端的变压器模型,用于在性能MIDI中进行击打和击落跟踪,利用一个编码器解码器结构来将MIDI的输入进行顺序到顺序的翻译以击打注解。 我们的方法引入了新的预处理技术,包括动态增强和优化符号化战略,以提高不同数据集的准确性和通用性。 我们利用A-MAPS、ASAP、GuitarSet和Leduc数据集进行了广泛的实验,将我们的模型与最新隐藏的Markov模型(HMMs)和深层学习的击动跟踪方法进行比较。 结果表明,我们的模型超越了现有的象征性音乐击打跟踪方法,实现了各种音乐风格和乐器的F1核心具有竞争力。 我们的研究结果突出表明了变压器结构在象征性击打跟踪方面的潜力,并建议今后与自动音乐记录系统整合,以加强音乐分析和生成。
Article 225
Title@2025-07-01 (2): Pitfalls of Evaluating Language Models with Open Benchmarks
Title: Pitfalls of Evaluating Language Models with Open Benchmarks | Lücken bei der Bewertung von Sprachmodellen mit offenen Benchmarks | 具有开放基准的评价语言模式的空洞 2507.00460v1 |
Authors (5): Md. Najib Hasan, Mohammad Fakhruddin Babar, Souvika Sarkar, Monowar Hasan, Santu Karmaker
Open Large Language Model (LLM) benchmarks, such as HELM and BIG-bench, offer standardized, transparent protocols that facilitate the fair comparison, reproducibility, and iterative advancement of Language Models (LMs). However, their openness also introduces critical and underexplored pitfalls. This study exposes these weaknesses by systematically constructing ``cheating’’ models – smaller variants of BART, T5, and GPT-2 fine-tuned directly on public test sets – which achieve top rankings on a prominent open, holistic benchmark (HELM) despite poor generalization and limited practical utility. Our findings underscore three key insights: \ca high leaderboard performance on open benchmarks may not always reflect real-world effectiveness; \cb private or dynamic benchmarks must complement open evaluations to safeguard integrity; and \cc a fundamental reevaluation of current benchmarking practices is essential to ensure robust and trustworthy LM assessments.
开放大语言模型(LLM)基准,如HELM和BIG-bench,提供了标准化、透明的协议,便利了语言模型的公平比较、再复制和迭代发展。然而,它们的开放性也带来了关键和未得到充分探索的陷阱。本研究通过系统构建“切换”模型,即BART、T5和GPT-2的较小变体,直接在公共测试组上进行微调,暴露了这些弱点。 尽管普遍化程度不高,实用性也有限,但是在突出的开放、整体基准(HELM)上取得了最高排名。我们的调查结果强调了三个主要见解:开放基准的高领导板业绩不一定反映现实世界的实效;\cb私人或动态基准必须补充公开评估,以维护完整性;以及对当前基准做法进行根本的重新评价对于确保可靠和可靠的LM评估至关重要。
Article 226
Title@2025-07-01 (2): Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention
Title: Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention | Überwinden von Langkontext-Grenzen von State-Space-Modellen über Kontext-Abhängige Sparse-Achtung | 克服国家空间模型通过环境依赖性分散关注而克服国家空间模型的长文限制 2507.00449v1 |
Authors (4): Zhihao Zhan, Jianan Zhao, Zhaocheng Zhu, Jian Tang
Efficient long-context modeling remains a critical challenge for natural language processing (NLP), as the time complexity of the predominant Transformer architecture scales quadratically with the sequence length. While state-space models (SSMs) offer alternative sub-quadratic solutions, they struggle to capture long-range dependencies effectively. In this work, we focus on analyzing and improving the long-context modeling capabilities of SSMs. We show that the widely used synthetic task, associative recall, which requires a model to recall a value associated with a single key without context, insufficiently represents the complexities of real-world long-context modeling. To address this limitation, we extend the associative recall to a novel synthetic task, \emph{joint recall}, which requires a model to recall the value associated with a key given in a specified context. Theoretically, we prove that SSMs do not have the expressiveness to solve multi-query joint recall in sub-quadratic time complexity. To resolve this issue, we propose a solution based on integrating SSMs with Context-Dependent Sparse Attention (CDSA), which has the expressiveness to solve multi-query joint recall with sub-quadratic computation. To bridge the gap between theoretical analysis and real-world applications, we propose locality-sensitive Hashing Attention with sparse Key Selection (HAX), which instantiates the theoretical solution and is further tailored to natural language domains. Extensive experiments on both synthetic and real-world long-context benchmarks show that HAX consistently outperforms SSM baselines and SSMs integrated with context-independent sparse attention (CISA).
高效长文模型对于自然语言处理(NLP)来说仍是一个严峻的挑战,因为主导式变换器结构的时间复杂性随序列长度的长度而以二次曲线为尺度。虽然国家空间模型(SSMS)提供了替代的次二次二次曲线解决方案,但它们在有效捕捉远程依赖性方面挣扎着。在这项工作中,我们侧重于分析和提高SSSM的长文本模型能力。我们表明,广泛使用的合成任务,即联合回顾,需要一种模型来回顾一个与没有背景的单键相关联的值,这不代表真实世界长文模型的复杂性。为了应对这一限制,我们将联动回顾扩展至一个新的合成任务,\emph{联合回忆},这需要一种模型来回顾与特定背景下的钥匙相关价值。理论上,我们证明SSSSMMS没有在解析后时间复杂性中解析多压性联合回忆。为了解决这个问题,我们提出了一种基于SSSMSMS与内部依赖性直径直径直系直径直系直径直径的直径直径直径直径直径直径直径直径直径直的直径直径直径直径直径直径方(CD) 和高空径直地分析。
Article 227
Title@2025-07-01 (2): Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models’ Uncertainty?
Title: Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models’ Uncertainty? | Epistemische Marker in der Einschätzung von Vertrauen wiedersehen: Können Marker die Ungewissheit großer Sprachmodelle genau widerspiegeln? | 重新审视信心估计中的亮点标记:标记能否准确地反映大语言模型的不确定性? 2505.24778v2 |
Authors (4): Jiayu Liu, Qing Zong, Weiqi Wang, Yangqiu Song
As large language models (LLMs) are increasingly used in high-stakes domains, accurately assessing their confidence is crucial. Humans typically express confidence through epistemic markers (e.g., “fairly confident”) instead of numerical values. However, it remains unclear whether LLMs consistently use these markers to reflect their intrinsic confidence due to the difficulty of quantifying uncertainty associated with various markers. To address this gap, we first define marker confidence as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs. Our results show that while markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. These findings raise significant concerns about the reliability of epistemic markers for confidence estimation, underscoring the need for improved alignment between marker based confidence and actual model uncertainty. Our code is available at https://github.com/HKUST-KnowComp/MarCon.
由于大型语言模型(LLMs)越来越多地用于高占用域域,准确评估其信任度至关重要。人类通常通过缩写标记(例如“相当自信”)而不是数字值表示信心。然而,由于难以量化与各种标记有关的不确定性,LLMs是否始终使用这些标记来反映其内在信心,这一点仍然不清楚。为了缩小这一差距,我们首先将标记信心定义为在模型使用缩略语标记时观察到的准确性。我们评估了在公开源码和专有LMs的分布和分配外设置中多种解答数据集的稳定性。我们的结果显示,虽然标记在同一分布范围内非常普遍,但其信任度在分配外的假设中是不一致的。这些结论对缩略语标记用于信任度估计的可靠性提出了重大关切,强调在基于标记的信心和实际模型不确定性之间需要改进一致性。我们的代码可在https://github.com/HKust-KinComp/Marcon查阅。
Article 228
Title@2025-07-01 (2): Beyond Sociodemographic Prompting: Using Supervision to Align LLMs with Human Response Distributions
Title: Beyond Sociodemographic Prompting: Using Supervision to Align LLMs with Human Response Distributions | Beyond Sociodemographic Prompting: Mit Supervision LLMs mit menschlichen Response-Distributionen ausrichten | 超越社会人口人口加速:利用监督使LMs与人的反应分布相匹配 2507.00439v1 |
Authors (7): Gauri Kambhatla, Sanjana Gautam, Angela Zhang, Alex Liu, Ravi Srinivasan, Junyi Jessy Li, Matthew Lease
The ability to accurately predict how different population groups would answer subjective questions would have great value. In this work, we show that use of relatively simple supervision can greatly improve language model alignment with diverse population groups, as measured over three datasets spanning various topics. Beyond evaluating average performance, we also report how alignment varies across specific groups. The simplicity and generality of our approach promotes easy adoption, while our broad findings provide useful guidance for when to use or not use our approach in practice. By conducting evaluation over many LLMs and prompting strategies, along with open-sourcing our work, we provide a useful benchmark to stimulate future research.
准确预测不同人口群体如何回答主观问题的能力将有很大价值。在这项工作中,我们表明,使用相对简单的监督可以大大改善语言模式与不同人口群体的一致性,如在涉及不同主题的三个数据集中衡量的那样。除了评价平均业绩外,我们还报告不同群体之间的一致程度如何不同。我们的方法简单而笼统,便于采用,而我们的广泛调查结果为实际使用或不使用我们的方法提供了有益的指导。通过对许多LLMS进行评价和推动制定战略,以及开放我们的工作来源,我们提供了一个有用的基准来刺激未来的研究。
Article 229
Title@2025-07-01 (2): Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Title: Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning | Verbessert Mathe-Reasoning die allgemeinen LLM-Fähigkeiten? Verstehen der Übertragbarkeit von LLM-Reasoning | 数学理由是否提高一般LLM能力? 理解LLM理由的可转让性 2507.00432v1 |
Authors (9): Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue
Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.
数学推理已成为大型语言模型(LLMS)进步的招牌子,新的模型在数学和AIME等基准上迅速超越了人类水平的绩效。但是,随着数学领头板每周改进一次,值得问道:这些收益是否反映了更广泛的解决问题能力或只是狭义的超标?为了回答这个问题,我们评估了20多个开放的、经推理调整的模式,这些模式涉及广泛的一系列任务,包括数学、科学质量A、代理规划、编码和标准指令执行。我们惊讶地发现,在数学上取得成功的大多数模型未能将其收益转移到其他领域。为了严格研究这一现象,我们用数学数据但不同的调控方法对Qwent13-14B模型进行了控制实验。我们发现,强化(RL)调控模式广泛覆盖了各个领域,同时监管的微调(SFT)调模型往往忘记了一般能力。 冷点空间代表以及象征性空间分配变化分析表明,SFT诱导大量的代表和输出流,而RL则保留了一般-Dmain结构。我们的结果表明,需要重新思考标准的培训后制方法,尤其是对SFTFT的推理学的依赖。
Article 230
Title@2025-07-01 (2): RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Radiology with Zero-Shot Multi-Task Capability
Title: RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Radiology with Zero-Shot Multi-Task Capability | RadZero: Ähnlichkeitsbasierte Cross-Attention für erklärbare Vision-Sprachenausrichtung in der Radiologie mit Zero-Shot-Multi-Task-Fähigkeit | RadZero:在无热多任务能力的放射学中,对可解释的视觉-语言协调进行基于相似的交叉关注 2504.07416v2 |
Authors (4): Jonggwon Park, Soobum Kim, Byungmu Yoon, Kyoyun Choi
Recent advancements in multi-modal models have significantly improved vision-language (VL) alignment in radiology. However, existing approaches struggle to effectively utilize complex radiology reports for learning and offer limited interpretability through attention probability visualizations. To address these challenges, we introduce RadZero, a novel framework for VL alignment in radiology with zero-shot multi-task capability. A key component of our approach is VL-CABS (Vision-Language Cross-Attention Based on Similarity), which aligns text embeddings with local image features for interpretable, fine-grained VL reasoning. RadZero leverages large language models to extract concise semantic sentences from radiology reports and employs multi-positive contrastive training to effectively capture relationships between images and multiple relevant textual descriptions. It uses a pre-trained vision encoder with additional trainable Transformer layers, allowing efficient high-resolution image processing. By computing similarity between text embeddings and local image patch features, VL-CABS enables zero-shot inference with similarity probability for classification, and pixel-level VL similarity maps for grounding and segmentation. Experimental results on public chest radiograph benchmarks show that RadZero outperforms state-of-the-art methods in zero-shot classification, grounding, and segmentation. Furthermore, VL similarity map analysis highlights the potential of VL-CABS for improving explainability in VL alignment. Additionally, qualitative evaluation demonstrates RadZero’s capability for open-vocabulary semantic segmentation, further validating its effectiveness in medical imaging.
最近多模式模型的进展大大改善了放射学中的视觉语言(VL)校准。然而,现有方法努力有效利用复杂的放射学报告进行学习,并通过关注概率直观化提供有限的解释性。为了应对这些挑战,我们引入了RadZero,这是放射学中VL校准的新框架,具有零发多任务能力。我们的方法的一个关键组成部分是VL-CABS(基于相似性的视觉语言语言跨感知),它将文字嵌入与当地图像特征相匹配,以便进行可解释的细微VL逻辑推理。RadZero利用大型语言模型从放射学报告中提取简明的语义句,并通过多积极的对比培训,以有效捕捉图像和多相关文本描述之间的关系。我们的方法的一个关键组成部分是VL-CABS(基于相似性语言嵌入和本地图像开阔分校特征),VL-CABS(Vix-L) 用于进一步推导出相似的图像分类概率概率,以及Vix-Cal-Salial-al-al-laimation Steal-laimal-ladal-realal-deal-ligal-laisal-laisal-deal-laisal-lader-laisal-toimal-laisal-laisal-deal-laisal-ladal-laisal-ligal-de-deal-toimal-toal-inal-to-to-toal-inal-deal-deal-inal-在地面分析方法中,在地面分析中,在地面分析中,在地面分析中,在地面分析中显示其相似的比。
Article 231
Title@2025-07-01 (2): Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows
Title: Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows | Flexible Sprachmodellierung im kontinuierlichen Raum mit transformerbasierten autoregressiven Strömungen | 具有以变换器为基础的自动递减流动的连续空间灵活语言建模 2507.00425v1 |
Authors (9): Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Yizhe Zhang, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Josh Susskind, Navdeep Jaitly
Autoregressive models have driven remarkable progress in language modeling. Their foundational reliance on discrete tokens, unidirectional context, and single-pass decoding, while central to their success, also inspires the exploration of a design space that could offer new axes of modeling flexibility. In this work, we explore an alternative paradigm, shifting language modeling from a discrete token space to a continuous latent space. We propose a novel framework TarFlowLM, that employs transformer-based autoregressive normalizing flows to model these continuous representations. This approach unlocks substantial flexibility, enabling the construction of models that can capture global bi-directional context through stacked, alternating-direction autoregressive transformations, support block-wise generation with flexible token patch sizes, and facilitate a hierarchical multi-pass generation process. We further propose new mixture-based coupling transformations designed to capture complex dependencies within the latent space shaped by discrete data, and demonstrate theoretical connections to conventional discrete autoregressive models. Extensive experiments on language modeling benchmarks demonstrate strong likelihood performance and highlight the flexible modeling capabilities inherent in our framework.
自动递减模式在语言建模方面取得了显著的进展。 其基础依赖离散的象征物、单向背景和单向解码,虽然是其成功的核心,但也激发了对设计空间的探索,这种空间可以提供建模灵活性的新轴心。 在这项工作中,我们探索了另一种模式,将语言建模从离散的象征空间转向连续潜伏空间。我们提出了一个新的框架 TarFlowLM , 利用基于变压器的自动递增正常流来模拟这些连续的演示。 这种方法释放了相当大的灵活性,使得能够通过堆叠式、交替式的自动递减转换构建能够捕捉到全球双向环境的模型,支持具有灵活代号的整型生成,并促进一个等级分级的多处生成过程。 我们进一步提出新的混合组合转换,旨在捕捉由离散数据形成的潜在空间内复杂的依赖性,并展示与传统的离散自动递增模型的理论连接。 语言建模基准的广泛实验显示了强大的概率性表现,突出了我们框架中的灵活建模能力。
Article 232
Title@2025-07-01 (2): Generative Representational Learning of Foundation Models for Recommendation
Title: Generative Representational Learning of Foundation Models for Recommendation | Generatives repräsentatives Lernen von Stiftungsmodellen zur Empfehlung | 产生基础基础建议模式的代言人学习 2506.11999v3 |
Authors (7): Zheli Zhou, Chenxu Zhu, Jianghao Lin, Bo Chen, Ruiming Tang, Weinan Zhang, Yong Yu
Developing a single foundation model with the capability to excel across diverse tasks has been a long-standing objective in the field of artificial intelligence. As the wave of general-purpose foundation models sweeps across various domains, their influence has significantly extended to the field of recommendation systems. While recent efforts have explored recommendation foundation models for various generative tasks, they often overlook crucial embedding tasks and struggle with the complexities of multi-task learning, including knowledge sharing & conflict resolution, and convergence speed inconsistencies. To address these limitations, we introduce RecFound, a generative representational learning framework for recommendation foundation models. We construct the first comprehensive dataset for recommendation foundation models covering both generative and embedding tasks across diverse scenarios. Based on this dataset, we propose a novel multi-task training scheme featuring a Task-wise Mixture of Low-rank Experts (TMoLE) to handle knowledge sharing & conflict, a Step-wise Convergence-oriented Sample Scheduler (S2Sched) to address inconsistent convergence, and a Model Merge module to balance the performance across tasks. Experiments demonstrate that RecFound achieves state-of-the-art performance across various recommendation tasks, outperforming existing baselines.
在人造情报领域,开发单一基础模型,有能力完成各种任务,这是长期的目标。随着普通用途基础模型的浪潮遍及各个领域,其影响已显著扩展到建议系统领域。虽然最近的努力探索了各种基因化任务的建议基础模型,但往往忽视了关键的嵌入任务以及与多任务学习的复杂性,包括知识共享和解决冲突以及趋同速度不一致等复杂问题作斗争。为了解决这些限制,我们引入RecFound,一个建议基础模型的基因化代表学习框架。我们为建议基础模型构建了第一个综合数据集,涵盖基因化和跨不同情景嵌入的任务。基于这一数据集,我们提出了一个新的多任务培训计划,以低级专家的任务组合为特点,处理知识共享和冲突,一个分步为主的以聚合为主的样本调度器(S2Sched),解决不一致的趋同问题,以及一个模型合并模块,以平衡各项任务的业绩。实验表明,REFound 实现了不同任务之间的状态业绩。
Article 233
Title@2025-07-01 (2): Pipelined Decoder for Efficient Context-Aware Text Generation
Title: Pipelined Decoder for Efficient Context-Aware Text Generation | Pipelined Decoder für effiziente Textgenerierung im Kontext | 高效生成内容软件的管道解码器 2506.23431v2 |
Authors (6): Zixian Huang, Chenxu Niu, Yu Gu, Gengyang Xiao, Xinwei Huang, Gong Cheng
As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a bottleneck limiting the generation speed. In this paper, we propose a new decoder architecture that efficiently generates text in parallel for context-aware generation tasks. Our proposed pipelined decoder initiates the generation of multiple subsequences simultaneously, and, at each time-step, it generates a new token for each subsequence to realize parallelism. Experiments on multiple text generation tasks, including question answering, text summarization, and keyphrase generation, show that our pipelined decoder significantly improves the generation speed without a significant loss of generation quality or additional memory consumption.
作为基因变异的AI的基础,自动递减模式要求根据以前产生的所有代币生成新的代币,这种代币质量很高,但也限制模型逐个生成代币,形成一个瓶颈,限制生成速度。在本文中,我们提议一个新的解码器结构,为背景变换任务同时有效生成文本。我们提议的编审解码器同时生成多个子序列,并在每一个时间步骤中为每个子序列生成一个新的代币,以实现平行。关于多文本生成任务的实验,包括问答、文本拼凑和关键词生成,表明我们编译的解码器在不大量损失生产质量或增加记忆消耗的情况下,大大改善了生成速度。
Article 234
Title@2025-07-01 (2): ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context
Title: ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context | ASTRO: Sprachmodelle zur Vernunft lehren durch Reflektieren und Zurückverfolgen im Kontext | ASTRO:通过反映和回溯文文体,将语言模式教成理论 2507.00417v1 |
Authors (6): Joongwon Kim, Anirudh Goyal, Liang Tan, Hannaneh Hajishirzi, Srinivasan Iyer, Tianlu Wang
We introduce ASTRO, the “Autoregressive Search-Taught Reasoner”, a framework for training language models to reason like search algorithms, explicitly leveraging self-reflection, backtracking, and exploration in their outputs. Recently, training large language models (LLMs) via reinforcement learning (RL) has led to the advent of reasoning models with greatly enhanced reasoning capabilities. Open-source replications of reasoning models, while successful, build upon models that already exhibit strong reasoning capabilities along with search behavior observed even before RL. As a result, it is yet unclear how to boost the reasoning capabilities of other non-reasoner models including Llama 3. ASTRO teaches such models to internalize structured search behavior through a synthetic dataset derived from Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. By converting search traces into natural language chain-of-thoughts that capture both successes and recoveries from failure, ASTRO bootstraps models with a rich prior for exploration during RL. We finetune our models on these search-derived traces and further improve performance via RL with verifiable rewards. We apply ASTRO to the Llama 3 family of models and achieve absolute performance gains of 16.0% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024, especially improving upon challenging problems that require iterative correction. Our results demonstrate that search-inspired training offers a principled way to instill robust reasoning capabilities into open LLMs.
我们引入了ASTRO, 即“自动递增搜索 — 执法理性” , 这是一种培训语言模型的框架, 以便像搜索算法那样, 明确利用自我反射、 回溯跟踪和在其产出中进行探索。 最近, 通过强化学习( RLL) 培训大型语言模型(LLM) 导致推理模型的出现, 其推理能力得到极大提高。 推理模型的开源复制虽然成功, 其基础模型已经表现出很强的推理能力, 以及甚至在RL之前所观察到的搜索行为。 结果, 如何提高其他非逻辑模型(包括Llama 3 ) 的推理能力, 明确利用自我反射、回溯跟踪和在输出输出输出输出输出输出过程中的自我思维能力。 通过蒙特卡洛树搜索(MCTS) 的合成数据集, 培训大型语言模型(LLMMM(LM) (LLM) ) , 通过数学解析20ML(A) 和MA(MA) MA(MA) 20) 的成绩, 我们应用ASTO(MA(O) MA(MA) 20) MA(MA) MA(MA) MA(MA) MA) MA(MA) 20) 20) 20) 将搜索(MA(MA(MA) MA) MA(MA) ) MA) 20) 20) 的成绩(MA(A) 的成绩转化为(A- ta) 20) 20) 20) 的成绩转化为(A- ta) 20) 应用A(A) 20) 将搜索结果转化为(A(MA) 20) 20) 20) 20) 20) 20) 。
Article 235
Title@2025-07-01 (2): Parameter-Efficient Fine-Tuning via Circular Convolution
Title: Parameter-Efficient Fine-Tuning via Circular Convolution | Parameter-Effizient Feintuning über Kreiskonvolution | 通过循环革命提高参数效率 2407.19342v4 |
Authors (7): Aochuan Chen, Jiashun Cheng, Zijing Liu, Ziqi Gao, Fugee Tsung, Yu Li, Jia Li
Low-Rank Adaptation (LoRA) has gained popularity for fine-tuning large foundation models, leveraging low-rank matrices $\mathbf{A}$ and $\mathbf{B}$ to represent weight changes (i.e., $\Delta \mathbf{W} = \mathbf{B} \mathbf{A}$). This method reduces trainable parameters and mitigates heavy memory consumption associated with full delta matrices by sequentially multiplying $\mathbf{A}$ and $\mathbf{B}$ with the activation. Despite its success, the intrinsic low-rank characteristic may limit its performance. Although several variants have been proposed to address this issue, they often overlook the crucial computational and memory efficiency brought by LoRA. In this paper, we propose Circular Convolution Adaptation (C$^3$A), which not only achieves high-rank adaptation with enhanced performance but also excels in both computational power and memory utilization. Extensive experiments demonstrate that C$^3$A consistently outperforms LoRA and its variants across various fine-tuning tasks.
低- 兰克适应 (LORA) 在微调大型基础模型(LORA) 方面已获得欢迎, 利用低级别矩阵 $\ mathbf{A} $ 和 $\ mathbf{B} 美元来代表重量变化( 即 $\ Delta\ mathbf{W} =\ mathbf{B} B} {B} = mathbf{B} = mathbf{A} ) 。 这种方法减少了可训练参数,并减轻了与全三角矩阵相关的大量记忆消耗, 其方法是在激活时按顺序再乘 $\ mathbf{A} $ 和 $\ mathb{B} $ 。 尽管取得了成功, 内在的低等级特征可能会限制其绩效。 尽管为解决这一问题提出了几种变式, 它们往往忽略了LRA 带来的关键的计算和记忆效率 。 在本文中, 我们提出“ 演算适应” (C$3$ A ) , 它不仅在提高性能和记忆利用两方面都取得了很高的适应, , 而且在计算能力和记忆应用上都很优 。
Article 236
Title@2025-07-01 (2): Two-Stage Regularization-Based Structured Pruning for LLMs
Title: Two-Stage Regularization-Based Structured Pruning for LLMs | Zweistufiges Regularisierungs-basierendes strukturiertes Pruning für LLMs | LLMM 双级正规化和结构化 2505.18232v2 |
Authors (9): Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Ruihan Jin, Feihu Che, Pengpeng Shao, Zhengqi Wen, Jianhua Tao
The deployment of large language models (LLMs) is largely hindered by their large number of parameters. Structural pruning has emerged as a promising solution. Prior structured pruning methods directly remove unimportant parameters based on certain metrics, which often causes knowledge loss and necessitates extensive retraining. To overcome this, we introduce a novel pruning method TRSP: Two-Stage Regularization-Based Structured Pruning for LLMs. Specifically, we multiply the output of each transformer layer by an initial learnable weight and iteratively learn these weights by adding their $\ell_1$-norm as a regularization term to the loss function, serving as the first-stage regularization. Subsequently, we apply additional regularization to the difference between the output and input of layers with smaller weights, encouraging the shift of knowledge to the preserved layers. This serves as the second-stage regularization. TRSP retains more knowledge and better preserves model performance than direct parameter elimination. Through extensive experimentation we show that TRSP outperforms strong layer-wise structured pruning methods without requiring retraining. As a layer-wise pruning method, it delivers notable end-to-end acceleration, making it a promising solution for efficient LLM deployment.
大型语言模型(LLMS)的部署在很大程度上受到其众多参数的阻碍。结构修剪已经成为一个很有希望的解决办法。以前结构化的修剪方法直接删除基于某些计量的不重要参数,这些参数往往造成知识流失,需要进行广泛的再培训。为了克服这一点,我们引入了一个新的修剪方法TRSP:基于双层的正规化和结构化的LLMS的预留。具体地说,我们通过初始可学习重量乘以每个变压器层的输出量,并反复学习这些重量,在损失函数中添加$@ell_1$-norm作为正规化的术语,作为第一阶段的正规化。随后,我们用额外的正规化方法来改变具有较小重量的层的产出和投入之间的差别,鼓励知识转移到保留层。这可以作为第二阶段的正规化。TRSP保留更多的知识和更好的保存模型性性能,而不是直接消除参数。通过广泛的实验,我们发现TRSP在不需要再培训的情况下,超越了强大的层次结构化的理算方法。作为一种高层次的加速度方法,它提供了一种有希望的顶点的加速。
Article 237
Title@2025-07-01 (2): Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs
Title: Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs | Graft: Integration des Domainwissens über effiziente Parametersynergie für MLLMs | Graft: 通过MLLM 高效参数协同将域知识整合 2506.23940v2 |
Authors (9): Yang Dai, Jianxiang An, Tianwei Lin, Hongyang He, Hongzhe Huang, Wenqiao Zhang, Zheqi Lv, Siliang Tang, Yueting Zhuang
Multimodal Large Language Models (MLLMs) have achieved success across various domains. However, their applicability tends to degrade when confronted with different types of data inputs, especially for MLLMs that have been fine-tuned for specific tasks. Despite its importance, the study of knowledge sharing among domain-specific MLLMs–such as those trained for mathematics or code–remains largely underexplored. To address the fragmentation of knowledge across domain-specialized MLLMs, we propose a unified parameter integration framework that enables modular composition of expert capabilities. Our method is grounded in a novel Compatibility-Aware Parameter Splicing (CAPS) strategy, which leverages both local functional attribution and global information-theoretic signals to guide selective parameter fusion. By extending this mechanism to the low-rank adaptation layer granularity, we ensure efficient integration with minimal inference overhead. Furthermore, we introduce a domain compatibility scoring mechanism that quantifies inter-expert alignment at the activation level and correlates with downstream task utility. This principled fusion protocol allows the final model to synergize heterogeneous expertise while preserving structural modularity. Extensive evaluations across diverse multimodal benchmarks validate the effectiveness of our framework, offering a scalable path toward compositional, domain-adaptive MLLMs.
多个多式大语言模型(MLLM)在不同领域都取得了成功,然而,在面临不同类型数据投入时,其适用性往往会降低,特别是对于为具体任务作了微调的MLLM(MLLM)而言,其适用性往往会降低。尽管其重要性,但研究特定领域MLLM(MLM)之间的知识共享,例如那些受过数学或代码余量训练的数学或代码余量模型(CLLMS),基本上未得到充分探讨。为了解决不同领域专门MLLM(MLLM)之间知识的分散问题,我们提议了一个统一的参数集成框架,使专家之间在专家能力上形成模块化组合。我们的方法基于一种新的兼容性软件,利用当地功能归属和全球信息理论信号来指导选择性的参数融合。我们通过将这一机制扩大到低层次的适应层适应层颗粒性,确保与最小的推导力管理器的整合。此外,我们引入了一种域兼容性评分机制,使专家之间在激活水平上保持组合,并与下游任务效用相关。这个原则的聚合协议使最后模型得以在保持结构模块组合上相互配合。
Article 238
Title@2025-07-01 (2): BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference
Title: BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference | BlockDialekt: Blockweise feinkörnige Mischformat-Quantisierung für energieeffiziente LLM-Inferenz | BlockDiaect: 节能LLM 推论的粗件精细混合格式量化 2501.01144v4 |
Authors (2): Wonsuk Jang, Thierry Tambe
The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardware-supported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.
大型语言模型(LLMS) 快速增长的大小在记忆使用和计算成本方面提出了重大挑战。 量化权重和激活都能够解决这些问题, 硬件支持的微微缩缩缩放正在形成一个有希望的缓解离子的解决方案。 但是, 现有的方法很难捕捉细块数据分布。 我们提议了BlockDiacle, 这是一种块式细微的混合格式技术, 从一个格式手册中为更好的数据代表性指定了每个区块的最佳数字格式。 此外, 我们引入了 Dialec FP4 4 格式手册, 一种适应不同数据分布的FP4变体( 类似方言的方言) 。 为了高效地利用这个方法, 我们建议了双阶段的方法, 用于在线的 DialectFP4 激活四倍的四分级化。 重要的是, DialectF4 能够确保能源效率, 选择可代表值为与低精度缩缩缩缩缩缩图相匹配的整整数值。 将LLLAMA3- 8B( LLMA2-7B) 的精度模型(LLMA2-7B) 与MFP4格式相比, 模型的精度模型的精度增长为MXFP-4格式, 将比小比小的精度格式, 显示为5.45- Plexmexmexmexmalmax, 的全缩图图图仅为5-x。
Article 239
Title@2025-07-01 (2): Causal Prompting for Implicit Sentiment Analysis with Large Language Models
Title: Causal Prompting for Implicit Sentiment Analysis with Large Language Models | Causal Prompting für Implizite Sentiment-Analyse mit großen Sprachmodellen | 利用大语言模型进行隐含语言分析的诱导原因 2507.00389v1 |
Authors (10): Jing Ren, Wenhao Zhou, Bowen Li, Mujie Liu, Nguyen Linh Dan Le, Jiade Cen, Liping Chen, Ziqi Xu, Xiwei Xu, Xiaodong Li
Implicit Sentiment Analysis (ISA) aims to infer sentiment that is implied rather than explicitly stated, requiring models to perform deeper reasoning over subtle contextual cues. While recent prompting-based methods using Large Language Models (LLMs) have shown promise in ISA, they often rely on majority voting over chain-of-thought (CoT) reasoning paths without evaluating their causal validity, making them susceptible to internal biases and spurious correlations. To address this challenge, we propose CAPITAL, a causal prompting framework that incorporates front-door adjustment into CoT reasoning. CAPITAL decomposes the overall causal effect into two components: the influence of the input prompt on the reasoning chains, and the impact of those chains on the final output. These components are estimated using encoder-based clustering and the NWGM approximation, with a contrastive learning objective used to better align the encoder’s representation with the LLM’s reasoning space. Experiments on benchmark ISA datasets with three LLMs demonstrate that CAPITAL consistently outperforms strong prompting baselines in both accuracy and robustness, particularly under adversarial conditions. This work offers a principled approach to integrating causal inference into LLM prompting and highlights its benefits for bias-aware sentiment reasoning. The source code and case study are available at: https://github.com/whZ62/CAPITAL.
隐含感知分析(ISA)旨在推断隐含而非明确表达的情绪,要求模型对隐含背景线索进行更深的推理;虽然最近采用大语言模型(LLMs)的促动方法在ISA中显示出希望,但它们往往依赖对思维链(CoT)推理路径的多数表决,而没有评价其因果关系,使其容易受到内部偏见和虚假关联的影响;为了应对这一挑战,我们提议CAPITAL(CAPITAL),这是一个将前门调整纳入COT推理的因果关系导导出框架。 CAPITAL将总体因果关系影响分为两个部分:即投入对推理链的影响,以及这些链对最后产出的影响。这些组成部分是使用基于编码的聚合和NWGM的近似性来估计的,而没有评估其因果关系,而没有评估其因果关系,因此使用对比性学习目标是为了更好地使编码器代表LMM的推理空间。关于ISA数据集基准的实验表明,CITAL/LMM(LM)一贯在准确性和稳健性方面,特别是在敌对性条件下,这种推理学为A-CLM的推理学基础。
Article 240
Title@2025-07-01 (2): DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning
Title: DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning | DALR: Dual Level Alignment Learning für multimodales Sentence Representative Learning | DALR: 双级统一学习促进多式判决代表制学习 2506.21096v2 |
Authors (6): Kang He, Yuzhe Ding, Haining Wang, Fei Li, Chong Teng, Donghong Ji
Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines.
以往的多式联运判决代表制学习方法取得了令人印象深刻的成绩,然而,大多数方法侧重于将图像和文本在粗略水平上加以统一,面临两个关键挑战:跨现代的不匹配偏差和内部的语义差异,这大大削弱了服刑代表制的质量。为了应对这些挑战,我们建议DALR(多模式判决代表制的双级统一学习)。关于跨模式的调整,我们提议了一个一致性学习模块,以软化负面样本,并利用与辅助任务的语义相似性实现细微的跨模式协调。此外,我们主张,判决关系超越二进制正反标签,展示一个更为复杂的排名结构。为了更好地捕捉这些关系,提高代表性质量,我们将排行与全球模式内部协调学习相结合。关于语义相似性(STS)和转让(TR)任务的全面实验证实了我们的方法的有效性,并始终表明它优于最新基线。
Article 241
Title@2025-07-01 (2): Flexora: Flexible Low Rank Adaptation for Large Language Models
Title: Flexora: Flexible Low Rank Adaptation for Large Language Models | Flexora: Flexible Low-Rank-Anpassung für große Sprachmodelle | 灵活度:针对大语言模式的灵活低级别适应 2408.10774v4 |
Authors (4): Chenxing Wei, Yao Shu, Ying Tiffany He, Fei Richard Yu
Large Language Models (LLMs) are driving advancements in artificial intelligence by increasing the scale of model parameters, which has significantly enhanced generalization ability and unlocked new capabilities in practice. However, their performance in specific downstream tasks is usually hindered by their knowledge boundaries on these tasks. Thus, fine-tuning techniques, especially the widely used Low-Rank Adaptation (LoRA) method, have been introduced to expand the boundaries on these tasks, whereas LoRA would underperform on certain tasks owing to its potential overfitting on these tasks. To overcome this overfitting and improve the performance of LoRA, we propose the flexible low rank adaptation (Flexora) method to automatically and flexibly select the most important layers needing to be fine-tuned to achieve the best performance on different downstream tasks. Specifically, Flexora firstly frames this layer selection problem as a well-defined hyperparameter optimization (HPO) problem, then addresses it using the unrolled differentiation (UD) method, and finally selects the most useful layers based on the optimized hyperparameters. Our extensive experiments on many pretrained models and natural language tasks show that Flexora is able to consistently improve over the existing baselines, indicating the effectiveness of our Flexora in practice. We additionally provide insightful theoretical results and many ablation studies to deliver a comprehensive understanding of our Flexora.
大型语言模型(LLMS)正在通过扩大模型参数的规模推动人工智能的进步,模型参数的规模已经大大提高了通用能力,并在实践中释放了新的能力;然而,它们在具体的下游任务中的表现通常受到这些任务的知识界限的阻碍,因此引入了微调技术,特别是广泛使用的低兰克适应(LORA)方法,以扩大这些任务的界限,而LORA则由于可能过度适应这些任务而在某些任务上表现不佳;为了克服这种过度适应和改进LORA的绩效,我们建议采用灵活低级适应(Flexora)方法,以自动和灵活地选择需要调整的最重要层次,以便在不同的下游任务中取得最佳业绩。具体地说,Flexora首先将这一层选择问题作为定义明确的超参数优化(HPO)问题,然后使用无节制的区别(UD)方法加以解决,最后根据最优化的超光度度度计选择最有用的层。我们在许多经过预先训练的模型和自然语言任务上进行的广泛实验,表明Flexorora能够持续地改进我们Flimalalalalalalalalalalalalal的很多的基线。
Article 242
Title@2025-07-01 (2): SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
Title: SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning | SPIRAL: Selbst-Spiel auf Null-Sum-Spiele Anreize zur Vernunft durch Multi-Agent Multi-Turn Verstärkungs-Lernen | SPIRAL: 在零桑运动会上自玩 2506.24119v2 |
Authors (12): Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques
Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.
强化学习的最新进展表明,语言模式可以通过对可核实的奖励任务进行培训来发展精密的推理,但是,这些方法取决于由人完成的问答配对和具体领域的奖励工程。我们引入了SPIRAL,这是一个自我游戏框架,模型通过玩多转、零和游戏来学习如何不断改进自身版本,从而消除了对人监督的需要。通过自我游戏,SPIRAL生成了一个无限的、具有挑战性的问题的课程,因为模型必须不断适应更强的对手。为了能够进行规模的自我游戏培训,我们为LLMS实施一个完全在线、多方向、多用途强化的多用途强化学习系统,并提出基于角色的利差推理估算(RAE)以稳定多剂培训。我们采用SPIRAL、零和零和游戏的自我游戏来进行自我学习,从而产生广泛的推理能力。仅Kuhn Poker 的Quen3-4B 培训在数学和8.4%的一般推理学上就实现了8.6%的改进,在25 000个专家游戏轨迹上仍然优于SFT模式。分析显示,通过三种认知模式进行有希望的转移:系统解、预期、预期值计算、预期值计算、预期值计算、预测、零和逐式推理算、逐式推理算、逐式推理推算、提高。
Article 243
Title@2025-07-01 (2): Gregorian melody, modality, and memory: Segmenting chant with Bayesian nonparametrics
Title: Gregorian melody, modality, and memory: Segmenting chant with Bayesian nonparametrics | Gregorianische Melodie, Modalität und Erinnerung: Segmentierungsgesang mit Bayesischen Nonparametrics | Gregorian 旋律、 模式和记忆: 与巴耶斯非参数分隔的口号 2507.00380v1 |
Authors (2): Vojtěch Lanz, Jan Hajič jr
The idea that Gregorian melodies are constructed from some vocabulary of segments has long been a part of chant scholarship. This so-called “centonisation” theory has received much musicological criticism, but frequent re-use of certain melodic segments has been observed in chant melodies, and the intractable number of possible segmentations allowed the option that some undiscovered segmentation exists that will yet prove the value of centonisation, and recent empirical results have shown that segmentations can outperform music-theoretical features in mode classification. Inspired by the fact that Gregorian chant was memorised, we search for an optimal unsupervised segmentation of chant melody using nested hierarchical Pitman-Yor language models. The segmentation we find achieves state-of-the-art performance in mode classification. Modeling a monk memorising the melodies from one liturgical manuscript, we then find empirical evidence for the link between mode classification and memory efficiency, and observe more formulaic areas at the beginnings and ends of melodies corresponding to the practical role of modality in performance. However, the resulting segmentations themselves indicate that even such a memory-optimal segmentation is not what is understood as centonisation.
Gregorian 旋律是从某部分的词汇中建构的观念长期以来一直是圣歌奖学金的一部分。这种所谓的“中心化”理论已经受到许多音乐学批评,但是在歌唱旋律中观察到了经常重复使用某些旋律段的情况,而可能的分解的棘手数量使得可以选择存在一些尚未发现的分解,但这种分解将证明中间化的价值,而最近的实证结果显示,分解可以超越模式分类中的音乐-理论特征。受这个所谓的“中心化”理论被回忆起来这一事实的启发,我们利用嵌套式等级的Pitman-Yor语言模型来寻找一种最佳的、不受监督的分解。我们所发现的分解在模式分类中达到最新水平的性能。建模一个僧侣,用一个不显眼的手稿来唤醒旋律,然后我们找到关于模式分类和记忆效率之间联系的经验证据,并在模式分类的开头和结尾观察到更多的公式化领域,并观察到与表现方式的实际作用相对应的旋律。但是,导致的分化的记忆分解本身就意味着什么。
Article 244
Title@2025-07-01 (2): Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples
Title: Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples | Lehren von Audio-Bewusst Große Sprachmodelle Was nicht hört: Halluzinationen durch synthesierte Negativproben abmildern | 教授听觉大语言模型:通过合成负样本减少幻觉 2505.14518v2 |
Authors (2): Chun-Yi Kuan, Hung-yi Lee
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs. However, these models often hallucinate non-existent sound events, reducing their reliability in real-world applications. To address this, we propose LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method that enhances ALLMs’ ability to distinguish between present and absent sounds using synthesized data from the backbone LLM. Unlike prior approaches, our method requires no modification to LLM parameters and efficiently integrates audio representations via a lightweight adapter. Experiments show that LISTEN effectively mitigates hallucinations while maintaining impressive performance on existing audio question and reasoning benchmarks. At the same time, it is more efficient in both data and computation.
最近音频大语言模型的进步使得它们能够处理和理解音频投入,然而,这些模型往往产生幻觉,使不存在的音频事件降低真实世界应用程序的可靠性。为了解决这个问题,我们提议采用Listen(学习通过扩展负数样本识别声音),这是一种对比式的培训方法,它能提高Allem使用主干LLM综合数据区分现在和不存在的声音的能力。与以往的做法不同,我们的方法不需要修改LLM参数,而是通过轻量级适配器有效地整合音频表达。实验显示ListEN有效地减轻幻觉,同时保持现有音频问题和推理基准的令人印象。与此同时,它在数据和计算方面效率更高。
Article 245
Title@2025-07-01 (2): Seeking and Updating with Live Visual Knowledge
Title: Seeking and Updating with Live Visual Knowledge | Suchen und Aktualisieren mit Live Visual Knowledge | 利用实况视觉知识探索和更新 2504.05288v2 |
Authors (9): Mingyang Fu, Yuyang Peng, Dongping Chen, Zetong Zhou, Benlin Liu, Yao Wan, Zhou Zhao, Philip S. Yu, Ranjay Krishna
The visual world around us constantly evolves, from real-time news and social media trends to global infrastructure changes visible through satellite imagery and augmented reality enhancements. However, Multimodal Large Language Models (MLLMs), which automate many tasks, struggle to stay current, limited by the cutoff dates in their fixed training datasets. To quantify this stagnation, we introduce LiveVQA, the first-of-its-kind dataset featuring 107,143 samples and 12 categories data specifically designed to support research in both seeking and updating with live visual knowledge. Drawing from recent news articles, video platforms, and academic publications in April 2024-May 2025, LiveVQA enables evaluation of how models handle latest visual information beyond their knowledge boundaries and how current methods help to update them. Our comprehensive benchmarking of 17 state-of-the-art MLLMs reveals significant performance gaps on content beyond knowledge cutoff, and tool-use or agentic visual seeking framework drastically gain an average of 327% improvement. Furthermore, we explore parameter-efficient fine-tuning (PEFT) methods to update MLLMs with new visual knowledge. We dive deeply to the critical balance between adapter capacity and model capability when updating MLLMs with new visual knowledge. All the experimental dataset and source code are publicly available at: https://livevqa.github.io.
我们周围的视觉世界不断演变,从实时新闻和社交媒体趋势,到通过卫星图像和增强现实的增强而可见的全球基础设施变化。然而,多式大语言模型(MLLLM)使许多任务自动化,努力保持时态,但受固定培训数据集中截断日期的限制。为了量化这种停滞,我们引入了LiveVQA,这是首个由107,143个样本和12个类别数据组成的原始数据集,专门用来支持实时视觉知识的搜索和更新方面的研究。从2024年4月、2024年5月和2025年4月至5月的最新新闻文章、视频平台和学术出版物中提取,LiveVQA使得能够评估模型如何处理超出其知识界限的最新视觉信息,以及当前方法如何帮助更新它们。我们对17个最新水平的MLLLMMS的全面基准显示,除了知识关闭之外,在内容上存在巨大的业绩差距,而且工具使用或代理视觉图像框架也大大改进了327 %。此外,我们探索参数高效的微调(PEFT)方法,用新的视觉知识更新MLLMMMMs。我们深潜深入探索如何在调整现有实验能力与可变的模型数据源之间的关键平衡。
Article 246
Title@2025-07-01 (2): SPADE: Structured Prompting Augmentation for Dialogue Enhancement in Machine-Generated Text Detection
Title: SPADE: Structured Prompting Augmentation for Dialogue Enhancement in Machine-Generated Text Detection | SPADE: Strukturierte Prompting Augmentation für Dialog-Verbesserung bei maschinengenerierter Texterkennung | SPADE: 在机器生成的文本探测中促进对话的结构性快速增强 2503.15044v2 |
Authors (4): Haoyi Li, Angela Yifei Yuan, Soyeon Caren Han, Christopher Leckie
The increasing capability of large language models (LLMs) to generate synthetic content has heightened concerns about their misuse, driving the development of Machine-Generated Text (MGT) detection models. However, these detectors face significant challenges due to the lack of high-quality synthetic datasets for training. To address this issue, we propose SPADE, a structured framework for detecting synthetic dialogues using prompt-based positive and negative samples. Our proposed methods yield 14 new dialogue datasets, which we benchmark against eight MGT detection models. The results demonstrate improved generalization performance when utilizing a mixed dataset produced by proposed augmentation frameworks, offering a practical approach to enhancing LLM application security. Considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy. Our open-source datasets, code and prompts can be downloaded from https://github.com/AngieYYF/SPADE-customer-service-dialogue.
大型语言模型(LLMS)生成合成内容的能力不断提高,使人们更加担心这些模型的滥用,从而推动开发机器生成文本(MGT)检测模型,然而,由于缺乏高质量的培训合成数据集,这些探测器面临重大挑战。为解决这一问题,我们提议SPADE,这是利用基于即时的正反抽样检测合成对话的结构化框架。我们建议的方法产生14个新的对话数据集,我们参照8个MGT检测模型进行基准。结果显示,在使用拟议增强框架产生的混合数据集时,一般化效果有所改善,为增强LLM应用安全提供了实用方法。考虑到真实世界代理缺乏对未来对手语句的了解,我们模拟在线对话探测并检查聊天历史长度和探测准确度之间的关系。我们的公开源数据集、代码和提示可以从https://github.com/AngieYF/SPADE-Scesserer-dialogue下载。
Article 247
Title@2025-07-01 (2): Question Decomposition for Retrieval-Augmented Generation
Title: Question Decomposition for Retrieval-Augmented Generation | Zersetzung der Fragestellung für retrieval-augmented Generation | 问题 后继子孙分解问题 2507.00355v1 |
Authors (3): Paul J. L. Ammann, Jonas Golde, Alan Akbik
Grounding large language models (LLMs) in verifiable external sources is a well-established strategy for generating reliable answers. Retrieval-augmented generation (RAG) is one such approach, particularly effective for tasks like question answering: it retrieves passages that are semantically related to the question and then conditions the model on this evidence. However, multi-hop questions, such as “Which company among NVIDIA, Apple, and Google made the biggest profit in 2023?,” challenge RAG because relevant facts are often distributed across multiple documents rather than co-occurring in one source, making it difficult for standard RAG to retrieve sufficient information. To address this, we propose a RAG pipeline that incorporates question decomposition: (i) an LLM decomposes the original query into sub-questions, (ii) passages are retrieved for each sub-question, and (iii) the merged candidate pool is reranked to improve the coverage and precision of the retrieved evidence. We show that question decomposition effectively assembles complementary documents, while reranking reduces noise and promotes the most relevant passages before answer generation. Although reranking itself is standard, we show that pairing an off-the-shelf cross-encoder reranker with LLM-driven question decomposition bridges the retrieval gap on multi-hop questions and provides a practical, drop-in enhancement, without any extra training or specialized indexing. We evaluate our approach on the MultiHop-RAG and HotpotQA, showing gains in retrieval (MRR@10: +36.7%) and answer accuracy (F1: +11.6%) over standard RAG baselines.
在可核查的外部来源中设置大型语言模型(LLMs ) : 质疑 RAG , 原因是相关事实常常在多个文件中传播,而不是在一个来源中共同出现, 使得标准RAG 难以获取足够的信息。 要解决这个问题,我们建议RAG 管道中包含问题解析内容:(一) LLM 将最初的查询移到子问题中, (二) 为每个子问题检索到段落, (三) 合并的候选人库重新排序,以提高检索到的证据的覆盖面和准确性。我们显示,问题解压缩了补充文件,同时将噪音重新定位,并在答案生成之前促进最相关的解析。尽管在升级的 R+RM 方面,我们通过升级升级升级和升级升级的 RBM , 显示我们升级的升级升级升级的 RBR 标准, 显示我们升级的升级的升级升级升级的 RLM , 在升级的 RB 中, 显示我们不进行升级的升级的升级的升级升级的 RLM 。
Article 248
Title@2025-07-01 (2): Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios
Title: Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios | Modellierung der Datenvielfalt für die gemeinsame Instanz und die Auswahl der Verbalisatoren in Kaltstart-Szenarien | 在 “ 冷开端 “ 情景下为联合试审和镇温器选择建立数据多样性模型 2507.00330v1 |
Authors (3): Mohna Chakraborty, Adithya Kulkarni, Qi Li
Prompt-based methods leverage the knowledge of pre-trained language models (PLMs) trained with a masked language modeling (MLM) objective; however, these methods are sensitive to template, verbalizer, and few-shot instance selection, particularly in cold-start settings with no labeled data. Existing studies overlook the dependency between instances and verbalizers, where instance-label probabilities depend on verbalizer token proximity in the embedding space. To address this, we propose COLDSELECT, a joint verbalizer and instance selection approach that models data diversity. COLDSELECT maps PLM vocabulary and $h_{[MASK]}$ embeddings into a shared space, applying dimensionality reduction and clustering to ensure efficient and diverse selection. By optimizing for minimal uncertainty and maximal diversity, COLDSELECT captures data relationships effectively. Experiments on eight benchmarks demonstrate COLDSELECT’s superiority in reducing uncertainty and enhancing generalization, outperforming baselines in verbalizer and few-shot instance selection for cold-start scenarios.
快速方法利用受过训练、具有隐蔽语言模型(MLM)目标的预先语言模型(PLM)的知识;然而,这些方法对模板、口头和少发实例选择十分敏感,特别是在没有贴标签的数据的寒冷启动环境中。现有研究忽略了实例和语言器之间的依赖性,在这种情况下,实例标签的概率取决于嵌入空间的口头符号接近度。为了解决这个问题,我们提议采用COLLDELECT(COLDSELECT),一种模拟数据多样性的联合口头和实例选择方法。COLLDELECT(COLL)绘制了PLM词汇和$h[MASK]}$嵌入一个共享空间,应用维度减少和组合,以确保高效和多样的选择。通过优化最小的不确定性和最大多样性,COLDSELECT(COLLECT)捕捉到数据关系。关于八项基准的实验表明COLLDSELECECT(COL)在减少不确定性和增强通用性方面的优势,在口头和低发实例选择冷启动情景中优劣基线。
Article 249
Title@2025-06-30 (1): Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones
Title: Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones | Fehler durch Interferenz: Sprachmodelle machen ausgeglichene Klammern Fehler, wenn fehlerhafte Mechanismen Klangeindrücke überschatten | 被干扰失败:语言模型在错误机制压倒阴影声音一号时造成平衡括号错误 2507.00322v1 |
Authors (4): Daking Rai, Samuel Miller, Kevin Moran, Ziyu Yao
Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing “sound mechanisms’’), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing “faulty mechanisms’’). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from $0$% to around $100$% without impairing the models’ general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around $20$%.
尽管在编码能力方面取得了显著进步,语言模型(LMS)仍然与简单的合成任务(例如产生平衡的括号)纠缠不休。在本研究中,我们调查了不同大小(124M-7B)的LM(124M-7B)中这些错误持续存在背后的深层机制,以了解和减轻错误。我们的研究显示,LMS依赖一些独立作出预测的成分(注意头和FF神经元),而有些成分可靠地促进在广泛的投入(即执行“声机制”)中正确回答,而另一些成分则不那么可靠,通过推广不正确的符号(即执行“失灵机制”)。错误机制掩盖声音并主要影响预测,就会发生错误。我们受这一洞察力的驱使,我们引入了RASteer,这是一个指导方法,系统地确定和增加可靠成分对改进模型性能的贡献。RASTEER大大改进了平衡的括号任务的业绩,提高了某些模型的精确度,从0美元提高到约100美元左右,同时不损害模型的总体编码能力。我们进一步展示了它对于20 %的运用。
Article 250
Title@2025-06-30 (1): ETTA: Elucidating the Design Space of Text-to-Audio Models
Title: ETTA: Elucidating the Design Space of Text-to-Audio Models | ETTA: Erklärung des Designraums von Text-zu-Audio-Modellen | ETTA: 说明文本到模拟模型的设计空间 2412.19351v2 |
Authors (6): Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro
Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA’s improved ability to generate creative audio following complex and imaginative captions – a task that is more challenging than current benchmarks.
近年来,在文本合成软件(TTA)合成方面取得了显著进展,使用户能够利用自然语言提示产生的合成音频来丰富其创造性工作流程。尽管取得了这一进展,数据、模型架构、培训客观功能和抽样战略对目标基准的影响却不甚为人理解。为了全面了解TTA模型的设计空间,我们建立了一个大规模的经验实验,重点是传播和流动匹配模型。我们的贡献包括:(1)AF-合成,这是从听觉模型中获得的高质量合成字幕的庞大数据集;(2)系统地比较TTA模型的不同建筑、培训和推断设计选择;(3)对数据、模型结构、培训、目标功能和指标基准的抽样方法及其关于生成质量和推断速度的Pareto曲线的分析。我们利用从这一广泛分析中获得的知识提出我们最好的模型,称为Elcifed Text-to-Audio(ETTA)。在对音频和音乐能力进行评估时,ETTA提供了对公共数据培训的基线的改进,同时与在产权数据方面受过培训的模型具有竞争力;(3)在生成质量和判断能力方面,我们展示ETTA更具有挑战性的工作基础。
Article 251
Title@2025-06-30 (1): Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification
Title: Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification | Breaking mBad! Supervised Feinabstimmung für Cross-Lingual Entgiftung | 监督跨语言解毒微调 2505.16722v2 |
Authors (5): Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, Thomas Hartvigsen
As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore “Cross-lingual Detoxification”, a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification’s effectiveness through 392 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBad.
随着大型语言模型(LLMs)在全球应用中日益普遍,确保这些模型在不同语言背景下无毒性仍然是一个重大挑战,我们探索“Cross-语言解毒化”这个跨语言模式可以减轻毒性,使高、低资源语言能够在不同文字家庭之间转移;我们通过392个广泛的环境分析跨语言解毒化的有效性,以有限的数据评价跨分布环境中的毒性减少情况,并调查缓解模式如何影响无毒任务,揭示安全和知识保护之间的权衡。我们的代码和数据集可在https://github.com/himanshhubeniwal/Breaking-mBad上公开查阅。
Article 252
Title@2025-06-30 (1): Open-ended Scientific Discovery via Bayesian Surprise
Title: Open-ended Scientific Discovery via Bayesian Surprise | Offene wissenschaftliche Entdeckung über Bayesian Surprise | 通过贝叶斯惊喜的不限名额科学发现 2507.00310v1 |
Authors (11): Dhruv Agarwal, Bodhisattwa Prasad Majumder, Reece Adamson, Megha Chakravorty, Satvika Reddy Gavireddy, Aditya Parashar, Harshit Surana, Bhavana Dalvi Mishra, Andrew McCallum, Ashish Sabharwal, Peter Clark
The promise of autonomous scientific discovery (ASD) hinges not only on answering questions, but also on knowing which questions to ask. Most recent works in ASD explore the use of large language models (LLMs) in goal-driven settings, relying on human-specified research questions to guide hypothesis generation. However, scientific discovery may be accelerated further by allowing the AI system to drive exploration by its own criteria. The few existing approaches in open-ended ASD select hypotheses based on diversity heuristics or subjective proxies for human interestingness, but the former struggles to meaningfully navigate the typically vast hypothesis space, and the latter suffers from imprecise definitions. This paper presents AutoDS – a method for open-ended ASD that instead drives scientific exploration using Bayesian surprise. Here, we quantify the epistemic shift from the LLM’s prior beliefs about a hypothesis to its posterior beliefs after gathering experimental results. To efficiently explore the space of nested hypotheses, our method employs a Monte Carlo tree search (MCTS) strategy with progressive widening using surprisal as the reward function. We evaluate AutoDS in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science. Our results demonstrate that under a fixed budget, AutoDS substantially outperforms competitors by producing 5–29\% more discoveries deemed surprising by the LLM. Our human evaluation further finds that two-thirds of AutoDS discoveries are surprising to the domain experts, suggesting this is an important step forward towards building open-ended ASD systems.
自主科学发现(ASD)的希望不仅取决于回答问题,还取决于了解需要问的问题。ASD最近的工作大多探索在目标驱动的环境中使用大型语言模型(LLMs),依靠人类指定的研究问题来引导假设的生成。然而,科学发现可以通过允许AI系统以自己的标准驱动探索而进一步加速。在开放的ASD选择基于多样性超常论或人类兴趣主观代言的假设的少数现有办法,而前一种为有意义地探索典型的大假设空间而奋斗,而后一种则受到不准确定义的困扰。本文展示了AutoDS(AutoDS),这是开放的ASDS(LM)的一个方法,它利用Bayesian的惊喜推动科学探索。在这里,我们量化了与LIS系统先前关于假设的缩影化转变,在收集实验结果之后,将它变为其后数级的假设。为了有效地探索嵌套假体空间,我们的方法是利用假称功能不断扩展的蒙特卡洛树搜索(MCTS)战略。 而后者则受到不精确的定义。我们评估了ADSDSD(ADD)在设计中,在建立数据-BRODSDS(O)中,在21世纪内, 将数据-SDSDSDA(BR)的快速探索的探索中, 将展示了一种更重要地展示中,我们的数据-SDA-SVDA(B-SDSDA)的快速的动作展示了一种重要的前的动作,在现实-SDF)将展示了比。
Article 253
Title@2025-06-30 (1): Natural language processing for African languages
Title: Natural language processing for African languages | Natürliche Sprachverarbeitung für afrikanische Sprachen | 非洲语言的自然语言处理 2507.00297v1 |
Authors (1): David Ifeoluwa Adelani
Recent advances in word embeddings and language models use large-scale, unlabelled data and self-supervised learning to boost NLP performance. Multilingual models, often trained on web-sourced data like Wikipedia, face challenges: few low-resource languages are included, their data is often noisy, and lack of labeled datasets makes it hard to evaluate performance outside high-resource languages like English. In this dissertation, we focus on languages spoken in Sub-Saharan Africa where all the indigenous languages in this region can be regarded as low-resourced in terms of the availability of labelled data for NLP tasks and unlabelled data found on the web. We analyse the noise in the publicly available corpora, and curate a high-quality corpus, demonstrating that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data. We demonstrate empirically the limitations of word embeddings, and the opportunities the multilingual pre-trained language model (PLM) offers especially for languages unseen during pre-training and low-resource scenarios. We further study how to adapt and specialize multilingual PLMs to unseen African languages using a small amount of monolingual texts. To address the under-representation of the African languages in NLP research, we developed large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks: named entity recognition and machine translation. We conduct an extensive empirical evaluation using state-of-the-art methods across supervised, weakly-supervised, and transfer learning settings.
在语言嵌入和语言模型方面最近的进展是使用大规模、无标签的数据和自我监督的学习来提高NLP的绩效。多语言模型,通常在Wikipedia等网络数据上受过培训,面临挑战:很少使用低资源语言,其数据往往噪音,缺乏贴标签的数据集使得难以在英语等高资源语言之外评价业绩。在这份论文中,我们侧重于撒哈拉以南非洲的语言,该地区所有土著语言都可被视为在NLP任务和网上发现的无标签数据提供方面资源不足。我们分析公开提供的公司库中的噪音,并整理一个高质量的资料库,表明在语言嵌入中学习的语义表达质量不仅取决于数据数量,而且取决于培训前数据的质量。我们从经验上展示了语言嵌入的局限性,以及多语言预培训语言模型(PLM)在培训前和低资源情景中尤其为隐蔽语言提供了低资源。我们用高语言翻译的透明化语言翻译提供了低资源。我们分析了在公开的Coralalalalalal-al legalal legalalal legal legrational a labal a labal lives,我们进一步研究如何在21级语言中应用中采用高语言中采用高语言的多语言,并特别的多语言学习方法。我们研究。我们如何在21语言中学习。我们学习的多语言的多语言中学习。
Article 254
Title@2025-06-30 (1): The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements
Title: The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements | Der Automatisierte LLM Speedrunning Benchmark: NanoGPT-Verbesserungen reproduzieren | 自动LLM快速运行基准:复制纳米GPT改进 2506.22419v2 |
Authors (23): Bingchen Zhao, Despoina Magka, Minqi Jiang, Xian Li, Roberta Raileanu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Kelvin Niu, Shagun Sodhani, Michael Shvartsman, Andrei Lupu, Alisia Lupidi, Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Thomas Foster, Lucia Cipolina-Kun, Abhishek Charnalia, Derek Dunfield, Alexander H. Miller, Oisin Mac Aodha, Jakob Foerster, Yoram Bachrach
Rapid advancements in large language models (LLMs) have the potential to assist in scientific progress. A critical capability toward this endeavor is the ability to reproduce existing work. To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community contributions on the NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous records training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new records improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLMs ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.
大型语言模型(LLMS)的快速进步具有协助科学进步的潜力。实现这项工作的关键能力是复制现有工作的能力。为了评价AI代理商在活跃研究领域复制成果的能力,我们引入了自动LLM速度运行基准,利用研究界在NanoGPT快速赛跑上的贡献,这是在最短的时间内培训GPT-2模型的竞赛。19个快速跑任务中的每一个任务都为代理商提供了以前的记录培训脚本,可选配以三种提示格式之一,从假码到纸样的新记录改进描述。通过设计和速度改进迅速执行的记录包含不同的代码层面的变化,从高级算法进步到硬件智能优化。这些特点使得基准既便于使用,又现实地适用于改进LMM培训的前沿问题。我们发现,最近的推理LMS和SoTA Scafffolds争难使我们基准中已经知道的革新重新实施,即使给出了详细的提示。因此,我们的基准提供了一种简单、不饱和测量LMS能力的尺度,对于自动复制来说是必要的。
Article 255
Title@2025-06-30 (1): Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs
Title: Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs | Kann LLM komplexe Attribution in QA auswerten? Automatisches Benchmarking mit Wissensgraphen | 利用知识图自动确定基准 2401.14640v2 |
Authors (9): Nan Hu, Jiaoyan Chen, Yike Wu, Guilin Qi, Hongru Wang, Sheng Bi, Yongrui Chen, Tongtong Wu, Jeff Z. Pan
Attributed Question Answering (AQA) has attracted wide attention, but there are still several limitations in evaluating the attributions, including lacking fine-grained attribution categories, relying on manual annotations, and failing to compare attributions with only subtle differences. To bridge these gaps, we introduce Complex Attributed Question Answering (CAQA), a large-scale benchmark containing comprehensive attribution categories, automatically generated using Knowledge Graphs (KGs), and complex attribution scenarios. We have conducted extensive experiments to verify the effectiveness of CAQA, including the benchmarking of 25 automatic evaluators, their comparison with human evaluators, the testing of LLM evaluators fine-tuned by CAQA and so on. These experiments also lead to a series of important findings that can benefit the future research of AQA. All the codes and data are publicly accessible at https://github.com/HuuuNan/CAQA-Benchmark.
特有问题解答(AQA)引起了广泛的注意,但是在评估属性方面仍然存在一些局限性,包括缺乏细微的属性类别,依赖人工说明,以及没有将属性与仅有的微妙差异进行比较。为了弥合这些差距,我们引入了复杂属性解答(CAQA)(CAQA) (CAQA) (CAQA) (CAQA) (CAQA) (CAQA) (CAQA) (CAG) (CAG) (CAG)) (CAQG) (CAQ) (CAQ AQ A) (CA) (CAQ AQ A) (CA) (CAQ) (CAQ) (CAQA) (CAQA) (CAQA) (CAQ) (CAQAQ) (CAQA) (CAQAQA) (CAQAQA) (CA) (AQAQA) (A) (A) (AQGGGGG) (A) (A) (AQGGG) (A) (AQG) (A) (AQGGG) (AQ) (A) (A) (AQGG) (A) (A) (AQG) (AQG) (A) (A) (AQGGGG) (A) (A) (A) (A) (A) (AQ) (AQ) (A) (AQ) (AQ) (AQAQAQ) (A) (A) (A) (A) (A) (A) (AQ) (A) (A) (A) (A) (A) (A) (A) (A) (A) (A) (A) (A) (A) (AQAQA) (A) (A) (A) (A) (A) (AQA) (A) (A) (A) (A) (A) (A) (AQAQA) (AQAQ) (A) (AQ) (A) (AQAQAQAQAQAQAQA
Article 256
Title@2025-06-30 (1): From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
Title: From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning | Von Tokens zu Gedanken: Wie LLMs und Menschen Kompression für Bedeutung traden | 从Tokens到思想:LLM和人类如何用贸易压缩来达到意义 2505.17117v3 |
Authors (4): Chen Shani, Dan Jurafsky, Yann LeCun, Ravid Shwartz-Ziv
Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.
人类通过将多种情况映射成抽象的表达方式,通过测算各种情况,将知识组织成紧凑类,通过测算将知识组织成精密压缩,并同时保留意义(例如,Robin和蓝雀都是鸟类;大多数鸟类可以飞翔)。这些概念反映了表达真实性和代表简单性之间的权衡。大语言模型(LLMs)显示了非凡的语言能力,然而,它们的内部表述是否在压缩和语义忠诚之间形成像人类一样的权衡取舍,这一点还不清楚。我们引入了一个新颖的信息理论框架,从低调扭曲理论和信息瓶颈原则中提取,以定量比较这些战略。从一套不同的LLMS中分析象征性嵌入与人类基本分类基准之间的平衡,我们发现关键的差异。虽然LLMS形成与人类判断相符合的广泛概念类别,但它们努力捕捉到对于人类理解至关重要的精细的精密的语义区分。 更重要的是,LLMS展示了对侵略性统计压缩的强烈的偏向,而人类概念系统似乎优先考虑适应性和背景丰富性,即使通过我们的措施可以降低压缩效率。
Article 257
Title@2025-06-30 (1): ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling
Title: ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling | EKG-Byte: Ein Tokenizer für die End-to-End Generative Elektrokardiogramm-Sprachenmodellierung | ECG-Byte: 终端到 En-En Energy 电动心电图语言建模调控器 2412.14373v2 |
Authors (5): William Han, Chaojing Duan, Michael A. Rosenberg, Emerson Liu, Ding Zhao
Large Language Models (LLMs) have demonstrated exceptional versatility across domains, including applications to electrocardiograms (ECGs). A growing body of work focuses on generating text from multi-channeled ECG signals and corresponding textual prompts. Existing approaches often involve a two-stage process: pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective, followed by finetuning an LLM for natural language generation (NLG) using encoder-derived features. However, these methods face two key limitations: inefficiency due to multi-stage training and challenges in interpreting encoder-generated features. To overcome these issues, we propose ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. ECG-Byte compresses and encodes ECG signals into tokens, enabling direct end-to-end LLM training by combining ECG and text tokens. This approach enhances interpretability, as ECG tokens can be directly mapped back to the original signals. Leveraging ECG-Byte, we achieve competitive NLG performance while training 3 times faster and using just 48\% of the data required by traditional two-stage methods.
大型语言模型(LLMS)展示了不同领域的特殊多功能性,包括对心电图的应用。越来越多的工作重点是从多渠道ECG信号和相应的文本提示生成文本。现有方法往往涉及一个两阶段过程:先对ECG专用编码器进行自我监督学习(SSL)目标的培训,然后对用于自然语言生成的LLM(NLG)进行微调,使用编码器衍生特征。然而,这些方法面临两个主要的局限性:由于多阶段培训和解释编码器生成特征方面的挑战,效率低下。为了克服这些问题,我们提议ECG-Byte,这是为ECGs自动递增语言模型改编成的配对代用品管道。ECG-Byrest 和 ECG 信号编码为代号,通过将ECG和文本符号合并,使直接端到端LM培训成为直接培训。这种方法提高了解释性,因为ECG的代号可以直接映射回原信号,而我们则需要使用48种具有竞争力的NBE-BSL方法,我们只需通过48种具有竞争力的C-C-C-C-BSpeat-CSy 进行快速的测试。
Article 258
Title@2025-06-30 (1): Impact of Fine-Tuning Methods on Memorization in Large Language Models
Title: Impact of Fine-Tuning Methods on Memorization in Large Language Models | Auswirkungen von Feintuning-Methoden auf die Erinnerung an große Sprachmodelle | 大语言模型中微调教学方法对记忆化的影响 2507.00258v1 |
Authors (4): Jie Hou, Chuxiong Wu, Lannan Luo, Qiang Zeng
As the capabilities of pre-trained large language models (LLMs) continue to advance, the “pre-train and fine-tune” paradigm has become increasingly mainstream, leading to the development of various fine-tuning methods. However, the privacy risks arising from memorization during fine-tuning have received relatively little attention. To address this gap, we categorize popular fine-tuning approaches and assess their impact on memorization through the lens of membership inference attacks (MIAs). Our results show that, compared to parameter-based fine-tuning, prompt-based fine-tuning achieves competitive performance while exhibiting lower vulnerability to MIAs. Furthermore, prompt-based methods maintain low memorization regardless of model scale. These findings suggest that parameter-based fine-tuning is more prone to leaking private information, whereas prompt-based fine-tuning serves as a more privacy-preserving option.
随着预先培训的大型语言模型(LLMs)的能力继续提高,“预先培训和微调”模式已日益成为主流,导致各种微调方法的开发,然而,微调期间记忆化产生的隐私风险相对较少受到注意。为了解决这一差距,我们从会员推论攻击的角度对流行微调方法进行分类,并评估其对记忆化的影响。我们的结果显示,与基于参数的微调相比,基于迅速的微调取得了竞争性的绩效,同时表现出对MIA的脆弱性较低。此外,基于迅速的方法无论模式规模如何,都保持较低的记忆度。这些结果表明,基于参数的微调更容易泄露私人信息,而基于迅速的微调则更有利于隐私。
Article 259
Title@2025-06-30 (1): Llama-Nemotron: Efficient Reasoning Models
Title: Llama-Nemotron: Efficient Reasoning Models | Llama-Nemotron: Effiziente Denkmodelle | Llama-Nepotron: 高效推理模型 2505.00949v4 |
Authors (136): Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Zijia Chen, Zhilin Wang, David Mosallanezhad, Adi Renduchintala, Haifeng Qian, Dima Rekesh, Fei Jia, Somshubra Majumdar, Vahid Noroozi, Wasi Uddin Ahmad, Sean Narenthiran, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Igor Gitman, Ivan Moshkov, Wei Du, Shubham Toshniwal, George Armstrong, Branislav Kisacanin, Matvei Novikov, Daria Gitman, Evelina Bakhturina, Prasoon Varshney, Makesh Narsimhan, Jane Polak Scowcroft, John Kamalu, Dan Su, Kezhi Kong, Markus Kliegl, Rabeeh Karimi, Ying Lin, Sanjeev Satheesh, Jupinder Parmar, Pritam Gundecha, Brandon Norick, Joseph Jennings, Shrimai Prabhumoye, Syeda Nahida Akter, Mostofa Patwary, Abhinav Khattar, Deepak Narayanan, Roger Waleffe, Jimmy Zhang, Bor-Yiing Su, Guyue Huang, Terry Kong, Parth Chadha, Sahil Jain, Christine Harvey, Elad Segal, Jining Huang, Sergey Kashirsky, Robert McQueen, Izzy Putterman, George Lam, Arun Venkatesan, Sherry Wu, Vinh Nguyen, Manoj Kilaru, Andrew Wang, Anna Warno, Abhilash Somasamudramath, Sandip Bhaskar, Maka Dong, Nave Assaf, Shahar Mor, Omer Ullman Argov, Scot Junkin, Oleksandr Romanenko, Pedro Larroy, Monika Katariya, Marco Rovinelli, Viji Balas, Nicholas Edelman, Anahita Bhiwandiwalla, Muthu Subramaniam, Smita Ithape, Karthik Ramamoorthy, Yuting Wu, Suguna Varshini Velury, Omri Almog, Joyjit Daw, Denys Fridman, Erick Galinkin, Michael Evans, Shaona Ghosh, Katherine Luna, Leon Derczynski, Nikki Pope, Eileen Long, Seth Schneider, Guillermo Siman, Tomasz Grzegorzek, Pablo Ribalta, Monika Katariya, Chris Alexiuk, Joey Conway, Trisha Saar, Ann Guan, Krzysztof Pawelec, Shyamala Prayaga, Oleksii Kuchaiev, Boris Ginsburg, Oluwatobi Olabiyi, Kari Briski, Jonathan Cohen, Bryan Catanzaro, Jonah Alben, Yonatan Geifman, Eric Chung
We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes – Nano (8B), Super (49B), and Ultra (253B) – and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency. In this report, we discuss the training procedure for these models, which entails using neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, and continued pretraining, followed by a reasoning-focused post-training stage consisting of two main parts: supervised fine-tuning and large scale reinforcement learning. Llama-Nemotron models are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference. To further support open research and facilitate model development, we provide the following resources: 1. We release the Llama-Nemotron reasoning models – LN-Nano, LN-Super, and LN-Ultra – under the commercially permissive NVIDIA Open Model License Agreement. 2. We release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset. 3. We also release our training codebases: NeMo, NeMo-Aligner, and Megatron-LM.
我们引入了Llama-Nemotron系列模型,这是一个由各种推理模型组成的开放型系列,它提供特殊的推理能力、推断效率以及企业使用许可。家庭分为三个大小 – – Nano(8B)、Super(49B)和Ultra(253B) – – 并且与DeepSeek-R1等最先进的推理模型进行竞争,同时提供高超的推论过量和记忆效率。我们在本报告中讨论了这些模型的培训程序,这需要利用Llama 3模型的神经结构搜索,以加速推理能力、知识蒸馏和继续培训前期,随后是以推理为重点的后培训阶段,由两个主要部分组成:监督的微调和大规模强化学习。Llama-Nementron模型是第一个支持动态推理模型的开放型模型,允许用户在推理过程中转换标准聊天和推理模式。为了进一步支持公开研究和促进模型开发,我们提供了以下资源:1. 我们释放了Llama-Neentron推理模型模型,用于加速推理模型 – – L-N-Nan-NA、LSO-A-ILS-A-A-A-S-S-S-S-S-S-S-A-IAR AS-IL-SAS-SAS-S-SAS-S-S-S-S-S-S-S-S-IAR-S-SAR-SAR-S-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SAR-S-SBAR-SAR-SAR-S-SAR-SAR-SAR-SAR-SAR-S-SAR-SAR-SAR-S-S-S-S-S-S-S-S-S-S-S-S-S-SAR-SAR-SAR-SDAR-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SAR-S-S-SAR-S-S-S-S-SAR-SAR-S-S-SAR-SAR-S-S-S-S-S-S
Article 260
Title@2025-06-30 (1): Developing Lightweight DNN Models With Limited Data For Real-Time Sign Language Recognition
Title: Developing Lightweight DNN Models With Limited Data For Real-Time Sign Language Recognition | Entwicklung leichter DNN-Modelle mit begrenzten Daten für Echtzeit-Sign Language-Erkennung | 开发轻型DNN模型,具有实时手语识别的有限数据 2507.00248v1 |
Authors (2): Nikita Nikitin, Eugene Fomin
We present a novel framework for real-time sign language recognition using lightweight DNNs trained on limited data. Our system addresses key challenges in sign language recognition, including data scarcity, high computational costs, and discrepancies in frame rates between training and inference environments. By encoding sign language specific parameters, such as handshape, palm orientation, movement, and location into vectorized inputs, and leveraging MediaPipe for landmark extraction, we achieve highly separable input data representations. Our DNN architecture, optimized for sub 10MB deployment, enables accurate classification of 343 signs with less than 10ms latency on edge devices. The data annotation platform ‘slait data’ facilitates structured labeling and vector extraction. Our model achieved 92% accuracy in isolated sign recognition and has been integrated into the ‘slait ai’ web application, where it demonstrates stable inference.
我们提出了一个使用经过有限数据培训的轻量级 DNN 实时手语识别的新框架。 我们的系统处理手语识别方面的主要挑战,包括数据稀缺、计算成本高以及培训和推断环境之间框架率的差异。 通过将手势、棕榈方向、移动和位置等手语特定参数编码为矢量化投入,以及利用MediaPipe 进行里程碑式提取,我们实现了高度可分隔的输入数据表达。 我们的DNN 架构,为子10MB部署优化,能够准确分类343个信号,边缘设备上含值不到10米。数据注解平台的“lait”数据便于结构化标签和矢量提取。我们的模型在孤立的签名识别中实现了92%的准确度,并被纳入了“lait ai” 网络应用程序,其中显示了稳定的推断。
Article 261
Title@2025-06-30 (1): EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning
Title: EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning | EfficientXLang: Verbesserung der Token-Effizienz durch Cross-Lingual Reasoning | 高效XLang:通过跨语言理由提高当量效率 2507.00246v1 |
Authors (3): Sanchit Ahuja, Praneetha Vaddamanu, Barun Patra
Despite recent advances in Language Reasoning Models (LRMs), most research focuses solely on English, even though many models are pretrained on multilingual data. In this work, we investigate: Is English the most token-efficient language for reasoning? We evaluate three open-source RLMs: DeepSeek R1, Qwen 2.5 and Qwen 3, across four math datasets and seven typologically diverse languages. We find that reasoning in non-English languages not only reduces token usage, but also preserves accuracy. These gains persist even after translating the reasoning traces into English, suggesting genuine shifts in reasoning behavior rather than surface-level linguistic effects. The extent of improvement, however, depends on the models multilingual strength. Our findings motivate a broader view of reasoning in language models, highlighting the potential of multilingual reasoning and the importance of strong multilingual foundations. The code for our work can be found: https://github.com/microsoft/EfficientXLang.
尽管最近在语言解释模型(LRMs)方面有所进展,但大多数研究都只侧重于英语,尽管许多模型在多语种数据方面已经受过预先培训。在这项工作中,我们调查:英语是否是最有象征价值的推理语言?我们评估了三种开放源码RLMs:DeepSeek R1、Quen 2.5和Quen 3,涉及四个数学数据集和七个类型多样的语言。我们发现,非英语的推理不仅减少了象征性的使用,而且还保留了准确性。这些推理痕迹在将推理痕迹翻译成英语之后仍然存在,表明推理行为的真正转变,而不是表面语言影响。但是,改进的程度取决于多语种模型的力量。我们的调查结果激发了对语言模型推理的更广泛观点,强调了多语种推理的潜力和强大的多语种基础的重要性。我们工作的守则可以找到:https://github.com/microsoft/EffiochXLang。
Article 262
Title@2025-06-30 (1): Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
Title: Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension | Skalierung der Inferenz-Zeit-Suche mit Vision Value Model für verbesserte visuelle Wahrnehmung | 增强视觉理解的视觉价值模型的增强推论-实时搜索 2412.03704v3 |
Authors (9): Xiyao Wang, Zhengyuan Yang, Linjie Li, Hongjin Lu, Yuancheng Xu, Chung-Ching Lin, Kevin Lin, Furong Huang, Lijuan Wang
Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs’ ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM’s performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at https://github.com/si0wang/VisVM.
尽管在视觉-语言模型(VLM)方面取得了显著进步,但缺乏提高反应质量的有效方法,通过扩大推算时间的计算,提高反应质量,这种能力被认为是最近大型语言模型研究中自我改进模型的核心步骤,在本文中,我们介绍了能够引导VLM推论时间搜索以更清晰的视觉理解生成反应的视觉-时间模型(VisVM),具体地说,VisVM不仅评估了当前搜索步骤中生成的判刑质量,还预测了当前步骤可能产生的后续判决质量,从而提供了长期价值。通过这种方式,VisVSVM引导VLM不再生成容易产生幻觉或细节不足的句子,从而产生更高质量的回应。实验结果表明,VisVM-引导搜索能够大大增强VLMS生成描述说明的能力,以更丰富的视觉细节和更少的幻觉,而与贪婪的解码和搜索方法和其他视觉奖赏信号相比,我们发现,通过VisVM-指导性字幕来对模型进行自我培训,从而改进VLM的VLM/VLMTM的功能-M的功能范围宽广的特性基准,表明我们开发的MMMMMMM/MMMM的自我价值-MDMDM/M/Ms/Ms/MUDR 的潜在价值。
Article 263
Title@2025-06-30 (1): The Algebraic Structure of Morphosyntax
Title: The Algebraic Structure of Morphosyntax | Die algebraische Struktur von Morphosyntax | 月光税的代数结构 2507.00244v1 |
Authors (2): Isabella Senturia, Matilde Marcolli
Within the context of the mathematical formulation of Merge and the Strong Minimalist Thesis, we present a mathematical model of the morphology-syntax interface. In this setting, morphology has compositional properties responsible for word formation, organized into a magma of morphological trees. However, unlike syntax, we do not have movement within morphology. A coproduct decomposition exists, but it requires extending the set of morphological trees beyond those which are generated solely by the magma, to a larger set of possible morphological inputs to syntactic trees. These participate in the formation of morphosyntactic trees as an algebra over an operad, and a correspondence between algebras over an operad. The process of structure formation for morphosyntactic trees can then be described in terms of this operadic correspondence that pairs syntactic and morphological data and the morphology coproduct. We reinterpret in this setting certain operations of Distributed Morphology as transformation that allow for flexibility in moving the boundary between syntax and morphology within the morphosyntactic objects.
在合并和强最小论理论的数学配方范围内,我们展示了一种形态-合成界面的数学模型。在这种背景下,形态学具有组成特性,对文字形成负责,形成成形成形,形成成形树的岩浆。然而,与形态学不同的是,我们没有在形态学中运动。共同产品分解存在,但需要将形态学树的系列扩展至仅由岩浆生成的形态学树之外,以更大的形态学输入合成树。这些都参与了形态合成树的形成,作为经演算的代数,以及代数树与经演算的对应。因此,形态学树的结构形成过程可以用这种外形对应来描述,即组合合成和形态学数据以及形态学产品。我们在此将分解的形态学的某些功能重新解释,作为在形态学和形态学对象之间移动边界的灵活性。
Article 264
Title@2025-06-30 (1): Linearly Decoding Refused Knowledge in Aligned Language Models
Title: Linearly Decoding Refused Knowledge in Aligned Language Models | Lineare Dekodierung Verstärktes Wissen in ausgerichteten Sprachmodellen | 在统一语言模型中线性解码拒绝的知识 2507.00239v1 |
Authors (2): Aryan Shrivastava, Ari Holtzman
Most commonly used language models (LMs) are instruction-tuned and aligned using a combination of fine-tuning and reinforcement learning, causing them to refuse users requests deemed harmful by the model. However, jailbreak prompts can often bypass these refusal mechanisms and elicit harmful responses. In this work, we study the extent to which information accessed via jailbreak prompts is decodable using linear probes trained on LM hidden states. We show that a great deal of initially refused information is linearly decodable. For example, across models, the response of a jailbroken LM for the average IQ of a country can be predicted by a linear probe with Pearson correlations exceeding $0.8$. Surprisingly, we find that probes trained on base models (which do not refuse) sometimes transfer to their instruction-tuned versions and are capable of revealing information that jailbreaks decode generatively, suggesting that the internal representations of many refused properties persist from base LMs through instruction-tuning. Importantly, we show that this information is not merely “leftover” in instruction-tuned models, but is actively used by them: we find that probe-predicted values correlate with LM generated pairwise comparisons, indicating that the information decoded by our probes align with suppressed generative behavior that may be expressed more subtly in other downstream tasks. Overall, our results suggest that instruction-tuning does not wholly eliminate or even relocate harmful information in representation space-they merely suppress its direct expression, leaving it both linearly accessible and indirectly influential in downstream behavior.
最常用的语言模型(LMS)是使用微调和强化学习相结合的组合进行教学和调整,使用户拒绝用户认为该模型有害的请求;然而,越狱提示往往绕过这些拒绝机制,引起有害反应;在这项工作中,我们利用在LM隐藏状态上培训的线性探测器,研究通过越狱提示获取的信息在多大程度上可以解密;我们表明,许多最初拒绝的属性的内部表述是线性可解码的。例如,在各种模型中,一个国家平均IQ的牢封LM的反应可以通过一个直线性探测器预测,其直线探测器的比值超过0.8美元。令人惊讶的是,我们发现,通过基础模型(不拒绝)培训的探测器有时可以转换到其经指令调整的版本,能够揭示出通过在LMS隐藏的方式解密许多被拒绝的属性的内部表述是持续从基础LMS调控的。我们显示,这种信息不仅仅是“变换”的表达方式,而是被它们积极使用:我们发现,通过深层次的比值显示,在深层次的比值中,我们所显示的指令可以比重的比重的比重是更精确地显示,我们所生成的比重的比重的比重的排序。
Article 265
Title@2025-06-30 (1): Interpretable AI for Time-Series: Multi-Model Heatmap Fusion with Global Attention and NLP-Generated Explanations
Title: Interpretable AI for Time-Series: Multi-Model Heatmap Fusion with Global Attention and NLP-Generated Explanations | Interpretierbare KI für die Time-Serie: Multi-Model Heatmap Fusion mit globaler Aufmerksamkeit und NLP-generierten Erklärungen | 时间序列可解释的 AI:全球关注的多模型热图融合和NLP - 引人注意的解释 2507.00234v1 |
Authors (2): Jiztom Kavalakkatt Francis, Matthew J Darr
In this paper, we present a novel framework for enhancing model interpretability by integrating heatmaps produced separately by ResNet and a restructured 2D Transformer with globally weighted input saliency. We address the critical problem of spatial-temporal misalignment in existing interpretability methods, where convolutional networks fail to capture global context and Transformers lack localized precision - a limitation that impedes actionable insights in safety-critical domains like healthcare and industrial monitoring. Our method merges gradient-weighted activation maps (ResNet) and Transformer attention rollout into a unified visualization, achieving full spatial-temporal alignment while preserving real-time performance. Empirical evaluations on clinical (ECG arrhythmia detection) and industrial (energy consumption prediction) datasets demonstrate significant improvements: the hybrid framework achieves 94.1% accuracy (F1 0.93) on the PhysioNet dataset and reduces regression error to RMSE = 0.28 kWh (R2 = 0.95) on the UCI Energy Appliance dataset-outperforming standalone ResNet, Transformer, and InceptionTime baselines by 3.8-12.4%. An NLP module translates fused heatmaps into domain-specific narratives (e.g., “Elevated ST-segment between 2-4 seconds suggests myocardial ischemia”), validated via BLEU-4 (0.586) and ROUGE-L (0.650) scores. By formalizing interpretability as causal fidelity and spatial-temporal alignment, our approach bridges the gap between technical outputs and stakeholder understanding, offering a scalable solution for transparent, time-aware decision-making.
在本文中,我们展示了一个通过将ResNet和经过改造的2D变异器单独制作的热图与全球加权输入显要性整合而提高模型可解释性的新框架。我们解决了现有可解释性方法中空间时态不匹配这一关键问题,即:革命网络未能捕捉到全球背景,而变异器缺乏本地化精确度—-这一限制妨碍了在保健和工业监测等安全关键领域中可采取行动的洞察力。我们的方法将梯度加权活化图(ResNet)和变异器关注推出为统一可视化,在保持实时性能的同时实现完全的空间时际对齐。临床(ECG心律检测)和工业(能源消耗预测)数据集的实证性评估显示了显著的改进:混合框架在PhysioNet数据集上实现了94.1%的准确度(F1 0.93),并降低了RMSE =0.28kWh(R2=0.95)在UCI能源应用的直径直径直径直径对立式对立(ResNet、变异形)和直径直径直径直径直径直径直径直径直径基线基线基线(3.L4)之间,在3.12-LL-L-L-L-L-L-L-L-L-L-L-deal-deal-deal-deal-deal-deal-deal-deal-deal-dealisalisal-dealisal-dealisalisalisalisal-deal-dexxalismexalisalisalismexalismex)。
Article 266
Title@2025-06-30 (1): A Graph-Based Classical and Quantum Approach to Deterministic L-System Inference
Title: A Graph-Based Classical and Quantum Approach to Deterministic L-System Inference | Ein auf Graphen basierender klassischer und Quantumansatz zur deterministischen L-System-Inferenz | 以图表为基础的确定性L-系统系统推断法的古学和量法 2411.19906v3 |
Authors (3): Ali Lotfi, Ian McQuillan, Steven Rayan
L-systems can be made to model and create simulations of many biological processes, such as plant development. Finding an L-system for a given process is typically solved by hand, by experts, in a massively time-consuming process. It would be significant if this could be done automatically from data, such as from sequences of images. In this paper, we are interested in inferring a particular type of L-system, deterministic context-free L-system (D0L-system) from a sequence of strings. We introduce the characteristic graph of a sequence of strings, which we then utilize to translate our problem (inferring D0L-systems) in polynomial time into the maximum independent set problem (MIS) and the SAT problem. After that, we offer a classical exact algorithm and an approximate quantum algorithm for the problem.
L-系统可以用来模拟和模拟许多生物过程,例如植物开发。为某一过程寻找L-系统通常由专家亲手在耗费大量时间的进程中用手解决。如果能够从数据(例如图像序列)中自动解决,将具有重要意义。在本文件中,我们有兴趣从字符串序列中推断出一种特定的L-系统类型,即确定性环境-L系统(D0L-System)。我们引入了字符串序列的特征图,然后我们用它来将我们的问题(推断D0L-Systems)在多时段转化为最独立的设置问题(MIS)和SAT问题。之后,我们为问题提供了一种典型的精确算法和大概的量算法。
Article 267
Title@2025-06-30 (1): Towards Style Alignment in Cross-Cultural Translation
Title: Towards Style Alignment in Cross-Cultural Translation | Auf dem Weg zur Stilausrichtung in kulturübergreifender Übersetzung | 实现跨文化翻译的风格一致 2507.00216v1 |
Authors (4): Shreya Havaldar, Adam Stein, Eric Wong, Lyle Ungar
Successful communication depends on the speaker’s intended style (i.e., what the speaker is trying to convey) aligning with the listener’s interpreted style (i.e., what the listener perceives). However, cultural differences often lead to misalignment between the two; for example, politeness is often lost in translation. We characterize the ways that LLMs fail to translate style - biasing translations towards neutrality and performing worse in non-Western languages. We mitigate these failures with RASTA (Retrieval-Augmented STylistic Alignment), a method that leverages learned stylistic concepts to encourage LLM translation to appropriately convey cultural communication norms and align style.
成功交流取决于演讲者的用意风格(即演讲者试图传达什么)与听众的用意一致(即听众的感知)。然而,文化差异往往导致两者之间的不协调;例如,在翻译中往往失去礼貌。我们描述LLMS不翻译风格的方式——偏向于中立翻译,在非西方语言中表现更差。我们用RASTA(RETREVAL-Augiveed Stylistic Stylistolation)(Ratreval-Agiving Stylistic Againation)来缓解这些失败,这一方法利用学过文学概念鼓励LLM翻译适当传达文化交流规范和一致风格。
Article 268
Title@2025-06-30 (1): Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning
Title: Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning | Zweistufiges Reasoning-infused Learning: Verbesserung der Klassifizierung mit LLM-generierter Reasoning | 双级推理学习:改进以LLM为主的理由分类 2507.00214v1 |
Authors (2): Mads Henrichsen, Rasmus Krebs
Standard classification models often map inputs directly to labels without explicit reasoning, potentially limiting their performance, robustness, and interpretability. This paper introduces a novel two-stage approach to enhance text classification by leveraging Large Language Model (LLM)-generated reasonings. In the first stage, we fine-tune a Llama-3.2-1B-Instruct model (henceforth Llama-R-Gen) on a general-purpose reasoning dataset (syvai/reasoning-gen) to generate textual reasoning (R) given a question and its answer. In the second stage, this generally trained Llama-R-Gen is used offline to create an augmented training dataset for a downstream generative model. This downstream model, based on Llama-3.2-1B-Instruct, takes only the input text (Q) and is trained to output the generated reasoning (R) immediately followed by the predicted emotion (A). We demonstrate this methodology on the dair-ai/emotion dataset for emotion classification. Our experiments show that the generative model trained to output reasoning and the emotion (Classifier Q->RA) achieves a significant improvement of 8.7 percentage points in accuracy (for emotion prediction) compared to a baseline generative model trained solely to output the emotion (Classifier Q->A), highlighting the strong generalization capabilities of the reasoning generation and the benefit of explicit reasoning training. This work underscores the potential of LLM-generated reasonings for creating richer training datasets, thereby improving the performance of diverse downstream NLP tasks and providing explicit explanations.
标准分类模型经常在没有明确推理的情况下将输入直接映射到标签上,这可能会限制其性能、稳健性和可解释性。本文件介绍了一种新型的两阶段方法,通过利用大语言模型(LLM)产生的推理来强化文本分类。在第一阶段,我们微调一个通用推理数据集的Llama-3.2-1B-Instruct模型(因此Llama-3.2-1B-Instruct Llama-R-Gen),该模型是通用推理数据集(syvai/argation-gen),根据一个问题和答案生成文本推理推理(R),产生文本推理推理(R),在第二个阶段,一般经过培训的Llama-R-Gen在离线上使用一般的Llama-R-Gen,用于为下游基因模型创建扩大的培训数据集。这一下游模型,为精确的精确度(C 精确度的精确度) 和精确度的精确度(Clisarial-rial-rial-rial-rial Q>的精确度的精确度的精确度) 提供了精确度的精确度,为精确度的精确度的精度的精确度的精确度,为精确度的精确度。
Article 269
Title@2025-06-30 (1): LineRetriever: Planning-Aware Observation Reduction for Web Agents
Title: LineRetriever: Planning-Aware Observation Reduction for Web Agents | LineRetriever: Planning-Aware-Beobachtungsreduktion für Web-Agenten | 线检索: 网络代理的规划-软件观测减少 2507.00210v1 |
Authors (9): Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Massimo Caccia, Véronique Eglin, Alexandre Aussem, Jérémy Espinas, Alexandre Lacoste
While large language models have demonstrated impressive capabilities in web navigation tasks, the extensive context of web pages, often represented as DOM or Accessibility Tree (AxTree) structures, frequently exceeds model context limits. Current approaches like bottom-up truncation or embedding-based retrieval lose critical information about page state and action history. This is particularly problematic for adaptive planning in web agents, where understanding the current state is essential for determining future actions. We hypothesize that embedding models lack sufficient capacity to capture plan-relevant information, especially when retrieving content that supports future action prediction. This raises a fundamental question: how can retrieval methods be optimized for adaptive planning in web navigation tasks? In response, we introduce \textit{LineRetriever}, a novel approach that leverages a language model to identify and retrieve observation lines most relevant to future navigation steps. Unlike traditional retrieval methods that focus solely on semantic similarity, \textit{LineRetriever} explicitly considers the planning horizon, prioritizing elements that contribute to action prediction. Our experiments demonstrate that \textit{LineRetriever} can reduce the size of the observation at each step for the web agent while maintaining consistent performance within the context limitations.
虽然大型语言模型在网络导航任务中表现出了令人印象深刻的能力,但网页的广泛背景(通常以DOM 或 AxTree (AxTree) 结构为代表)常常超过模型背景限制。目前的做法,例如自下而上的短跑或嵌入式检索,失去了关于页面状态和动作历史的关键信息。这对于网络代理的适应性规划来说特别有问题,因为了解当前状态对于决定未来行动至关重要。我们假设嵌入模型缺乏足够的能力来捕捉与计划有关的信息,特别是在检索支持未来行动预测的内容时。这提出了一个基本问题:如何优化网络导航任务适应规划的检索方法?作为回应,我们引入了\ textit{LineRetriever},这是一种新颖的方法,利用一种语言模型来识别和检索与未来导航步骤最相关的观测线。与仅仅侧重于语系相似性的传统检索方法不同, \ textitit{LineRetri} 明确考虑规划地平线, 优先考虑有助于行动预测的要素。我们的实验表明, 如何优化网络代理在每一步骤上保持持续运行限制。
Article 270
Title@2025-06-30 (1): Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting
Title: Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting | Sprachmodelle verstehen dich vielleicht nicht: Theorie des Geistes über Story Prompting bewerten | 语言模型可能无法理解你:通过故事提示评估心理理论 2506.19089v2 |
Authors (2): Nathaniel Getachew, Abulhair Saparov
We introduce $\texttt{StorySim}$, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, $\texttt{StorySim}$ produces novel, compositional story prompts anchored by a highly controllable $\texttt{Storyboard}$, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of state-of-the-art LLMs reveal that most models perform better on WM tasks than ToM tasks, and that models tend to perform better reasoning with humans compared to inanimate objects. Additionally, our framework enabled us to find evidence of heuristic behavior such as recency bias and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.
我们引入了 $\ textt{ storySim} $\ textt{ storySim} $\ textt{ storySim} $\ textt{ storySim $, 一个合成生成故事的可编程框架, 用于评估大型语言模型( LLM) 的思维理论( ToM) 和世界建模( WM) 能力。 与先前的基准不同, $\ textt{ storySim} 美元产生新的、 编造故事, 由高度可控的 $\ texttt{ storyStorySimboard} 美元支撑, 使得能够精确地操纵字符视角和事件。 我们使用这个框架来设计一和二等的 TOM 任务, 与 WM 任务一起控制跟踪和模拟精神状态的 WM 任务。 我们在一组最先进的LMSM 中进行的实验显示, 多数模型在WM 任务上的表现比 Tom 任务比 Tom 任务要好, 而模型往往比 与无动性对象物体进行更好的推理。 此外, 我们的框架让我们能够找到 诸如 偏差偏差和过分行为的证据行为的证据 证据 。 所有代码 数据和评估 。
Article 271
Title@2025-06-30 (1): RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
Title: RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression | RocketKV: Beschleunigung der Langkontext-LLM-Inferenz über zweistufige KV-Cache-Kompression | RocketKV: 通过两步KV缓存压缩加速长文本LLM推导 2502.14051v2 |
Authors (6): Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov
Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy containing two consecutive stages. In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens. In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensionality reductions. We show that RocketKV provides a compression ratio of up to 400$\times$, end-to-end speedup of up to 3.7$\times$ as well as peak memory reduction of up to 32.6% in the decode phase on an NVIDIA A100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks. We also propose a variant of RocketKV for multi-turn scenarios, which consistently outperforms other existing methods and achieves accuracy nearly on par with an oracle top-k attention scheme.
以变换器为基础的大语言模型非常依赖 KV 缓存,以便在解码阶段有效处理扩展的背景。 然而, KV 缓存的大小随着输入长度的长度而成比例地增加, 使记忆带宽和解码能力随进展而增加。 为了应对这一挑战, 我们提出 RockKV , 这是一种无培训的KV 缓存压缩战略, 包含连续两个阶段。 在第一阶段, 它在输入序列符号上执行粗格永久 KV 缓存迁移。 在第二阶段, 它采用混合的稀疏关注方法, 进行细微的顶层稀疏关注, 通过利用头部和顺序的维度削减来接近关注分数。 我们显示RockKV 提供了高达400美元时的压缩率, 最后到尾部的加速, 包括两个连续的阶段。 在 NVIDIA A100 GPU 的解码阶段, 最多减少32.6 % 的内存量。 与完整的 KV 缓存基线相比, 它在各种长档任务上取得了微不足道的精度损失, 同时提出了一种近乎于最高级的火箭K 方案, 。
Article 272
Title@2025-06-30 (1): Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs
Title: Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs | Bewertung von Deduplikationstechniken für Wirtschaftsforschungspapiertitel mit Fokus auf semantische Ähnlichkeit mit NLP und LLM | 利用NLP和LLMs评估经济研究论文标题的应用技术,重点是语义相似性 2410.01141v3 |
Authors (2): Doohee You, S Fraiberger
This study investigates efficient deduplication techniques for a large NLP dataset of economic research paper titles. We explore various pairing methods alongside established distance measures (Levenshtein distance, cosine similarity) and a sBERT model for semantic evaluation. Our findings suggest a potentially low prevalence of duplicates based on the observed semantic similarity across different methods. Further exploration with a human-annotated ground truth set is completed for a more conclusive assessment. The result supports findings from the NLP, LLM based distance metrics.
这项研究调查了经济研究论文标题的大型NLP数据集的有效解重复技术,我们探索了各种配对方法以及既定的距离测量(Levenshtein距离,相近性)和用于语义评估的SBERT模型。我们的调查结果表明,根据观察到的不同方法的语义相似性,重复率可能较低。进一步探索带有人文注释的地面真相集的工作已经完成,以便进行更结论性的评估。结果支持了NLP基于LM的距离测量结果。
Article 273
Title@2025-06-30 (1): Prompting as Scientific Inquiry
Title: Prompting as Scientific Inquiry | Als wissenschaftliche Untersuchung prompt | 作为科学调查 2507.00163v1 |
Authors (2): Ari Holtzman, Chenhao Tan
Prompting is the primary method by which we study and control large language models. It is also one of the most powerful: nearly every major capability attributed to LLMs-few-shot learning, chain-of-thought, constitutional AI-was first unlocked through prompting. Yet prompting is rarely treated as science and is frequently frowned upon as alchemy. We argue that this is a category error. If we treat LLMs as a new kind of complex and opaque organism that is trained rather than programmed, then prompting is not a workaround: it is behavioral science. Mechanistic interpretability peers into the neural substrate, prompting probes the model in its native interface: language. We contend that prompting is not inferior, but rather a key component in the science of LLMs.
提示是我们研究和控制大型语言模式的主要方法。它也是最强大的方法之一:几乎每一个主要能力都来自LLMS-few-shot-shot-iness、思维链、宪法的AI-首先通过推动而解开。然而,催化很少被作为科学来对待,常常被冷却为炼金术。我们争论说,这是一个类别错误。如果我们把LLMS视为一种经过培训而不是编程的新型复杂和不透明的有机体,那么,催化并不是一种变通办法:它就是行为科学。机械可解释性同行进入神经基,促使它在其本地界面(语言)中探究模型。我们争论说,催化并不是劣等的,而是LMS科学中的一个关键组成部分。
Article 274
Title@2025-06-30 (1): Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data
Title: Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data | Tabelle Verständnis und (Multimodale) LLMs: Eine Cross-Domain-Fallstudie zu wissenschaftlichen vs. nicht wissenschaftlichen Daten | 理学与非科学数据交叉案例研究 2507.00152v1 |
Authors (8): Ekaterina Borisova, Fabio Barth, Nils Feldhus, Raia Abu Ahmad, Malte Ostendorff, Pedro Ortiz Suarez, Georg Rehm, Sebastian Möller
Tables are among the most widely used tools for representing structured data in research, business, medicine, and education. Although LLMs demonstrate strong performance in downstream tasks, their efficiency in processing tabular data remains underexplored. In this paper, we investigate the effectiveness of both text-based and multimodal LLMs on table understanding tasks through a cross-domain and cross-modality evaluation. Specifically, we compare their performance on tables from scientific vs. non-scientific contexts and examine their robustness on tables represented as images vs. text. Additionally, we conduct an interpretability analysis to measure context usage and input relevance. We also introduce the TableEval benchmark, comprising 3017 tables from scholarly publications, Wikipedia, and financial reports, where each table is provided in five different formats: Image, Dictionary, HTML, XML, and LaTeX. Our findings indicate that while LLMs maintain robustness across table modalities, they face significant challenges when processing scientific tables.
这些表格是代表研究、商业、医学和教育领域结构化数据的最广泛使用的工具之一。虽然LLMS在下游任务中表现良好,但它们处理表单数据的效率仍未得到充分探讨。在本文件中,我们通过跨主题和跨模式评估,调查表格理解任务中的文本和多式LMS的有效性。具体地说,我们比较其在科学和非科学背景表格中的绩效,并检查其作为图像与文本的表格的稳健性。此外,我们进行了可解释性分析,以衡量背景使用情况和投入相关性。我们还采用了表Eval基准,包括3 017个来自学术出版物、维基百科和财务报告的表格,其中每个表格以五种不同格式提供:图像、对立、HTML、XM和LaTeX。我们的调查结果表明,LMS在处理科学表格时,虽然在表格之间保持稳健性,但面临重大挑战。
Article 275
Title@2025-06-30 (1): On the Predictive Power of Representation Dispersion in Language Models
Title: On the Predictive Power of Representation Dispersion in Language Models | Zur vorausschauenden Macht der Repräsentationsdispersion in Sprachmodellen | 语文模式代表性分布的预测力 2506.24106v1 |
Authors (4): Yanhong Li, Ming Li, Karen Livescu, Jiawei Zhou
We show that a language model’s ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations more widely tend to achieve lower perplexity. Concretely, we find that representation dispersion - the average pairwise cosine distance among hidden vectors - strongly and negatively correlates with perplexity across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). Beyond illustrating this link, we show how dispersion can be leveraged for a range of practical tasks without requiring labeled data. First, measuring dispersion on unlabeled text allows us to predict downstream accuracy in new domains, offering a data-efficient tool for model selection. Next, we find that identifying layers with higher dispersion pinpoints the best representations for retrieval-based methods such as kNN-LM, bypassing exhaustive layer-by-layer searches. Finally, we integrate a simple push-away objective into training, which increases dispersion in both single-domain and cross-domain scenarios and directly improves perplexity in each.
我们发现,语言模型预测文本的能力与其嵌入空间的广度紧密相连:那些更广泛地传播其背景表达方式的模型往往会达到更低的不易理解性。具体地说,我们发现代表分布 — — 隐藏矢量之间平均对对齐的共弦距离 — — 与不同模式家庭(LLAMA、Qwen等)和域(Wikipedia、新闻、科学摘要)的不易理解性有着强烈和负面的联系。我们除了说明这一链接外,还显示如何在不需要标签数据的情况下将分散用于一系列实际任务。首先,测量未标文本的分散性使我们能够预测新域的下游精度,为模式选择提供数据效率高的工具。接下来,我们发现,识别偏差层能够确定基于检索的方法的最佳表达方式,例如 kNNN-LM,绕过无遗层逐层搜索。最后,我们将一个简单的推离目标纳入培训,从而增加单层和跨层情景的分散性,直接改进每个场点的重复性。
Article 276
Title@2025-06-30 (1): Knowing You Don’t Know: Learning When to Continue Search in Multi-round RAG through Self-Practicing
Title: Knowing You Don’t Know: Learning When to Continue Search in Multi-round RAG through Self-Practicing | Wissen, dass Sie nicht wissen: Lernen, wann Sie die Suche in Multi-round RAG durch Selbst-Praktiken fortsetzen | 了解您不知道: 学习何时通过自我实践在多轮RAG中继续搜索 2505.02811v2 |
Authors (4): Diji Yang, Linda Zeng, Jinmeng Rao, Yi Zhang
Retrieval Augmented Generation (RAG) has shown strong capability in enhancing language models’ knowledge and reducing AI generative hallucinations, driving its widespread use. However, complex tasks requiring multi-round retrieval remain challenging, and early attempts tend to be overly optimistic without a good sense of self-skepticism. Current multi-round RAG systems may continue searching even when enough information has already been retrieved, or they may provide incorrect answers without having sufficient information or knowledge. Existing solutions either require large amounts of expensive human-labeled process supervision data or lead to subpar performance. This paper aims to address these limitations by introducing a new framework, SIM-RAG, to explicitly enhance RAG systems’ self-awareness and multi-round retrieval capabilities. To train SIM-RAG, we first let a RAG system self-practice multi-round retrieval, augmenting existing question-answer pairs with intermediate inner monologue reasoning steps to generate synthetic training data. For each pair, the system may explore multiple retrieval paths, which are labeled as successful if they reach the correct answer and unsuccessful otherwise. Using this data, we train a lightweight information sufficiency Critic. At inference time, the Critic evaluates whether the RAG system has retrieved sufficient information at each round, guiding retrieval decisions and improving system-level self-awareness through in-context reinforcement learning. Experiments across multiple prominent RAG benchmarks show that SIM-RAG is an effective multi-round RAG solution. Furthermore, this framework is system-efficient, adding a lightweight component to RAG without requiring modifications to existing LLMs or search engines, and data-efficient, eliminating the need for costly human-annotated mid-step retrieval process supervision data.
重新获取增强新一代公司(RAG)在提高语言模型知识和减少人工基因突变幻觉、推动其广泛使用方面表现出很强的能力。然而,需要多轮检索的复杂任务仍然具有挑战性,早期尝试往往过于乐观,没有良好的自我怀疑感。目前的多轮检索系统即使已经检索到足够的信息,也可能继续搜索,或者在没有足够信息或知识的情况下提供不正确的答案。现有的解决方案要么需要大量昂贵的人类标志性进程监督数据,要么导致低级性能。本文旨在通过引入一个新的框架SIM-RAG来克服这些限制,明确加强RAG系统的自我意识和多轮检索能力。为了培训SIM-RAG,我们首先让RAG系统自我操作的多轮检索系统自行操作,用中间的单词推理推理步骤增加现有的问答配对来生成合成培训数据。对于每对夫妇来说,系统可以探索多重检索路径,如果达到简单答案或不成功的话,这些路径将被标记为成功。使用这一数据,我们训练一个轻度的 RAG-RG 级系统在不断更新的系统内部检索过程中是否有足够的自我评估。
Article 277
Title@2025-06-30 (1): SEUF: Is Unlearning One Expert Enough for Mixture-of-Experts LLMs?
Title: SEUF: Is Unlearning One Expert Enough for Mixture-of-Experts LLMs? | SEUF: Reicht es für LLMs für Mixture-of-Experts aus, einen Experten zu lernen? | SEUF:不学习一位专家是否足以使混合专家LLM公司受益? 2411.18797v2 |
Authors (7): Haomin Zhuang, Yihua Zhang, Kehan Guo, Jinghan Jia, Gaowen Liu, Sijia Liu, Xiangliang Zhang
Recent advancements in LLMs unlearning have shown remarkable success in removing unwanted data-model influences while preserving the model’s utility for legitimate knowledge. Despite these strides, sparse Mixture-of-Experts (MoE) LLMs–a key subset of the LLM family–have remained unexplored in the context of unlearning. As MoE LLMs are celebrated for their exceptional performance, we ask:How can unlearning be performed effectively and efficiently on MoE LLMs? Our pilot study shows that the dynamic routing nature of MoE LLMs introduces unique challenges, leading to excessive forgetting, uncontrolled knowledge erasure and substantial utility drops when existing unlearning methods are applied. To address this, we propose a novel Selected-Expert Unlearning Framework (SEUF). Through expert attribution, unlearning is concentrated on the most actively engaged experts for the specified knowledge. Concurrently, an anchor loss is applied to the router to stabilize the active state of this targeted expert, ensuring focused and controlled unlearning. SEUF is compatible with various standard unlearning algorithms. Extensive experiments demonstrate that SEUF enhances both forget quality up to 5% and model utility by 35% on MoE LLMs across various benchmarks and LLM architectures (compared to standard unlearning algorithms), while only unlearning 0.06% of the model parameters.
尽管取得了这些进步,但LLM Family-Have仍没有在未学习方面探索。随着MLMS的杰出表现,我们问:MLMS如何在MoE LLMS上有效和高效地学习?我们的实验研究表明,MOELMS的动态路由性质带来了独特的挑战,导致过度遗忘、知识失控消逝和在应用现有不学习方法时大量使用工具。为了解决这个问题,我们提出了一个新的“选择-专家不学习框架”(SEUF)。通过专家的归属,未学习集中在最积极从事特定知识的专家身上。与此同时,对路由器应用了锚值损失,以稳定目标专家的积极状态,确保有重点和控制的不学习。SEUF与各种标准的不学习算法相兼容,在应用现有不学习方法时导致过度遗忘、不受控制的知识消逝和大量工具下降。为了解决这个问题,我们提出了一个新的“选择-专家不学习框架”(SEUF),通过专家归属,未学习集中在最活跃的专家身上。同时,对路由路由器来稳定这个目标专家的积极状态,确保有重点和控制地不学习。SUF与各种标准的不学习方法兼容。广泛的实验表明SEUFSEUF提高质量质量,只有5 %和MLLLM标准,而没有学习标准。
Article 278
Title@2025-06-30 (1): MotionGPT3: Human Motion as a Second Modality
Title: MotionGPT3: Human Motion as a Second Modality | MotionGPT3: Menschliche Bewegung als zweite Modalität | MotionGPT3:人类运动作为第二模式 2506.24086v1 |
Authors (8): Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, Xin Chen
Though recent advances in multimodal models have demonstrated strong capabilities and opportunities in unified understanding and generation, the development of unified motion-language models remains underexplored. To enable such models with high-fidelity human motion, two core challenges must be addressed. The first is the reconstruction gap between the continuous motion modality and discrete representation in an autoregressive manner, and the second is the degradation of language intelligence during unified training. Inspired by the mixture of experts, we propose MotionGPT3, a bimodal motion-language model that treats human motion as a second modality, decoupling motion modeling via separate model parameters and enabling both effective cross-modal interaction and efficient multimodal scaling training. To preserve language intelligence, the text branch retains the original structure and parameters of the pretrained language model, while a new motion branch is integrated via a shared attention mechanism, enabling bidirectional information flow between two modalities. We first employ a motion Variational Autoencoder (VAE) to encode raw human motion into latent representations. Based on this continuous latent space, the motion branch predicts motion latents directly from intermediate hidden states using a diffusion head, bypassing discrete tokenization. Extensive experiments show that our approach achieves competitive performance on both motion understanding and generation tasks while preserving strong language capabilities, establishing a unified bimodal motion diffusion framework within an autoregressive manner.
尽管多式联运模式最近的进展在统一理解和形成方面表现出了强大的能力和机会,但统一运动语言模式的发展仍未得到充分探讨。为了使这些模式能够具有高度不忠的人类运动,必须应对两个核心挑战。第一是连续运动模式与以自动递进方式的离散代表性之间的重建差距,第二是统一培训期间语言智能的退化。在专家混合的启发下,我们提议采用双向自动计算3,一种双向运动语言模式,将人类运动作为第二种模式,通过不同的模型参数脱钩运动模式,促成有效的跨模式互动和有效的多式联运规模化培训。为维护语言智能,文本处保留预先培训的语言模式的原始结构和参数,同时通过共同关注机制将新的运动分支整合,使两种模式之间的双向信息流动。我们首先采用动态自动自动计算器将人类运动转化为潜在代表。基于这一连续的潜在空间,运动处预测从中间的隐藏国家直接潜伏性动态,同时利用一种强有力的灵活性模型,在一种动态上展示一种稳定的传播方式,同时利用一种稳定的机动性模型,在一种动态上展示一种强大的机动性模式上展示一种强大的自我传播能力。
Article 279
Title@2025-06-30 (1): STACK: Adversarial Attacks on LLM Safeguard Pipelines
Title: STACK: Adversarial Attacks on LLM Safeguard Pipelines | Gegenseitige Angriffe auf LLM Safeguard Pipelines | 对LLM保障管道的反向攻击 2506.24068v1 |
Authors (8): Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, Adam Gleave
Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.
AI 开发者依靠多层保障措施来防范对AI系统进行灾难性滥用。 人类保护他们最新的Claude 4 Opus模型使用一种这样的防御管道,以及其他边境开发者,包括谷歌DeepMind和OpAI承诺很快部署类似的防御系统。 然而,这些管道的安全尚不清楚,先前的工作评价有限,或对这些管道进行攻击。 我们通过开发和红对开放源码防御管道进行黑箱攻击,来弥补这一缺口。 首先,我们发现有一个新的微小的微小微速制成输入和输出分类在三次袭击和两套数据集中使用了最新的开放重量保障模型SHeldGemma,将灾难性误用数据集Clearharm上的攻击成功率降至0%。 其次,我们引入了AttaCK(STACK)(STACK)(STACK(STACK)(STACT)程序,在对几发式制式的GLEARM输油管进行黑箱攻击时达到71%的ASR。 最后,我们还在转让过程中评价STACK(S), 达到33%, 初步证据证明设计攻击的阶段是可行的,没有进入特定管道。
Article 280
Title@2025-06-30 (1): Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models
Title: Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models | Logit-Gap Steering: Effiziente Short-Suffix Jailbreaks für ausgerichtete große Sprachmodelle | Lologit-Gap 指导:通用大语言模型的高效短后休息室 2506.24056v1 |
Authors (2): Tung-Ling Li, Hongliang Liu
We introduce logit-gap steering, a fast jailbreak framework that casts the refusal-affirmation gap of RLHF-aligned language models as a single pass over the vocabulary. A forward-computable score blends gap reduction with lightweight proxies for KL penalty and reward shift, allowing a “sort-sum-stop” sweep to complete in under a second and return a short suffix–two orders of magnitude fewer model calls than beam or gradient attacks. The same suffix generalises to unseen prompts and scales from 0.5 B to 70 B checkpoints, lifting one-shot attack success from baseline levels to 80-100% while preserving topical coherence. Beyond efficiency, these suffixes expose sentence-boundary reward cliffs and other alignment artefacts, offering a lightweight probe into how safety tuning reshapes internal representations.
我们引入了日志加普方向(logit-gap 方向),这是一个快速突破框架,它把RLHF-结盟语言模式的拒绝-确认差距作为词汇的单关。 一个可以向前计算分数的分数将减少差距与轻量代数的KL处罚和奖励转移混合起来,允许“sort-sup-stop”扫瞄在第二秒内完成,并返回短小的后缀-两级数量级的示范电话,比光束或梯度袭击少。 同样的后缀概括是从0.5 B至70 B检查站的看不见提示和比例,将一发攻击成功率从基线水平提高到80-100 % ,同时保持时标的一致性。 除了效率外,这些后缀还暴露了受判决约束的悬崖和其他匹配手工艺品,为安全调整内部表现的方式提供了轻量的探测器。
Article 281
Title@2025-06-30 (1): KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy
Title: KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy | KMI: Ein Datensatz koreanischer Motivationsinterviews für Psychotherapie | KMI:韩国精神疗法动机访谈对话数据集 2502.05651v2 |
Authors (7): Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, Sungzoon Cho
The increasing demand for mental health services has led to the rise of AI-driven mental health chatbots, though challenges related to privacy, data collection, and expertise persist. Motivational Interviewing (MI) is gaining attention as a theoretical basis for boosting expertise in the development of these chatbots. However, existing datasets are showing limitations for training chatbots, leading to a substantial demand for publicly available resources in the field of MI and psychotherapy. These challenges are even more pronounced in non-English languages, where they receive less attention. In this paper, we propose a novel framework that simulates MI sessions enriched with the expertise of professional therapists. We train an MI forecaster model that mimics the behavioral choices of professional therapists and employ Large Language Models (LLMs) to generate utterances through prompt engineering. Then, we present KMI, the first synthetic dataset theoretically grounded in MI, containing 1,000 high-quality Korean Motivational Interviewing dialogues. Through an extensive expert evaluation of the generated dataset and the dialogue model trained on it, we demonstrate the quality, expertise, and practicality of KMI. We also introduce novel metrics derived from MI theory in order to evaluate dialogues from the perspective of MI.
对心理健康服务的需求不断增加,导致AI驱动的心理健康聊天室上升,尽管与隐私、数据收集和专业知识有关的挑战依然存在。积极性访谈(MI)作为增加这些聊天室发展方面专门知识的理论基础日益受到关注。然而,现有的数据集显示,对聊天室培训的局限性,导致对军事室和心理治疗领域公开资源的大量需求。这些挑战在非英语语言中更为突出,它们受到的关注较少。我们在本文件中提议了一个新框架,以模拟利用专业治疗师的专门知识丰富了的MI会议。我们培训了模拟专业治疗师行为选择和使用大语言模型的MI预报器模型,以通过迅速的工程产生出话。然后,我们介绍军事室的第一个综合数据集,其中包含1,000个高质量的韩国动力访谈对话。通过对生成的数据集的广泛专家评价以及为此培训的对话模式,我们展示了KMI质量、专门知识和实用性。我们还从科学研究所的理论角度引入了创新的模型。我们还从科学研究所的模型到从科学研究所的顺序进行新的矩阵评估。
Article 282
Title@2025-06-30 (1): Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track
Title: Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track | Position: Machine Learning Konferenzen sollten einen “Refutations and Critiques” Track erstellen | 职位:机器学习会议应建立“反驳和批评”轨道 2506.19882v2 |
Authors (15): Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch, Brando Miranda, Matthias Gerstgrasser, Susan Zhang, Andreas Haupt, Isha Gupta, Elyas Obbad, Jesse Dodge, Jessica Zosa Forde, Koustuv Sinha, Francesco Orabona, Sanmi Koyejo, David Donoho
Science progresses by iteratively advancing and correcting humanity’s understanding of the world. In machine learning (ML) research, rapid advancements have led to an explosion of publications, but have also led to misleading, incorrect, flawed or perhaps even fraudulent studies being accepted and sometimes highlighted at ML conferences due to the fallibility of peer review. While such mistakes are understandable, ML conferences do not offer robust processes to help the field systematically correct when such errors are made. This position paper argues that ML conferences should establish a dedicated “Refutations and Critiques” (R&C) Track. This R&C Track would provide a high-profile, reputable platform to support vital research that critically challenges prior research, thereby fostering a dynamic self-correcting research ecosystem. We discuss key considerations including track design, review principles, potential pitfalls, and provide an illustrative example submission concerning a recent ICLR 2025 Oral. We conclude that ML conferences should create official, reputable mechanisms to help ML research self-correct.
在机器学习(ML)研究中,快速进展导致出版物爆炸,但也导致误导、不正确、有缺陷或甚至是欺诈性研究在ML会议上被接受,有时还由于同行审评的可败性而被强调。虽然这些错误是可以理解的,但ML会议并没有提供强有力的程序来帮助实地系统纠正这些错误。本立场文件认为ML会议应该建立一个专门的“反驳和批评”轨道。这个R&C轨道将提供一个高知名度、有声望的平台,以支持对先前研究提出严峻挑战的重要研究,从而形成一个动态的自我纠正研究生态系统。我们讨论了关键考虑因素,包括跟踪设计、审查原则、潜在的陷阱,并就最近2025年国际劳工研究中心的口头发言提供了实例说明性意见。我们的结论是,ML会议应该建立正式的、有声望的机制来帮助ML研究自我纠正。
Article 283
Title@2025-06-30 (1): Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation
Title: Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation | Befreien Sie mich damit! Stealthy Mitgliedschaft Inferenz für Retrieval-Augmented Generation | 中我这个! 偷盗会员身份的回溯性 被支持的一代人的推论 2502.00306v2 |
Authors (6): Ali Naseh, Yuefeng Peng, Anshuman Suri, Harsh Chaudhari, Alina Oprea, Amir Houmansadr
Retrieval-Augmented Generation (RAG) enables Large Language Models (LLMs) to generate grounded responses by leveraging external knowledge databases without altering model parameters. Although the absence of weight tuning prevents leakage via model parameters, it introduces the risk of inference adversaries exploiting retrieved documents in the model’s context. Existing methods for membership inference and data extraction often rely on jailbreaking or carefully crafted unnatural queries, which can be easily detected or thwarted with query rewriting techniques common in RAG systems. In this work, we present Interrogation Attack (IA), a membership inference technique targeting documents in the RAG datastore. By crafting natural-text queries that are answerable only with the target document’s presence, our approach demonstrates successful inference with just 30 queries while remaining stealthy; straightforward detectors identify adversarial prompts from existing methods up to ~76x more frequently than those generated by our attack. We observe a 2x improvement in TPR@1%FPR over prior inference attacks across diverse RAG configurations, all while costing less than $0.02 per document inference.
重力调整虽然缺乏权重调整防止了通过模型参数渗漏,但引入了在模型背景下利用检索到的文件的推断对手的风险。现有的成员推论和数据提取方法往往依赖于破门而入或精心编造的非自然查询,这些查询可以很容易地被检测到,或者由于在RAG系统中常见的调试重写技术而受到阻碍。在这项工作中,我们介绍了一种成员推论技术,即以RAG数据存储处的文件为目标的成员推论技术。我们的方法通过制作只对目标文件的存在负责的自然文本查询,表明在仅进行30个查询的同时仍保持偷盗,成功地推断出只有30个查询;直接的探测器发现从现有方法到~76x的对抗性提示比我们攻击产生的频率多得多。我们观察到,TPR@1% FPR对各种RAG配置的先前推论攻击有2x改进,而每次文件推论成本均低于0.02美元。
Article 284
Title@2025-06-30 (1): LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries
Title: LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries | LibVulnWatch: Ein Deep Assessment Agent System und Leaderboard für die Entdeckung versteckter Schwachstellen in Open-Source-KI-Bibliotheken | LibVuln Watch: 深入评估代理系统和开放源的AI图书馆中发现隐藏的弱点的主导板 2505.08842v2 |
Authors (10): Zekun Wu, Seonglae Cho, Umar Mohammed, Cristian Munoz, Kleyton Costa, Xin Guan, Theo King, Ze Wang, Emre Kazim, Adriano Koshiyama
Open-source AI libraries are foundational to modern AI systems, yet they present significant, underexamined risks spanning security, licensing, maintenance, supply chain integrity, and regulatory compliance. We introduce LibVulnWatch, a system that leverages recent advances in large language models and agentic workflows to perform deep, evidence-based evaluations of these libraries. Built on a graph-based orchestration of specialized agents, the framework extracts, verifies, and quantifies risk using information from repositories, documentation, and vulnerability databases. LibVulnWatch produces reproducible, governance-aligned scores across five critical domains, publishing results to a public leaderboard for ongoing ecosystem monitoring. Applied to 20 widely used libraries, including ML frameworks, LLM inference engines, and agent orchestration tools, our approach covers up to 88% of OpenSSF Scorecard checks while surfacing up to 19 additional risks per library, such as critical RCE vulnerabilities, missing SBOMs, and regulatory gaps. By integrating advanced language technologies with the practical demands of software risk assessment, this work demonstrates a scalable, transparent mechanism for continuous supply chain evaluation and informed library selection.
开放源码的AI图书馆是现代AI系统的基础,但它们在安全、许可、维护、供应链完整性和监管合规方面提出了重大、未得到充分审查的风险,涉及安全、许可、维护、供应链完整性和监管合规。我们引入了LibVuln Watch,这是一个利用大型语言模型和代理工作流程的最新进展对这些图书馆进行深入、循证评价的系统。我们的方法建立在基于图表的专业化代理机构协调、框架摘录、核实和利用来自储存库、文件和脆弱性数据库的信息量化风险的基础上。LibVulnWatch在五个关键领域产生了可复制、治理一致的得分,向一个公共领导板公布成果,用于持续生态系统监测。在20个广泛使用的图书馆,包括ML框架、LLM推导引擎和代理管工具中应用,我们的方法覆盖了88%的OpenSSF记分卡检查,同时浏览每个图书馆的额外风险达19个,如RCE严重脆弱性、缺失SBOMS和监管漏洞。这项工作通过将先进语言技术与软件风险评估的实际需求相结合,展示了一个可扩展、透明的连续供应链评估和知情选择图书馆机制。
Article 285
Title@2025-06-30 (1): Ella: Embodied Social Agents with Lifelong Memory
Title: Ella: Embodied Social Agents with Lifelong Memory | Ella: Verkörperte Sozialagenten mit lebenslangem Gedächtnis | Ella:有终身记忆的社会代理人 2506.24019v1 |
Authors (7): Hongxin Zhang, Zheyuan Zhang, Zeyuan Wang, Zunzhe Zhang, Lixing Fang, Qinhong Zhou, Chuang Gan
We introduce Ella, an embodied social agent capable of lifelong learning within a community in a 3D open world, where agents accumulate experiences and acquire knowledge through everyday visual observations and social interactions. At the core of Ella’s capabilities is a structured, long-term multimodal memory system that stores, updates, and retrieves information effectively. It consists of a name-centric semantic memory for organizing acquired knowledge and a spatiotemporal episodic memory for capturing multimodal experiences. By integrating this lifelong memory system with foundation models, Ella retrieves relevant information for decision-making, plans daily activities, builds social relationships, and evolves autonomously while coexisting with other intelligent beings in the open world. We conduct capability-oriented evaluations in a dynamic 3D open world where 15 agents engage in social activities for days and are assessed with a suite of unseen controlled evaluations. Experimental results show that Ella can influence, lead, and cooperate with other agents well to achieve goals, showcasing its ability to learn effectively through observation and social interaction. Our findings highlight the transformative potential of combining structured memory systems with foundation models for advancing embodied intelligence. More videos can be found at https://umass-embodied-agi.github.io/Ella/.
我们引入了Ella, 这是一种在3D开放世界中社区中终身学习的具有内涵的社会媒介,在3D开放世界中,代理积累经验,通过日常视觉观察和社会互动获取知识;Ella的能力核心是一个结构化的长期多式记忆系统,储存、更新和有效检索信息;它包括一种以名称为中心的语义记忆,用于组织获得的知识,以及用于获取多模式经验的零星时段记忆;通过将这一终身记忆系统与基础模型结合起来,Ella为决策获取相关信息,规划日常活动,建立社会关系,在与开放世界中其他智能人共存的同时自主发展;我们在动态的3D开放世界中进行面向能力的评价,15个代理进行为期数天的社会活动,并经过一套看不见的控制评价。实验结果显示,Ella能够影响、引导和与其他代理合作,从而实现各种目标,展示其通过观察和社会互动有效学习的能力。我们的调查结果强调了将结构化的记忆系统与基础模型相结合的变革潜力。更多视频可以在 https://umas-imas-imagio.
Article 286
Title@2025-06-30 (1): EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations
Title: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations | EXPERT: Eine erklärbare Bildunterschrift Auswertung Metric mit strukturierten Erklärungen | 具有结构性解释的可解释图像说明评价计量 2506.24016v1 |
Authors (5): Hyunjong Kim, Sangyeop Kim, Jongheon Jeong, Yeongjae Cho, Sungzoon Cho
Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.
大型语言模型和视觉语言模型的近期进展使人们对可解释的图像说明评价指标的兴趣日益浓厚,然而,这些指标产生的解释没有标准化标准,所产生解释的总体质量仍未得到核实。在本文件中,我们提议了 “ 专家 “ 这一无参考评价标准,它根据三个基本标准提供结构化解释:流畅、相关性和描述性。我们通过建立高质量结构化解释的大规模数据集,开发了两阶段评价模板,以有效监督评分和解释生成的愿景语言模型。 “ 专家 “ 在基准数据集方面实现最新的最新结果,同时提供比现有指标质量高得多的解释,并通过全面的人类评价加以验证。我们的代码和数据集可在https://github.com/hkim811/EXPERT上查阅。
Article 287
Title@2025-06-30 (1): Large Language Models Don’t Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective
Title: Large Language Models Don’t Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective | Große Sprachmodelle machen keinen Sinn für Wortprobleme. Ein Scoping Review aus einer mathematischen Bildungsperspektive | 大语言模型不能引起对字问题的看法。从数学教育角度进行范围界定审查。 2506.24006v1 |
Authors (5): Anselm R. Strohmaier, Wim Van Dooren, Kathrin Seßler, Brian Greer, Lieven Verschaffel
The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education. One hope is that they can support mathematics learning, including word-problem solving. Since LLMs can handle textual input with ease, they appear well-suited for solving mathematical word problems. Yet their real competence, whether they can make sense of the real-world context, and the implications for classrooms remain unclear. We conducted a scoping review from a mathematics-education perspective, including three parts: a technical overview, a systematic review of word problems used in research, and a state-of-the-art empirical evaluation of LLMs on mathematical word problems. First, in the technical overview, we contrast the conceptualization of word problems and their solution processes between LLMs and students. In computer-science research this is typically labeled mathematical reasoning, a term that does not align with usage in mathematics education. Second, our literature review of 213 studies shows that the most popular word-problem corpora are dominated by s-problems, which do not require a consideration of realities of their real-world context. Finally, our evaluation of GPT-3.5-turbo, GPT-4o-mini, GPT-4.1, and o3 on 287 word problems shows that most recent LLMs solve these s-problems with near-perfect accuracy, including a perfect score on 20 problems from PISA. LLMs still showed weaknesses in tackling problems where the real-world context is problematic or non-sensical. In sum, we argue based on all three aspects that LLMs have mastered a superficial solution process but do not make sense of word problems, which potentially limits their value as instructional tools in mathematics classrooms.
大语言模型(LLMs)的进展(LLMs)像 查特GPT 这样的大语言模型(LLMs)的进展提出了如何将其纳入教育的问题。希望之一是它们能够支持数学学习,包括解决字词问题。由于LLMs可以轻松地处理文字输入,它们似乎完全适合解决数学字词问题。然而,它们的真正能力,它们是否能理解现实世界背景,以及它们对教室的影响仍然不清楚。我们从数学教育的角度,包括三个部分,进行了范围界定审查:技术概览,系统审查研究中使用的字眼问题,系统审查数学问题,对数学问题LLMS进行最新的经验性评估。首先,在技术概览中,我们比较了文字问题的概念化及其在LLMs和学生之间的解决办法。在计算机科学研究中,这通常被称为数学推理,这个术语与数学教育的用法不相符。第二,我们对213项研究的文献审查表明,最受欢迎的单词问题仍由S-plembles组成,但不要求我们考虑其真实世界背景的不准确性说明。最后的数学进程。最后的一个问题是GPT-MS-MS-Ms。 在G-MS上显示G-MS-MS-MS-MS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SO-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SLOL-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S
Article 288
Title@2025-06-30 (1): Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning
Title: Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning | Auto-TA: Auf dem Weg zu einer skalierbaren Automatisierten Thematischen Analyse (TA) über Multi-Agent Large Language Models mit Verstärkungslernen | Auto-TA:通过具有强化学习的多代理大语言模式逐步实现可缩放自动主题分析(TA) 2506.23998v1 |
Authors (7): Seungjun Yi, Joakim Nguyen, Huimin Xu, Terence Lim, Andrew Well, Mia Markey, Ying Ding
Congenital heart disease (CHD) presents complex, lifelong challenges often underrepresented in traditional clinical metrics. While unstructured narratives offer rich insights into patient and caregiver experiences, manual thematic analysis (TA) remains labor-intensive and unscalable. We propose a fully automated large language model (LLM) pipeline that performs end-to-end TA on clinical narratives, which eliminates the need for manual coding or full transcript review. Our system employs a novel multi-agent framework, where specialized LLM agents assume roles to enhance theme quality and alignment with human analysis. To further improve thematic relevance, we optionally integrate reinforcement learning from human feedback (RLHF). This supports scalable, patient-centered analysis of large qualitative datasets and allows LLMs to be fine-tuned for specific clinical contexts.
遗传性心脏病(CHD)提出了复杂的终身挑战,在传统的临床指标中往往代表不足。虽然不结构化的叙述提供了对病人和护理者经验的深刻洞察力,但人工专题分析(TA)仍然是劳动密集型和不可扩展的。我们建议采用完全自动化的大型语言模型(LLM)管道,对临床说明进行端至端的TA,这就不需要人工编码或完整的笔录审查。我们的系统使用一个新型的多试剂框架,在这个框架中,专门的LLM代理承担提高主题质量和与人类分析接轨的作用。为了进一步提高主题相关性,我们可选择性地纳入从人类反馈(RLHF)中学习的强化部分。这支持对大型质量数据集进行可缩放的、以病人为中心的分析,并使LMMS能够根据具体的临床环境进行微调。
Article 289
Title@2025-06-30 (1): TTRL: Test-Time Reinforcement Learning
Title: TTRL: Test-Time Reinforcement Learning | TTRL: Test-Zeit-Verstärkungs-Lernen | TTRL: 试验时间强化学习 2504.16084v3 |
Authors (16): Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, Bowen Zhou
This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model maj@n, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks and highlight TTRL’s potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL
本文调查了关于大语言模型(LLMs)推理任务没有明确标签的数据的强化学习(RL) 。 问题的核心挑战是在无法获取地面真相信息的情况下,在推断期间奖励估计,而没有获得地面真相信息。 虽然这一环境似乎难以捉摸,但我们发现试验时间缩放(TTTS)的常见做法,如多数投票,产生出适合驱动RL培训的令人惊讶的有效奖励。在这项工作中,我们引入了测试时间强化学习(TTRL),这是使用RL的无标记数据培训LLL的新方法。TTRL使LMS在事先培训模型中能够自我演化。我们的实验显示TTRLL始终在提高各种任务和模型的绩效。 值得注意的是,TTRLLL将提高Quwen-2.5-Math-7B的成绩,在AIME 2024 上大约211%的成绩,仅提供未标记的测试数据。 此外,尽管TTRLLLL只是由maj@n 度标准来监督, TTRLLL显示在T的绩效, 持续超过初始模型的上限 maj@ trinaldrm 和我们所直接测试的地面上的各种数据结果。
Article 290
Title@2025-06-30 (1): Machine Understanding of Scientific Language
Title: Machine Understanding of Scientific Language | Maschinelles Verständnis der wissenschaftlichen Sprache | 科学语言机器理解 2506.23990v1 |
Authors (1): Dustin Wright
Scientific information expresses human understanding of nature. This knowledge is largely disseminated in different forms of text, including scientific papers, news articles, and discourse among people on social media. While important for accelerating our pursuit of knowledge, not all scientific text is faithful to the underlying science. As the volume of this text has burgeoned online in recent years, it has become a problem of societal importance to be able to identify the faithfulness of a given piece of scientific text automatically. This thesis is concerned with the cultivation of datasets, methods, and tools for machine understanding of scientific language, in order to analyze and understand science communication at scale. To arrive at this, I present several contributions in three areas of natural language processing and machine learning: automatic fact checking, learning with limited data, and scientific text processing. These contributions include new methods and resources for identifying check-worthy claims, adversarial claim generation, multi-source domain adaptation, learning from crowd-sourced labels, cite-worthiness detection, zero-shot scientific fact checking, detecting exaggerated scientific claims, and modeling degrees of information change in science communication. Critically, I demonstrate how the research outputs of this thesis are useful for effectively learning from limited amounts of scientific text in order to identify misinformative scientific statements and generate new insights into the science communication process
人类对自然的认识。这种知识主要以不同形式的文字形式传播,包括科学论文、新闻文章和社交媒体上的人际交流。虽然对于加快我们的知识追求很重要,但并非所有科学文本都忠实于基础科学。由于这一文本的数量在最近几年里在网上涌现,因此成为一个具有社会重要性的问题,以便能够自动识别某一科学文本的忠实性。这一论文涉及如何培养数据集、方法和工具,以便机器理解科学语言,以便进行大规模的分析和理解科学交流。为此,我提出自然语言处理和机器学习的三个领域的若干贡献:自动核对事实、用有限的数据学习和科学文本处理。这些贡献包括用于确定可核对的主张的新方法和资源、对抗性主张的生成、多源域适应、从众包标签中学习、引力检测、零光科学事实检查、探测夸大的科学主张以及科学交流的模型化程度。我准确地指出,该论文的研究产出如何有助于从有限的科学文本中有效地产生科学见解,并有效地从有限的科学见解中找出新的科学见解。
Article 291
Title@2025-06-30 (1): TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation
Title: TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation | TaP: Ein taxonomy-geführtes Framework für automatisierte und skalierbare Präferenzdatengenerierung | TAP: 自动和可缩放的首选数据生成分类-指导框架 2506.23979v1 |
Authors (11): Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun, Wuwei Huang, Quandong Wang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong
Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propose the \underline{\textbf{Ta}}xonomy-Guided \underline{\textbf{P}}reference Data Generation (TaP) framework, which facilitates automated and scalable construction of preference datasets across various languages. TaP is grounded in a structured taxonomy that allows fine-grained control over dataset composition, thereby ensuring both diversity and comprehensive coverage. We employ TaP-generated datasets to perform supervised and preference fine-tuning on various LLMs. Experimental results demonstrate that LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets. Remarkably, LLMs trained on TaP-generated datasets surpass the performance of those trained on an open-source dataset that is 180 times larger.
对大型语言模型(LLMs)进行有监督的微调和优惠微调需要高质量的数据集,以提高其遵守指示和与人类偏好和价值一致的能力。然而,建立这类数据集需要大量资源,而大多数可用于有监督和优惠微调的数据集都是用英文制作的。为了应对这些挑战,我们提议采用“下线”=textbf{Taxomomis-Guided\ sunderline_textbf{Pf}参考数据生成(TaP)框架,该框架有助于自动和可缩放地构建各种语言的首选数据集。TaP以结构分类为基础,允许对数据集的组成进行精细的分类控制,从而确保多样性和全面覆盖。我们使用TaP生成的数据集对各种LMs进行有监督的和优先微调。实验结果表明,在TaP产生的数据集方面受过培训的LMs比在现有的公开源数据集上受过培训的数据集要高得多。值得注意的是,关于TaP生成数据集的培训的LMs在超过在公开源数据集上受过训练的人的成绩180倍。
Article 292
Title@2025-06-30 (1): LLM Agents Are the Antidote to Walled Gardens
Title: LLM Agents Are the Antidote to Walled Gardens | LLM-Agenten sind das Gegenmittel zu ummauerten Gärten | LLM 药剂是被围墙隔绝的花园的抗药剂 2506.23978v1 |
Authors (2): Samuele Marro, Philip Torr
While the Internet’s core infrastructure was designed to be open and universal, today’s application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock-in. We argue that LLM-based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift universal interoperability: the ability for any two digital services to exchange data seamlessly using AI-mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks and technical debt. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security.
虽然互联网的核心基础设施的设计是开放和普遍的,但今天的应用层却由封闭的、专有的平台主宰。开放和可互操作的API需要大量投资,市场领导人没有多少动力来促成数据交换,从而削弱用户锁定。我们争辩说,基于LLM的代理商从根本上破坏了现状。代理商可以在数据格式之间自动翻译,并与为人类设计的界面互动:这使得互操作性大大降低,而且实际上无法避免。我们称之为这一转变的通用互操作性:任何两个数字服务商利用AI中介的适应器无缝地交换数据的能力。通用互操作性破坏了垄断行为,促进了数据的可移动性。然而,它也可能导致新的安全风险和技术债务。我们的立场是,ML社区应该接受这一发展,同时建立适当的框架来缓解下行。我们现在可以采取行动,利用AI来恢复用户自由和竞争性市场,而同时又不牺牲安全。
Article 293
Title@2025-06-30 (1): Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders
Title: Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders | Enthüllen der Entscheidungsfindung in LLMs für Textklassifikation : Extraktion einflussreicher und interpretierbarer Konzepte mit Sparse Autoencodern | 文本分类LLMs的不懈决策:与Sparse Autoenckers分离具有影响力和可解释的概念 2506.23951v1 |
Authors (4): Mathis Le Bail, Jérémie Dentan, Davide Buscaldi, Sonia Vanier
Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that correspond to human-interpretable features. In this paper, we investigate the effectiveness of SAE-based explainability approaches for sentence classification, a domain where such methods have not been extensively explored. We present a novel SAE-based architecture tailored for text classification, leveraging a specialized classifier head and incorporating an activation rate sparsity loss. We benchmark this architecture against established methods such as ConceptShap, Independent Component Analysis, and other SAE-based concept extraction techniques. Our evaluation covers two classification benchmarks and four fine-tuned LLMs from the Pythia family. We further enrich our analysis with two novel metrics for measuring the precision of concept-based explanations, using an external sentence encoder. Our empirical results show that our architecture improves both the causality and interpretability of the extracted features.
光学自动编码器(SAE)已被成功用于探测大语言模型(LLMs)并从内部表述中提取可解释的概念,这些概念是神经活化的线性组合,与人类解释的特征相对应。在本文中,我们调查了基于SAE的可解释的量刑分类方法的有效性,这是一个尚未广泛探讨这类方法的领域。我们提出了一个基于SAE的新型结构,专门用于文本分类,利用专门分类器头,并纳入活性反应速率损失。我们用概念Shap、独立组件分析和其他基于SAE的概念提取技术等既定方法来衡量这一结构。我们的评估涵盖了两个分类基准和四个来自Pythia家族的微调LMs。我们进一步用两个新的指标来丰富我们的分析,用一个外部句编码器来衡量基于概念的解释的精确性。我们的经验结果显示,我们的结构改进了提取的特征的因果关系和可解释性。
Article 294
Title@2025-06-30 (1): Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages
Title: Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages | Nutzung des Potenzials der Prompt-Engineering für die Hate Speech Detection in Low-Resource-Sprachen | 利用迅速工程的潜力,在低资源语言中发现仇恨言论 2506.23930v1 |
Authors (3): Ruhina Tabasshum Prome, Tarikul Islam Tamiti, Anomadarshi Barua
The rapid expansion of social media leads to a marked increase in hate speech, which threatens personal lives and results in numerous hate crimes. Detecting hate speech presents several challenges: diverse dialects, frequent code-mixing, and the prevalence of misspelled words in user-generated content on social media platforms. Recent progress in hate speech detection is typically concentrated on high-resource languages. However, low-resource languages still face significant challenges due to the lack of large-scale, high-quality datasets. This paper investigates how we can overcome this limitation via prompt engineering on large language models (LLMs) focusing on low-resource Bengali language. We investigate six prompting strategies - zero-shot prompting, refusal suppression, flattering the classifier, multi-shot prompting, role prompting, and finally our innovative metaphor prompting to detect hate speech effectively in low-resource languages. We pioneer the metaphor prompting to circumvent the built-in safety mechanisms of LLMs that marks a significant departure from existing jailbreaking methods. We investigate all six different prompting strategies on the Llama2-7B model and compare the results extensively with three pre-trained word embeddings - GloVe, Word2Vec, and FastText for three different deep learning models - multilayer perceptron (MLP), convolutional neural network (CNN), and bidirectional gated recurrent unit (BiGRU). To prove the effectiveness of our metaphor prompting in the low-resource Bengali language, we also evaluate it in another low-resource language - Hindi, and two high-resource languages - English and German. The performance of all prompting techniques is evaluated using the F1 score, and environmental impact factor (IF), which measures CO$_2$ emissions, electricity usage, and computational time.
社交媒体的迅速扩张导致仇恨言辞明显增加,这威胁到个人生活,并导致大量仇恨犯罪。 检测仇恨言辞带来了若干挑战:多种方言、频繁的代码混合,以及社交媒体平台用户生成的内容中普遍存在拼错字词的现象。近来在识别仇恨言论方面的进展通常集中在高资源语言上。然而,由于缺少大规模、高质量的数据集,低资源语言仍面临重大挑战。本文调查了我们如何通过快速设计以低资源孟加拉语为重点的大型语言模式(LLLMM2-7B)的经常言词来克服这一限制。我们调查了六种催化战略 — 零点提示、拒绝抑制、赞美分析器、多镜头提示、角色提示,最后我们创新的隐喻,以有效检测低资源语言中的仇恨言论。 然而,LMLMS的内在安全机制大大偏离了现有的破旧方法。我们调查了Llama2-7B型低语言的所有六种激励策略,并且将结果与三种深度的NMLF 快速定位网络 — 快速理解, 快速理解了我们的历史和快速理解的三套语言。
Article 295
Title@2025-06-30 (1): IMPACT: Inflectional Morphology Probes Across Complex Typologies
Title: IMPACT: Inflectional Morphology Probes Across Complex Typologies | IMPACT: Beugungsmorphologie über komplexe Typologien hinweg | IMPACT: 跨越复杂类型 2506.23929v1 |
Authors (5): Mohammed J. Saeed, Tommi Vehvilainen, Evgeny Fedoseev, Sevil Caliskan, Tatiana Vodolazova
Large Language Models (LLMs) have shown significant progress on various multilingual benchmarks and are increasingly used to generate and evaluate text in non-English languages. However, while they may produce fluent outputs, it remains unclear to what extent these models truly grasp the underlying linguistic complexity of those languages, particularly in morphology. To investigate this, we introduce IMPACT, a synthetically generated evaluation framework focused on inflectional morphology, which we publicly release, designed to evaluate LLM performance across five morphologically rich languages: Arabic, Russian, Finnish, Turkish, and Hebrew. IMPACT includes unit-test-style cases covering both shared and language-specific phenomena, from basic verb inflections (e.g., tense, number, gender) to unique features like Arabic’s reverse gender agreement and vowel harmony in Finnish and Turkish. We assess eight multilingual LLMs that, despite strong English performance, struggle with other languages and uncommon morphological patterns, especially when judging ungrammatical examples. We also show that Chain of Thought and Thinking Models can degrade performance. Our work exposes gaps in LLMs’ handling of linguistic complexity, pointing to clear room for improvement. To support further research, we publicly release the IMPACT framework.
大型语言模型(LLMS)在多种多语种基准方面取得了显著进展,并越来越多地用于制作和评价非英语语言文本,然而,虽然这些模型可能产生流畅的产出,但仍然不清楚这些模型在多大程度上真正掌握了这些语言特别是形态学的语言复杂性。为了调查这一点,我们引入了IMPACT(IMACT),这是一个合成产生的评价框架,侧重于自然形态学,我们公开发布该框架,目的是评价在阿拉伯语、俄语、芬兰语、土耳其语和希伯莱语这五种形态丰富语言中的LLM(LM)性能。IMPACT(IMA)包括单位式的测试型案例,既包括共同现象,也包括语言特定现象,从基本的动词偏差(例如紧张、数量、性别)到阿拉伯语反向的性别协议和芬兰语和土耳其语语语调等独特特征。我们评估了8个多语种LMS(LM),尽管英语表现很强,但与其他语言和异常的形态学模式挣扎,特别是在评判非语法学实例时。我们还表明,思维和思维模式的连锁可以降低工作绩效。我们的工作暴露了LMSLMS(LMs)处理语言复杂性框架的空白)。
Article 296
Title@2025-06-30 (1): The Trilemma of Truth in Large Language Models
Title: The Trilemma of Truth in Large Language Models | Das Trilemma der Wahrheit in großen Sprachmodellen | 大语言模型中的真理三边 2506.23921v1 |
Authors (2): Germans Savcisens, Tina Eliassi-Rad
We often attribute human characteristics to large language models (LLMs) and claim that they “know” certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM’s depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs “know” and how certain they are of their probabilistic internal knowledge.
我们经常将人的特点归结于大型语言模型(LLMS),并声称他们“知道”某些东西。LLMS具有代表培训期间保留的信息的内部概率知识。我们如何评估这种知识的真实性?我们检查两种共同的方法来检验LLMS的真实性,并发现一些有缺陷的假设。为了解决这些有缺陷的假设,我们引入了SawMIL(SawMIL(Sort for Sprassar Enown Convention 多重Instess)),一种检验方法,利用LLMS的内部激活将声明分为真实、虚假和两者都没有。 sawMIL是建立在多重 Internance 学习和符合预测的基础上的。我们如何评估16个开放源LMS的5项有效性标准,包括默认和聊天基变异体,以及3个新的数据集。我们提供的这些洞察是:(1)真实信号通常集中在LM深度的第三季度;(2)真理和假信号并不总是对称;(3)线性探点比一些默认模型更好;(4)非线性LMSMS的准确性反馈,可能要求从真实性测算为真实性测算。
Article 297
Title@2025-06-30 (1): Empirical evidence of Large Language Model’s influence on human spoken communication
Title: Empirical evidence of Large Language Model’s influence on human spoken communication | Empirische Beweise für den Einfluss von Large Language Model auf die menschliche gesprochene Kommunikation | 大语言模式对人口交流的影响的经验证据 2409.01754v2 |
Authors (7): Hiromu Yakura, Ezequiel Lopez-Lopez, Levin Brinkmann, Ignacio Serna, Prateek Gupta, Ivan Soraperra, Iyad Rahwan
From the invention of writing and the printing press, to television and social media, human history is punctuated by major innovations in communication technology, which fundamentally altered how ideas spread and reshaped our culture. Recent chatbots powered by generative artificial intelligence constitute a novel medium that encodes cultural patterns in their neural representations and disseminates them in conversations with hundreds of millions of people. Understanding whether these patterns transmit into human language, and ultimately shape human culture, is a fundamental question. While fully quantifying the causal impact of a chatbot like ChatGPT on human culture is very challenging, lexicographic shift in human spoken communication may offer an early indicator of such broad phenomenon. Here, we apply econometric causal inference techniques to 740,249 hours of human discourse from 360,445 YouTube academic talks and 771,591 conversational podcast episodes across multiple disciplines. We detect a measurable and abrupt increase in the use of words preferentially generated by ChatGPT, such as delve, comprehend, boast, swift, and meticulous, after its release. These findings suggest a scenario where machines, originally trained on human data and subsequently exhibiting their own cultural traits, can, in turn, measurably reshape human culture. This marks the beginning of a closed cultural feedback loop in which cultural traits circulate bidirectionally between humans and machines. Our results motivate further research into the evolution of human-machine culture, and raise concerns over the erosion of linguistic and cultural diversity, and the risks of scalable manipulation.
从书写和印刷出版的发明到电视和社交媒体,人类历史都因通信技术的重大创新而支离破碎,这些创新从根本上改变了思想传播和改造我们的文化。最近以基因化人工智能为动力的聊天机器人构成了一种新颖的媒介,将文化模式编码在他们的神经表征中,并在与数以亿计的人的对话中传播。了解这些模式是否传播到人类语言并最终塑造人类文化,是一个根本问题。在充分量化像查特GPT这样的聊天机对人类文化的因果影响非常具有挑战性,但人类口语通信的地理变化可能提供如此广泛现象的早期指标。在这里,我们应用了计量因果推断技术到740 249小时的人类话语,来自360 445 YouTube学术谈话和771 591个对话播客流,跨多个学科。我们发现,这些模式是否传播到人类语言的优雅语言,例如调、理解、吹嘘、快速和细调等,在发布之后,使用这种语言的传动,这些结论表明,机器最初在人类数据上受过训练,并随后在人类文化循环中不断改变人类文化成果之间,从而在人类文化循环中呈现文化结果之间传播。我们的文化标志,我们发现,我们发现可测量和文化循环开始进入了文化循环之间,可以测量,我们的文化结果可以测量,可以进一步传播。
Article 298
Title@2025-06-30 (1): Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting
Title: Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting | Mehrstufige mathematische Reasonierung in großen Sprachmodellen durch mehrschichtige Selbstreflexion mit Auto-Prompting | 通过使用自动促进的多语言自评,在大语言模型中推进多层次多语种数学理由 2506.23888v1 |
Authors (5): André de Souza Loureiro, Jorge Valverde-Rebaza, Julieta Noguez, David Escarcega, Ricardo Marcacini
Recent advancements in Large Language Models (LLMs) have significantly improved their problem-solving capabilities. However, these models still struggle when faced with complex multi-step reasoning tasks. In this paper, we propose the Multi-Layered Self-Reflection with Auto-Prompting (MAPS) framework, a novel approach designed to enhance multi-step mathematical reasoning in LLMs by integrating techniques such as Chain of Thought (CoT), Self-Reflection, and Auto-Prompting. Unlike traditional static prompting methods, MAPS employs an iterative refinement process. Initially, the model generates a solution using CoT prompting. When errors are detected, an adaptive self-reflection mechanism identifies and analyzes them, generating tailored prompts to guide corrections. These dynamically adjusted prompts enable the model to iteratively refine its reasoning. Experiments on four well-established benchmarks across multiple LLMs show that MAPS significantly outperforms standard CoT and achieves competitive results with reasoning-optimized models. In addition, MAPS enables general-purpose LLMs to reach performance levels comparable to specialized reasoning models. While deeper reflection layers improve accuracy, they also increase token usage and costs. To balance this trade-off, MAPS strategically limits reflection depth, ensuring an optimal balance between cost and reasoning performance.
大语言模型(LLMS)最近的进展大大改善了其解决问题的能力,然而,这些模型在面临复杂的多步推理任务时仍然难以解决问题。在本文件中,我们提议采用多步自评和自动促进(MAPS)框架,这是一种新颖的办法,旨在通过整合思维链(CoT)、自我恢复和自动促进等技术,加强LMS的多步数学推理。与传统的静态促动方法不同,MAPS采用迭接的完善程序。最初,该模型利用COT提示来产生一种解决方案。在发现错误时,适应性自我反省机制确定和分析这些解决方案,产生有针对性的导纠正的提示。这些动态调整使模型能够反复完善其推理。对多种LMS的四项既定基准的实验表明,MAPS大大超越了标准COT,并且通过推理优化模型取得了竞争性的结果。此外,MAPS使通用LMS能够达到可与专业推理模型相近的绩效水平。在更深深的思考层次上,提高了战略推理的精确度,也提高了成本。
Article 299
Title@2025-06-30 (1): Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It
Title: Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It | Warum Benchmark-Scores unzuverlässig sind und was dagegen zu tun ist | 垃圾垃圾, 合理解释? 为什么基准分数不可靠? 如何做呢? 2506.23864v1 |
Authors (4): Seyed Mahed Mousavi, Edoardo Cecchinato, Lucia Hornikova, Giuseppe Riccardi
We conduct a systematic audit of three widely used reasoning benchmarks, SocialIQa, FauxPas-EAI, and ToMi, and uncover pervasive flaws in both benchmark items and evaluation methodology. Using five LLMs (GPT-{3, 3.5, 4, o1}, and LLaMA 3.1) as diagnostic tools, we identify structural, semantic, and pragmatic issues in benchmark design (e.g., duplicated items, ambiguous wording, and implausible answers), as well as scoring procedures that prioritize output form over reasoning process. Through systematic human annotation and re-evaluation on cleaned benchmark subsets, we find that model scores often improve not due to due to erratic surface wording variations and not to improved reasoning. Infact, further analyses show that model performance is highly sensitive to minor input variations such as context availability and phrasing, revealing that high scores may reflect alignment with format-specific cues rather than consistent inference based on the input. These findings challenge the validity of current benchmark-based claims about reasoning in LLMs, and highlight the need for evaluation protocols that assess reasoning as a process of drawing inference from available information, rather than as static output selection. We release audited data and evaluation tools to support more interpretable and diagnostic assessments of model reasoning.
我们系统地审计了三个广泛使用的推理基准,即SocialIQa、FauxPas-EAI和Tomi,发现基准项目和评价方法中普遍存在的缺陷。我们使用五个LLMs(GPT-{3, 3.5, 4, o1}和LLLaMA 3.1)作为诊断工具,我们发现基准设计中的结构、语义和务实问题(例如,重复项目、模棱两可的措辞和不可信的答案),以及评分程序,这些评分程序将产出形式置于比推理过程的优先地位。我们发现,模型评分往往不是由于表面措辞变化不定,而不是由于推理的改进而得到改善。深入的分析表明,模型的性能非常敏感于小的投入变化,例如背景的可用性和语法,表明高得分可能反映与具体格式提示的一致性,而不是基于投入的推理。这些调查结果质疑目前基于基准的索赔在LMS的推理中的有效性,我们强调评价协议的必要性,即评估推理方案需要评估推理程序,作为从现有信息中推理的推理,而不是经过审计的推理学和推理评估,作为推理工具。
Article 300
Title@2025-06-30 (1): GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization
Title: GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization | GeometrieZero: Verbesserung der Geometrie-Lösung für LLM mit Gruppen-Kontrast-Policy-Optimierung | 几何零:改进与集团反竞争政策优化相结合的LLM的几何解决办法 2506.07160v2 |
Authors (7): Yikun Wang, Yibin Wang, Dianyi Wang, Zimian Peng, Qipeng Guo, Dacheng Tao, Jiaqi Wang
Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, particularly in mathematical reasoning, amid which geometry problem solving remains a challenging area where auxiliary construction plays a enssential role. Existing approaches either achieve suboptimal performance or rely on massive LLMs (e.g., GPT-4o), incurring massive computational costs. We posit that reinforcement learning with verifiable reward (e.g., GRPO) offers a promising direction for training smaller models that effectively combine auxiliary construction with robust geometric reasoning. However, directly applying GRPO to geometric reasoning presents fundamental limitations due to its dependence on unconditional rewards, which leads to indiscriminate and counterproductive auxiliary constructions. To address these challenges, we propose Group Contrastive Policy Optimization (GCPO), a novel reinforcement learning framework featuring two key innovations: (1) Group Contrastive Masking, which adaptively provides positive or negative reward signals for auxiliary construction based on contextual utility, and a (2) length reward that promotes longer reasoning chains. Building on GCPO, we develop GeometryZero, a family of affordable-size geometric reasoning models that judiciously determine when to employ auxiliary construction. Our extensive empirical evaluation across popular geometric benchmarks (Geometry3K, MathVista) demonstrates that GeometryZero models consistently outperform baselines (e.g. GRPO), achieving an average improvement of 4.29% across all benchmarks.
nan
Article 301
Title@2025-06-30 (1): Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts
Title: Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts | Verwenden Sie Sparse Autoencoder, um unbekannte Konzepte zu entdecken, nicht um auf bekannte Konzepte zu handeln | 使用粗略自动编码器发现未知概念, 而不是对已知概念采取行动 2506.23845v1 |
Authors (5): Kenny Peng, Rajiv Movva, Jon Kleinberg, Emma Pierson, Nikhil Garg
While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that while SAEs may be less effective for acting on known concepts, SAEs are powerful tools for discovering unknown concepts. This distinction cleanly separates existing negative and positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) ML interpretability, explainability, fairness, auditing, and safety, and (ii) social and health sciences.
nan
Article 302
Title@2025-06-30 (1): Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model
Title: Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model | Helfen oder Fallen denkende Token? Auf dem Weg zu einem effizienteren, großen, vernünftigen Modell | 思考 Tok 帮助还是陷阱? 迈向更高效的大理由模型 2506.23840v1 |
Authors (5): Bowen Ding, Yuhan Chen, Futing Wang, Lingfeng Ming, Tao Lin
Large Reasoning Models (LRMs) excel at solving complex problems but face an overthinking dilemma. When handling simple tasks, they often produce verbose responses overloaded with thinking tokens (e.g., wait, however). These tokens trigger unnecessary high-level reasoning behaviors like reflection and backtracking, reducing efficiency. In this work, our pilot study reveals that these thinking-token-induced behaviors are not essential for effective problem-solving and may even hinder correct reasoning within constrained token budgets. We identify this phenomenon as the thinking trap. To mitigate this issue, we propose Dual Policy Preference Optimization (DuP-PO), a novel algorithm featuring: (1) A rollout sampling strategy that guarantees balanced exposure to responses with and without thinking tokens; (2) A fine-grained advantage control technique to dynamically regulate the prediction of target tokens; (3) A policy shaping method ensuring stable gradient contributions from thinking tokens. Experimental results on five popular math reasoning benchmarks show that DuP-PO performs well on the popular LRM, which significantly improves their token efficiency during reasoning, while achieving superior performance of the base model.
nan
Article 303
Title@2025-06-30 (1): Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning
Title: Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning | Erklärbare Sentiment-Analyse mit DeepSeek-R1: Leistung, Effizienz und wenig scharfes Lernen | “深搜索-R1:性能、效率和很少热学习”的可解释的感官分析 2503.11655v2 |
Authors (2): Donghao Huang, Zhaoxia Wang
Large language models (LLMs) have transformed sentiment analysis, yet balancing accuracy, efficiency, and explainability remains a critical challenge. This study presents the first comprehensive evaluation of DeepSeek-R1–an open-source reasoning model–against OpenAI’s GPT-4o and GPT-4o-mini. We test the full 671B model and its distilled variants, systematically documenting few-shot learning curves. Our experiments show DeepSeek-R1 achieves a 91.39\% F1 score on 5-class sentiment and 99.31\% accuracy on binary tasks with just 5 shots, an eightfold improvement in few-shot efficiency over GPT-4o. Architecture-specific distillation effects emerge, where a 32B Qwen2.5-based model outperforms the 70B Llama-based variant by 6.69 percentage points. While its reasoning process reduces throughput, DeepSeek-R1 offers superior explainability via transparent, step-by-step traces, establishing it as a powerful, interpretable open-source alternative.
nan
Article 304
Title@2025-06-30 (1): Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph
Title: Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph | Benchmarking Uncertainty Quantification Methods for Large Language Models mit LM-Polygraph | 与LM-Porgraph 参照大语言模型的不确定性量化方法 2406.15627v4 |
Authors (15): Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Lyudmila Rvanova, Akim Tsvigun, Daniil Vasilev, Rui Xing, Abdelrahman Boda Sadallah, Kirill Grishchenkov, Sergey Petrakov, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, Maxim Panov, Artem Shelmanov
The rapid proliferation of large language models (LLMs) has stimulated researchers to seek effective and efficient approaches to deal with LLM hallucinations and low-quality outputs. Uncertainty quantification (UQ) is a key element of machine learning applications in dealing with such challenges. However, research to date on UQ for LLMs has been fragmented in terms of techniques and evaluation methodologies. In this work, we address this issue by introducing a novel benchmark that implements a collection of state-of-the-art UQ baselines and offers an environment for controllable and consistent evaluation of novel UQ techniques over various text generation tasks. Our benchmark also supports the assessment of confidence normalization methods in terms of their ability to provide interpretable scores. Using our benchmark, we conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches. Code: https://github.com/IINemo/lm-polygraph Benchmark: https://huggingface.co/LM-Polygraph
nan
Article 305
Title@2025-06-30 (1): Computational Analysis of Character Development in Holocaust Testimonies
Title: Computational Analysis of Character Development in Holocaust Testimonies | Computational Analyse der Charakterentwicklung in Holocaust-Zeugnissen | 大屠杀证词特征发展计算分析 2412.17063v3 |
Authors (4): Esther Shizgal, Eitan Wagner, Renana Keydar, Omri Abend
This work presents a computational approach to analyze character development along the narrative timeline. The analysis characterizes the inner and outer changes the protagonist undergoes within a narrative, and the interplay between them. We consider transcripts of Holocaust survivor testimonies as a test case, each telling the story of an individual in first-person terms. We focus on the survivor’s religious trajectory, examining the evolution of their disposition toward religious belief and practice along the testimony. Clustering the resulting trajectories in the dataset, we identify common sequences in the data. Our findings highlight multiple common structures of religiosity across the narratives: in terms of belief, most present a constant disposition, while for practice, most present an oscillating structure, serving as valuable material for historical and sociological research. This work demonstrates the potential of natural language processing techniques for analyzing character evolution through thematic trajectories in narratives.
nan
Article 306
Title@2025-06-30 (1): AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data
Title: AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data | AutoEvoEval: Ein Automatisiertes Framework für die Evolving Close-Ended LLM-Evaluierungsdaten | AutoEvoEval:发展近端LLM评价数据自动框架 2506.23735v1 |
Authors (2): JiaRu Wu, Mingwei Liu
Large language models (LLMs) have shown remarkable performance on various tasks, but existing evaluation benchmarks are often static and insufficient to fully assess their robustness and generalization in realistic scenarios. Prior work using evolutionary or adversarial data augmentation has improved evaluation diversity but lacks systematic control over perturbation types and multi-step complexity, limiting comprehensive robustness analysis. To address these gaps, we propose AutoEvoEval, an evolution-based evaluation framework for close-ended tasks such as multi-choice question answering. AutoEvoEval introduces 22 interpretable atomic evolution operations and supports multi-round compositions, enabling controlled generation of diverse, challenging, and realistic test samples. We conduct extensive experiments addressing four research questions on a broad set of open- and closed-source LLMs. Our results show that atomic operations cause an average accuracy drop of 7.283\%, with structure-disrupting or misleading semantic edits causing the largest declines. Model sensitivities vary significantly for the same perturbation, and combining multiple evolution steps amplifies adversarial effects by up to 52.932\%. These findings suggest current benchmarks may overestimate true model generalization and emphasize the need for evolution-aware robustness evaluation. Code and resources are available at: https://github.com/SYSUSELab/AutoEvoEval.
nan
Article 307
Title@2025-06-30 (1): CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning
Title: CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning | CSC-SQL: Korrektive Selbstkonsistenz im Text-zu-SQL durch Verstärkungslernen | CSC-SQL:通过强化学习在文本到SQL中实现校正的自我统一 2505.13271v2 |
Authors (2): Lei Sheng, Shuai-Shuai Xu
Large language models (LLMs) have demonstrated strong capabilities in translating natural language questions about relational databases into SQL queries. In particular, test-time scaling techniques such as Self-Consistency and Self-Correction can enhance SQL generation accuracy by increasing computational effort during inference. However, these methods have notable limitations: Self-Consistency may select suboptimal outputs despite majority votes, while Self-Correction typically addresses only syntactic errors. To leverage the strengths of both approaches, we propose CSC-SQL, a novel method that integrates Self-Consistency and Self-Correction. CSC-SQL selects the two most frequently occurring outputs from parallel sampling and feeds them into a merge revision model for correction. Additionally, we employ the Group Relative Policy Optimization (GRPO) algorithm to fine-tune both the SQL generation and revision models via reinforcement learning, significantly enhancing output quality. Experimental results confirm the effectiveness and generalizability of CSC-SQL. On the BIRD private test set, our 7B model achieves 71.72\% execution accuracy, while the 32B model achieves 73.67\%. The code has been open sourced at https://github.com/CycloneBoy/csc_sql.
nan
Article 308
Title@2025-06-30 (1): Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization
Title: Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization | Auf dem Weg zu einem automatisierten multimodalen Ansatz für die Videozusammenfassung: Eine Brücke zwischen Text, Audio und Gesichtsqueue-basierter Zusammenfassung bauen | 采用自动多式方式进行视频摘要描述:在文字、音频和基于面轴的缩写之间架建桥梁 2506.23714v1 |
Authors (4): Md Moinul Islam, Sofoklis Kakouros, Janne Heikkilä, Mourad Oussalah
The increasing volume of video content in educational, professional, and social domains necessitates effective summarization techniques that go beyond traditional unimodal approaches. This paper proposes a behaviour-aware multimodal video summarization framework that integrates textual, audio, and visual cues to generate timestamp-aligned summaries. By extracting prosodic features, textual cues and visual indicators, the framework identifies semantically and emotionally important moments. A key contribution is the identification of bonus words, which are terms emphasized across multiple modalities and used to improve the semantic relevance and expressive clarity of the summaries. The approach is evaluated against pseudo-ground truth (pGT) summaries generated using LLM-based extractive method. Experimental results demonstrate significant improvements over traditional extractive method, such as the Edmundson method, in both text and video-based evaluation metrics. Text-based metrics show ROUGE-1 increasing from 0.4769 to 0.7929 and BERTScore from 0.9152 to 0.9536, while in video-based evaluation, our proposed framework improves F1-Score by almost 23%. The findings underscore the potential of multimodal integration in producing comprehensive and behaviourally informed video summaries.
nan
Article 309
Title@2025-06-30 (1): Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments
Title: Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments | Testbare Audits: Überprüfbare KI-Sicherheits-Benchmarks unter Verwendung von Trusted Execution Environments | 可检验的审计:使用可信赖的执行环境的可核实的AI安全基准 2506.23706v1 |
Authors (4): Christoph Schnabl, Daniel Hugenroth, Bill Marino, Alastair R. Beresford
Benchmarks are important measures to evaluate safety and compliance of AI models at scale. However, they typically do not offer verifiable results and lack confidentiality for model IP and benchmark datasets. We propose Attestable Audits, which run inside Trusted Execution Environments and enable users to verify interaction with a compliant AI model. Our work protects sensitive data even when model provider and auditor do not trust each other. This addresses verification challenges raised in recent AI governance frameworks. We build a prototype demonstrating feasibility on typical audit benchmarks against Llama-3.1.
nan
Article 310
Title@2025-06-30 (1): Thinking About Thinking: SAGE-nano’s Inverse Reasoning for Self-Aware Language Models
Title: Thinking About Thinking: SAGE-nano’s Inverse Reasoning for Self-Aware Language Models | Denken über das Denken: SAGE-nano’s Inverse Reasoning for Self-Aware Language Models | 思考思考:SAGE-nano 自我意识语言模型的反向理由 2507.00092v1 |
Authors (6): Basab Jha, Firoj Paudel, Ujjwal Puri, Zhang Yuting, Choi Donghyuk, Wang Junhao
Large Language Models (LLMs) have demonstrated remarkable capabilities at solving complex reasoning tasks with Chain-of-Thought (CoT) prompting, but their decision-making processes remain somewhat blackbox. We introduce textbfinverse reasoning, a novel paradigm enabling LLMs to decompose and explain their own reasoning chains post-hoc. Our approach, used in SAGE-nano, a 4-billion-parameter reasoning model, employs a metacognitive structure that reflects back via attention processes to identify major decision points and generate explanations of reasoning choices. While typical CoT approaches are directed towards forward reasoning generation, inverse reasoning provides insight into why specific reasoning chains were selected over others. Through thorough testing of logical reasoning puzzles, math problems and ethical dilemmas from AQUA-RAT, CommonsenseQA, and customized benchmarks, we demonstrate that SAGE-nano is at the cutting edge both on reasoning accuracy (74.6% on AQUA-RAT) and explanation quality (92.1% human preference score) for its task, and offers performance almost on par with models like Claude-3.5 Sonnet or GPT-4o. Our contributions are: (i) the first rigorous framework for LLM self-reflection via inverse reasoning, (ii) a novel metalearning framework to reverse the attention flow, (iii) comprehensive evaluation frameworks for reasoning transparency, and (iv) evidence that increasing reasoning using inverse reasoning improves interpretability along with reasoning performance. Our work creates new avenues for transparent AI systems and closes significant gaps in AI safety, education, and scientific discovery.
nan
Article 311
Title@2025-06-30 (1): Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
Title: Sparsing Law: Towards Large Language Models with Greater Activation Sparsity | Sparsing Law: Auf dem Weg zu großen Sprachmodellen mit größerer Aktivierungssparsität | 评分法:走向大语言模式,具有更大的激活率平等性 2411.02335v4 |
Authors (10): Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Xiaojun Meng, Liqun Deng, Jiansheng Wei, Zhiyuan Liu, Maosong Sun
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-$p\%$ sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., $1-\mathrm{sparsity\ ratio}$) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.
nan
Article 312
Title@2025-06-30 (1): Efficient Interleaved Speech Modeling through Knowledge Distillation
Title: Efficient Interleaved Speech Modeling through Knowledge Distillation | Effiziente interleaved Speech Modeling durch Wissensdestillation | 通过知识蒸馏建模建立知识蒸馏模式 2506.23670v1 |
Authors (2): Mohammadmahdi Nouriborji, Morteza Rohanian
Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher’s performance, outperforming size-matched baselines. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to support reproducible research on compact, expressive speech generation.
nan
Article 313
Title@2025-06-30 (1): L0: Reinforcement Learning to Become General Agents
Title: L0: Reinforcement Learning to Become General Agents | L0: Stärkung des Lernens, Generalagenten zu werden | L0:加强学习成为一般代理 2506.23667v1 |
Authors (10): Junjie Zhang, Jingyi Xi, Zhuoyang Song, Junyu Lu, Yuhua Ke, Ting Sun, Yukun Yang, Jiaxing Zhang, Songxin Zhang, Zejian Xie
Training large language models (LLMs) to act as autonomous agents for multi-turn, long-horizon tasks remains significant challenges in scalability and training efficiency. To address this, we introduce L-Zero (L0), a scalable, end-to-end training pipeline for general-purpose agents. Featuring a low-cost, extensible, and sandboxed concurrent agent worker pool, L0 lowers the barrier for applying reinforcement learning in complex environments. We also introduce NB-Agent, the agent scaffold within L0, which operates in a “code-as-action” fashion via a Read-Eval-Print-Loop (REPL). We evaluate L0 on factuality question-answering benchmarks. Our experiments demonstrate that a base model can develop robust problem-solving skills using solely Reinforcement Learning with Verifiable Rewards (RLVR). On the Qwen2.5-7B-Instruct model, our method boosts accuracy on SimpleQA from 30 % to 80 % and on HotpotQA from 22 % to 41 %. We have open-sourced the entire L0 system, including our L0 series models, the NB-Agent, a complete training pipeline, and the corresponding training recipes on (https://github.com/cmriat/l0).
nan
Article 314
Title@2025-06-30 (1): Zero-Shot Contextual Embeddings via Offline Synthetic Corpus Generation
Title: Zero-Shot Contextual Embeddings via Offline Synthetic Corpus Generation | Zero-Shot Kontextuelle Einbettungen über Offline Synthetische Corpus-Generierung | 通过离线合成机体生成零零热背景嵌入 2506.23662v1 |
Authors (2): Philip Lippmann, Jie Yang
Context-aware embedding methods boost retrieval accuracy by conditioning on corpus statistics (e.g., term co-occurrence and topical patterns) extracted from neighboring documents. However, this context-aware approach requires access to the target corpus or requires domain-specific finetuning, posing practical barriers in privacy-sensitive or resource-constrained settings. We present ZEST, a zero-shot contextual adaptation framework that replaces real corpus access with a one-time offline synthesis of a compact proxy. Given only a handful exemplar documents representative of the general target domain, we use a multi-step hierarchical procedure to generate a synthetic context corpus of several hundred documents that aims to emulate key domain-specific distributions. At inference, the frozen context-aware encoder uses this proxy corpus – without any finetuning or target corpus access – to produce domain-adapted embeddings. Across the MTEB benchmark, ZEST’s zero-shot synthetic context adaptation using only five example documents performs within 0.5% of models leveraging full target corpus access – demonstrating remarkable efficacy without any retraining. ZEST thus provides a practical method for deploying high-performance, adaptable embeddings in constrained environments.
nan
Article 315
Title@2025-06-30 (1): Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization
Title: Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization | Robustes LLM-Unlearning mit MUDMAN: Meta-Unlearning mit Disruptionsmasken und Normalisierung | 与 MUDMAN 一起重新学习: 以干扰蒙蔽和正常化的方式重新学习 2506.12484v3 |
Authors (4): Filip Sondej, Yushi Yang, Mikołaj Kniejski, Marcel Windys
Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40%, setting a new state-of-the-art for robust unlearning.
nan
Article 316
Title@2025-06-30 (1): Evaluating K-Fold Cross Validation for Transformer Based Symbolic Regression Models
Title: Evaluating K-Fold Cross Validation for Transformer Based Symbolic Regression Models | Bewertung der K-Fold Cross-Validierung für Transformer-basierte symbolische Regressionsmodelle | 评估基于变换器的符号回归模型的 K- Fold 交叉验证 2410.21896v2 |
Authors (7): Kaustubh Kislay, Shlok Singh, Soham Joshi, Rohan Dutta, Jay Shim, George Flint, Kevin Zhu
Symbolic Regression remains an NP-Hard problem, with extensive research focusing on AI models for this task. Transformer models have shown promise in Symbolic Regression, but performance suffers with smaller datasets. We propose applying k-fold cross-validation to a transformer-based symbolic regression model trained on a significantly reduced dataset (15,000 data points, down from 500,000). This technique partitions the training data into multiple subsets (folds), iteratively training on some while validating on others. Our aim is to provide an estimate of model generalization and mitigate overfitting issues associated with smaller datasets. Results show that this process improves the model’s output consistency and generalization by a relative improvement in validation loss of 53.31%. Potentially enabling more efficient and accessible symbolic regression in resource-constrained environments.
nan
Article 317
Title@2025-06-30 (1): Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs
Title: Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs | Bewertung der Simulation menschlicher Persönlichkeits-getriebener Anfälligkeit für Fehlinformationen mit LLMs | 评估模拟人类个性-驱动人对与LLMs的错误信息可视性 2506.23610v1 |
Authors (2): Manuel Pratelli, Marinella Petrocchi
Large language models (LLMs) make it possible to generate synthetic behavioural data at scale, offering an ethical and low-cost alternative to human experiments. Whether such data can faithfully capture psychological differences driven by personality traits, however, remains an open question. We evaluate the capacity of LLM agents, conditioned on Big-Five profiles, to reproduce personality-based variation in susceptibility to misinformation, focusing on news discernment, the ability to judge true headlines as true and false headlines as false. Leveraging published datasets in which human participants with known personality profiles rated headline accuracy, we create matching LLM agents and compare their responses to the original human patterns. Certain trait-misinformation associations, notably those involving Agreeableness and Conscientiousness, are reliably replicated, whereas others diverge, revealing systematic biases in how LLMs internalize and express personality. The results underscore both the promise and the limits of personality-aligned LLMs for behavioral simulation, and offer new insight into modeling cognitive diversity in artificial agents.
nan
Article 318
Title@2025-06-30 (1): KAG-Thinker: Interactive Thinking and Deep Reasoning in LLMs via Knowledge-Augmented Generation
Title: KAG-Thinker: Interactive Thinking and Deep Reasoning in LLMs via Knowledge-Augmented Generation | KAG-Thinker: Interactive Thinking und Deep Reasoning in LLMs über wissensbasierte Generation | KAG- Thinker: 通过知识型一代在LLMs中互动思考和深智 2506.17728v3 |
Authors (19): Dalong Zhang, Jun Xu, Jun Zhou, Lei Liang, Lin Yuan, Ling Zhong, Mengshu Sun, Peilong Zhao, QiWei Wang, Xiaorui Wang, Xinkai Du, YangYang Hou, Yu Ao, ZhaoYang Wang, Zhengke Gui, ZhiYing Yi, Zhongpu Bo, Haofen Wang, Huajun Chen
In this paper, we introduce KAG-Thinker, which upgrade KAG to a multi-turn interactive thinking and deep reasoning framework powered by a dedicated parameter-light large language model (LLM). Our approach constructs a structured thinking process for solving complex problems, enhancing the the logical coherence and contextual consistency of the reasoning process in question-answering (Q&A) tasks on domain-specific knowledge bases (KBs) within LLMs. Following the \textbf{Logical Form} guided retrieval and reasoning technology route of KAG, this framework first decomposes complex questions into independently solvable sub-problems (which are also referred to as logical forms) through \textbf{breadth decomposition}. Each such logical form is represented in two equivalent forms-natural language and logical function-and subsequently classified as either a Knowledge Retrieval or Reasoning Analysis task. Dependencies and parameter passing between these tasks are explicitly modeled via logical function interfaces. In the solving process, the Retrieval function performs retrieval tasks. It retrieves one-hop structured and unstructured information of specified knowledge unit. While the Math and Deduce functions are used to perform reasoning analysis tasks. Secondly, it is worth noting that, in the Knowledge Retrieval sub-problem tasks, LLMs and external knowledge sources are regarded as equivalent KBs. We use the \textbf{knowledge boundary} module to determine the optimal source using self-regulatory mechanisms such as confidence calibration and reflective reasoning, and use the \textbf{depth solving} module to enhance the comprehensiveness of knowledge acquisition…
nan
Article 319
Title@2025-06-30 (1): Semantic-guided Diverse Decoding for Large Language Model
Title: Semantic-guided Diverse Decoding for Large Language Model | Semantisch-geführte Diverse Dekodierung für großes Sprachmodell | 用于大语种的语义制导多种解码模型 2506.23601v1 |
Authors (10): Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou
Diverse decoding of large language models is crucial for applications requiring multiple semantically distinct responses, yet existing methods primarily achieve lexical rather than semantic diversity. This limitation significantly constrains Best-of-N strategies, group-based reinforcement learning, and data synthesis. While temperature sampling and diverse beam search modify token distributions or apply n-gram penalties, they fail to ensure meaningful semantic differentiation. We introduce Semantic-guided Diverse Decoding (SemDiD), operating directly in embedding space that balances quality with diversity through three complementary mechanisms: orthogonal directional guidance, dynamic inter-group repulsion, and position-debiased probability assessment. SemDiD harmonizes these competing objectives using adaptive gain functions and constraint optimization, ensuring both quality thresholds and maximal semantic differentiation. Experiments show SemDiD consistently outperforms existing methods, improving Best-of-N coverage by 1.4-5.2% across diverse tasks and accelerating RLHF training convergence by 15% while increasing accuracy by up to 2.1%.
nan
Article 320
Title@2025-06-30 (1): FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models
Title: FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models | FedEx-LoRA: Exakte Aggregation für Federated and Efficient Fine-Tuning of Foundation Models | FedEx-LORA:基金会模型的联邦和高效精度 2410.09432v4 |
Authors (3): Raghav Singhal, Kaustubh Ponkshe, Praneeth Vepakomma
Low-Rank Adaptation (LoRA) is a popular technique for efficient fine-tuning of foundation models. However, applying LoRA in federated learning environments, where data is distributed across multiple clients, presents unique challenges. Existing methods rely on traditional federated averaging of LoRA adapters, resulting in inexact updates. To address this, we propose Federated Exact LoRA, or FedEx-LoRA, which adds a residual error term to the pretrained frozen weight matrix. Our approach achieves exact updates with minimal computational and communication overhead, preserving LoRA’s efficiency. We evaluate the method on various models across arithmetic reasoning, commonsense reasoning, natural language understanding and natural language generation tasks, showing consistent performance gains over state-of-the-art methods across multiple settings. Through extensive analysis, we quantify that the deviations in updates from the ideal solution are significant, highlighting the need for exact aggregation. Our method’s simplicity, efficiency, and broad applicability position it as a promising solution for accurate and effective federated fine-tuning of foundation models. Our code is publicly available at https://github.com/RaghavSinghal10/fedex-lora.
nan
Article 321
Title@2025-06-30 (1): Reachability in symmetric VASS
Title: Reachability in symmetric VASS | Erreichbarkeit in symmetrischer VASS | 对称VASS的可达性 2506.23578v1 |
Authors (2): Łukasz Kamiński, Sławomir Lasota
We investigate the reachability problem in symmetric vector addition systems with states (VASS), where transitions are invariant under a group of permutations of coordinates. One extremal case, the trivial groups, yields general VASS. In another extremal case, the symmetric groups, we show that the reachability problem can be solved in PSPACE, regardless of the dimension of input VASS (to be contrasted with Ackermannian complexity in general VASS). We also consider other groups, in particular alternating and cyclic ones. Furthermore, motivated by the open status of the reachability problem in data VASS, we estimate the gain in complexity when the group arises as a combination of the trivial and symmetric groups.
nan
Article 322
Title@2025-06-30 (1): MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
Title: MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI | MMReason: Ein offenes Multi-Modal Multi-Step-Reason-Benchmark für MLLMs in Richtung AGI | MMReason:面向AGI的MLLMs的开放性多模式多模式多步多步理由基准 2506.23563v1 |
Authors (12): Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K. Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, Wenhao Wu, Dacheng Tao
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions. First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations. Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps. With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities. We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research. Code will be available at https://github.com/HJYao00/MMReason.
nan
Article 323
Title@2025-06-30 (1): From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data
Title: From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data | Von der Ausrichtung zur Weiterentwicklung: Bootstrapping Audio-Language Alignment mit synthetischen Daten | 从对齐到推进: 用合成数据推动音频语言对齐 2505.20166v2 |
Authors (2): Chun-Yi Kuan, Hung-yi Lee
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. However, this adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where crucial textual capabilities like instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making it resource-intensive. To address these issues, previous works have leveraged the backbone LLMs to synthesize general-purpose, caption-style alignment data. In this paper, we propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs’ ability to differentiate between present and absent sounds. We further extend our approach to multi-audio scenarios, enabling the model to either explain differences between audio inputs or produce unified captions that describe all inputs, thereby enhancing audio-language alignment. We refer to the entire ALLM training framework as bootstrapping audio-language alignment via synthetic data generation from backbone LLMs (BALSa). Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance on audio understanding and reasoning benchmarks, as well as instruction-following skills. Moreover, incorporating multi-audio training further enhances the model’s comprehension and reasoning capabilities. Overall, BALSa offers an efficient and scalable approach to developing ALLMs.
nan
Article 324
Title@2025-06-30 (1): FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation
Title: FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation | FlexRAG: Ein flexibler und umfassender Rahmen für die Retrieval-Augmented Generation | FlexRAG: 灵活和综合的回回回一代人框架 2506.12494v2 |
Authors (3): Zhuocheng Zhang, Yang Feng, Min Zhang
Retrieval-Augmented Generation (RAG) plays a pivotal role in modern large language model applications, with numerous existing frameworks offering a wide range of functionalities to facilitate the development of RAG systems. However, we have identified several persistent challenges in these frameworks, including difficulties in algorithm reproduction and sharing, lack of new techniques, and high system overhead. To address these limitations, we introduce \textbf{FlexRAG}, an open-source framework specifically designed for research and prototyping. FlexRAG supports text-based, multimodal, and network-based RAG, providing comprehensive lifecycle support alongside efficient asynchronous processing and persistent caching capabilities. By offering a robust and flexible solution, FlexRAG enables researchers to rapidly develop, deploy, and share advanced RAG systems. Our toolkit and resources are available at \href{https://github.com/ictnlp/FlexRAG}{https://github.com/ictnlp/FlexRAG}.
nan
Article 325
Title@2025-06-30 (1): On Recipe Memorization and Creativity in Large Language Models: Is Your Model a Creative Cook, a Bad Cook, or Merely a Plagiator?
Title: On Recipe Memorization and Creativity in Large Language Models: Is Your Model a Creative Cook, a Bad Cook, or Merely a Plagiator? | Auf Rezept Erinnerung und Kreativität in großen Sprachmodellen: Ist Ihr Modell ein kreativer Koch, ein schlechter Koch oder nur ein Plagiator? | “大语言模型中的食谱记忆和创造性:你的模型是创意烹饪,坏烹饪,还是仅仅一个粉刷器?” 2506.23527v1 |
Authors (2): Jan Kvapil, Martin Fajcik
This work-in-progress investigates the memorization, creativity, and nonsense found in cooking recipes generated from Large Language Models (LLMs). Precisely, we aim (i) to analyze memorization, creativity, and non-sense in LLMs using a small, high-quality set of human judgments and (ii) to evaluate potential approaches to automate such a human annotation in order to scale our study to hundreds of recipes. To achieve (i), we conduct a detailed human annotation on 20 preselected recipes generated by LLM (Mixtral), extracting each recipe’s ingredients and step-by-step actions to assess which elements are memorized–i.e., directly traceable to online sources possibly seen during training–and which arise from genuine creative synthesis or outright nonsense. We find that Mixtral consistently reuses ingredients that can be found in online documents, potentially seen during model training, suggesting strong reliance on memorized content. To achieve aim (ii) and scale our analysis beyond small sample sizes and single LLM validation, we design an ``LLM-as-judge’’ pipeline that automates recipe generation, nonsense detection, parsing ingredients and recipe steps, and their annotation. For instance, comparing its output against human annotations, the best ingredient extractor and annotator is Llama 3.1+Gemma 2 9B, achieving up to 78% accuracy on ingredient matching. This automated framework enables large-scale quantification of memorization, creativity, and nonsense in generated recipes, providing rigorous evidence of the models’ creative capacities.
nan
Article 326
Title@2025-06-30 (1): NEU-ESC: A Comprehensive Vietnamese dataset for Educational Sentiment analysis and topic Classification toward multitask learning
Title: NEU-ESC: A Comprehensive Vietnamese dataset for Educational Sentiment analysis and topic Classification toward multitask learning | NEU-ESC: Ein umfassender vietnamesischer Datensatz für die Analyse der Lernfähigkeit und das Thema Klassifizierung in Richtung Multitask-Lernen | NEU-ESC:越南综合数据集,用于教育敏感分析和多任务学习的专题分类 2506.23524v1 |
Authors (5): Phan Quoc Hung Mai, Quang Hung Nguyen, Phuong Giang Duong, Hong Hanh Nguyen, Nguyen Tuan Long
In the field of education, understanding students’ opinions through their comments is crucial, especially in the Vietnamese language, where resources remain limited. Existing educational datasets often lack domain relevance and student slang. To address these gaps, we introduce NEU-ESC, a new Vietnamese dataset for Educational Sentiment Classification and Topic Classification, curated from university forums, which offers more samples, richer class diversity, longer texts, and broader vocabulary. In addition, we explore multitask learning using encoder-only language models (BERT), in which we showed that it achieves performance up to 83.7% and 79.8% accuracy for sentiment and topic classification tasks. We also benchmark our dataset and model with other datasets and models, including Large Language Models, and discuss these benchmarks. The dataset is publicly available at: https://huggingface.co/datasets/hung20gg/NEU-ESC.
nan
Article 327
Title@2025-06-30 (1): A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans
Title: A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans | Eine umfassende Bewertung der semantischen Beziehungskenntnisse von vorgebildeten Sprachmodellen und Menschen | 全面评价未受过训练语言模式和人文的语义关系知识 2412.01131v3 |
Authors (4): Zhihan Cao, Hiroaki Yamada, Simone Teufel, Takenobu Tokunaga
Recently, much work has concerned itself with the enigma of what exactly PLMs (pretrained language models) learn about different aspects of language, and how they learn it. One stream of this type of research investigates the knowledge that PLMs have about semantic relations. However, many aspects of semantic relations were left unexplored. Only one relation was considered, namely hypernymy. Furthermore, previous work did not measure humans’ performance on the same task as that solved by the PLMs. This means that at this point in time, there is only an incomplete view of models’ semantic relation knowledge. To address this gap, we introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy. We use six metrics (two newly introduced here) for recently untreated aspects of semantic relation knowledge, namely soundness, completeness, symmetry, asymmetry, prototypicality, and distinguishability and fairly compare humans and models on the same task. Our extensive experiments involve 16 PLMs, eight masked and eight causal language models. Up to now only masked language models had been tested although causal and masked language models treat context differently. Our results reveal a significant knowledge gap between humans and models for almost all semantic relations. Antonymy is the outlier relation where all models perform reasonably well. In general, masked language models perform significantly better than causal language models. Nonetheless, both masked and causal language models are likely to confuse non-antonymy relations with antonymy.
nan
Article 328
Title@2025-06-30 (1): Assessing GPTZero’s Accuracy in Identifying AI vs. Human-Written Essays
Title: Assessing GPTZero’s Accuracy in Identifying AI vs. Human-Written Essays | Beurteilung der Genauigkeit von GPTzero bei der Identifizierung von KI gegen von Menschen geschriebene Essays | 评估GPTZero在识别AI与人类-Written日志中的准确性 2506.23517v1 |
Authors (3): Selin Dik, Osman Erdem, Mehmet Dik
As the use of AI tools by students has become more prevalent, instructors have started using AI detection tools like GPTZero and QuillBot to detect AI written text. However, the reliability of these detectors remains uncertain. In our study, we focused mostly on the success rate of GPTZero, the most-used AI detector, in identifying AI-generated texts based on different lengths of randomly submitted essays: short (40-100 word count), medium (100-350 word count), and long (350-800 word count). We gathered a data set consisting of twenty-eight AI-generated papers and fifty human-written papers. With this randomized essay data, papers were individually plugged into GPTZero and measured for percentage of AI generation and confidence. A vast majority of the AI-generated papers were detected accurately (ranging from 91-100% AI believed generation), while the human generated essays fluctuated; there were a handful of false positives. These findings suggest that although GPTZero is effective at detecting purely AI-generated content, its reliability in distinguishing human-authored texts is limited. Educators should therefore exercise caution when relying solely on AI detection tools.
nan
Article 329
Title@2025-06-30 (1): Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
Title: Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding | Gumiho: Eine hybride Architektur, um frühe Token in spekulativer Dekodierung zu priorisieren | Gumiho:在投机下限中优先考虑早期物料的混合结构 2503.10135v2 |
Authors (7): Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum
Speculative decoding (SPD) aims to accelerate the auto-regressive token generation process of a target Large Language Model (LLM). Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. The target LLM verifies the predicted sequence and accepts aligned tokens, enabling efficient multi-token generation. However, existing methods assume that all tokens within a sequence are equally important, employing identical head structures and relying on a single-generation paradigm, either serial or parallel. To this end, we theoretically demonstrate that initial tokens in the draft sequence are more important than later ones. Building on this insight, we propose Gumiho, a hybrid model combining serial and parallel heads. Specifically, given the critical importance of early tokens, we employ a sophisticated Transformer architecture for the early draft heads in a serial configuration to improve accuracy. For later tokens, we utilize multiple lightweight MLP heads operating in parallel to enhance efficiency. By allocating more advanced model structures and longer running times to the early heads, Gumiho achieves improved overall performance. The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness.
nan
Article 330
Title@2025-06-30 (1): LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates
Title: LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates | LLM-Bremsen: LLM-Vorhersagen mit relevanten Sub-Updates ausgleichen | LLM LLM Braress: 利用相关的子更新实现LLM预测 2503.16334v2 |
Authors (2): Ying Shen, Lifu Huang
Recent findings reveal that much of the knowledge in a Transformer-based Large Language Model (LLM) is encoded in its feed-forward (FFN) layers, where each FNN layer can be interpreted as the summation of sub-updates, each corresponding to a weighted column vector from the FFN’s value parameter matrix that often encodes human-interpretable concepts. In light of this, we hypothesize that model performance and behaviors can be further enhanced and controlled by modulating the contributions of these sub-updates based on their relevance to the input or target output style, and propose LLMBRACES, a novel and efficient method that computes relevance scores associated with value vectors in FFN layers and leverages these scores to dynamically adjust the contribution of sub-updates. By optimizing sub-update contributions, LLMBRACES refines the prediction process, leading to more accurate and reliable outputs, much like a ‘brace’ providing support and stability. Moreover, LLMBRACES can be extended to support conditional control over generation characteristics, such as sentiment, thereby offering fine-grained steering of LLM outputs. Extensive experiments on various LLMs-including Qwen2.5-1.5B, Llama2-7B, and Llama3-8B-demonstrate that LLMBRACES outperforms baseline approaches in both fine-tuning and zero-shot settings while requiring significantly fewer tunable parameters, up to 75% fewer compared to LoRA. Furthermore, LLMBRACES excels in sentiment-controlled generation and toxicity reduction, highlighting its potential for flexible, controlled text generation across applications.
nan
Article 331
Title@2025-06-30 (1): Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably
Title: Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably | Verstärkte Feinsteuerung ermöglicht MLLMs das Erlernen neuartiger Aufgaben stabil | 强化精细调整启用 MLLMS 学习新创任务 2506.23508v1 |
Authors (13): Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on an open-source multimodal model, Qwen2.5-VL. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly on novel tasks but maintains prior knowledge. We analyze this phenomenon through the lens of learning dynamics, showing that RFT reinforces correct samples that are naturally aligned with the base model’s probability landscape, mitigating interference with prior knowledge. Moreover, supervised training on correct RFT-simulated rollouts allows SFT to preserve knowledge while rapidly learning new tasks. These findings suggest that data distribution, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT’s potential for stable continual learning in multimodal large language models.
nan
Article 332
Title@2025-06-30 (1): FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models’ Knowledge and Reasoning
Title: FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models’ Knowledge and Reasoning | FinEval-KR: Ein Financial Domain Evaluation Framework für das Wissen und die Vernunft großer Sprachmodelle | FinEval-KR:大语言模式知识和理由说明的财务域评价框架 2506.21591v2 |
Authors (12): Shaoyu Dou, Yutian Shen, Mofan Chen, Zixuan Wang, Jiajie Xu, Qi Guo, Kailai Shao, Chao Chen, Haixiang Hu, Haibo Shi, Min Min, Liwen Zhang
Large Language Models (LLMs) demonstrate significant potential but face challenges in complex financial reasoning tasks requiring both domain knowledge and sophisticated reasoning. Current evaluation benchmarks often fall short by not decoupling these capabilities indicators from single task performance and lack root cause analysis for task failure. To address this, we introduce FinEval-KR, a novel evaluation framework for decoupling and quantifying LLMs’ knowledge and reasoning abilities independently, proposing distinct knowledge score and reasoning score metrics. Inspired by cognitive science, we further propose a cognitive score based on Bloom’s taxonomy to analyze capabilities in reasoning tasks across different cognitive levels. We also release a new open-source Chinese financial reasoning dataset covering 22 subfields to support reproducible research and further advancements in financial reasoning. Our experimental results reveal that LLM reasoning ability and higher-order cognitive ability are the core factors influencing reasoning accuracy. We also specifically find that even top models still face a bottleneck with knowledge application. Furthermore, our analysis shows that specialized financial LLMs generally lag behind the top general large models across multiple metrics.
nan
Article 333
Title@2025-06-30 (1): Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent
Title: Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent | Thought-Augmented Planung für LLM-Powered Interactive Recommender Agent | LLM 授权互动建议代理商的集思广益规划 2506.23485v1 |
Authors (9): Haocheng Yu, Yaxiong Wu, Hao Wang, Wei Guo, Yong Liu, Yawen Li, Yuyang Ye, Junping Du, Enhong Chen
Interactive recommendation is a typical information-seeking task that allows users to interactively express their needs through natural language and obtain personalized recommendations. Large language model-powered (LLM-powered) agents have become a new paradigm in interactive recommendations, effectively capturing users’ real-time needs and enhancing personalized experiences. However, due to limited planning and generalization capabilities, existing formulations of LLM-powered interactive recommender agents struggle to effectively address diverse and complex user intents, such as intuitive, unrefined, or occasionally ambiguous requests. To tackle this challenge, we propose a novel thought-augmented interactive recommender agent system (TAIRA) that addresses complex user intents through distilled thought patterns. Specifically, TAIRA is designed as an LLM-powered multi-agent system featuring a manager agent that orchestrates recommendation tasks by decomposing user needs and planning subtasks, with its planning capacity strengthened through Thought Pattern Distillation (TPD), a thought-augmentation method that extracts high-level thoughts from the agent’s and human experts’ experiences. Moreover, we designed a set of user simulation schemes to generate personalized queries of different difficulties and evaluate the recommendations based on specific datasets. Through comprehensive experiments conducted across multiple datasets, TAIRA exhibits significantly enhanced performance compared to existing methods. Notably, TAIRA shows a greater advantage on more challenging tasks while generalizing effectively on novel tasks, further validating its superiority in managing complex user intents within interactive recommendation systems. The code is publicly available at:https://github.com/Alcein/TAIRA.
nan
Article 334
Title@2025-06-30 (1): CTISum: A New Benchmark Dataset For Cyber Threat Intelligence Summarization
Title: CTISum: A New Benchmark Dataset For Cyber Threat Intelligence Summarization | CTISum: Ein neuer Benchmark-Datensatz für Cyber Threat Intelligence Zusammenfassung | CTISum:网络威胁情报总结的新基准数据集 2408.06576v2 |
Authors (7): Wei Peng, Junmei Ding, Wei Wang, Lei Cui, Wei Cai, Zhiyu Hao, Xiaochun Yun
Cyber Threat Intelligence (CTI) summarization involves generating concise and accurate highlights from web intelligence data, which is critical for providing decision-makers with actionable insights to swiftly detect and respond to cyber threats in the cybersecurity domain. Despite that, the development of efficient techniques for summarizing CTI reports, comprising facts, analytical insights, attack processes, and more, has been hindered by the lack of suitable datasets. To address this gap, we introduce CTISum, a new benchmark dataset designed for the CTI summarization task. Recognizing the significance of understanding attack processes, we also propose a novel fine-grained subtask: attack process summarization, which aims to help defenders assess risks, identify security gaps, and uncover vulnerabilities. Specifically, a multi-stage annotation pipeline is designed to collect and annotate CTI data from diverse web sources, alongside a comprehensive benchmarking of CTISum using both extractive, abstractive and LLMs-based summarization methods. Experimental results reveal that current state-of-the-art models face significant challenges when applied to CTISum, highlighting that automatic summarization of CTI reports remains an open research problem. The code and example dataset can be made publicly available at https://github.com/pengwei-iie/CTISum.
nan
Article 335
Title@2025-06-30 (1): Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission
Title: Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission | Federated Learning-Enabled Hybrid Language Models für kommunikationseffiziente Token-Übertragung | 通信-高效调式传真传播的联邦学习-进进混合语言模式 2507.00082v1 |
Authors (5): Faranaksadat Solat, Joohyung Lee, Mohamed Seif, Dusit Niyato, H. Vincent Poor
Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers. Unlike traditional end-to-end LLM inference, HLMs reduce latency and communication by invoking LLMs only when local SLM predictions are uncertain, i.e., when token-level confidence is low or entropy is high. However, ambiguous or low-confidence predictions still require frequent offloading to the LLM, leading to significant communication overhead in bandwidth-constrained settings. To address this, we propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL). FedHLM’s key innovation lies in collaboratively learning token-level uncertainty thresholds that govern when LLM assistance is needed. Rather than using static or manually tuned thresholds, FedHLM employs FL to optimize these thresholds in a privacy-preserving, distributed manner. Additionally, it leverages embedding-based token representations for Peer-to-Peer (P2P) resolution, enabling clients to reuse tokens inferred by semantically similar peers without engaging the LLM. We further introduce hierarchical model aggregation: edge servers refine local routing policies through client updates, while cross-cluster coordination aligns global decision boundaries. This layered design captures recurring uncertainty patterns, reducing redundant LLM queries. Experiments on large-scale news classification tasks show that FedHLM reduces LLM transmissions by over 95 percent with negligible accuracy loss, making it well-suited for scalable and efficient edge-AI applications.
nan
Article 336
Title@2025-06-30 (1): Parenting: Optimizing Knowledge Selection of Retrieval-Augmented Language Models with Parameter Decoupling and Tailored Tuning
Title: Parenting: Optimizing Knowledge Selection of Retrieval-Augmented Language Models with Parameter Decoupling and Tailored Tuning | Parenting: Optimierung der Wissensauswahl von Retrieval-Augmented Language Models mit Parameterentkopplung und maßgeschneidertem Tuning | 亲子关系: 优化使用参数分离和定制调试的检索增强语言模型的知识选择 2410.10360v3 |
Authors (10): Yongxin Xu, Ruizhe Zhang, Xinke Jiang, Yujie Feng, Yuzhen Xiao, Xinyu Ma, Runchuan Zhu, Xu Chu, Junfeng Zhao, Yasha Wang
Retrieval-Augmented Generation (RAG) offers an effective solution to the issues faced by Large Language Models (LLMs) in hallucination generation and knowledge obsolescence by incorporating externally retrieved knowledge. However, existing methods lack effective control mechanisms for integrating internal and external knowledge. Inspired by human cognitive processes, we propose Parenting, a novel framework that decouples, identifies, and purposefully optimizes parameter subspaces related to adherence and robustness. Specifically, Parenting utilizes a key parameter mining method that combines forward and backward propagation signals to localize subspaces representing different capabilities. Then, Parenting employs a type-tailored tuning strategy, applying specific and appropriate optimizations to different subspaces, aiming to achieve a balanced enhancement of both adherence and robustness. Extensive experiments on various datasets and models validate the effectiveness and generalizability of our method.
nan
Article 337
Title@2025-06-30 (1): What to Keep and What to Drop: Adaptive Table Filtering Framework
Title: What to Keep and What to Drop: Adaptive Table Filtering Framework | Was zu halten und was zu fallen: Adaptive Tabelle Filterung Rahmen | 保持和放下什么:适应性表格过滤框架 2506.23463v1 |
Authors (1): Jang Won June
Large language models (LLMs) for table-based reasoning often struggle with large tables due to input length limits. We propose ATF (Adaptive Table Filtering Framework), a modular and question-aware filtering pipeline that prunes uninformative columns and rows using LLM-generated column descriptions, clustering, and sparse-dense alignment scores. ATF integrates seamlessly with existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that ATF reduces table cells by ~70\%, boosting performance on out-of-domain TableQA tasks while causing slight performance drops on Table Fact Verification, where full-table context is more critical. These results highlight ATF’s ability to adaptively balance informativeness and minimalism across tasks.
nan
Article 338
Title@2025-06-30 (1): State and Memory is All You Need for Robust and Reliable AI Agents
Title: State and Memory is All You Need for Robust and Reliable AI Agents | Zustand und Gedächtnis sind alles, was Sie für robuste und zuverlässige KI-Agenten brauchen | 国家记忆是强力和可靠的AI代理所需要的一切 2507.00081v1 |
Authors (15): Matthew Muhoberac, Atharva Parikh, Nirvi Vakharia, Saniya Virani, Aco Radujevic, Savannah Wood, Meghav Verma, Dimitri Metaxotos, Jeyaraman Soundararajan, Thierry Masquelin, Alexander G. Godfrey, Sean Gardner, Dobrila Rudnicki, Sam Michael, Gaurav Chopra
Large language models (LLMs) have enabled powerful advances in natural language understanding and generation. Yet their application to complex, real-world scientific workflows remain limited by challenges in memory, planning, and tool integration. Here, we introduce SciBORG (Scientific Bespoke Artificial Intelligence Agents Optimized for Research Goals), a modular agentic framework that allows LLM-based agents to autonomously plan, reason, and achieve robust and reliable domain-specific task execution. Agents are constructed dynamically from source code documentation and augmented with finite-state automata (FSA) memory, enabling persistent state tracking and context-aware decision-making. This approach eliminates the need for manual prompt engineering and allows for robust, scalable deployment across diverse applications via maintaining context across extended workflows and to recover from tool or execution failures. We validate SciBORG through integration with both physical and virtual hardware, such as microwave synthesizers for executing user-specified reactions, with context-aware decision making and demonstrate its use in autonomous multi-step bioassay retrieval from the PubChem database utilizing multi-step planning, reasoning, agent-to-agent communication and coordination for execution of exploratory tasks. Systematic benchmarking shows that SciBORG agents achieve reliable execution, adaptive planning, and interpretable state transitions. Our results show that memory and state awareness are critical enablers of agentic planning and reliability, offering a generalizable foundation for deploying AI agents in complex environments.
nan
Article 339
Title@2025-06-30 (1): Bridge: A Unified Framework to Knowledge Graph Completion via Language Models and Knowledge Representation
Title: Bridge: A Unified Framework to Knowledge Graph Completion via Language Models and Knowledge Representation | Brücke: Ein einheitliches Framework zur Wissensgraphenvervollständigung über Sprachmodelle und Wissensdarstellung | 桥梁:通过语言模式和知识代表性完成知识图的统一框架 2411.06660v3 |
Authors (5): Qiao Qiao, Yuepei Li, Qing Wang, Kang Zhou, Qi Li
Knowledge graph completion (KGC) is a task of inferring missing triples based on existing Knowledge Graphs (KGs). Both structural and semantic information are vital for successful KGC. However, existing methods only use either the structural knowledge from the KG embeddings or the semantic information from pre-trained language models (PLMs), leading to suboptimal model performance. Moreover, since PLMs are not trained on KGs, directly using PLMs to encode triples may be inappropriate. To overcome these limitations, we propose a novel framework called Bridge, which jointly encodes structural and semantic information of KGs. Specifically, we strategically encode entities and relations separately by PLMs to better utilize the semantic knowledge of PLMs and enable structured representation learning via a structural learning principle. Furthermore, to bridge the gap between KGs and PLMs, we employ a self-supervised representation learning method called BYOL to fine-tune PLMs with two different views of a triple. Unlike BYOL, which uses augmentation methods to create two semantically similar views of the same image, potentially altering the semantic information. We strategically separate the triple into two parts to create different views, thus avoiding semantic alteration. Experiments demonstrate that Bridge outperforms the SOTA models on three benchmark datasets.
nan
Article 340
Title@2025-06-30 (1): Mechanistic Interpretability of Emotion Inference in Large Language Models
Title: Mechanistic Interpretability of Emotion Inference in Large Language Models | Mechanistische Interpretation von Emotionsinferenzen in großen Sprachmodellen | 大语言模型情感引因的可解释性 2502.05489v2 |
Authors (6): Ala N. Tak, Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, Jonathan Gratch
Large language models (LLMs) show promising capabilities in predicting human emotions from text. However, the mechanisms through which these models process emotional stimuli remain largely unexplored. Our study addresses this gap by investigating how autoregressive LLMs infer emotions, showing that emotion representations are functionally localized to specific regions in the model. Our evaluation includes diverse model families and sizes and is supported by robustness checks. We then show that the identified representations are psychologically plausible by drawing on cognitive appraisal theory, a well-established psychological framework positing that emotions emerge from evaluations (appraisals) of environmental stimuli. By causally intervening on construed appraisal concepts, we steer the generation and show that the outputs align with theoretical and intuitive expectations. This work highlights a novel way to causally intervene and precisely shape emotional text generation, potentially benefiting safety and alignment in sensitive affective domains.
nan
Article 341
Title@2025-06-29 (7): TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs
Title: TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs | TuCo: Messung des Beitrags von Feinsteuerung zu individuellen Reaktionen von LLMs | TuCo:衡量微调对LLMM个人对策的贡献 2506.23423v1 |
Authors (3): Felipe Nuti, Tim Franzmeyer, João Henriques
Past work has studied the effects of fine-tuning on large language models’ (LLMs) overall performance on certain tasks. However, a quantitative and systematic method for analyzing its effect on individual outputs is still lacking. Here, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method tracks the model’s intermediate hidden states, providing a more fine-grained insight into the effects of fine-tuning than a simple comparison of final outputs from pre-trained and fine-tuned models. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that model behavior and performance can be steered by up- or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution (TuCo) as the ratio of the magnitudes of the fine-tuning component to the pre-training component. We observe that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces TuCo, and that TuCo is consistently lower on prompts where these attacks succeed compared to those where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of such attacks. In summary, TuCo enables the quantitative study of how fine-tuning influences model behavior and safety, and vice versa.
nan
Article 342
Title@2025-06-29 (7): Datasets for Fairness in Language Models: An In-Depth Survey
Title: Datasets for Fairness in Language Models: An In-Depth Survey | Datensätze für Fairness in Sprachmodellen: Eine In-Depth-Umfrage | 语言模型公平性数据集:内部调查 2506.23411v1 |
Authors (5): Jiale Zhang, Zichong Wang, Avash Palikhe, Zhipeng Yin, Wenbin Zhang
Fairness benchmarks play a central role in shaping how we evaluate language models, yet surprisingly little attention has been given to examining the datasets that these benchmarks rely on. This survey addresses that gap by presenting a broad and careful review of the most widely used fairness datasets in current language model research, characterizing them along several key dimensions including their origin, scope, content, and intended use to help researchers better appreciate the assumptions and limitations embedded in these resources. To support more meaningful comparisons and analyses, we introduce a unified evaluation framework that reveals consistent patterns of demographic disparities across datasets and scoring methods. Applying this framework to twenty four common benchmarks, we highlight the often overlooked biases that can influence conclusions about model fairness and offer practical guidance for selecting, combining, and interpreting these datasets. We also point to opportunities for creating new fairness benchmarks that reflect more diverse social contexts and encourage more thoughtful use of these tools going forward. All code, data, and detailed results are publicly available at https://github.com/vanbanTruong/Fairness-in-Large-Language-Models/tree/main/datasets to promote transparency and reproducibility across the research community.
nan
Article 343
Title@2025-06-29 (7): Automating Adjudication of Cardiovascular Events Using Large Language Models
Title: Automating Adjudication of Cardiovascular Events Using Large Language Models | Automatisieren der Adjudikation von Herz-Kreislauf-Ereignissen mit großen Sprachmodellen | 使用大语言模型自动裁决心血管事件 2503.17222v2 |
Authors (5): Sonish Sivarajkumar, Kimia Ameri, Chuqin Li, Yanshan Wang, Min Jiang
Cardiovascular events, such as heart attacks and strokes, remain a leading cause of mortality globally, necessitating meticulous monitoring and adjudication in clinical trials. This process, traditionally performed manually by clinical experts, is time-consuming, resource-intensive, and prone to inter-reviewer variability, potentially introducing bias and hindering trial progress. This study addresses these critical limitations by presenting a novel framework for automating the adjudication of cardiovascular events in clinical trials using Large Language Models (LLMs). We developed a two-stage approach: first, employing an LLM-based pipeline for event information extraction from unstructured clinical data and second, using an LLM-based adjudication process guided by a Tree of Thoughts approach and clinical endpoint committee (CEC) guidelines. Using cardiovascular event-specific clinical trial data, the framework achieved an F1-score of 0.82 for event extraction and an accuracy of 0.68 for adjudication. Furthermore, we introduce the CLEART score, a novel, automated metric specifically designed for evaluating the quality of AI-generated clinical reasoning in adjudicating cardiovascular events. This approach demonstrates significant potential for substantially reducing adjudication time and costs while maintaining high-quality, consistent, and auditable outcomes in clinical trials. The reduced variability and enhanced standardization also allow for faster identification and mitigation of risks associated with cardiovascular therapies.
nan
Article 344
Title@2025-06-29 (7): Teaching a Language Model to Speak the Language of Tools
Title: Teaching a Language Model to Speak the Language of Tools | Ein Sprachmodell lehren, um die Sprache der Werkzeuge zu sprechen | 教授一种语言模式,讲工具语言 2506.23394v1 |
Authors (1): Simeon Emanuilov
External tool integration through function-calling is essential for practical language model applications, yet most multilingual models lack reliable tool-use capabilities in non-English languages. Even state-of-the-art multilingual models struggle with determining when to use tools and generating the structured outputs required for function calls, often exhibiting language confusion when prompted in lower-resource languages. This work presents a methodology for adapting existing language models to enable robust tool use in any target language, using Bulgarian as a case study. The approach involves continued training of the BgGPT model series (2.6B, 9B, 27B parameters) on a novel bilingual dataset of 10,035 function-calling examples designed to support standardized protocols like MCP (Model Context Protocol). The research introduces TUCAN (Tool-Using Capable Assistant Navigator), which achieves up to 28.75% improvement in function-calling accuracy over base models while preserving core language understanding, as verified on established Bulgarian benchmarks. Beyond accuracy gains, TUCAN models demonstrate production-ready response formatting with clean, parsable function calls, contrasting with the verbose and inconsistent outputs of base models. The models, evaluation framework, and dataset are released to enable replication for other languages. This work demonstrates a practical approach for extending tool-augmented capabilities beyond English-centric systems.
nan
Article 345
Title@2025-06-29 (7): Hierarchical Memory Organization for Wikipedia Generation
Title: Hierarchical Memory Organization for Wikipedia Generation | Hierarchische Speicherorganisation für Wikipedia Generation | 维基百科世代等级记忆组织 2506.23393v1 |
Authors (9): Eugene J. Yu, Dawei Zhu, Yifan Song, Xiangyu Wong, Jiebin Zhang, Wenxuan Shi, Xiaoguang Li, Qun Liu, Sujian Li
Generating Wikipedia articles autonomously is a challenging task requiring the integration of accurate, comprehensive, and well-structured information from diverse sources. This paper introduces the Memory Organization-based Generation (MOG) framework, a novel approach to address these challenges by leveraging a hierarchical memory architecture. MOG extracts fine-grained memory units from web documents, recursively organizes them into a Wikipedia-style hierarchical structure, and uses this structure to guide the generation process. This ensures alignment between memory and the article outline, improving both informativeness and verifiability while minimizing hallucinations. Additionally, a citation module is implemented to enhance traceability by linking every generated sentence to specific memory units. Evaluations on our newly created WikiStart dataset demonstrate that MOG outperforms baseline methods in producing informative and reliable articles, making it particularly robust in real-world scenarios.
nan
Article 346
Title@2025-06-29 (7): Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance
Title: Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance | Vergleichende Bewertung von ChatGPT und DeepSeek über zentrale NLP-Aufgaben: Stärken, Schwächen und Domain-spezifische Leistung | 国家劳工政策关键任务:力量、弱点和具体具体绩效 2506.18501v2 |
Authors (2): Wael Etaiwi, Bushra Alhijawi
The increasing use of large language models (LLMs) in natural language processing (NLP) tasks has sparked significant interest in evaluating their effectiveness across diverse applications. While models like ChatGPT and DeepSeek have shown strong results in many NLP domains, a comprehensive evaluation is needed to understand their strengths, weaknesses, and domain-specific abilities. This is critical as these models are applied to various tasks, from sentiment analysis to more nuanced tasks like textual entailment and translation. This study aims to evaluate ChatGPT and DeepSeek across five key NLP tasks: sentiment analysis, topic classification, text summarization, machine translation, and textual entailment. A structured experimental protocol is used to ensure fairness and minimize variability. Both models are tested with identical, neutral prompts and evaluated on two benchmark datasets per task, covering domains like news, reviews, and formal/informal texts. The results show that DeepSeek excels in classification stability and logical reasoning, while ChatGPT performs better in tasks requiring nuanced understanding and flexibility. These findings provide valuable insights for selecting the appropriate LLM based on task requirements.
nan
Article 347
Title@2025-06-29 (7): Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs
Title: Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs | Perspective Dial: Perspective of Text and Guiding LLM Outputs messen | 计量文字和引导性LLM产出 2506.23377v1 |
Authors (3): Taejin Kim, Siun-Chuon Mau, Konrad Vesey
Large language models (LLMs) are used in a variety of mission-critical roles. Due to the rapidly developing nature of LLMs, there is a lack of quantifiable understanding of the bias and perspective associated with LLM output. Inspired by this need, this paper considers the broader issue of perspective or viewpoint of general text and perspective control of large-language model (LLM) output. Perspective-Dial consists of two main components: a (1) metric space, dubbed Perspective Space, that enables quantitative measurements of different perspectives regarding a topic, and the use of (2) Systematic Prompt Engineering that utilizes greedy-coordinate descent to control LLM output perspective based on measurement feedback from the Perspective Space. The empirical nature of the approach allows progress to side step a principled understanding of perspective or bias – effectively quantifying and adjusting outputs for a variety of topics. Potential applications include detection, tracking and mitigation of LLM bias, narrative detection, sense making and tracking in public discourse, and debate bot advocating given perspective.
nan
Article 348
Title@2025-06-29 (7): Emotional RAG LLMs: Reading Comprehension for the Open Internet
Title: Emotional RAG LLMs: Reading Comprehension for the Open Internet | Emotionale RAG LLMs: Leseverständnis für das offene Internet | 情感性RAG LLM: 阅读开放互联网理解 2408.11189v2 |
Authors (5): Benjamin Reichman, Adar Avsian, Kartik Talamadupula, Toshish Jawale, Larry Heck
Queries to large language models (LLMs) can be divided into two parts: the instruction/question and the accompanying context. The context for retrieval-augmented generation (RAG) systems in most benchmarks comes from Wikipedia-like texts written in a neutral and factual tone. However, real-world RAG applications often retrieve internet-based text with diverse tones and linguistic styles, posing challenges for downstream tasks. This paper introduces (a) a dataset that transforms RAG-retrieved passages into emotionally inflected and sarcastic text, (b) an emotion translation model for adapting text to different tones, and (c) a prompt-based method to improve LLMs’ pragmatic interpretation of retrieved text.
nan
Article 349
Title@2025-06-29 (7): You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties
Title: You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties | Sie klingen eine kleine Tense: L2 Maßgeschneiderte klare TTS Verwendung von Durational Vowel Properties | 你听起来有点紧张: L2 使用时空声波属性的 L2 定制的清除 TTS 2506.23367v1 |
Authors (5): Paige Tuttösí, H. Henny Yeung, Yue Wang, Jean-Julien Aucouturier, Angelica Lim
We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a “clarity mode” for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouraging and respectful than overall slowed down speech. Remarkably, listeners were not aware of these effects: despite the decreased word error rate in clarity mode, listeners still believed that slowing all target words was the most intelligible, suggesting that actual intelligibility does not correlate with perceived intelligibility. Additionally, we found that Whisper-ASR did not use the same cues as L2 speakers to differentiate difficult vowels and is not sufficient to assess the intelligibility of TTS systems for these individuals.
nan
Article 350
Title@2025-06-29 (7): Density, asymmetry and citation dynamics in scientific literature
Title: Density, asymmetry and citation dynamics in scientific literature | Dichte, Asymmetrie und Zitierdynamik in der wissenschaftlichen Literatur | 科学文献中的密度、不对称和引用动态 2506.23366v1 |
Authors (2): Nathaniel Imel, Zachary Hafen
Scientific behavior is often characterized by a tension between building upon established knowledge and introducing novel ideas. Here, we investigate whether this tension is reflected in the relationship between the similarity of a scientific paper to previous research and its eventual citation rate. To operationalize similarity to previous research, we introduce two complementary metrics to characterize the local geometry of a publication’s semantic neighborhood: (1) \emph{density} ($\rho$), defined as the ratio between a fixed number of previously-published papers and the minimum distance enclosing those papers in a semantic embedding space, and (2) asymmetry ($\alpha$), defined as the average directional difference between a paper and its nearest neighbors. We tested the predictive relationship between these two metrics and its subsequent citation rate using a Bayesian hierarchical regression approach, surveying $\sim 53,000$ publications across nine academic disciplines and five different document embeddings. While the individual effects of $\rho$ on citation count are small and variable, incorporating density-based predictors consistently improves out-of-sample prediction when added to baseline models. These results suggest that the density of a paper’s surrounding scientific literature may carry modest but informative signals about its eventual impact. Meanwhile, we find no evidence that publication asymmetry improves model predictions of citation rates. Our work provides a scalable framework for linking document embeddings to scientometric outcomes and highlights new questions regarding the role that semantic similarity plays in shaping the dynamics of scientific reward.
nan
Article 351
Title@2025-06-29 (7): ChipXplore: Natural Language Exploration of Hardware Designs and Libraries
Title: ChipXplore: Natural Language Exploration of Hardware Designs and Libraries | ChipXplore: Natural Language Exploration von Hardware-Designs und Bibliotheken | ChipXplore: 硬件设计和图书馆的自然语言探索 2407.12749v3 |
Authors (3): Manar Abdelatty, Jacob Rosenstein, Sherief Reda
Hardware design workflows rely on Process Design Kits (PDKs) from different fabrication nodes, each containing standard cell libraries optimized for speed, power, or density. Engineers typically navigate between the design and target PDK to make informed decisions, such as selecting gates for area optimization or enhancing the speed of the critical path. However, this process is often manual, time-consuming, and prone to errors. To address this, we present ChipXplore, a multi-agent collaborative framework powered by large language models that enables engineers to query hardware designs and PDKs using natural language. By exploiting the structured nature of PDK and hardware design data, ChipXplore retrieves relevant information through text-to-SQL and text-to-Cypher customized workflows. The framework achieves an execution accuracy of 97.39\% in complex natural language queries and improves productivity by making retrieval 5.63x faster while reducing errors by 5.25x in user studies. Compared to generic workflows, ChipXplore’s customized workflow is capable of orchestrating reasoning and planning over multiple databases, improving accuracy by 29.78\%. ChipXplore lays the foundation for building autonomous agents capable of tackling diverse physical design tasks that require PDK and hardware design awareness.
nan
Article 352
Title@2025-06-29 (7): Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking
Title: Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking | Destillieren und Verfeinern von Vernunft in kleinen Sprachmodellen für die Neurangierung von Dokumenten | 用于文件排序的小型语文模式中理由推理的提炼和精炼 2504.03947v3 |
Authors (2): Chris Samarinas, Hamed Zamani
We present a novel approach for training small language models for reasoning-intensive document ranking that combines knowledge distillation with reinforcement learning optimization. While existing methods often rely on expensive human annotations or large black-box language models, our methodology leverages web data and a teacher LLM to automatically generate high-quality training examples with relevance explanations. By framing document ranking as a reinforcement learning problem and incentivizing explicit reasoning capabilities, we train a compact 3B parameter language model that achieves state-of-the-art performance on the BRIGHT benchmark. Our model ranks third on the leaderboard while using substantially fewer parameters than other approaches, outperforming models that are over 20 times larger. Through extensive experiments, we demonstrate that generating explanations during inference, rather than directly predicting relevance scores, enables more effective reasoning with smaller language models. The self-supervised nature of our method offers a scalable and interpretable solution for modern information retrieval systems.
nan
Article 353
Title@2025-06-29 (7): Potemkin Understanding in Large Language Models
Title: Potemkin Understanding in Large Language Models | Potemkin Verständnis in großen Sprachmodellen | 大语言模型中的波坦金理解 2506.21521v2 |
Authors (4): Marina Mancoridis, Bec Weeks, Keyon Vafa, Sendhil Mullainathan
Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM’s capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs – such as AP exams – are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.
nan
Article 354
Title@2025-06-29 (7): I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue
Title: I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue | Ich verstehe, was Sie meinen: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue | 我理解你的意思:在多模式对话中,用共同语音手势解决参考问题 2503.00071v3 |
Authors (4): Esam Ghaleb, Bulat Khaertdinov, Aslı Özyürek, Raquel Fernández
In face-to-face interaction, we use multiple modalities, including speech and gestures, to communicate information and resolve references to objects. However, how representational co-speech gestures refer to objects remains understudied from a computational perspective. In this work, we address this gap by introducing a multimodal reference resolution task centred on representational gestures, while simultaneously tackling the challenge of learning robust gesture embeddings. We propose a self-supervised pre-training approach to gesture representation learning that grounds body movements in spoken language. Our experiments show that the learned embeddings align with expert annotations and have significant predictive power. Moreover, reference resolution accuracy further improves when (1) using multimodal gesture representations, even when speech is unavailable at inference time, and (2) leveraging dialogue history. Overall, our findings highlight the complementary roles of gesture and speech in reference resolution, offering a step towards more naturalistic models of human-machine interaction.
nan
Article 355
Title@2025-06-29 (7): TigerLLM - A Family of Bangla Large Language Models
Title: TigerLLM - A Family of Bangla Large Language Models | TigerLLM - Eine Familie von Bangla Große Sprachmodelle | TegerLLLM - 孟加拉大语言模式大家庭 2503.10995v3 |
Authors (2): Nishat Raihan, Marcos Zampieri
The development of Large Language Models (LLMs) remains heavily skewed towards English and a few other high-resource languages. This linguistic disparity is particularly evident for Bangla - the 5th most spoken language. A few initiatives attempted to create open-source Bangla LLMs with performance still behind high-resource languages and limited reproducibility. To address this gap, we introduce TigerLLM - a family of Bangla LLMs. Our results demonstrate that these models surpass all open-source alternatives and also outperform larger proprietary models like GPT3.5 across standard benchmarks, establishing TigerLLM as the new baseline for future Bangla language modeling.
nan
Article 356
Title@2025-06-29 (7): WebDancer: Towards Autonomous Information Seeking Agency
Title: WebDancer: Towards Autonomous Information Seeking Agency | WebDancer: Auf dem Weg zu einer autonomen Informationsagentur | WebDancer:走向自主信息搜索机构 2505.22648v2 |
Authors (13): Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi-step research. In this work, we present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct, WebDancer. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models. The codes and demo will be released in https://github.com/Alibaba-NLP/WebAgent.
nan
Article 357
Title@2025-06-29 (7): ATGen: A Framework for Active Text Generation
Title: ATGen: A Framework for Active Text Generation | ATGen: Ein Framework für die aktive Textgenerierung | ATGen: 主动生成文本的框架 2506.23342v1 |
Authors (15): Akim Tsvigun, Daniil Vasilev, Ivan Tsvigun, Ivan Lysenko, Talgat Bektleuov, Aleksandr Medvedev, Uliana Vinogradova, Nikita Severin, Mikhail Mozikov, Andrey Savchenko, Rostislav Grigorev, Ramil Kuleev, Fedor Zhdanov, Artem Shelmanov, Ilya Makarov
Active learning (AL) has demonstrated remarkable potential in reducing the annotation effort required for training machine learning models. However, despite the surging popularity of natural language generation (NLG) tasks in recent years, the application of AL to NLG has been limited. In this paper, we introduce Active Text Generation (ATGen) - a comprehensive framework that bridges AL with text generation tasks, enabling the application of state-of-the-art AL strategies to NLG. Our framework simplifies AL-empowered annotation in NLG tasks using both human annotators and automatic annotation agents based on large language models (LLMs). The framework supports LLMs deployed as services, such as ChatGPT and Claude, or operated on-premises. Furthermore, ATGen provides a unified platform for smooth implementation and benchmarking of novel AL strategies tailored to NLG tasks. Finally, we present evaluation results for state-of-the-art AL strategies across diverse settings and multiple text generation tasks. We show that ATGen reduces both the effort of human annotators and costs associated with API calls to LLM-based annotation agents. The code of the framework is available on GitHub under the MIT license. The video presentation is available at http://atgen-video.nlpresearch.group
nan
Article 358
Title@2025-06-29 (7): Information Loss in LLMs’ Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family
Title: Information Loss in LLMs’ Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family | Informationsverlust in der Mehrsprachigen Übersetzung von LLMs: Die Rolle von Trainingsdaten, Sprachnähe und Sprachfamilie | LLM女士多种语文翻译信息损失:培训数据的作用、语言接近和语言家庭 2506.23340v1 |
Authors (5): Yumeng Lin, Xufeng Duan, David Haslett, Yige Chen, Zhenguang G. Cai
Large language models have achieved impressive progress in multilingual translation, yet they continue to face challenges with certain language pairs-particularly those with limited training data or significant linguistic divergence from English. This study systematically investigates how training data, language proximity, and language family affect information loss in multilingual translation. We evaluate two large language models, GPT-4 and Llama 2, by performing round-trip translations. Translation quality was assessed using BLEU scores and BERT similarity metrics. Our results reveal a robust interaction between training data size and language distance: while abundant training data can mitigate the effects of linguistic divergence, languages structurally closer to English consistently yield higher translation quality in low-resource conditions. Among various distance metrics, orthographic, phylogenetic, syntactic, and geographical distances emerge as strong predictors of translation performance. Language family also exerts an independent influence. These findings contribute to a deeper understanding of the linguistic constraints shaping multilingual translation in large language models, emphasizing that translation quality is shaped not only by data volume but also by structural and typological relationships between languages.
nan
Article 359
Title@2025-06-29 (7): Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition
Title: Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition | Intricate Cues im Dialog zu verfolgen: Gemeinsame Graphenstruktur und Stimmungsdynamik für multimodale Emotionserkennung | 对话中的追踪源数:多模式情感认知的联合图表结构和感知动态 2407.21536v2 |
Authors (3): Jiang Li, Xiaoping Wang, Zhigang Zeng
Multimodal emotion recognition in conversation (MERC) has garnered substantial research attention recently. Existing MERC methods face several challenges: (1) they fail to fully harness direct inter-modal cues, possibly leading to less-than-thorough cross-modal modeling; (2) they concurrently extract information from the same and different modalities at each network layer, potentially triggering conflicts from the fusion of multi-source data; (3) they lack the agility required to detect dynamic sentimental changes, perhaps resulting in inaccurate classification of utterances with abrupt sentiment shifts. To address these issues, a novel approach named GraphSmile is proposed for tracking intricate emotional cues in multimodal dialogues. GraphSmile comprises two key components, i.e., GSF and SDP modules. GSF ingeniously leverages graph structures to alternately assimilate inter-modal and intra-modal emotional dependencies layer by layer, adequately capturing cross-modal cues while effectively circumventing fusion conflicts. SDP is an auxiliary task to explicitly delineate the sentiment dynamics between utterances, promoting the model’s ability to distinguish sentimental discrepancies. GraphSmile is effortlessly applied to multimodal sentiment analysis in conversation (MSAC), thus enabling simultaneous execution of MERC and MSAC tasks. Empirical results on multiple benchmarks demonstrate that GraphSmile can handle complex emotional and sentimental patterns, significantly outperforming baseline models.
nan
Article 360
Title@2025-06-29 (7): Automated Vehicles Should be Connected with Natural Language
Title: Automated Vehicles Should be Connected with Natural Language | Automatisierte Fahrzeuge sollten mit natürlicher Sprache verbunden werden | 自动车辆应与自然语言连接 2507.01059v1 |
Authors (6): Xiangbo Gao, Keshu Wu, Hao Zhang, Kexin Tian, Yang Zhou, Zhengzhong Tu
Multi-agent collaborative driving promises improvements in traffic safety and efficiency through collective perception and decision making. However, existing communication media – including raw sensor data, neural network features, and perception results – suffer limitations in bandwidth efficiency, information completeness, and agent interoperability. Moreover, traditional approaches have largely ignored decision-level fusion, neglecting critical dimensions of collaborative driving. In this paper we argue that addressing these challenges requires a transition from purely perception-oriented data exchanges to explicit intent and reasoning communication using natural language. Natural language balances semantic density and communication bandwidth, adapts flexibly to real-time conditions, and bridges heterogeneous agent platforms. By enabling the direct communication of intentions, rationales, and decisions, it transforms collaborative driving from reactive perception-data sharing into proactive coordination, advancing safety, efficiency, and transparency in intelligent transportation systems.
nan
Article 361
Title@2025-06-29 (7): GaussMaster: An LLM-based Database Copilot System
Title: GaussMaster: An LLM-based Database Copilot System | GaußMaster: Ein LLM-basiertes Datenbank-Copilot-System | GaussMaster:以LLM为基础的数据库联合试验系统 2506.23322v1 |
Authors (7): Wei Zhou, Ji Sun, Xuanhe Zhou, Guoliang Li, Luyang Liu, Hao Wu, Tianyuan Wang
In the financial industry, data is the lifeblood of operations, and DBAs shoulder significant responsibilities for SQL tuning, database deployment, diagnosis, and service repair. In recent years, both database vendors and customers have increasingly turned to autonomous database platforms in an effort to alleviate the heavy workload of DBAs. However, existing autonomous database platforms are limited in their capabilities, primarily addressing single-point issues such as NL2SQL, anomaly detection, and SQL tuning. Manual intervention remains a necessity for comprehensive database maintenance. GaussMaster aims to revolutionize this landscape by introducing an LLM-based database copilot system. This innovative solution is designed not only to assist developers in writing efficient SQL queries but also to provide comprehensive care for database services. When database instances exhibit abnormal behavior, GaussMaster is capable of orchestrating the entire maintenance process automatically. It achieves this by analyzing hundreds of metrics and logs, employing a Tree-of-thought approach to identify root causes, and invoking appropriate tools to resolve issues. We have successfully implemented GaussMaster in real-world scenarios, such as the banking industry, where it has achieved zero human intervention for over 34 database maintenance scenarios. In this paper, we present significant improvements in these tasks with code at https://gitcode.com/opengauss/openGauss-GaussMaster.
nan
Article 362
Title@2025-06-29 (7): Creativity in AI: Progresses and Challenges
Title: Creativity in AI: Progresses and Challenges | Kreativität in der KI: Fortschritte und Herausforderungen | 大赦国际的创造性:进展和挑战 2410.17218v5 |
Authors (4): Mete Ismayilzada, Debjit Paul, Antoine Bosselut, Lonneke van der Plas
Creativity is the ability to produce novel, useful, and surprising ideas, and has been widely studied as a crucial aspect of human cognition. Machine creativity on the other hand has been a long-standing challenge. With the rise of advanced generative AI, there has been renewed interest and debate regarding AI’s creative capabilities. Therefore, it is imperative to revisit the state of creativity in AI and identify key progresses and remaining challenges. In this work, we survey leading works studying the creative capabilities of AI systems, focusing on creative problem-solving, linguistic, artistic, and scientific creativity. Our review suggests that while the latest AI models are largely capable of producing linguistically and artistically creative outputs such as poems, images, and musical pieces, they struggle with tasks that require creative problem-solving, abstract thinking and compositionality and their generations suffer from a lack of diversity, originality, long-range incoherence and hallucinations. We also discuss key questions concerning copyright and authorship issues with generative models. Furthermore, we highlight the need for a comprehensive evaluation of creativity that is process-driven and considers several dimensions of creativity. Finally, we propose future research directions to improve the creativity of AI outputs, drawing inspiration from cognitive science and psychology.
nan
Article 363
Title@2025-06-29 (7): AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling
Title: AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling | AutoToM: Skalierung modellbasierter mentaler Schlussfolgerungen über Automatisierte Agentenmodellierung | AutoToM:通过自动代理建模增强基于模型的心理推断 2502.15676v2 |
Authors (5): Zhining Zhang, Chuanyang Jin, Mung Yao Jia, Shunchi Zhang, Tianmin Shu
Theory of Mind (ToM), the ability to understand people’s minds based on their behavior, is key to developing socially intelligent agents. Current approaches to ToM reasoning either rely on prompting Large Language Models (LLMs), which are prone to systematic errors, or use handcrafted, rigid agent models for model-based inference, which are more robust but fail to generalize across domains. In this work, we introduce AutoToM, an automated agent modeling method for scalable, robust, and interpretable mental inference. Given a ToM problem, AutoToM first proposes an initial agent model and then performs automated Bayesian inverse planning based on this model, leveraging an LLM backend. Guided by inference uncertainty, it iteratively refines the model by introducing additional mental variables and/or incorporating more timesteps in the context. Across five diverse benchmarks, AutoToM outperforms existing ToM methods and even large reasoning models. Additionally, we show that AutoToM can produce human-like confidence estimates and enable online mental inference for embodied decision-making.
nan
Article 364
Title@2025-06-29 (7): Ensemble BERT for Medication Event Classification on Electronic Health Records (EHRs)
Title: Ensemble BERT for Medication Event Classification on Electronic Health Records (EHRs) | Ensemble BERT für Medikationsveranstaltungsklassifikation auf elektronischen Gesundheitsakten (EHRs) | 电子健康记录(EHRs)药品事件分类集合BERT 2506.23315v1 |
Authors (3): Shouvon Sarker, Xishuang Dong, Lijun Qian
Identification of key variables such as medications, diseases, relations from health records and clinical notes has a wide range of applications in the clinical domain. n2c2 2022 provided shared tasks on challenges in natural language processing for clinical data analytics on electronic health records (EHR), where it built a comprehensive annotated clinical data Contextualized Medication Event Dataset (CMED). This study focuses on subtask 2 in Track 1 of this challenge that is to detect and classify medication events from clinical notes through building a novel BERT-based ensemble model. It started with pretraining BERT models on different types of big data such as Wikipedia and MIMIC. Afterwards, these pretrained BERT models were fine-tuned on CMED training data. These fine-tuned BERT models were employed to accomplish medication event classification on CMED testing data with multiple predictions. These multiple predictions generated by these fine-tuned BERT models were integrated to build final prediction with voting strategies. Experimental results demonstrated that BERT-based ensemble models can effectively improve strict Micro-F score by about 5% and strict Macro-F score by about 6%, respectively.
nan
Article 365
Title@2025-06-29 (7): AI Awareness
Title: AI Awareness | KI-Bewusstsein | AIA 认识 2504.20084v2 |
Authors (4): Xiaojian Li, Haoyuan Shi, Rongwu Xu, Wei Xu
Recent breakthroughs in artificial intelligence (AI) have brought about increasingly capable systems that demonstrate remarkable abilities in reasoning, language understanding, and problem-solving. These advancements have prompted a renewed examination of AI awareness not as a philosophical question of consciousness, but as a measurable, functional capacity. AI awareness is a double-edged sword: it improves general capabilities, i.e., reasoning, safety, while also raising concerns around misalignment and societal risks, demanding careful oversight as AI capabilities grow. In this review, we explore the emerging landscape of AI awareness, which includes metacognition (the ability to represent and reason about its own cognitive state), self-awareness (recognizing its own identity, knowledge, limitations, inter alia), social awareness (modeling the knowledge, intentions, and behaviors of other agents and social norms), and situational awareness (assessing and responding to the context in which it operates). First, we draw on insights from cognitive science, psychology, and computational theory to trace the theoretical foundations of awareness and examine how the four distinct forms of AI awareness manifest in state-of-the-art AI. Next, we systematically analyze current evaluation methods and empirical findings to better understand these manifestations. Building on this, we explore how AI awareness is closely linked to AI capabilities, demonstrating that more aware AI agents tend to exhibit higher levels of intelligent behaviors. Finally, we discuss the risks associated with AI awareness, including key topics in AI safety, alignment, and broader ethical concerns.
nan
Article 366
Title@2025-06-29 (7): Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles
Title: Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles | Kennen Sie zuerst und werden Sie besser: Modellierung von Mensch-ähnlichen Benutzer-Simulatoren über Implizite Profile | “先知你,再善待你:通过隐含描述文件模拟人像用户模拟器” 2502.18968v4 |
Authors (6): Kuang Wang, Xianfei Li, Shenghao Yang, Li Zhou, Feng Jiang, Haizhou Li
User simulators are crucial for replicating human interactions with dialogue systems, supporting both collaborative training and automatic evaluation, especially for large language models (LLMs). However, current role-playing methods face challenges such as a lack of utterance-level authenticity and user-level diversity, often hindered by role confusion and dependence on predefined profiles of well-known figures. In contrast, direct simulation focuses solely on text, neglecting implicit user traits like personality and conversation-level consistency. To address these issues, we introduce the User Simulator with Implicit Profiles (USP), a framework that infers implicit user profiles from human-machine interactions to simulate personalized and realistic dialogues. We first develop an LLM-driven extractor with a comprehensive profile schema, then refine the simulation using conditional supervised fine-tuning and reinforcement learning with cycle consistency, optimizing at both the utterance and conversation levels. Finally, a diverse profile sampler captures the distribution of real-world user profiles. Experimental results show that USP outperforms strong baselines in terms of authenticity and diversity while maintaining comparable consistency. Additionally, using USP to evaluate LLM on dynamic multi-turn aligns well with mainstream benchmarks, demonstrating its effectiveness in real-world applications.
nan
Article 367
Title@2025-06-29 (7): A Context-aware Framework for Translation-mediated Conversations
Title: A Context-aware Framework for Translation-mediated Conversations | Ein Context-aware Framework für translation-mediated conversations | 翻译调解对话的背景意识框架 2412.04205v2 |
Authors (5): José Pombal, Sweta Agrawal, Patrick Fernandes, Emmanouil Zaranis, André F. T. Martins
Automatic translation systems offer a powerful solution to bridge language barriers in scenarios where participants do not share a common language. However, these systems can introduce errors leading to misunderstandings and conversation breakdown. A key issue is that current systems fail to incorporate the rich contextual information necessary to resolve ambiguities and omitted details, resulting in literal, inappropriate, or misaligned translations. In this work, we present a framework to improve large language model-based translation systems by incorporating contextual information in bilingual conversational settings during training and inference. We validate our proposed framework on two task-oriented domains: customer chat and user-assistant interaction. Across both settings, the system produced by our framework-TowerChat-consistently results in better translations than state-of-the-art systems like GPT-4o and TowerInstruct, as measured by multiple automatic translation quality metrics on several language pairs. We also show that the resulting model leverages context in an intended and interpretable way, improving consistency between the conveyed message and the generated translations.
nan
Article 368
Title@2025-06-29 (7): Objective-Free Local Learning and Emergent Language Structure in Thinking Machines
Title: Objective-Free Local Learning and Emergent Language Structure in Thinking Machines | Zielfreies lokales Lernen und neue Sprachstrukturen in denkenden Maschinen | 考虑机器中无目标的地方学习和新兴语言结构 2506.23293v1 |
Authors (1): P. Myles Eugenio
We present a neuro-symbolic framework for generative language modeling based on local, event-driven emergent learning. At its core is a hierarchical Hopfield memory chain acting as a compositional short-term memory and dynamic tokenizer (retokenizer). Rather than relying on predefined tokens or supervision, the model builds structure from scratch, learning symbol sequences as multi-scale representations. It constructs projection tensors that bind co-occurring features into hierarchical tokens, introducing redundancy (i.e an emergent gauge structure) and enabling compression of local activations into long-range dependencies. Curiously, we find that the retokenizer can filter natural language patterns from noise, generating synthetic languages with coherent internal morphology – quantifiably the same as human language. Language is learned in a local (Hebbian) fashion, where model constraints dictate allowed emergent structure, and new information is retained in alignment with this structure. The absence of a global objective enables a form of plasticity not found in conventional language models, allowing the system to generalize beyond its initial inference class – even without explicit data. We demonstrate that briefly activating a new neuron during inference binds distributed multi-scale token features into a symbolic embedding. These emergent embedding neurons act as long-term memory and support a key-value mechanism for compositional inference and generalization. This architecture provides a methodological foundation for studying how symbolic structure can emerge from local neural learning. It offers a new pathway for building scalable, interpretable neuro-symbolic systems – where tokens, grammar, and reasoning arise as compressed memory traces within a Hopfield hierarchy. This approach advances the development of neuromorphic architectures for generative language models.
nan
Article 369
Title@2025-06-29 (7): Two Spelling Normalization Approaches Based on Large Language Models
Title: Two Spelling Normalization Approaches Based on Large Language Models | Zwei Rechtschreibungs-Normalisierungsansätze basierend auf großen Sprachmodellen | 基于大语言模式的两种拼法正常化办法 2506.23288v1 |
Authors (2): Miguel Domingo, Francisco Casacuberta
The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue, spelling normalization endeavors to align a document’s orthography with contemporary standards. In this study, we propose two new approaches based on large language models: one of which has been trained without a supervised training, and a second one which has been trained for machine translation. Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.
nan
Article 370
Title@2025-06-29 (7): Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models
Title: Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models | Beispiel dann Identifizieren: Ein allgemeiner Rahmen für die Risikokontrolle und Bewertung in multimodalen großen Sprachmodellen | 确定:多式大语言模式风险管理和评估总框架 2410.08174v3 |
Authors (6): Qingni Wang, Tiantian Geng, Zhiyuan Wang, Teng Wang, Bo Fu, Feng Zheng
Multimodal Large Language Models (MLLMs) exhibit promising advancements across various tasks, yet they still encounter significant trustworthiness issues. Prior studies apply Split Conformal Prediction (SCP) in language modeling to construct prediction sets with statistical guarantees. However, these methods typically rely on internal model logits or are restricted to multiple-choice settings, which hampers their generalizability and adaptability in dynamic, open-ended environments. In this paper, we introduce TRON, a two-step framework for risk control and assessment, applicable to any MLLM that supports sampling in both open-ended and closed-ended scenarios. TRON comprises two main components: (1) a novel conformal score to sample response sets of minimum size, and (2) a nonconformity score to identify high-quality responses based on self-consistency theory, controlling the error rates by two specific risk levels. Furthermore, we investigate semantic redundancy in prediction sets within open-ended contexts for the first time, leading to a promising evaluation metric for MLLMs based on average set size. Our comprehensive experiments across four Video Question-Answering (VideoQA) datasets utilizing eight MLLMs show that TRON achieves desired error rates bounded by two user-specified risk levels. Additionally, deduplicated prediction sets maintain adaptiveness while being more efficient and stable for risk assessment under different risk levels.
nan
Article 371
Title@2025-06-29 (7): Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games
Title: Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games | Beschädigt durch Reasoning: Reasoning Sprachmodelle werden Free-Riders in Public Goods Games | 原因:在公共货物运动会中,理性语言模式成为自由骑手 2506.23276v1 |
Authors (6): David Guzman Piedrahita, Yongjin Yang, Mrinmaya Sachan, Giorgia Ramponi, Bernhard Schölkopf, Zhijing Jin
As large language models (LLMs) are increasingly deployed as autonomous agents, understanding their cooperation and social mechanisms is becoming increasingly important. In particular, how LLMs balance self-interest and collective well-being is a critical challenge for ensuring alignment, robustness, and safe deployment. In this paper, we examine the challenge of costly sanctioning in multi-agent LLM systems, where an agent must decide whether to invest its own resources to incentivize cooperation or penalize defection. To study this, we adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas over repeated interactions. Our analysis reveals four distinct behavioral patterns among models: some consistently establish and sustain high levels of cooperation, others fluctuate between engagement and disengagement, some gradually decline in cooperative behavior over time, and others rigidly follow fixed strategies regardless of outcomes. Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation, whereas some traditional LLMs consistently achieve high levels of cooperation. These findings suggest that the current approach to improving LLMs, which focuses on enhancing their reasoning capabilities, does not necessarily lead to cooperation, providing valuable insights for deploying LLM agents in environments that require sustained collaboration. Our code is available at https://github.com/davidguzmanp/SanctSim
nan
Article 372
Title@2025-06-29 (7): Agentic Medical Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge
Title: Agentic Medical Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge | Agentisches medizinisches Wissen Grafiken verbessern medizinische Frageantworten: Die Lücke zwischen LLMs und sich entwickelndem medizinischem Wissen überbrücken | 药用知识图加强医疗问题的回答:缩小LLMM与不断发展的医学知识之间的差距 2502.13010v3 |
Authors (5): Mohammad Reza Rezaei, Reza Saadati Fard, Jayson L. Parker, Rahul G. Krishnan, Milad Lankarany
Large Language Models (LLMs) have significantly advanced medical question-answering by leveraging extensive clinical data and medical literature. However, the rapid evolution of medical knowledge and the labor-intensive process of manually updating domain-specific resources pose challenges to the reliability of these systems. To address this, we introduce Agentic Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates the construction and continuous updating of medical knowledge graphs, integrates reasoning, and retrieves current external evidence, such as PubMed and WikiSearch. By dynamically linking new findings and complex medical concepts, AMG-RAG not only improves accuracy but also enhances interpretability in medical queries. Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of 66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to 100 times larger. Notably, these improvements are achieved without increasing computational overhead, highlighting the critical role of automated knowledge graph generation and external evidence retrieval in delivering up-to-date, trustworthy medical insights.
nan
Article 373
Title@2025-06-29 (7): RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing
Title: RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing | RAG und RAU: Eine Umfrage zum retrieval-augmentierten Sprachmodell in der natürlichen Sprachverarbeitung | RAG和RAU:关于自然语言处理中检索增强语言模式的调查 2404.19543v2 |
Authors (2): Yucheng Hu, Yuxing Lu
Large Language Models (LLMs) have catalyzed significant advancements in Natural Language Processing (NLP), yet they encounter challenges such as hallucination and the need for domain-specific knowledge. To mitigate these, recent methodologies have integrated information retrieved from external resources with LLMs, substantially enhancing their performance across NLP tasks. This survey paper addresses the absence of a comprehensive overview on Retrieval-Augmented Language Models (RALMs), both Retrieval-Augmented Generation (RAG) and Retrieval-Augmented Understanding (RAU), providing an in-depth examination of their paradigm, evolution, taxonomy, and applications. The paper discusses the essential components of RALMs, including Retrievers, Language Models, and Augmentations, and how their interactions lead to diverse model structures and applications. RALMs demonstrate utility in a spectrum of tasks, from translation and dialogue systems to knowledge-intensive applications. The survey includes several evaluation methods of RALMs, emphasizing the importance of robustness, accuracy, and relevance in their assessment. It also acknowledges the limitations of RALMs, particularly in retrieval quality and computational efficiency, offering directions for future research. In conclusion, this survey aims to offer a structured insight into RALMs, their potential, and the avenues for their future development in NLP. The paper is supplemented with a Github Repository containing the surveyed works and resources for further study: https://github.com/2471023025/RALM_Survey.
nan
Article 374
Title@2025-06-29 (7): The language of time: a language model perspective on time-series foundation models
Title: The language of time: a language model perspective on time-series foundation models | Die Sprache der Zeit: ein Sprachmodell Perspektive auf Zeitreihen Grundmodelle | 时间语言:时间序列基础模型的语言模式视角 2507.00078v1 |
Authors (5): Yi Xie, Yun Xiong, Zejian Shi, Hao Niu, Zhengfu Liu
With the rise of large language models, the paradigm of training foundation models with massive parameter counts on vast datasets has been adopted in multiple domains to achieve remarkable success. Time series foundation models represent a significant extension of this paradigm, demonstrating exceptional expressive power, generalization, and cross-domain transferability. However, this gives rise to a fundamental paradox: time series data reflect distinct dynamical systems, making cross-domain transfer intuitively implausible, yet this is contradicted by the models’ empirical success. To resolve this paradox, this paper investigates, from both theoretical and experimental perspectives, the representation learning mechanisms and generalization capabilities of patch-based time series foundation models. We argue that such models are not merely applying a new architecture but are fundamentally generalizing the representation paradigm of language models by extending deterministic vector-based representations to latent probabilistic distributional forms. Our theoretical analysis supports this framework by demonstrating that continuous time-series patches can be faithfully quantized into a discrete vocabulary whose key statistical properties are highly consistent with those of natural language. This generalization allows time series models to inherit the robust representation and transfer abilities of large language models, thereby explaining their superior performance in temporal tasks. Ultimately, our work provides a rigorous theoretical cornerstone for understanding, evaluating, and improving the safety and reliability of large-scale time series foundation models.
nan
Article 375
Title@2025-06-29 (7): Generalist Reward Models: Found Inside Large Language Models
Title: Generalist Reward Models: Found Inside Large Language Models | Generalist Reward Models: In großen Sprachmodellen gefunden | 通用奖赏模式:在大语言模式内建立起来 2506.23235v1 |
Authors (9): Yi-Chen Li, Tian Xu, Yang Yu, Xuqin Zhang, Xiong-Hui Chen, Zhongxiang Ling, Ningjing Chao, Lei Yuan, Zhi-Hua Zhou
The alignment of Large Language Models (LLMs) is critically dependent on reward models trained on costly human preference data. While recent work explores bypassing this cost with AI feedback, these methods often lack a rigorous theoretical foundation. In this paper, we discover that a powerful generalist reward model is already latently present within any LLM trained via standard next-token prediction. We prove that this endogenous reward is not a heuristic, but is theoretically equivalent to a reward function learned through offline inverse reinforcement learning. This connection allows us to directly elicit a high-quality reward signal from a base (pre-trained or supervised fine-tuned) model without any further training. Critically, we also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model. To our best knowledge, this is the first theoretical proof of the effectiveness of reinforcement learning for LLMs. Our experiments validate this theory, demonstrating that our method not only outperforms existing LLM-as-a-judge approaches but can also surpass explicitly trained reward models. These findings suggest that the reward modeling stage can be replaced by a principled method of eliciting the knowledge already captured during pre-training, heralding a more efficient, powerful, and scalable paradigm for LLMs alignment as well as multi-modal models.
nan
Article 376
Title@2025-06-29 (7): Masked Gated Linear Unit
Title: Masked Gated Linear Unit | Maskierte gezahnte Lineareinheit | 面罩线条股 2506.23225v1 |
Authors (5): Yukito Tajima, Nakamasa Inoue, Yusuke Sekikawa, Ikuro Sato, Rio Yokota
Gated Linear Units (GLUs) have become essential components in the feed-forward networks of state-of-the-art Large Language Models (LLMs). However, they require twice as many memory reads compared to feed-forward layers without gating, due to the use of separate weight matrices for the gate and value streams. To address this bottleneck, we introduce Masked Gated Linear Units (MGLUs), a novel family of GLUs with an efficient kernel implementation. The core contribution of MGLUs include: (1) the Mixture of Element-wise Gating (MoEG) architecture that learns multiple binary masks, each determining gate or value assignments at the element level on a single shared weight matrix resulting in reduced memory transfer, and (2) FlashMGLU, a hardware-friendly kernel that yields up to a 19.7 $\times$ inference-time speed-up over a naive PyTorch MGLU and is 47% more memory-efficient and 34% faster than standard GLUs despite added architectural complexity on an RTX5090 GPU. In LLM experiments, the Swish-activated variant SwiMGLU preserves its memory advantages while matching - or even surpassing - the downstream accuracy of the SwiGLU baseline.
nan
Article 377
Title@2025-06-29 (7): UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Title: UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding | UrbanLLAVA: Ein multimodales Large Language Model für urbane Intelligenz mit räumlicher Vernunft und Verständnis | UrbulalLALLAVA:具有空间合理性和理解性的城市情报多模式大语言模式 2506.23219v1 |
Authors (5): Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, Yong Li
Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $\textit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $\textit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $\textit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $\textit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.
nan
Article 378
Title@2025-06-29 (7): RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams
Title: RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams | RiverText: Eine Python-Bibliothek für das Training und Evaluieren inkrementaler Word-Einbettungen aus Textdatenströmen | RiverText:一个培训和评价来自文本数据流的递增单词嵌入的Python图书馆 2506.23192v1 |
Authors (2): Gabriel Iturra-Bocaz, Felipe Bravo-Marquez
Word embeddings have become essential components in various information retrieval and natural language processing tasks, such as ranking, document classification, and question answering. However, despite their widespread use, traditional word embedding models present a limitation in their static nature, which hampers their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To overcome this problem, incremental word embedding algorithms are introduced, capable of dynamically updating word representations in response to new language patterns and processing continuous data streams. This paper presents RiverText, a Python library for training and evaluating incremental word embeddings from text data streams. Our tool is a resource for the information retrieval and natural language processing communities that work with word embeddings in streaming scenarios, such as analyzing social media. The library implements different incremental word embedding techniques, such as Skip-gram, Continuous Bag of Words, and Word Context Matrix, in a standardized framework. In addition, it uses PyTorch as its backend for neural network training. We have implemented a module that adapts existing intrinsic static word embedding evaluation tasks for word similarity and word categorization to a streaming setting. Finally, we compare the implemented methods with different hyperparameter settings and discuss the results. Our open-source library is available at https://github.com/dccuchile/rivertext.
nan
Article 379
Title@2025-06-29 (7): Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models
Title: Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models | Neudefinition von Bewertungsstandards: Ein einheitlicher Rahmen für die Bewertung der koreanischen Fähigkeiten von Sprachmodellen | 重新界定评价标准:评价韩国语言模式能力的统一框架 2503.22968v3 |
Authors (9): Hanwool Lee, Dasol Choi, Sooyong Kim, Ilgyun Jung, Sangwon Baek, Guijin Son, Inseon Hwang, Naeun Lee, Seunghyeok Hong
Recent advancements in Korean large language models (LLMs) have driven numerous benchmarks and evaluation methods, yet inconsistent protocols cause up to 10 p.p performance gaps across institutions. Overcoming these reproducibility gaps does not mean enforcing a one-size-fits-all evaluation. Rather, effective benchmarking requires diverse experimental approaches and a framework robust enough to support them. To this end, we introduce HRET (Haerae Evaluation Toolkit), an open-source, registry-based framework that unifies Korean LLM assessment. HRET integrates major Korean benchmarks, multiple inference backends, and multi-method evaluation, with language consistency enforcement to ensure genuine Korean outputs. Its modular registry design also enables rapid incorporation of new datasets, methods, and backends, ensuring the toolkit adapts to evolving research needs. Beyond standard accuracy metrics, HRET incorporates Korean-focused output analyses-morphology-aware Type-Token Ratio (TTR) for evaluating lexical diversity and systematic keyword-omission detection for identifying missing concepts-to provide diagnostic insights into language-specific behaviors. These targeted analyses help researchers pinpoint morphological and semantic shortcomings in model outputs, guiding focused improvements in Korean LLM development.
nan
Article 380
Title@2025-06-29 (7): FinAI-BERT: A Transformer-Based Model for Sentence-Level Detection of AI Disclosures in Financial Reports
Title: FinAI-BERT: A Transformer-Based Model for Sentence-Level Detection of AI Disclosures in Financial Reports | FinAI-BERT: Ein transformerbasiertes Modell für die Sentence-Level-Erkennung von KI-Enthüllungen in Finanzberichten | FinAI-BERT:以判决为基础在判决一级侦查财务报告中AI披露的变换模式 2507.01991v1 |
Authors (1): Muhammad Bilal Zafar
The proliferation of artificial intelligence (AI) in financial services has prompted growing demand for tools that can systematically detect AI-related disclosures in corporate filings. While prior approaches often rely on keyword expansion or document-level classification, they fall short in granularity, interpretability, and robustness. This study introduces FinAI-BERT, a domain-adapted transformer-based language model designed to classify AI-related content at the sentence level within financial texts. The model was fine-tuned on a manually curated and balanced dataset of 1,586 sentences drawn from 669 annual reports of U.S. banks (2015 to 2023). FinAI-BERT achieved near-perfect classification performance (accuracy of 99.37 percent, F1 score of 0.993), outperforming traditional baselines such as Logistic Regression, Naive Bayes, Random Forest, and XGBoost. Interpretability was ensured through SHAP-based token attribution, while bias analysis and robustness checks confirmed the model’s stability across sentence lengths, adversarial inputs, and temporal samples. Theoretically, the study advances financial NLP by operationalizing fine-grained, theme-specific classification using transformer architectures. Practically, it offers a scalable, transparent solution for analysts, regulators, and scholars seeking to monitor the diffusion and framing of AI across financial institutions.
nan
Article 381
Title@2025-06-29 (7): The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation
Title: The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation | Die Wirksamkeit von LLMs als Annotatoren: Eine vergleichende Übersicht und empirische Analyse der direkten Repräsentation | LLMs作为说明人的效力:直接代表的比较概览和经验分析 2405.01299v2 |
Authors (2): Maja Pavlovic, Massimo Poesio
Large Language Models (LLMs) have emerged as powerful support tools across various natural language tasks and a range of application domains. Recent studies focus on exploring their capabilities for data annotation. This paper provides a comparative overview of twelve studies investigating the potential of LLMs in labelling data. While the models demonstrate promising cost and time-saving benefits, there exist considerable limitations, such as representativeness, bias, sensitivity to prompt variations and English language preference. Leveraging insights from these studies, our empirical analysis further examines the alignment between human and GPT-generated opinion distributions across four subjective datasets. In contrast to the studies examining representation, our methodology directly obtains the opinion distribution from GPT. Our analysis thereby supports the minority of studies that are considering diverse perspectives when evaluating data annotation tasks and highlights the need for further research in this direction.
nan
Article 382
Title@2025-06-29 (7): V-SYNTHESIS: Task-Agnostic Synthesis of Consistent and Diverse In-Context Demonstrations from Scratch via V-Entropy
Title: V-SYNTHESIS: Task-Agnostic Synthesis of Consistent and Diverse In-Context Demonstrations from Scratch via V-Entropy | V-SYNTHESIS: Task-Agnostische Synthese von konsistenten und unterschiedlichen In-Context-Demonstrationen von Scratch über V-Entropie | V-SYSIS:关于通过V-Entropy从Scratch到V-Entropy的一致和多样化的文体演示的 任务-不可知综合 2506.23149v1 |
Authors (6): Dingzirui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
High labeling cost for in-context learning (ICL) demonstrations motivates using large language models (LLMs) for synthesis to reduce overhead. However, existing synthesis methods are mainly task-specific or rely on pre-existing demonstrations. So this paper focuses on synthesizing demonstrations from scratch for arbitrary tasks. A major challenge in synthesizing from scratch is ensuring consistency with the target task, as the lack of labeling guidance could lead to synthesis bias. We first propose a consistency metric called V-Score, which has higher performance and lower computation cost compared with the metrics based on grams or embedding vectors. Furthermore, we introduce V-Synthesis, which leverages V-Score for proportional sampling to ensure both high consistency and diversity of synthesized demonstrations. Experimental results demonstrate that V-Synthesis yields an average performance improvement of 2.0% compared to existing synthesis methods confirming the effectiveness of V-Synthesis.
nan
Article 383
Title@2025-06-29 (7): Brevity is the soul of sustainability: Characterizing LLM response lengths
Title: Brevity is the soul of sustainability: Characterizing LLM response lengths | Brevity ist die Seele der Nachhaltigkeit: Charakterisierende LLM-Responselängen | 博利是可持续性的灵魂:确定LLM 反应长度 2506.08686v2 |
Authors (7): Soham Poddar, Paramita Koley, Janardan Misra, Sanjay Podder, Navveen Balani, Niloy Ganguly, Saptarshi Ghosh
A significant portion of the energy consumed by Large Language Models (LLMs) arises from their inference processes; hence developing energy-efficient methods for inference is crucial. While several techniques exist for inference optimization, output compression remains relatively unexplored, with only a few preliminary efforts addressing this aspect. In this work, we first benchmark 12 decoder-only LLMs across 5 datasets, revealing that these models often produce responses that are substantially longer than necessary. We then conduct a comprehensive quality assessment of LLM responses, formally defining six information categories present in LLM responses. We show that LLMs often tend to include redundant or additional information besides the minimal answer. To address this issue of long responses by LLMs, we explore several simple and intuitive prompt-engineering strategies. Empirical evaluation shows that appropriate prompts targeting length reduction and controlling information content can achieve significant energy optimization between 25-60\% by reducing the response length while preserving the quality of LLM responses.
nan
Article 384
Title@2025-06-29 (7): Benchmarking Deep Search over Heterogeneous Enterprise Data
Title: Benchmarking Deep Search over Heterogeneous Enterprise Data | Benchmarking Deep Search über heterogene Unternehmensdaten | 确定对不同不同企业数据进行深度搜索的基准 2506.23139v1 |
Authors (6): Prafulla Kumar Choubey, Xiangyu Peng, Shilpa Bhagavath, Kung-Hsiang Huang, Caiming Xiong, Chien-Sheng Wu
We present a new benchmark for evaluating Deep Search–a realistic and complex form of retrieval-augmented generation (RAG) that requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources. These include documents, meeting transcripts, Slack messages, GitHub, and URLs, which vary in structure and often contain human-to-human interactions. We build it using a synthetic data pipeline that simulates business workflows across product planning, development, and support stages, generating interconnected content with realistic noise and multi-hop questions with guaranteed ground-truth answers. We release our benchmark with both answerable and unanswerable queries, and retrieval pool of 39,190 enterprise artifacts, enabling fine-grained evaluation of long-context LLM and RAG systems. Our experiments reveal that even the best-performing agentic RAG methods achieve an average performance score of 32.96 on our benchmark. With further analysis, we highlight retrieval as the main bottleneck: existing methods struggle to conduct deep searches and retrieve all necessary evidence. Consequently, they often reason over partial context, leading to significant performance degradation.
nan
Article 385
Title@2025-06-29 (7): LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation
Title: LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation | LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation | 利用结构化数据检索软件增强型一代技术文件的LLM协助问题查询 2506.23136v1 |
Authors (2): Shadman Sobhan, Mohammad Ariful Haque
Large Language Models (LLMs) are capable of natural language understanding and generation. But they face challenges such as hallucination and outdated knowledge. Fine-tuning is one possible solution, but it is resource-intensive and must be repeated with every data update. Retrieval-Augmented Generation (RAG) offers an efficient solution by allowing LLMs to access external knowledge sources. However, traditional RAG pipelines struggle with retrieving information from complex technical documents with structured data such as tables and images. In this work, we propose a RAG pipeline, capable of handling tables and images in documents, for technical documents that support both scanned and searchable formats. Its retrieval process combines vector similarity search with a fine-tuned reranker based on Gemma-2-9b-it. The reranker is trained using RAFT (Retrieval-Augmented Fine-Tuning) on a custom dataset designed to improve context identification for question answering. Our evaluation demonstrates that the proposed pipeline achieves a high faithfulness score of 94% (RAGas) and 96% (DeepEval), and an answer relevancy score of 87% (RAGas) and 93% (DeepEval). Comparative analysis demonstrates that the proposed architecture is superior to general RAG pipelines in terms of table-based questions and handling questions outside context.
nan
Article 386
Title@2025-06-29 (7): Evaluating Rare Disease Diagnostic Performance in Symptom Checkers: A Synthetic Vignette Simulation Approach
Title: Evaluating Rare Disease Diagnostic Performance in Symptom Checkers: A Synthetic Vignette Simulation Approach | Bewertung der Diagnoseleistung bei seltenen Krankheiten bei Symptomen: Ein synthetischer Vignette-Simulationsansatz | 评价症状检查器中的罕见疾病诊断性能: 合成Vignette模拟方法 2506.19750v4 |
Authors (3): Takashi Nishibayashi, Seiji Kanazawa, Kumpei Yamada
Symptom Checkers (SCs) provide medical information tailored to user symptoms. A critical challenge in SC development is preventing unexpected performance degradation for individual diseases, especially rare diseases, when updating algorithms. This risk stems from the lack of practical pre-deployment evaluation methods. For rare diseases, obtaining sufficient evaluation data from user feedback is difficult. To evaluate the impact of algorithm updates on the diagnostic performance for individual rare diseases before deployment, this study proposes and validates a novel Synthetic Vignette Simulation Approach. This approach aims to enable this essential evaluation efficiently and at a low cost. To estimate the impact of algorithm updates, we generated synthetic vignettes from disease-phenotype annotations in the Human Phenotype Ontology (HPO), a publicly available knowledge base for rare diseases curated by experts. Using these vignettes, we simulated SC interviews to predict changes in diagnostic performance. The effectiveness of this approach was validated retrospectively by comparing the predicted changes with actual performance metrics using the R-squared ($R^2$) coefficient. Our experiment, covering eight past algorithm updates for rare diseases, showed that the proposed method accurately predicted performance changes for diseases with phenotype frequency information in HPO (n=5). For these updates, we found a strong correlation for both Recall@8 change ($R^2$ = 0.83,$p$ = 0.031) and Precision@8 change ($R^2$ = 0.78,$p$ = 0.047). Our proposed method enables the pre-deployment evaluation of SC algorithm changes for individual rare diseases. This evaluation is based on a publicly available medical knowledge database created by experts, ensuring transparency and explainability for stakeholders. Additionally, SC developers can efficiently improve diagnostic performance at a low cost.
nan
Article 387
Title@2025-06-29 (7): Format-Adapter: Improving Reasoning Capability of LLMs by Adapting Suitable Format
Title: Format-Adapter: Improving Reasoning Capability of LLMs by Adapting Suitable Format | Format-Adapter: Verbesserung der Kapazität von LLMs durch Anpassung des geeigneten Formats | 格式设计师:通过调整适当格式,提高LLMs的理据能力 2506.23133v1 |
Authors (11): Dingzirui Wang, Xuanliang Zhang, Rongyu Cao, Longxu Dou, Xianzhen Luo, Yingwei Ma, Qingfu Zhu, Wanxiang Che, Binhua Li, Fei Huang, Yongbin Li
Generating and voting multiple answers is an effective method to mitigate reasoning inconsistencies of large language models (LLMs). Prior works have shown that multiple reasoning formats outperform a single format when generating multiple answers. However, previous works using multiple formats rely on formats labeled by humans, which could be unsuitable for all tasks and have high labeling costs. To address this issue, we adapt suitable formats to the given tasks by generating and selecting formats. We first propose how to measure the reasoning error when generating multiple answers. Then, we introduce Format-Adapter, which utilizes LLMs to generate and select suitable reasoning formats by minimizing the error measurement we present. We conduct experiments on math and commonsense reasoning tasks, where Format-Adapter achieves a 4.3% performance improvement on average over previous works, demonstrating the effectiveness.
nan
Article 388
Title@2025-06-29 (7): Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Title: Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding | Time-R1: Nach dem Training Großer Vision-Sprachenmodell für die zeitliche Videoerdung | 时间-R1:培训后用于实时视频定位的大型视觉语言模型 2503.13377v3 |
Authors (17): Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, Qin Jin
Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance the capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore data-efficient post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small yet comprehensive benchmark for LVLM evaluation, assessing 11 types of queries and featuring balanced distributions across both videos and queries. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.
nan
Article 389
Title@2025-06-29 (7): Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning
Title: Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning | Entleashing Embodyd Task Planning Fähigkeit in LLMs durch Verstärkung Learning | 通过强化学习,在LLMs中释放未穿衣任务规划能力 2506.23127v1 |
Authors (6): Zhaoye Fei, Li Ji, Siyin Wang, Junhao Shi, Jingjing Gong, Xipeng Qiu
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they face significant challenges in embodied task planning scenarios that require continuous environmental understanding and action generation. Existing approaches generate open-loop action scripts based on static knowledge, making it difficult to learn causal relationships between actions and environmental feedback, particularly in partially observable environments. We introduce Embodied Planner-R1, a novel outcome-driven reinforcement learning framework that enables LLMs to develop interactive capabilities through autonomous exploration with minimal supervision. Our framework incorporates three key innovations: (1) Without human annotations, we employ pure reinforcement learning with group rollout, incorporating in-environment interaction through parallel exploration; (2) completion-driven sparse reward; and (3) Interactive Policy Optimization (IPO) for efficient learning from grouped trajectories. Across two challenging text-based Embodied planning benchmarks, Embodied Planner-R1 achieves impressive completion rates of 97.78% on ALFWorld and 79.92% on ScienceWorld, surpassing prior methods by a large margin, and suffers only a -3.66% drop in previously unseen environments, evidencing strong generalization.
nan
Article 390
Title@2025-06-29 (7): Beware of Calibration Data for Pruning Large Language Models
Title: Beware of Calibration Data for Pruning Large Language Models | Hüten Sie sich vor Kalibrierdaten für das Pruning von großen Sprachmodellen | 注意为粗略大语言模型提供校准数据 2410.17711v2 |
Authors (8): Yixin Ji, Yang Xiang, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang
As large language models (LLMs) are widely applied across various fields, model compression has become increasingly crucial for reducing costs and improving inference efficiency. Post-training pruning is a promising method that does not require resource-intensive iterative training and only needs a small amount of calibration data to assess the importance of parameters. Recent research has enhanced post-training pruning from different aspects but few of them systematically explore the effects of calibration data, and it is unclear if there exist better calibration data construction strategies. We fill this blank and surprisingly observe that calibration data is also crucial to post-training pruning, especially for high sparsity. Through controlled experiments on important influence factors of calibration data, including the pruning settings, the amount of data, and its similarity with pre-training data, we observe that a small size of data is adequate, and more similar data to its pre-training stage can yield better performance. As pre-training data is usually inaccessible for advanced LLMs, we further provide a self-generating calibration data synthesis strategy to construct feasible calibration data. Experimental results on recent strong open-source LLMs (e.g., DCLM, and LLaMA-3) show that the proposed strategy can enhance the performance of strong pruning methods (e.g., Wanda, DSnoT, OWL) by a large margin (up to $2.68\%$). Code is available at https://github.com/Dereck0602/calibration_data.
nan
Article 391
Title@2025-06-29 (7): Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models
Title: Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models | Decoding Memes: Benchmarking Narrative Role Classification für multilinguale und multimodale Modelle | 代码模式:多语种和多模式模式的 “ 示范 “ 和 “ 多语种和多模式模式 “ 的 “ 示范作用分类基准 “ 2506.23122v1 |
Authors (2): Shivam Sharma, Tanmoy Chakraborty
This work investigates the challenging task of identifying narrative roles - Hero, Villain, Victim, and Other - in Internet memes, across three diverse test sets spanning English and code-mixed (English-Hindi) languages. Building on an annotated dataset originally skewed toward the ‘Other’ class, we explore a more balanced and linguistically diverse extension, originally introduced as part of the CLEF 2024 shared task. Comprehensive lexical and structural analyses highlight the nuanced, culture-specific, and context-rich language used in real memes, in contrast to synthetically curated hateful content, which exhibits explicit and repetitive lexical markers. To benchmark the role detection task, we evaluate a wide spectrum of models, including fine-tuned multilingual transformers, sentiment and abuse-aware classifiers, instruction-tuned LLMs, and multimodal vision-language models. Performance is assessed under zero-shot settings using precision, recall, and F1 metrics. While larger models like DeBERTa-v3 and Qwen2.5-VL demonstrate notable gains, results reveal consistent challenges in reliably identifying the ‘Victim’ class and generalising across cultural and code-mixed content. We also explore prompt design strategies to guide multimodal models and find that hybrid prompts incorporating structured instructions and role definitions offer marginal yet consistent improvements. Our findings underscore the importance of cultural grounding, prompt engineering, and multimodal reasoning in modelling subtle narrative framings in visual-textual content.
nan
Article 392
Title@2025-06-29 (7): Enough Coin Flips Can Make LLMs Act Bayesian
Title: Enough Coin Flips Can Make LLMs Act Bayesian | Genug Münze Flips kann LLMs Act Bayesian | 足够多的硬币翻翻可以制造长效LLM 贝叶斯女士 2503.04722v2 |
Authors (7): Ritwik Gupta, Rodolfo Corona, Jiaxin Ge, Eric Wang, Dan Klein, Trevor Darrell, David M. Chan
Large language models (LLMs) exhibit the ability to generalize given few-shot examples in their input prompt, an emergent capability known as in-context learning (ICL). We investigate whether LLMs use ICL to perform structured reasoning in ways that are consistent with a Bayesian framework or rely on pattern matching. Using a controlled setting of biased coin flips, we find that: (1) LLMs often possess biased priors, causing initial divergence in zero-shot settings, (2) in-context evidence outweighs explicit bias instructions, (3) LLMs broadly follow Bayesian posterior updates, with deviations primarily due to miscalibrated priors rather than flawed updates, and (4) attention magnitude has negligible effect on Bayesian inference. With sufficient demonstrations of biased coin flips via ICL, LLMs update their priors in a Bayesian manner.
nan
Article 393
Title@2025-06-29 (7): A Survey of Test-Time Compute: From Intuitive Inference to Deliberate Reasoning
Title: A Survey of Test-Time Compute: From Intuitive Inference to Deliberate Reasoning | Eine Übersicht über die Berechnung der Testzeit: Vom intuitiven Rückschluss zur überlegten Vernunft | 试验时间计算调查:从直觉推理到故意推理 2501.02497v3 |
Authors (9): Yixin Ji, Juntao Li, Yang Xiang, Hai Ye, Kaixin Wu, Kai Yao, Jia Xu, Linjian Mo, Min Zhang
The remarkable performance of the o1 model in complex reasoning demonstrates that test-time compute scaling can further unlock the model’s potential, enabling powerful System-2 thinking. However, there is still a lack of comprehensive surveys for test-time compute scaling. We trace the concept of test-time compute back to System-1 models. In System-1 models, test-time compute addresses distribution shifts and improves robustness and generalization through parameter updating, input modification, representation editing, and output calibration. In System-2 models, it enhances the model’s reasoning ability to solve complex problems through repeated sampling, self-correction, and tree search. We organize this survey according to the trend of System-1 to System-2 thinking, highlighting the key role of test-time compute in the transition from System-1 models to weak System-2 models, and then to strong System-2 models. We also point out advanced topics and future directions.
nan
Article 394
Title@2025-06-29 (7): MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings
Title: MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings | MoCa: Modality-aware Continual Pre-Training macht bidirektionale multimodale Einbettungen besser | MoCa: 模式 – – 有意识的连续培训前预培训使双向双向多模式嵌入更佳 2506.23115v1 |
Authors (7): Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou
Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into effective bidirectional multimodal embedding models. The first stage, Modality-aware Continual Pre-training, introduces a joint reconstruction objective that simultaneously denoises interleaved text and image inputs, enhancing bidirectional context-aware reasoning. The second stage, Heterogeneous Contrastive Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple image-caption pairs to enhance generalization and alignment. Our method addresses the stated limitations by introducing bidirectional attention through continual pre-training, scaling effectively with massive unlabeled datasets via joint reconstruction objectives, and utilizing diverse multimodal data for enhanced representation robustness. Experiments demonstrate that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results, and exhibits strong scalability with both model size and training data on MMEB.
nan
Article 395
Title@2025-06-29 (7): FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes
Title: FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes | FairI Tales: Bewertung von Fairness in indischen Kontexten mit Fokus auf Bias und Stereotypen | FairI Tales:以偏见和陈规定型观念为重点,评价印度背景下的公平性 2506.23111v1 |
Authors (6): Janki Atul Nawale, Mohammed Safi Ur Rahman Khan, Janani D, Mansi Gupta, Danish Pruthi, Mitesh M. Khapra
Existing studies on fairness are largely Western-focused, making them inadequate for culturally diverse countries such as India. To address this gap, we introduce INDIC-BIAS, a comprehensive India-centric benchmark designed to evaluate fairness of LLMs across 85 identity groups encompassing diverse castes, religions, regions, and tribes. We first consult domain experts to curate over 1,800 socio-cultural topics spanning behaviors and situations, where biases and stereotypes are likely to emerge. Grounded in these topics, we generate and manually validate 20,000 real-world scenario templates to probe LLMs for fairness. We structure these templates into three evaluation tasks: plausibility, judgment, and generation. Our evaluation of 14 popular LLMs on these tasks reveals strong negative biases against marginalized identities, with models frequently reinforcing common stereotypes. Additionally, we find that models struggle to mitigate bias even when explicitly asked to rationalize their decision. Our evaluation provides evidence of both allocative and representational harms that current LLMs could cause towards Indian identities, calling for a more cautious usage in practical applications. We release INDIC-BIAS as an open-source benchmark to advance research on benchmarking and mitigating biases and stereotypes in the Indian context.
nan
Article 396
Title@2025-06-29 (7): From Individuals to Interactions: Benchmarking Gender Bias in Multimodal Large Language Models from the Lens of Social Relationship
Title: From Individuals to Interactions: Benchmarking Gender Bias in Multimodal Large Language Models from the Lens of Social Relationship | Von Individuen zu Interaktionen: Benchmarking von Gender Bias in multimodalen großen Sprachmodellen aus dem Bereich der sozialen Beziehung | 从个人到互动:从社会关系的角度衡量多模式大语言模式中的性别偏见 2506.23101v1 |
Authors (2): Yue Xu, Wenjie Wang
Multimodal large language models (MLLMs) have shown impressive capabilities across tasks involving both visual and textual modalities. However, growing concerns remain about their potential to encode and amplify gender bias, particularly in socially sensitive applications. Existing benchmarks predominantly evaluate bias in isolated scenarios, overlooking how bias may emerge subtly through interpersonal interactions. We fill this gap by going beyond single-entity evaluation and instead focusing on a deeper examination of relational and contextual gender bias in dual-individual interactions. We introduce Genres, a novel benchmark designed to evaluate gender bias in MLLMs through the lens of social relationships in generated narratives. Genres assesses gender bias through a dual-character profile and narrative generation task that captures rich interpersonal dynamics and supports a fine-grained bias evaluation suite across multiple dimensions. Experiments on both open- and closed-source MLLMs reveal persistent, context-sensitive gender biases that are not evident in single-character settings. Our findings underscore the importance of relationship-aware benchmarks for diagnosing subtle, interaction-driven gender bias in MLLMs and provide actionable insights for future bias mitigation.
nan
Article 397
Title@2025-06-29 (7): Learning Dynamics of LLM Finetuning
Title: Learning Dynamics of LLM Finetuning | Dynamisches Lernen der LLM-Feinsteuerung | LLM 微调的学习动态 2407.10490v4 |
Authors (2): Yi Ren, Danica J. Sutherland
Learning dynamics, which describes how the learning of specific training examples influences the model’s predictions on other examples, gives us a powerful tool for understanding the behavior of deep learning systems. We study the learning dynamics of large language models during different types of finetuning, by analyzing the step-wise decomposition of how influence accumulates among different potential responses. Our framework allows a uniform interpretation of many interesting observations about the training of popular algorithms for both instruction tuning and preference tuning. In particular, we propose a hypothetical explanation of why specific types of hallucination are strengthened after finetuning, e.g., the model might use phrases or facts in the response for question B to answer question A, or the model might keep repeating similar simple phrases when generating responses. We also extend our framework and highlight a unique “squeezing effect” to explain a previously observed phenomenon in off-policy direct preference optimization (DPO), where running DPO for too long makes even the desired outputs less likely. This framework also provides insights into where the benefits of on-policy DPO and other variants come from. The analysis not only provides a novel perspective of understanding LLM’s finetuning but also inspires a simple, effective method to improve alignment performance.
nan
Article 398
Title@2025-06-29 (7): MMInA: Benchmarking Multihop Multimodal Internet Agents
Title: MMInA: Benchmarking Multihop Multimodal Internet Agents | MMINA: Benchmarking Multihop Multimodale Internet-Agenten | MMINA: 确定多速多式互联网代理商的基准 2404.09992v2 |
Authors (4): Shulin Tian, Ziniu Zhang, Liangyu Chen, Ziwei Liu
Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to extract multimodal information from web pages as observations autonomously; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent’s progress in completing multihop tasks. We experiment with both standalone (multimodal) language models and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks with more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach that replays past action trajectories to reflect. Our method significantly improves the performance of both the single-hop and multihop web browsing abilities. Our code and data are available at github.com/shulin16/MMInA.
nan
Article 399
Title@2025-06-29 (7): TyphoFormer: Language-Augmented Transformer for Accurate Typhoon Track Forecasting
Title: TyphoFormer: Language-Augmented Transformer for Accurate Typhoon Track Forecasting | TyphoFormer: Sprachgesteigerter Transformer für präzise Typhoon-Track-Prognose | 台风前台风:用于准确预报台风轨道的语文增强变换器 2506.17609v2 |
Authors (6): Lincan Li, Eren Erman Ozguven, Yue Zhao, Guang Wang, Yiqun Xie, Yushun Dong
Accurate typhoon track forecasting is crucial for early system warning and disaster response. While Transformer-based models have demonstrated strong performance in modeling the temporal dynamics of dense trajectories of humans and vehicles in smart cities, they usually lack access to broader contextual knowledge that enhances the forecasting reliability of sparse meteorological trajectories, such as typhoon tracks. To address this challenge, we propose TyphoFormer, a novel framework that incorporates natural language descriptions as auxiliary prompts to improve typhoon trajectory forecasting. For each time step, we use Large Language Model (LLM) to generate concise textual descriptions based on the numerical attributes recorded in the North Atlantic hurricane database. The language descriptions capture high-level meteorological semantics and are embedded as auxiliary special tokens prepended to the numerical time series input. By integrating both textual and sequential information within a unified Transformer encoder, TyphoFormer enables the model to leverage contextual cues that are otherwise inaccessible through numerical features alone. Extensive experiments are conducted on HURDAT2 benchmark, results show that TyphoFormer consistently outperforms other state-of-the-art baseline methods, particularly under challenging scenarios involving nonlinear path shifts and limited historical observations.
nan
Article 400
Title@2025-06-29 (7): DReSS: Data-driven Regularized Structured Streamlining for Large Language Models
Title: DReSS: Data-driven Regularized Structured Streamlining for Large Language Models | DResS: Datengesteuerte Regularisierte Strukturierte Straffung für große Sprachmodelle | DReSS: 数据驱动的大型语文模式正规化结构精简 2501.17905v3 |
Authors (8): Mingkuan Feng, Jinyang Wu, Shuai Zhang, Pengpeng Shao, Ruihan Jin, Zhengqi Wen, Jianhua Tao, Feihu Che
Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale results in high computational and memory costs. Recent studies have revealed that LLMs exhibit sparsity, providing the potential to reduce model size through pruning techniques. However, existing pruning methods typically follow a prune-then-finetune paradigm. Since the pruned components still contain valuable information, their direct removal often leads to irreversible performance degradation, imposing a substantial computational burden to recover performance during finetuning. In this paper, we propose a novel paradigm that first applies regularization, then prunes, and finally finetunes. Based on this paradigm, we introduce DReSS, a simple and effective Data-driven Regularized Structured Streamlining method for LLMs. By leveraging a small amount of data to regularize the components to be pruned, DReSS explicitly transfers the important information to the remaining parts of the model in advance. Compared to direct pruning, this can reduce the information loss caused by parameter removal, thereby enhancing its language modeling capabilities. Experimental results demonstrate that DReSS significantly outperforms existing pruning methods even under extreme pruning ratios, significantly reducing latency and increasing throughput.
nan
Article 401
Title@2025-06-29 (7): Text2VectorSQL: Bridging Text-to-SQL and Vector Search for Unified Natural Language Queries
Title: Text2VectorSQL: Bridging Text-to-SQL and Vector Search for Unified Natural Language Queries | Text2VectorSQL: Überbrückung Text-zu-SQL und Vektor Suche nach Unified Natural Language Queries | Text2VectorSQL: 连接文本到SQL和矢量搜索统一自然语言查询 2506.23071v1 |
Authors (4): Zhengren Wang, Bozhou Li, Dongwen Yao, Wentao Zhang
While Text-to-SQL enables natural language interaction with structured databases, its effectiveness diminishes with unstructured data or ambiguous queries due to rigid syntax and limited expressiveness. Concurrently, vector search has emerged as a powerful paradigm for semantic retrieval, particularly for unstructured data. However, existing VectorSQL implementations still rely heavily on manual crafting and lack tailored evaluation frameworks, leaving a significant gap between theoretical potential and practical deployment. To bridge these complementary paradigms, we introduces Text2VectorSQL, a novel framework unifying Text-to-SQL and vector search to overcome expressiveness constraints and support more diverse and holistical natural language queries. Specifically, Text2VectorSQL enables semantic filtering, multi-modal matching, and retrieval acceleration. For evaluation, we build vector index on appropriate columns, extend user queries with semantic search, and annotate ground truths via an automatic pipeline with expert review. Furthermore, we develop dedicated Text2VectorSQL models with synthetic data, demonstrating significant performance improvements over baseline methods. Our work establishes the foundation for the Text2VectorSQL task, paving the way for more versatile and intuitive database interfaces. The repository will be publicly available at https://github.com/Open-DataFlow/Text2VectorSQL.
nan
Article 402
Title@2025-06-29 (7): Multimodal Medical Code Tokenizer
Title: Multimodal Medical Code Tokenizer | Multimodaler medizinischer Code Tokenizer | 多式联运医疗法典化器 2502.04397v3 |
Authors (8): Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
Foundation models trained on patient electronic health records (EHRs) require tokenizing medical data into sequences of discrete vocabulary items. Existing tokenizers treat medical codes from EHRs as isolated textual tokens. However, each medical code is defined by its textual description, its position in ontological hierarchies, and its relationships to other codes, such as disease co-occurrences and drug-treatment associations. Medical vocabularies contain more than 600,000 codes with critical information for clinical reasoning. We introduce MedTok, a multimodal medical code tokenizer that uses the text descriptions and relational context of codes. MedTok processes text using a language model encoder and encodes the relational structure with a graph encoder. It then quantizes both modalities into a unified token space, preserving modality-specific and cross-modality information. We integrate MedTok into five EHR models and evaluate it on operational and clinical tasks across in-patient and out-patient datasets, including outcome prediction, diagnosis classification, drug recommendation, and risk stratification. Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.32% on EHRShot, with the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate using MedTok tokenizer with medical QA systems. Our results demonstrate the potential of MedTok as a unified tokenizer for medical codes, improving tokenization for medical foundation models.
nan
Article 403
Title@2025-06-29 (7): Boosting LLM’s Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning
Title: Boosting LLM’s Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning | Förderung der molekularen Struktur von LLM mit Wissensverstärkung der Baumsuche | 推动LLM的分子结构 2506.23056v1 |
Authors (9): Xiang Zhuang, Bin Wu, Jiyu Cui, Kehua Feng, Xiaotong Li, Huabin Xing, Keyan Ding, Qiang Zhang, Huajun Chen
Molecular structure elucidation involves deducing a molecule’s structure from various types of spectral data, which is crucial in chemical experimental analysis. While large language models (LLMs) have shown remarkable proficiency in analyzing and reasoning through complex tasks, they still encounter substantial challenges in molecular structure elucidation. We identify that these challenges largely stem from LLMs’ limited grasp of specialized chemical knowledge. In this work, we introduce a Knowledge-enhanced reasoning framework for Molecular Structure Elucidation (K-MSE), leveraging Monte Carlo Tree Search for test-time scaling as a plugin. Specifically, we construct an external molecular substructure knowledge base to extend the LLMs’ coverage of the chemical structure space. Furthermore, we design a specialized molecule-spectrum scorer to act as a reward model for the reasoning process, addressing the issue of inaccurate solution evaluation in LLMs. Experimental results show that our approach significantly boosts performance, particularly gaining more than 20% improvement on both GPT-4o-mini and GPT-4o. Our code is available at https://github.com/HICAI-ZJU/K-MSE.
nan
Article 404
Title@2025-06-29 (7): MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition
Title: MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition | MariNER: Ein Datensatz für die historische brasilianische portugiesische Identitätserkennung | Marinner:巴西历史上葡萄牙命名实体识别数据集 2506.23051v1 |
Authors (4): João Lucas Luz Lima Sarcinelli, Marina Lages Gonçalves Teixeira, Jade Bortot de Paiva, Diego Furtado Silva
Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that aims to identify and classify entity mentions in texts across different categories. While languages such as English possess a large number of high-quality resources for this task, Brazilian Portuguese still lacks in quantity of gold-standard NER datasets, especially when considering specific domains. Particularly, this paper considers the importance of NER for analyzing historical texts in the context of digital humanities. To address this gap, this work outlines the construction of MariNER: \textit{Mapeamento e Anota\c{c}~oes de Registros hIst'oricos para NER} (Mapping and Annotation of Historical Records for NER), the first gold-standard dataset for early 20th-century Brazilian Portuguese, with more than 9,000 manually annotated sentences. We also assess and compare the performance of state-of-the-art NER models for the dataset.
nan
Article 405
Title@2025-06-29 (7): AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks
Title: AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks | AURA: Agent für Verständnis, Vernunft und automatisierte Werkzeugnutzung in stimmgesteuerten Aufgaben | AURA: 语音驱动任务中理解、解释和自动工具使用代理 2506.23049v1 |
Authors (5): Leander Melroy Maben, Gayathri Ganesh Lakshmy, Srijith Radhakrishnan, Siddhant Arora, Shinji Watanabe
Despite advances in language and speech technologies, no open-source system enables full speech-to-speech, multi-turn dialogue with integrated tool use and agentic reasoning. We introduce AURA (Agent for Understanding, Reasoning, and Automated Tool Use), the first open-source, speech-native assistant capable of completing complex, goal-driven tasks through dynamic tool invocation and multi-turn conversation. AURA combines open-weight ASR, TTS, and LLMs in a cascaded pipeline and supports tools such as calendar booking, contact lookup, web search, and email. Its modular design allows easy integration of new tools using natural language prompts and action classes. On VoiceBench, AURA scores 92.75% on OpenBookQA-outperforming all open-weight systems and nearing GPT-4o-and 4.39 on AlpacaEval, competitive with other open-weight systems. Human evaluation shows 90% task success on complex, multi-turn speech tasks.
nan
Article 406
Title@2025-06-29 (7): SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions
Title: SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions | SoMi-ToM: Bewertung der multiperspektiven Theorie des Geistes in körpereigenen sozialen Interaktionen | SoMi-ToM:评估社会互动中的多视角思维理论 2506.23046v1 |
Authors (6): Xianzhe Fan, Xuhui Zhou, Chuanyang Jin, Kolby Nottingham, Hao Zhu, Maarten Sap
Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model’s ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state-of-the-art large vision-language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi-ToM: the average accuracy gap between humans and models is 40.1% in first-person evaluation and 26.4% in third-person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.
nan
Article 407
Title@2025-06-29 (7): MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation
Title: MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation | MetaSynth: Meta-prompting-Driven Agentic Scaffolds für vielfältige synthetische Datengenerierung | MetaSynth: 用于多种合成数据生成的元- 制造- 挥发剂脚架 2504.12563v2 |
Authors (5): Haris Riaz, Sourav Bhabesh, Vinayak Arannil, Miguel Ballesteros, Graham Horwood
Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is low diversity, which negatively impacts its downstream applicability for improving other models. To address this, we propose MetaSynth, a method for generating synthetic data that enhances diversity through meta-prompting, where a language model orchestrates multiple “expert” LLM agents to collaboratively generate data. Using only 25 million tokens of synthetic data generated with MetaSynth, we successfully adapt a well-trained LLM (Mistral-7B-v0.3) to two specialized domains-Finance and Biomedicine-without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre-training corpora. Continually pre-training Mistral-7B-v0.3 with MetaSynth notably outperforms the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in Biomedicine. The same model shows degraded performance when trained on data generated using a template prompt, even when the template includes prior generations and varying In-Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth.
nan
Article 408
Title@2025-06-29 (7): CHARTOM: A Visual Theory-of-Mind Benchmark for LLMs on Misleading Charts
Title: CHARTOM: A Visual Theory-of-Mind Benchmark for LLMs on Misleading Charts | CHARTOM: Ein visueller Theorie-von-Mind-Benchmark für LLMs auf irreführenden Diagrammen | 错误领导图表LLML女士的视觉理论基准 2408.14419v3 |
Authors (8): Shubham Bharti, Shiyun Cheng, Jihyun Rho, Jianrui Zhang, Mu Cai, Yong Jae Lee, Martina Rau, Xiaojin Zhu
We introduce CHARTOM, a visual theory-of-mind benchmark designed to evaluate multimodal large language models’ capability to understand and reason about misleading data visualizations though charts. CHARTOM consists of carefully designed charts and associated questions that require a language model to not only correctly comprehend the factual content in the chart (the FACT question) but also judge whether the chart will be misleading to a human readers (the MIND question), a dual capability with significant societal benefits. We detail the construction of our benchmark including its calibration on human performance and estimation of MIND ground truth called the Human Misleadingness Index. We evaluated several leading LLMs – including GPT, Claude, Gemini, Qwen, Llama, and Llava series models – on the CHARTOM dataset and found that it was challenging to all models both on FACT and MIND questions. This highlights the limitations of current LLMs and presents significant opportunity for future LLMs to improve on understanding misleading charts.
nan
Article 409
Title@2025-06-28 (6): Multimodal Contrastive Representation Learning in Augmented Biomedical Knowledge Graphs
Title: Multimodal Contrastive Representation Learning in Augmented Biomedical Knowledge Graphs | Multimodales Kontrastives Repräsentationslernen in Augmented Biomedical Knowledge Graphs | 生物医学知识强化图中多模式差异代表性学习 2501.01644v2 |
Authors (4): Tien Dang, Viet Thanh Duy Nguyen, Minh Tuan Le, Truong-Son Hy
Biomedical Knowledge Graphs (BKGs) integrate diverse datasets to elucidate complex relationships within the biomedical field. Effective link prediction on these graphs can uncover valuable connections, such as potential novel drug-disease relations. We introduce a novel multimodal approach that unifies embeddings from specialized Language Models (LMs) with Graph Contrastive Learning (GCL) to enhance intra-entity relationships while employing a Knowledge Graph Embedding (KGE) model to capture inter-entity relationships for effective link prediction. To address limitations in existing BKGs, we present PrimeKG++, an enriched knowledge graph incorporating multimodal data, including biological sequences and textual descriptions for each entity type. By combining semantic and relational information in a unified representation, our approach demonstrates strong generalizability, enabling accurate link predictions even for unseen nodes. Experimental results on PrimeKG++ and the DrugBank drug-target interaction dataset demonstrate the effectiveness and robustness of our method across diverse biomedical datasets. Our source code, pre-trained models, and data are publicly available at https://github.com/HySonLab/BioMedKG
nan
Article 410
Title@2025-06-28 (6): The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models
Title: The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models | Die begrenzten Auswirkungen medizinischer Anpassung von großen Sprach- und Visions-Sprachenmodellen | 大语言和视觉语言模式医学适应的有限影响 2411.08870v3 |
Authors (5): Daniel P. Jeong, Pranav Mani, Saurabh Garg, Zachary C. Lipton, Michael Oberst
Several recent works seek to adapt general-purpose large language models (LLMs) and vision-language models (VLMs) for medical applications through continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining improves performance on various downstream medical tasks, such as answering medical exam questions. In this paper, we compare ten “medical” LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting and supervised fine-tuning regimes for medical question answering (QA). For instance, on clinical-note-based QA tasks in the 3-shot setting, medical LLMs outperform their base models in only 26.7% of cases, reach a (statistical) tie in 16.7% of cases, and perform significantly worse in the remaining 56.7% of cases. Our conclusions are based on (i) comparing each medical model directly against its base model; (ii) optimizing the prompts for each model separately in zero-/few-shot prompting; and (iii) accounting for statistical uncertainty in comparisons. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.
nan
Article 411
Title@2025-06-28 (6): MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning
Title: MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning | MARBLE: Ein harter Maßstab für multimodale räumliche Vernunft und Planung | 多式联运空间理由和规划的硬基准 2506.22992v1 |
Authors (4): Yulun Jiang, Yekun Chai, Maria Brbić, Michael Moor
The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence. However, existing reasoning benchmarks focus on text-only reasoning, or employ multimodal questions that can be answered by directly retrieving information from a non-text modality. Thus, complex reasoning remains poorly understood in multimodal domains. Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal and M-Cube, that require the crafting and understanding of multistep plans under spatial, visual, and physical constraints. We find that current MLLMs perform poorly on MARBLE – all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube. Only in simplified subtasks some models outperform the random baseline, indicating that complex reasoning is still a challenge for existing MLLMs. Moreover, we show that perception remains a bottleneck, where MLLMs occasionally fail to extract information from the visual inputs. By shedding a light on the limitations of MLLMs, we hope that MARBLE will spur the development of the next generation of models with the ability to reason and plan across many, multimodal reasoning steps.
nan
Article 412
Title@2025-06-28 (6): Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement
Title: Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement | Time-MQA: Zeitreihe Multi-Task-Fragebeantwortung mit Kontextverbesserung | 时间-MQA:时间系列多任务问题,加强背景回答 2503.01875v2 |
Authors (8): Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, Qingsong Wen
Time series data are foundational in finance, healthcare, and energy domains. However, most existing methods and datasets remain focused on a narrow spectrum of tasks, such as forecasting or anomaly detection. To bridge this gap, we introduce Time Series Multi-Task Question Answering (Time-MQA), a unified framework that enables natural language queries across multiple time series tasks - numerical analytical tasks and open-ended question answering with reasoning. Central to Time-MQA is the TSQA dataset, a large-scale dataset containing $\sim$200k question-answer pairs derived from diverse time series spanning environment, traffic, etc. This comprehensive resource covers various time series lengths and promotes robust model development. We further demonstrate how continually pre-training large language models (Mistral 7B, Llama-3 8B, and Qwen-2.5 7B) on the TSQA dataset enhanced time series reasoning capabilities, moving beyond mere numeric tasks and enabling more advanced and intuitive interactions with temporal data. The complete TSQA dataset, models, user study questionnaires for evaluation, and other related materials have been open-sourced.
nan
Article 413
Title@2025-06-28 (6): A Systematic Study of Compositional Syntactic Transformer Language Models
Title: A Systematic Study of Compositional Syntactic Transformer Language Models | Eine systematische Studie kompositorischer syntaktischer Transformer-Sprachmodelle | 系统研究合成同步转换器语言模型 2506.22978v1 |
Authors (4): Yida Zhao, Hao Xve, Xiang Hu, Kewei Tu
Syntactic language models (SLMs) enhance Transformers by incorporating syntactic biases through the modeling of linearized syntactic parse trees alongside surface sentences. This paper focuses on compositional SLMs that are based on constituency parse trees and contain explicit bottom-up composition of constituent representations. We identify key aspects of design choices in existing compositional SLMs and propose a unified framework encompassing both existing models and novel variants. We conduct a comprehensive empirical evaluation of all the variants in our framework across language modeling, syntactic generalization, summarization, dialogue, and inference efficiency. Based on the experimental results, we make multiple recommendations on the design of compositional SLMs. Our code is released at https://github.com/zhaoyd1/compositional_SLMs.
nan
Article 414
Title@2025-06-28 (6): On the Generalizability of “Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals”
Title: On the Generalizability of “Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals” | Zur Verallgemeinerbarkeit von “Wettbewerb von Mechanismen: Aufspüren, wie Sprachmodelle mit Fakten und Gegenfakten umgehen” | 关于“机制的竞争:追踪语言模式如何处理事实和反事实”的一般性 2506.22977v1 |
Authors (5): Asen Dotsinski, Udit Thakur, Marko Ivanov, Mohammad Hafeez Khan, Maria Heuss
We present a reproduction study of “Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals” (Ortu et al., 2024), which investigates competition of mechanisms in language models between factual recall and counterfactual in-context repetition. Our study successfully reproduces their primary findings regarding the localization of factual and counterfactual information, the dominance of attention blocks in mechanism competition, and the specialization of attention heads in handling competing information. We reproduce their results on both GPT-2 (Radford et al., 2019) and Pythia 6.9B (Biderman et al., 2023). We extend their work in three significant directions. First, we explore the generalizability of these findings to even larger models by replicating the experiments on Llama 3.1 8B (Grattafiori et al., 2024), discovering greatly reduced attention head specialization. Second, we investigate the impact of prompt structure by introducing variations where we avoid repeating the counterfactual statement verbatim or we change the premise word, observing a marked decrease in the logit for the counterfactual token. Finally, we test the validity of the authors’ claims for prompts of specific domains, discovering that certain categories of prompts skew the results by providing the factual prediction token as part of the subject of the sentence. Overall, we find that the attention head ablation proposed in Ortu et al. (2024) is ineffective for domains that are underrepresented in their dataset, and that the effectiveness varies based on model architecture, prompt structure, domain and task.
nan
Article 415
Title@2025-06-28 (6): Interpretable LLM-based Table Question Answering
Title: Interpretable LLM-based Table Question Answering | Interpretierbare LLM-basierte Tabellenfragebeantwortung | 基于表问题的回答 2412.12386v3 |
Authors (6): Giang Nguyen, Ivan Brugere, Shubham Sharma, Sanjay Kariyappa, Anh Totti Nguyen, Freddy Lecue
Interpretability in Table Question Answering (Table QA) is critical, especially in high-stakes domains like finance and healthcare. While recent Table QA approaches based on Large Language Models (LLMs) achieve high accuracy, they often produce ambiguous explanations of how answers are derived. We propose Plan-of-SQLs (POS), a new Table QA method that makes the model’s decision-making process interpretable. POS decomposes a question into a sequence of atomic steps, each directly translated into an executable SQL command on the table, thereby ensuring that every intermediate result is transparent. Through extensive experiments, we show that: First, POS generates the highest-quality explanations among compared methods, which markedly improves the users’ ability to simulate and verify the model’s decisions. Second, when evaluated on standard Table QA benchmarks (TabFact, WikiTQ, and FeTaQA), POS achieves QA accuracy that is competitive to existing methods, while also offering greater efficiency-requiring significantly fewer LLM calls and table database queries (up to 25x fewer)-and more robust performance on large-sized tables. Finally, we observe high agreement (up to 90.59% in forward simulation) between LLMs and human users when making decisions based on the same explanations, suggesting that LLMs could serve as an effective proxy for humans in evaluating Table QA explanations.
nan
Article 416
Title@2025-06-28 (6): MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models
Title: MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models | MLAN: Sprachbasierte Anleitung Tuning bewahrt und überträgt Wissen in multimodalen Sprachmodellen | MLAN: 多种语文模式中基于语文的指导指示图示保留和转让知识 2411.10557v3 |
Authors (11): Jianhong Tu, Zhuohao Ni, Nicholas Crispino, Zihao Yu, Michael Bendersky, Beliz Gunel, Ruoxi Jia, Xin Liu, Lingjuan Lyu, Dawn Song, Chenguang Wang
We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the importance of each modality in the instruction tuning stage, often using a majority of vision-language data while keeping text-only data limited and fixing mixtures of modalities. By incorporating diverse text-only data in the visual instruction tuning stage, we vary vision-language data in various controlled experiments to investigate the importance of modality in visual instruction tuning. Our comprehensive evaluation shows that the text-heavy instruction tuning approach is able to perform on-par with traditional vision-heavy mixtures on both modalities across 12 general datasets while using as low as half the total training tokens. We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.
nan
Article 417
Title@2025-06-28 (6): Truth Neurons
Title: Truth Neurons | Wahrheit Neuronen | 真理中世纪 2505.12182v2 |
Authors (5): Haohang Li, Yupeng Cao, Yangyang Yu, Jordan W. Suchow, Zining Zhu
Despite their remarkable success and deployment across diverse workflows, language models sometimes produce untruthful responses. Our limited understanding of how truthfulness is mechanistically encoded within these models jeopardizes their reliability and safety. In this paper, we propose a method for identifying representations of truthfulness at the neuron level. We show that language models contain truth neurons, which encode truthfulness in a subject-agnostic manner. Experiments conducted across models of varying scales validate the existence of truth neurons, confirming that the encoding of truthfulness at the neuron level is a property shared by many language models. The distribution patterns of truth neurons over layers align with prior findings on the geometry of truthfulness. Selectively suppressing the activations of truth neurons found through the TruthfulQA dataset degrades performance both on TruthfulQA and on other benchmarks, showing that the truthfulness mechanisms are not tied to a specific dataset. Our results offer novel insights into the mechanisms underlying truthfulness in language models and highlight potential directions toward improving their trustworthiness and reliability.
nan
Article 418
Title@2025-06-28 (6): What can large language models do for sustainable food?
Title: What can large language models do for sustainable food? | Was können große Sprachmodelle für nachhaltige Lebensmittel tun? | 大型语言模式对于可持续食物能做些什么? 2503.04734v2 |
Authors (6): Anna T. Thomas, Adam Yee, Andrew Mayne, Maya B. Mathur, Dan Jurafsky, Kristina Gligorić
Food systems are responsible for a third of human-caused greenhouse gas emissions. We investigate what Large Language Models (LLMs) can contribute to reducing the environmental impacts of food production. We define a typology of design and prediction tasks based on the sustainable food literature and collaboration with domain experts, and evaluate six LLMs on four tasks in our typology. For example, for a sustainable protein design task, food science experts estimated that collaboration with an LLM can reduce time spent by 45% on average, compared to 22% for collaboration with another expert human food scientist. However, for a sustainable menu design task, LLMs produce suboptimal solutions when instructed to consider both human satisfaction and climate impacts. We propose a general framework for integrating LLMs with combinatorial optimization to improve reasoning capabilities. Our approach decreases emissions of food choices by 79% in a hypothetical restaurant while maintaining participants’ satisfaction with their set of choices. Our results demonstrate LLMs’ potential, supported by optimization techniques, to accelerate sustainable food development and adoption.
nan
Article 419
Title@2025-06-28 (6): Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders
Title: Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders | Präzise Topic Alignment in großen Sprachmodellen über Sparse Autoencoder aktivieren | 启用大语言模型中的精确主题对齐 2506.12576v2 |
Authors (3): Ananya Joshi, Celia Cintas, Skyler Speakman
Recent work shows that Sparse Autoencoders (SAE) applied to large language model (LLM) layers have neurons corresponding to interpretable concepts. These SAE neurons can be modified to align generated outputs, but only towards pre-identified topics and with some parameter tuning. Our approach leverages the observational and modification properties of SAEs to enable alignment for any topic. This method 1) scores each SAE neuron by its semantic similarity to an alignment text and uses them to 2) modify SAE-layer-level outputs by emphasizing topic-aligned neurons. We assess the alignment capabilities of this approach on diverse public topic datasets including Amazon reviews, Medicine, and Sycophancy, across the currently available open-source LLMs and SAE pairs (GPT2 and Gemma) with multiple SAEs configurations. Experiments aligning to medical prompts reveal several benefits over fine-tuning, including increased average language acceptability (0.25 vs. 0.5), reduced training time across multiple alignment topics (333.6s vs. 62s), and acceptable inference time for many applications (+0.00092s/token). Our open-source code is available at github.com/IBM/sae-steering.
nan
Article 420
Title@2025-06-28 (6): Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models
Title: Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models | Agent-to-Agent Theorie des Geistes: Testen Gesprächspartner Bewusstsein unter großen Sprachmodellen | 精神感官理论:测试大语言模型间对话者的认识 2506.22957v1 |
Authors (4): Younwoo Choi, Changling Li, Yongjin Yang, Zhijing Jin
As large language models (LLMs) are increasingly integrated into multi-agent and human-AI systems, understanding their awareness of both self-context and conversational partners is essential for ensuring reliable performance and robust safety. While prior work has extensively studied situational awareness which refers to an LLM’s ability to recognize its operating phase and constraints, it has largely overlooked the complementary capacity to identify and adapt to the identity and characteristics of a dialogue partner. In this paper, we formalize this latter capability as interlocutor awareness and present the first systematic evaluation of its emergence in contemporary LLMs. We examine interlocutor inference across three dimensions-reasoning patterns, linguistic style, and alignment preferences-and show that LLMs reliably identify same-family peers and certain prominent model families, such as GPT and Claude. To demonstrate its practical significance, we develop three case studies in which interlocutor awareness both enhances multi-LLM collaboration through prompt adaptation and introduces new alignment and safety vulnerabilities, including reward-hacking behaviors and increased jailbreak susceptibility. Our findings highlight the dual promise and peril of identity-sensitive behavior in LLMs, underscoring the need for further understanding of interlocutor awareness and new safeguards in multi-agent deployments. Our code is open-sourced at https://github.com/younwoochoi/InterlocutorAwarenessLLM.
nan
Article 421
Title@2025-06-28 (6): HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation
Title: HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation | HalluSegBench: Counterfactual Visual Reasoning for Segmentation Halluzination Evaluation | HalluSegeBench:截肢幻觉评价的反事实视觉理由 2506.21546v2 |
Authors (6): Xinzhuo Li, Adheesh Juvekar, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Ismini Lourentzou
Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.
nan
Article 422
Title@2025-06-28 (6): SConU: Selective Conformal Uncertainty in Large Language Models
Title: SConU: Selective Conformal Uncertainty in Large Language Models | SConU: Selektive konforme Unsicherheit in großen Sprachmodellen | SCONU:大语言模式中选择性的形式不确定性 2504.14154v2 |
Authors (7): Zhiyuan Wang, Qingni Wang, Yue Zhang, Tianlong Chen, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
As large language models are increasingly utilized in real-world applications, guarantees of task-specific metrics are essential for their reliable deployment. Previous studies have introduced various criteria of conformal uncertainty grounded in split conformal prediction, which offer user-specified correctness coverage. However, existing frameworks often fail to identify uncertainty data outliers that violate the exchangeability assumption, leading to unbounded miscoverage rates and unactionable prediction sets. In this paper, we propose a novel approach termed Selective Conformal Uncertainty (SConU), which, for the first time, implements significance tests, by developing two conformal p-values that are instrumental in determining whether a given sample deviates from the uncertainty distribution of the calibration set at a specific manageable risk level. Our approach not only facilitates rigorous management of miscoverage rates across both single-domain and interdisciplinary contexts, but also enhances the efficiency of predictions. Furthermore, we comprehensively analyze the components of the conformal procedures, aiming to approximate conditional coverage, particularly in high-stakes question-answering tasks.
nan
Article 423
Title@2025-06-28 (6): MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering
Title: MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering | MOTOR: Multimodaler Optimaler Transport über geschliffenes Retrieval in der medizinischen visuellen Fragestellung | 在医疗视觉问题解答中通过定地检索进行多式最佳交通 2506.22900v1 |
Authors (4): Mai A. Shaaban, Tausifa Jan Saleem, Vijay Ram Papineni, Mohammad Yaqub
Medical visual question answering (MedVQA) plays a vital role in clinical decision-making by providing contextually rich answers to image-based queries. Although vision-language models (VLMs) are widely used for this task, they often generate factually incorrect answers. Retrieval-augmented generation addresses this challenge by providing information from external sources, but risks retrieving irrelevant context, which can degrade the reasoning capabilities of VLMs. Re-ranking retrievals, as introduced in existing approaches, enhances retrieval relevance by focusing on query-text alignment. However, these approaches neglect the visual or multimodal context, which is particularly crucial for medical diagnosis. We propose MOTOR, a novel multimodal retrieval and re-ranking approach that leverages grounded captions and optimal transport. It captures the underlying relationships between the query and the retrieved context based on textual and visual information. Consequently, our approach identifies more clinically relevant contexts to augment the VLM input. Empirical analysis and human expert evaluation demonstrate that MOTOR achieves higher accuracy on MedVQA datasets, outperforming state-of-the-art methods by an average of 6.45%. Code is available at https://github.com/BioMedIA-MBZUAI/MOTOR.
nan
Article 424
Title@2025-06-28 (6): From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment
Title: From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment | Von Ergebnissen zu Prozessen: Leitende PRM-Lernen von ORM für die Schlussfolgerungs-Zeit-Ausrichtung | 从结果到过程:指导程序程序管理从ORM学习,以推断-时间协调 2506.12446v2 |
Authors (5): Bin Xie, Bingbing Xu, Yige Yuan, Shengmao Zhu, Huawei Shen
Inference-time alignment methods have gained significant attention for their efficiency and effectiveness in aligning large language models (LLMs) with human preferences. However, existing dominant approaches using reward-guided search (RGS) primarily rely on outcome reward models (ORMs), which suffer from a critical granularity mismatch: ORMs are designed to provide outcome rewards for complete responses, while RGS methods rely on process rewards to guide the policy, leading to inconsistent scoring and suboptimal alignment. To address this challenge, we introduce process reward models (PRMs) into RGS and argue that an ideal PRM should satisfy two objectives: Score Consistency, ensuring coherent evaluation across partial and complete responses, and Preference Consistency, aligning partial sequence assessments with human preferences. Based on these, we propose SP-PRM, a novel dual-consistency framework integrating score consistency-based and preference consistency-based partial evaluation modules without relying on human annotation. Extensive experiments on dialogue, summarization, and reasoning tasks demonstrate that SP-PRM substantially enhances existing RGS methods, achieving a 3.6%-10.3% improvement in GPT-4 evaluation scores across all tasks.
nan
Article 425
Title@2025-06-28 (6): Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance?
Title: Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance? | Welche Programmiersprache und welche Features bei Pre-Training Stage beeinflussen Downstream Logical Inferenz Performance? | 培训前阶段哪些语言和特点影响下游逻辑推论性能? 2410.06735v2 |
Authors (6): Fumiya Uchiyama, Takeshi Kojima, Andrew Gambardella, Qi Cao, Yusuke Iwasawa, Yutaka Matsuo
Recent large language models (LLMs) have demonstrated remarkable generalization abilities in mathematics and logical reasoning tasks. Prior research indicates that LLMs pre-trained with programming language data exhibit high mathematical and reasoning abilities; however, this causal relationship has not been rigorously tested. Our research aims to verify which programming languages and features during pre-training affect logical inference performance. Specifically, we pre-trained decoder-based language models from scratch using datasets from ten programming languages (e.g., Python, C, Java) and three natural language datasets (Wikipedia, Fineweb, C4) under identical conditions. Thereafter, we evaluated the trained models in a few-shot in-context learning setting on logical reasoning tasks: FLD and bAbi, which do not require commonsense or world knowledge. The results demonstrate that nearly all models trained with programming languages consistently outperform those trained with natural languages, indicating that programming languages contain factors that elicit logic inference performance. In addition, we found that models trained with programming languages exhibit a better ability to follow instructions compared to those trained with natural languages. Further analysis reveals that the depth of Abstract Syntax Trees representing parsed results of programs also affects logical reasoning performance. These findings will offer insights into the essential elements of pre-training for acquiring the foundational abilities of LLMs.
nan
Article 426
Title@2025-06-28 (6): Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis
Title: Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis | Arabische Dialektklassifikation mit RNNs, Transformern und großen Sprachmodellen: Eine vergleichende Analyse | 使用RNN、变换器和大语言模式的阿拉伯语方言分类:比较分析 2506.19753v2 |
Authors (4): Omar A. Essameldin, Ali O. Elbeih, Wael H. Gomaa, Wael F. Elsersy
The Arabic language is among the most popular languages in the world with a huge variety of dialects spoken in 22 countries. In this study, we address the problem of classifying 18 Arabic dialects of the QADI dataset of Arabic tweets. RNN models, Transformer models, and large language models (LLMs) via prompt engineering are created and tested. Among these, MARBERTv2 performed best with 65% accuracy and 64% F1-score. Through the use of state-of-the-art preprocessing techniques and the latest NLP models, this paper identifies the most significant linguistic issues in Arabic dialect identification. The results corroborate applications like personalized chatbots that respond in users’ dialects, social media monitoring, and greater accessibility for Arabic communities.
nan
Article 427
Title@2025-06-28 (6): PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
Title: PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models | PRMBench: Ein feinkörniger und anspruchsvoller Benchmark für Prozess-Level-Reward-Modelle | PRMBBench:进程一级奖励模式的精细和质疑基准 2501.03124v5 |
Authors (5): Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, Yu Cheng
Process-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios. However, current benchmarks primarily focus on step correctness, failing to evaluate PRMs’ performance systematically. To address this gap, we introduce PRMBench, a process-level benchmark specifically designed to assess the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions, including simplicity, soundness, and sensitivity. In our experiments on 15 models, spanning both open-source PRMs and closed-source large language models prompted as critic models, we uncover significant weaknesses in current PRMs. These findings underscore the challenges inherent in process-level evaluation and highlight key directions for future research. We hope PRMBench can be a robust bench for advancing research on PRM evaluation and development.
nan
Article 428
Title@2025-06-28 (6): Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval
Title: Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval | Mask-aware Text-to-Image Retrieval: Referenzierung der Expression-Segmentierung trifft modales Retrieval | Mask-aware 文本到图像检索val: 参考表达式分解会遇到交叉模式检索val 2506.22864v1 |
Authors (4): Li-Cheng Shen, Jih-Kang Hsieh, Wei-Hua Li, Chu-Song Chen
Text-to-image retrieval (TIR) aims to find relevant images based on a textual query, but existing approaches are primarily based on whole-image captions and lack interpretability. Meanwhile, referring expression segmentation (RES) enables precise object localization based on natural language descriptions but is computationally expensive when applied across large image collections. To bridge this gap, we introduce Mask-aware TIR (MaTIR), a new task that unifies TIR and RES, requiring both efficient image search and accurate object segmentation. To address this task, we propose a two-stage framework, comprising a first stage for segmentation-aware image retrieval and a second stage for reranking and object grounding with a multimodal large language model (MLLM). We leverage SAM 2 to generate object masks and Alpha-CLIP to extract region-level embeddings offline at first, enabling effective and scalable online retrieval. Secondly, MLLM is used to refine retrieval rankings and generate bounding boxes, which are matched to segmentation masks. We evaluate our approach on COCO and D$^3$ datasets, demonstrating significant improvements in both retrieval accuracy and segmentation quality over previous methods.
nan
Article 429
Title@2025-06-28 (6): MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding
Title: MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding | MANTA: Cross-Modal Semantic Alignment und informationstheoretische Optimierung für langformiges multimodales Verständnis | MANTA:跨模式的语义一致和信息理论优化,促进长期多式联运理解 2507.00068v1 |
Authors (2): Ziqi Zhong, Daniel Tang
While multi-modal learning has advanced significantly, current approaches often treat modalities separately, creating inconsistencies in representation and reasoning. We introduce MANTA (Multi-modal Abstraction and Normalization via Textual Alignment), a theoretically-grounded framework that unifies visual and auditory inputs into a structured textual space for seamless processing with large language models. MANTA addresses four key challenges: (1) semantic alignment across modalities with information-theoretic optimization, (2) adaptive temporal synchronization for varying information densities, (3) hierarchical content representation for multi-scale understanding, and (4) context-aware retrieval of sparse information from long sequences. We formalize our approach within a rigorous mathematical framework, proving its optimality for context selection under token constraints. Extensive experiments on the challenging task of Long Video Question Answering show that MANTA improves state-of-the-art models by up to 22.6% in overall accuracy, with particularly significant gains (27.3%) on videos exceeding 30 minutes. Additionally, we demonstrate MANTA’s superiority on temporal reasoning tasks (23.8% improvement) and cross-modal understanding (25.1% improvement). Our framework introduces novel density estimation techniques for redundancy minimization while preserving rare signals, establishing new foundations for unifying multimodal representations through structured text.
nan
Article 430
Title@2025-06-28 (6): Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions
Title: Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions | Mind the Gap: Entity-Context-Aware ASR Strukturierte Transkriptionen | 牢记差距:实体提供的背景软件ASR结构化分类 2506.22858v1 |
Authors (1): Duygu Altinok
Automatic Speech Recognition (ASR) systems, such as Whisper, achieve high transcription accuracy but struggle with named entities and numerical data, especially when proper formatting is required. These issues increase word error rate (WER) and impair semantic understanding in critical domains like legal, financial, and medical applications. We propose a novel training approach that extends the semantic context of ASR models by adding overlapping context windows during training. By sliding 5-second overlaps on both sides of 30-second chunks, we create a 40-second “effective semantic window,” improving entity recognition and formatting while focusing predictions on the central 30 seconds. To address entities spanning chunk boundaries, we reassign such entities entirely to the right-hand chunk, ensuring proper formatting. Additionally, enriched training data with embedded entity labels enables the model to learn both recognition and type-specific formatting. Evaluated on the Spoken Wikipedia dataset, our method improves performance across semantic tasks, including named entity recognition (NER) and entity formatting. These results highlight the effectiveness of context-aware training in addressing ASR limitations for long-form transcription and complex entity recognition tasks.
nan
Article 431
Title@2025-06-28 (6): Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems
Title: Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems | Wissen Augmented Finetuning Matters in RAG und Agent Based Dialog Systems | 在区域咨询组和代理人基础对话系统中增加知识的微调问题 2506.22852v1 |
Authors (5): Yucheng Cai, Yuxuan Wu, Yi Huang, Junlan Feng, Zhijian Ou
Large language models (LLMs) have recently been applied to dialog systems. Despite making progress, LLMs are prone to errors in knowledge-intensive scenarios. Recently, approaches based on retrieval augmented generation (RAG) and agent have emerged to improve the factual accuracy by enhancing the LLMs with knowledge retrieved from external knowledge bases (KBs). This is mostly implemented by prompting the LLMs with instructions, examples and the retrieved knowledge. However, LLMs may have difficulty using the retrieved knowledge effectively for response generation, because they are not well trained to do such generation for specific domains. To mitigate this problem, we propose to finetune the LLMs in the RAG-based and agent-based systems with domain-specific data, together with domain-specific external knowledge, which is called knowledge augmented finetuning (KAFT). We base our study on the MobileCS2 dataset, a real-life customer service dialog dataset that features intensive knowledge interactions, to systematically compare the prompting and KAFT techniques in the RAG-based and agent-based systems. Experiment results show that KAFT substantially surpasses prompting in both RAG and agent systems, particularly in terms of factual accuracy. To the best of our knowledge, this paper represents the first solid empirical work to investigate the KAFT idea.
nan
Article 432
Title@2025-06-28 (6): Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization
Title: Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization | Steigerung der CTC-basierten ASR-Nutzung durch LLM-basierte Intermediate Loss Regularisierung | 利用基于LLM的中间损失规范化,促进基于反恐委员会的ASR 2506.22846v1 |
Authors (1): Duygu Altinok
End-to-end (E2E) automatic speech recognition (ASR) systems have revolutionized the field by integrating all components into a single neural network, with attention-based encoder-decoder models achieving state-of-the-art performance. However, their autoregressive decoding process limits inference speed, making them unsuitable for real-time applications. In contrast, CTC-based models offer faster, non-autoregressive decoding but struggle to model linguistic dependencies effectively. Addressing this challenge, we propose a novel auxiliary loss framework called Language-Aware Intermediate Loss (LAIL) to enhance CTC-based ASR using the linguistic knowledge of large language models (LLMs). By attaching connector layers to intermediate encoder layers, LAIL maps outputs to the embedding space of an LLM and computes a causal language modeling loss during training. This approach enhances linguistic modeling while preserving the computational efficiency of CTC decoding. Using the Conformer architecture and various LLaMA models, we demonstrate significant improvements in Word Error Rate (WER) on the LibriSpeech, TEDLIUM2, and WSJ corpora, achieving state-of-the-art performance for CTC-based ASR with minimal computational overhead.
nan
Article 433
Title@2025-06-28 (6): Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases
Title: Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases | Besser ausgerichtet mit Umfragegegnern oder Trainingsdaten? Enthüllung politischer Leanings von LLMs in US Supreme Court Cases | 与美国最高法院案件调查答卷人或培训数据更加一致? 2502.18282v3 |
Authors (6): Shanshan Xu, T. Y. S. S Santosh, Yanai Elazar, Quirin Vogel, Barbara Plank, Matthias Grabmair
Recent works have shown that Large Language Models (LLMs) have a tendency to memorize patterns and biases present in their training data, raising important questions about how such memorized content influences model behavior. One such concern is the emergence of political bias in LLM outputs. In this paper, we investigate the extent to which LLMs’ political leanings reflect memorized patterns from their pretraining corpora. We propose a method to quantitatively evaluate political leanings embedded in the large pretraining corpora. Subsequently we investigate to whom are the LLMs’ political leanings more aligned with, their pretrainig corpora or the surveyed human opinions. As a case study, we focus on probing the political leanings of LLMs in 32 US Supreme Court cases, addressing contentious topics such as abortion and voting rights. Our findings reveal that LLMs strongly reflect the political leanings in their training data, and no strong correlation is observed with their alignment to human opinions as expressed in surveys. These results underscore the importance of responsible curation of training data, and the methodology for auditing the memorization in LLMs to ensure human-AI alignment.
nan
Article 434
Title@2025-06-28 (6): Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback
Title: Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback | Margin Matching Preference Optimization: Verbesserte Modellausrichtung mit Granular Feedback | 边际匹配优先优化:用颗粒反馈增强模型协调 2410.03145v2 |
Authors (5): Kyuyoung Kim, Ah Jeong Seo, Hao Liu, Jinwoo Shin, Kimin Lee
Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date. Despite their success, existing methods typically rely on simple binary labels, such as those indicating preferred outputs in pairwise preferences, which fail to capture the subtle differences in relative quality between pairs. To address this limitation, we introduce an approach called Margin Matching Preference Optimization (MMPO), which incorporates relative quality margins into optimization, leading to improved LLM policies and reward models. Specifically, given quality margins in pairwise preferences, we design soft target probabilities based on the Bradley-Terry model, which are then used to train models with the standard cross-entropy objective. Experiments with both human and AI feedback data demonstrate that MMPO consistently outperforms baseline methods, often by a substantial margin, on popular benchmarks including MT-bench and RewardBench. Notably, the 7B model trained with MMPO achieves state-of-the-art performance on RewardBench as of June 2024, outperforming other models of the same scale. Our analysis also shows that MMPO is more robust to overfitting, leading to better-calibrated models.
nan
Article 435
Title@2025-06-28 (6): Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models
Title: Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models | Auswahl und Zusammenführung: Auf dem Weg zu einer anpassungsfähigen und skalierbaren Namenskanzlei-Erkennung mit großen Sprachmodellen | 选择和合并:努力以大语言模式识别可适应和可缩放命名实体 2506.22813v1 |
Authors (3): Zhuojun Ding, Wei Wei, Chenghao Fan
Supervised fine-tuning (SFT) is widely used to align large language models (LLMs) with information extraction (IE) tasks, such as named entity recognition (NER). However, annotating such fine-grained labels and training domain-specific models is costly. Existing works typically train a unified model across multiple domains, but such approaches lack adaptation and scalability since not all training data benefits target domains and scaling trained models remains challenging. We propose the SaM framework, which dynamically Selects and Merges expert models at inference time. Specifically, for a target domain, we select domain-specific experts pre-trained on existing domains based on (i) domain similarity to the target domain and (ii) performance on sampled instances, respectively. The experts are then merged to create task-specific models optimized for the target domain. By dynamically merging experts beneficial to target domains, we improve generalization across various domains without extra training. Additionally, experts can be added or removed conveniently, leading to great scalability. Extensive experiments on multiple benchmarks demonstrate our framework’s effectiveness, which outperforms the unified model by an average of 10%. We further provide insights into potential improvements, practical experience, and extensions of our framework.
nan
Article 436
Title@2025-06-28 (6): BayesLoRA: Task-Specific Uncertainty in Low-Rank Adapters
Title: BayesLoRA: Task-Specific Uncertainty in Low-Rank Adapters | BayesLoRA: Aufgabenspezifische Unsicherheit in Low-Rank-Adaptern | BayesLOLRA:低兰克适应器中任务具体不确定性 2506.22809v1 |
Authors (1): Cooper Doyle
We propose BayesLoRA, a task-specific uncertainty quantification framework that integrates MC-Dropout into Low-Rank Adapters (LoRA). Unlike general-purpose transformer uncertainty methods, BayesLoRA provides guardrails tailored to downstream workflows, enabling agents to introspect and modulate behavior under uncertainty. We demonstrate mathematically and empirically that LoRA adapters exhibit amplified variance outside fine-tuning distributions, yielding reliable confidence estimates for agentic decision-making.
nan
Article 437
Title@2025-06-28 (6): MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs
Title: MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs | MedEthicsQA: Eine umfassende Frage-Antwort-Benchmark für medizinische Ethik-Bewertung von LLMs | MedEthicsQA:LLMs医学道德评价的全面回答问题基准 2506.22808v1 |
Authors (8): Jianhui Wei, Zijie Meng, Zikai Xiao, Tianxiang Hu, Yang Feng, Zhijie Zhou, Jian Wu, Zuozhu Liu
While Medical Large Language Models (MedLLMs) have demonstrated remarkable potential in clinical tasks, their ethical safety remains insufficiently explored. This paper introduces $\textbf{MedEthicsQA}$, a comprehensive benchmark comprising $\textbf{5,623}$ multiple-choice questions and $\textbf{5,351}$ open-ended questions for evaluation of medical ethics in LLMs. We systematically establish a hierarchical taxonomy integrating global medical ethical standards. The benchmark encompasses widely used medical datasets, authoritative question banks, and scenarios derived from PubMed literature. Rigorous quality control involving multi-stage filtering and multi-faceted expert validation ensures the reliability of the dataset with a low error rate ($2.72\%$). Evaluation of state-of-the-art MedLLMs exhibit declined performance in answering medical ethics questions compared to their foundation counterparts, elucidating the deficiencies of medical ethics alignment. The dataset, registered under CC BY-NC 4.0 license, is available at https://github.com/JianhuiWei7/MedEthicsQA.
nan
Article 438
Title@2025-06-28 (6): Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement
Title: Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement | Verbesserung der Fähigkeit und Robustheit von großen Sprachmodellen durch verstärkte Learning-Driven Query Refinement | 通过强化学习-驱动查询改进,加强大语言模式的能力和健全性 2407.01461v3 |
Authors (8): Xiaohua Wang, Zisu Huang, Feiran Zhang, Zhibo Xu, Cenyuan Zhang, Qi Qian, Xiaoqing Zheng, Xuanjing Huang
The capacity of large language models (LLMs) to generate honest, harmless, and helpful responses heavily relies on the quality of user prompts. However, these prompts often tend to be brief and vague, thereby significantly limiting the full potential of LLMs. Moreover, harmful prompts can be meticulously crafted and manipulated by adversaries to jailbreak LLMs, inducing them to produce potentially toxic content. To enhance the capabilities of LLMs while maintaining strong robustness against harmful jailbreak inputs, this study proposes a transferable and pluggable framework that refines user prompts before they are input into LLMs. This strategy improves the quality of the queries, empowering LLMs to generate more truthful, benign and useful responses. Specifically, a lightweight query refinement model is introduced and trained using a specially designed reinforcement learning approach that incorporates multiple objectives to enhance particular capabilities of LLMs. Extensive experiments demonstrate that the refinement model not only improves the quality of responses but also strengthens their robustness against jailbreak attacks. Code is available at: https://github.com/Huangzisu/query-refinement .
nan
Article 439
Title@2025-06-28 (6): Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization
Title: Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization | Den Sweet Spot finden: Präferenzdatenkonstruktion für Scaling Preference Optimierung | 寻找甜点:扩大优惠优化的优先数据构建 2502.16825v3 |
Authors (7): Yao Xiao, Hai Ye, Linyao Chen, Hwee Tou Ng, Lidong Bing, Xiaoli Li, Roy Ka-wei Lee
Iterative data generation and model retraining are widely used to align large language models (LLMs). It typically involves a policy model to generate on-policy responses and a reward model to guide training data selection. Direct Preference Optimization (DPO) further enhances this process by constructing preference pairs of chosen and rejected responses. In this work, we aim to \emph{scale up} the number of on-policy samples via repeated random sampling to improve alignment performance. Conventional practice selects the sample with the highest reward as chosen and the lowest as rejected for DPO. However, our experiments reveal that this strategy leads to a \emph{decline} in performance as the sample size increases. To address this, we investigate preference data construction through the lens of underlying normal distribution of sample rewards. We categorize the reward space into seven representative points and systematically explore all 21 ($C_7^2$) pairwise combinations. Through evaluations on four models using AlpacaEval 2, we find that selecting the rejected response at reward position $\mu - 2\sigma$ rather than the minimum reward, is crucial for optimal performance. We finally introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.
nan
Article 440
Title@2025-06-28 (6): Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino
Title: Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino | Kalahi: Eine handgemachte, basis-kulturelle LLM-Evaluierungssuite für Filipino | Kalahi:为菲律宾人设计的手工、基层文化LLM评价套套 2409.15380v4 |
Authors (7): Jann Railey Montalan, Jian Gang Ngui, Wei Qi Leong, Yosephine Susanto, Hamsawardhini Rengarajan, Alham Fikri Aji, William Chandra Tjhi
Multilingual large language models (LLMs) today may not necessarily provide culturally appropriate and relevant responses to its Filipino users. We introduce Kalahi, a cultural LLM evaluation suite collaboratively created by native Filipino speakers. It is composed of 150 high-quality, handcrafted and nuanced prompts that test LLMs for generations that are relevant to shared Filipino cultural knowledge and values. Strong LLM performance in Kalahi indicates a model’s ability to generate responses similar to what an average Filipino would say or do in a given situation. We conducted experiments on LLMs with multilingual and Filipino language support. Results show that Kalahi, while trivial for Filipinos, is challenging for LLMs, with the best model answering only 46.0% of the questions correctly compared to native Filipino performance of 89.10%. Thus, Kalahi can be used to accurately and reliably evaluate Filipino cultural representation in LLMs.
nan
Article 441
Title@2025-06-28 (6): ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models
Title: ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models | ContextCache: Kontext-Bewusst Semantischer Cache für Multi-Turn-Abfragen in großen Sprachmodellen | 上下文缓存: 用于大语言模式多发查询的背景软件语义缓存 2506.22791v1 |
Authors (7): Jianxin Yan, Wangze Ni, Lei Chen, Xuemin Lin, Peng Cheng, Zhan Qin, Kui Ren
Semantic caching significantly reduces computational costs and improves efficiency by storing and reusing large language model (LLM) responses. However, existing systems rely primarily on matching individual queries, lacking awareness of multi-turn dialogue contexts, which leads to incorrect cache hits when similar queries appear in different conversational settings. This demonstration introduces ContextCache, a context-aware semantic caching system for multi-turn dialogues. ContextCache employs a two-stage retrieval architecture that first executes vector-based retrieval on the current query to identify potential matches and then integrates current and historical dialogue representations through self-attention mechanisms for precise contextual matching. Evaluation of real-world conversations shows that ContextCache improves precision and recall compared to existing methods. Additionally, cached responses exhibit approximately 10 times lower latency than direct LLM invocation, enabling significant computational cost reductions for LLM conversational applications.
nan
Article 442
Title@2025-06-28 (6): PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection
Title: PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection | PhonemeFake: Deepfake Realism mit sprachgetriebener Segmentmanipulation und adaptiver Bilevel-Erkennung neu definieren | PhonemeFake: 重新定义“深假”现实主义, 使用语言驱动的分部分操纵和适应性双级检测 2506.22783v1 |
Authors (4): Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep P. Chinchali
Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.
nan
Article 443
Title@2025-06-28 (6): Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
Title: Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning | Lehrmodelle zu verbalisieren Belohnung Hacking in Chain-of-Thought-Reasoning | 教学模型,以思考、思考、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理 2506.22777v1 |
Authors (5): Miles Turpin, Andy Arditi, Marvin Li, Joe Benton, Julian Michael
Language models trained with RL can engage in reward hacking–exploiting unintended strategies for high reward–without revealing this behavior in their chain-of-thought reasoning, making detection difficult and posing risks for high-stakes applications. We propose verbalization fine-tuning (VFT), a pre-RL intervention that trains models to explicitly acknowledge when they are influenced by prompt cues–hints which point to incorrect answers (e.g., “a Stanford professor thinks the answer is A”). To evaluate VFT, we subsequently train models with RL on environments where held-out prompt cues signal which incorrect answers will receive high reward, incentivizing models to reward hack by exploiting cues instead of reasoning correctly. We measure how often models exploit these cues without verbalizing it. After RL, only 6% of the VFT-trained model’s responses consist of undetected reward hacks. In comparison, when we perform RL without VFT, the rate of undetected reward hacks goes up to 88%; with a debiasing baseline intervention, this increases further to 99%. VFT achieves this by substantially increasing how often models verbalize the influence of cues–from 8% to 42% after VFT, and up to 94% after RL–while baselines remain low even after RL (10% and 1%). Our results show that teaching models to explicitly verbalize reward hacking behavior before RL significantly improves their detection, offering a practical path toward more transparent and safe AI systems.
nan
Article 444
Title@2025-06-28 (6): PromptDSI: Prompt-based Rehearsal-free Continual Learning for Document Retrieval
Title: PromptDSI: Prompt-based Rehearsal-free Continual Learning for Document Retrieval | PromptDSI: Prompt-basiert Probefreies Kontinuales Lernen für Dokument-Retrieval | 快速检索:为检索文件而进行基于即时的无排练的持续学习 2406.12593v4 |
Authors (8): Tuan-Luc Huynh, Thuy-Trang Vu, Weiqing Wang, Yinwei Wei, Trung Le, Dragan Gasevic, Yuan-Fang Li, Thanh-Toan Do
Differentiable Search Index (DSI) utilizes pre-trained language models to perform indexing and document retrieval via end-to-end learning without relying on external indexes. However, DSI requires full re-training to index new documents, causing significant computational inefficiencies. Continual learning (CL) offers a solution by enabling the model to incrementally update without full re-training. Existing CL solutions in document retrieval rely on memory buffers or generative models for rehearsal, which is infeasible when accessing previous training data is restricted due to privacy concerns. To this end, we introduce PromptDSI, a prompt-based, rehearsal-free continual learning approach for document retrieval. PromptDSI follows the Prompt-based Continual Learning (PCL) framework, using learnable prompts to efficiently index new documents without accessing previous documents or queries. To improve retrieval latency, we remove the initial forward pass of PCL, which otherwise greatly increases training and inference time, with a negligible trade-off in performance. Additionally, we introduce a novel topic-aware prompt pool that employs neural topic embeddings as fixed keys, eliminating the instability of prompt key optimization while maintaining competitive performance with existing PCL prompt pools. In a challenging rehearsal-free continual learning setup, we demonstrate that PromptDSI variants outperform rehearsal-based baselines, match the strong cache-based baseline in mitigating forgetting, and significantly improving retrieval performance on new corpora.
nan
Article 445
Title@2025-06-28 (6): Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine
Title: Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine | Entscheiden Sie weniger, kommunizieren Sie mehr: Auf dem Konstrukt Gültigkeit der End-to-End-Fact-Checking in der Medizin | 决定少决定少决定少决定,交流多交流: 2506.20876v2 |
Authors (9): Sebastian Joseph, Lily Chen, Barry Wei, Michael Mackert, Iain J. Marshall, Paul Pu Liang, Ramez Kouzy, Byron C. Wallace, Junyi Jessy Li
Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in adopting these systems for public health and medicine has grown due to the high-stakes nature of medical decisions and challenges in critically appraising a vast and diverse medical literature. Evidence-based medicine connects to every individual, and yet the nature of it is highly technical, rendering the medical literacy of majority users inadequate to sufficiently navigate the domain. Such problems with medical communication ripens the ground for end-to-end fact-checking agents: check a claim against current medical literature and return with an evidence-backed verdict. And yet, such systems remain largely unused. To understand this, we present the first study examining how clinical experts verify real claims from social media by synthesizing medical evidence. In searching for this upper-bound, we reveal fundamental challenges in end-to-end fact-checking when applied to medicine: Difficulties connecting claims in the wild to scientific evidence in the form of clinical trials; ambiguities in underspecified claims mixed with mismatched intentions; and inherently subjective veracity labels. We argue that fact-checking should be approached and evaluated as an interactive communication problem, rather than an end-to-end process.
nan
Article 446
Title@2025-06-28 (6): Detecting Sockpuppetry on Wikipedia Using Meta-Learning
Title: Detecting Sockpuppetry on Wikipedia Using Meta-Learning | Sockepuppetry auf Wikipedia erkennen Mit Meta-Learning | 在维基百科上用元学习探测袜子布料 2506.10314v2 |
Authors (2): Luc Raszewski, Christine De Kock
Malicious sockpuppet detection on Wikipedia is critical to preserving access to reliable information on the internet and preventing the spread of disinformation. Prior machine learning approaches rely on stylistic and meta-data features, but do not prioritise adaptability to author-specific behaviours. As a result, they struggle to effectively model the behaviour of specific sockpuppet-groups, especially when text data is limited. To address this, we propose the application of meta-learning, a machine learning technique designed to improve performance in data-scarce settings by training models across multiple tasks. Meta-learning optimises a model for rapid adaptation to the writing style of a new sockpuppet-group. Our results show that meta-learning significantly enhances the precision of predictions compared to pre-trained models, marking an advancement in combating sockpuppetry on open editing platforms. We release a new dataset of sockpuppet investigations to foster future research in both sockpuppetry and meta-learning fields.
nan
Article 447
Title@2025-06-28 (6): Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion
Title: Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion | Doppelentendre: Robuste audiobasierte KI-generierte Lyrics-Erkennung über Multi-View Fusion | 双向内容: 强力音频根据 AI 生成的音频通过多视图组合探测 2506.15981v2 |
Authors (4): Markus Frohmann, Gabriel Meseguer-Brocal, Markus Schedl, Elena V. Epure
The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics-related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.
nan
Article 448
Title@2025-06-28 (6): Jan-nano Technical Report
Title: Jan-nano Technical Report | Jan-nano Technischer Bericht | Jan-nano技术报告 2506.22760v1 |
Authors (2): Alan Dao, Dinh Bach Vu
Most language models face a fundamental tradeoff where powerful capabilities require substantial computational resources. We shatter this constraint with Jan-nano, a 4B parameter language model that redefines efficiency through radical specialization: instead of trying to know everything, it masters the art of finding anything instantly. Fine-tuned from Qwen3-4B using our novel multi-stage RLVR system that completely eliminates reliance on next token prediction training (SFT), Jan-nano achieves 83.2% on SimpleQA benchmark with MCP integration while running on consumer hardware. With 128K context length, Jan-nano proves that intelligence isn’t about scale, it’s about strategy.
nan
Article 449
Title@2025-06-28 (6): AI-Generated Song Detection via Lyrics Transcripts
Title: AI-Generated Song Detection via Lyrics Transcripts | AI-Generated Song Detection via Lyrics Transcripts | AI 创名歌曲通过歌词谱状探测 2506.18488v2 |
Authors (5): Markus Frohmann, Elena V. Epure, Gabriel Meseguer-Brocal, Markus Schedl, Romain Hennequin
The recent rise in capabilities of AI-based music generation tools has created an upheaval in the music industry, necessitating the creation of accurate methods to detect such AI-generated content. This can be done using audio-based detectors; however, it has been shown that they struggle to generalize to unseen generators or when the audio is perturbed. Furthermore, recent work used accurate and cleanly formatted lyrics sourced from a lyrics provider database to detect AI-generated music. However, in practice, such perfect lyrics are not available (only the audio is); this leaves a substantial gap in applicability in real-life use cases. In this work, we instead propose solving this gap by transcribing songs using general automatic speech recognition (ASR) models. We do this using several detectors. The results on diverse, multi-genre, and multi-lingual lyrics show generally strong detection performance across languages and genres, particularly for our best-performing model using Whisper large-v2 and LLM2Vec embeddings. In addition, we show that our method is more robust than state-of-the-art audio-based ones when the audio is perturbed in different ways and when evaluated on different music generators. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.
nan
Article 450
Title@2025-06-28 (6): ScienceMeter: Tracking Scientific Knowledge Updates in Language Models
Title: ScienceMeter: Tracking Scientific Knowledge Updates in Language Models | ScienceMeter: Nachvollziehen wissenschaftlicher Wissensaktualisierungen in Sprachmodellen | ScienceMeter: 语言模式科学知识最新跟踪 2505.24302v2 |
Authors (4): Yike Wang, Shangbin Feng, Yulia Tsvetkov, Hannaneh Hajishirzi
Large Language Models (LLMs) are increasingly used to support scientific research, but their knowledge of scientific advancements can quickly become outdated. We introduce ScienceMeter, a new framework for evaluating scientific knowledge update methods over scientific knowledge spanning the past, present, and future. ScienceMeter defines three metrics: knowledge preservation, the extent to which models’ understanding of previously learned papers are preserved; knowledge acquisition, how well scientific claims from newly introduced papers are acquired; and knowledge projection, the ability of the updated model to anticipate or generalize to related scientific claims that may emerge in the future. Using ScienceMeter, we examine the scientific knowledge of LLMs on claim judgment and generation tasks across a curated dataset of 15,444 scientific papers and 30,888 scientific claims from ten domains including medicine, biology, materials science, and computer science. We evaluate five representative knowledge update approaches including training- and inference-time methods. With extensive experiments, we find that the best-performing knowledge update methods can preserve only 85.9% of existing knowledge, acquire 71.7% of new knowledge, and project 37.7% of future knowledge. Inference-based methods work for larger models, whereas smaller models require training to achieve comparable performance. Cross-domain analysis reveals that performance on these objectives is correlated. Even when applying on specialized scientific LLMs, existing knowledge update methods fail to achieve these objectives collectively, underscoring that developing robust scientific knowledge update mechanisms is both crucial and challenging.
nan
Article 451
Title@2025-06-28 (6): S^3cMath: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners
Title: S^3cMath: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners | S^3cMath: Spontane Step-Level Selbstkorrektur macht große Sprachmodelle besser Mathematische Reasoner | S3cMatth:自发的逐步自我校正使大语言模型更好地解释数学理由 2409.01524v3 |
Authors (8): Yuchen Yan, Jin Jiang, Yang Liu, Yixin Cao, Xin Xu, Mengdi Zhang, Xunliang Cai, Jian Shao
Self-correction is a novel method that can stimulate the potential reasoning abilities of large language models (LLMs). It involves detecting and correcting errors during the inference process when LLMs solve reasoning problems. However, recent works do not regard self-correction as a spontaneous and intrinsic capability of LLMs. Instead, such correction is achieved through post-hoc generation, external knowledge introduction, multi-model collaboration, and similar techniques. In this paper, we propose a series of mathematical LLMs called S$^3$c-Math, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning. This capability helps LLMs to recognize whether their ongoing inference tends to contain errors and simultaneously correct these errors to produce a more reliable response. We proposed a method, which employs a step-level sampling approach to construct step-wise self-correction data for achieving such ability. Additionally, we implement a training strategy that uses above constructed data to equip LLMs with spontaneous step-level self-correction capacities. Our data and methods have been demonstrated to be effective across various foundation LLMs, consistently showing significant progress in evaluations on GSM8K, MATH, and other mathematical benchmarks. To the best of our knowledge, we are the first to introduce the spontaneous step-level self-correction ability of LLMs in mathematical reasoning.
nan
Article 452
Title@2025-06-28 (6): Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling
Title: Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling | Kennen Sie Ihre Fehler: Auf dem Weg zu verhindern, dass übermäßige Abhängigkeit auf Task-Oriented Conversational AI durch Accountability Modeling | 了解你的错误:通过建立问责制模式,努力防止过度依赖以任务为导向的对话AI 2501.10316v4 |
Authors (4): Suvodip Dey, Yi-Jyun Sun, Gokhan Tur, Dilek Hakkani-Tur
Recent LLMs have enabled significant advancements for conversational agents. However, they are also well known to hallucinate, producing responses that seem plausible but are factually incorrect. On the other hand, users tend to over-rely on LLM-based AI agents, accepting AI’s suggestion even when it is wrong. Adding positive friction, such as explanations or getting user confirmations, has been proposed as a mitigation in AI-supported decision-making systems. In this paper, we propose an accountability model for LLM-based task-oriented dialogue agents to address user overreliance via friction turns in cases of model uncertainty and errors associated with dialogue state tracking (DST). The accountability model is an augmented LLM with an additional accountability head that functions as a binary classifier to predict the relevant slots of the dialogue state mentioned in the conversation. We perform our experiments with multiple backbone LLMs on two established benchmarks (MultiWOZ and Snips). Our empirical findings demonstrate that the proposed approach not only enables reliable estimation of AI agent errors but also guides the decoder in generating more accurate actions. We observe around 3% absolute improvement in joint goal accuracy (JGA) of DST output by incorporating accountability heads into modern LLMs. Self-correcting the detected errors further increases the JGA from 67.13 to 70.51, achieving state-of-the-art DST performance. Finally, we show that error correction through user confirmations (friction turn) achieves a similar performance gain, highlighting its potential to reduce user overreliance.
nan
Article 453
Title@2025-06-28 (6): LegiGPT: Party Politics and Transport Policy with Large Language Model
Title: LegiGPT: Party Politics and Transport Policy with Large Language Model | LegiGPT: Parteipolitik und Verkehrspolitik mit großem Sprachmodell | 友好社:具有大语言模式的党政治和交通政策 2506.16692v2 |
Authors (2): Hyunsoo Yun, Eun Hak Lee
Given the significant influence of lawmakers’ political ideologies on legislative decision-making, analyzing their impact on transportation-related policymaking is of critical importance. This study introduces a novel framework that integrates a large language model (LLM) with explainable artificial intelligence (XAI) to analyze transportation-related legislative proposals. Legislative bill data from South Korea’s 21st National Assembly were used to identify key factors shaping transportation policymaking. These include political affiliations and sponsor characteristics. The LLM was employed to classify transportation-related bill proposals through a stepwise filtering process based on keywords, sentences, and contextual relevance. XAI techniques were then applied to examine the relationships between political party affiliation and associated attributes. The results revealed that the number and proportion of conservative and progressive sponsors, along with district size and electoral population, were critical determinants shaping legislative outcomes. These findings suggest that both parties contributed to bipartisan legislation through different forms of engagement, such as initiating or supporting proposals. This integrated approach offers a valuable tool for understanding legislative dynamics and guiding future policy development, with broader implications for infrastructure planning and governance.
nan
Article 454
Title@2025-06-28 (6): How to Retrieve Examples in In-context Learning to Improve Conversational Emotion Recognition using Large Language Models?
Title: How to Retrieve Examples in In-context Learning to Improve Conversational Emotion Recognition using Large Language Models? | Wie kann man Beispiele im In-Context-Lernen abrufen, um die Erkennung von Konversationsgefühlen mit großen Sprachmodellen zu verbessern? | 如何利用大语言模式获取学习内文中的实例, 2506.20199v2 |
Authors (3): Mengqi Wang, Tiantian Feng, Shrikanth Narayanan
Large language models (LLMs) have enabled a wide variety of real-world applications in various domains. However, creating a high-performing application with high accuracy remains challenging, particularly for subjective tasks like emotion recognition. Inspired by the SLT 2024 GenSER Challenge, this study investigates approaches to improving conversational emotion recognition (CER) by LLMs. Specifically, we explore how to retrieve high-quality examples in in-context learning (ICL) to enhance CER. We propose various strategies based on random and augmented example retrieval and also analyze the impact of conversational context on CER accuracy. Experiments were conducted on the three datasets including IEMOCAP, MELD and EmoryNLP. The results show that augmented example retrieval consistently outperforms other techniques under investigation across all datasets, highlighting the importance of retrieving coherent targeted examples and enhancing them through paraphrasing.
nan
Article 455
Title@2025-06-28 (6): Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting
Title: Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting | Verbessertes Supervised-Fine-Tuning für große Sprachmodelle, um Katastrophenvergessenheit zu vermeiden | 改进对大语言模型改进监督的微调,以缓解灾难性遗忘 2506.09428v2 |
Authors (2): Fei Ding, Baiqiao Wang
Supervised Fine-Tuning (SFT) is a critical step for enhancing the instruction-following capabilities of Large Language Models (LLMs) and adapting them to specialized domains. However, SFT often leads to a degradation of the model’s general abilities, a phenomenon known as catastrophic forgetting. This problem is exacerbated when third-party practitioners fine-tune open-source models, as the original SFT data is typically not available. To address this challenge, we propose a novel and cost-effective SFT method that effectively mitigates catastrophic forgetting without requiring access to the original SFT data. Our approach first reconstructs the likely instruction distribution of the base model. It then employs a multi-model generation and filtering pipeline to synthesize a high-quality general-purpose dataset. This synthetic dataset is mixed with new, domain-specific data for fine-tuning. Experimental results show that our method not only preserves the model’s capabilities in general domains but also improves task-specific performance, outperforming baselines that use publicly available SFT datasets.
nan
Article 456
Title@2025-06-28 (6): Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs
Title: Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs | Die Hitze aufdrehen: Min-p-Sampling für kreative und kohärente LLM-Ausgaben | 翻开热热:创意和一致的LLM产出的最小抽样 2407.01082v7 |
Authors (6): Minh Nhat Nguyen, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, Ravid Shwartz-Ziv
Large Language Models (LLMs) generate text by sampling the next token from a probability distribution over the vocabulary at each decoding step. Popular sampling methods like top-p (nucleus sampling) often struggle to balance quality and diversity, especially at higher temperatures which lead to incoherent or repetitive outputs. We propose min-p sampling, a dynamic truncation method that adjusts the sampling threshold based on the model’s confidence by using the top token’s probability as a scaling factor. Our experiments on benchmarks including GPQA, GSM8K, and AlpacaEval Creative Writing show that min-p sampling improves both the quality and diversity of generated text across different model families (Mistral and Llama 3) and model sizes (1B to 123B parameters), especially at higher temperatures. Human evaluations further show a clear preference for min-p sampling, in both text quality and creativity. Min-p sampling has been adopted by popular open-source LLM frameworks, including Hugging Face Transformers, VLLM, and many others, highlighting its considerable impact on improving text generation quality.
nan
Article 457
Title@2025-06-28 (6): The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure
Title: The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure | Die Übersetzungsbarriere Hypothese: Mehrsprachige Generation mit großen Sprachmodellen leidet unter Implizitem Übersetzungsfehler | 《翻译障碍假设:具有大语言模型的多语言一代人因隐含翻译失败而遭受的痛苦》 2506.22724v1 |
Authors (7): Niyati Bafna, Tianjian Li, Kenton Murray, David R. Mortensen, David Yarowsky, Hale Sirin, Daniel Khashabi
Multilingual generation with large language models (LLMs) is often of poor quality for mid- to low-resource languages. Building on insights from interpretability, we demonstrate the existence of an implicit task-solving–>translation pipeline for generation, whereby the model first solves the required task in a largely target-language-agnostic manner, and subsequently translates answer concepts into the intended target language. We hypothesize that the failure of the translation stage is an important culprit for the observed low quality of final outputs, and formalize this as the translation barrier hypothesis. We test this hypothesis for a word translation task across 108 language pairs, using logit lens to observe model processing in intermediate layers. We find that a significant portion of overall failures indeed stems from translation failure, or the model’s inability to translate correctly solved intermediate concepts into the target language. This is especially true for low-resource target languages. Our results highlight an important hurdle for end-to-end multilingual generation, and lend guiding insights for future work seeking to improve multilinguality in LLMs.
nan
Article 458
Title@2025-06-28 (6): BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute
Title: BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute | BEST-Route: Adaptives LLM Routing mit Test-Time Optimal Compute | 最佳选择:用测试时最佳计算法运行的适应性LMLM 2506.22716v1 |
Authors (10): Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Carmen Hipolito Garcia, Menglin Xia, Laks V. S. Lakshmanan, Qingyun Wu, Victor Rühle
Large language models (LLMs) are powerful tools but are often expensive to deploy at scale. LLM query routing mitigates this by dynamically assigning queries to models of varying cost and quality to obtain a desired trade-off. Prior query routing approaches generate only one response from the selected model and a single response from a small (inexpensive) model was often not good enough to beat a response from a large (expensive) model due to which they end up overusing the large model and missing out on potential cost savings. However, it is well known that for small models, generating multiple responses and selecting the best can enhance quality while remaining cheaper than a single large-model response. We leverage this idea to propose BEST-Route, a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds. Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop.
nan
Article 459
Title@2025-06-28 (6): Residual Matrix Transformers: Scaling the Size of the Residual Stream
Title: Residual Matrix Transformers: Scaling the Size of the Residual Stream | Residual Matrix Transformers: Skalierung der Größe des Residual Stream | 残余矩阵变异器:扩大残余流的规模 2506.22696v1 |
Authors (2): Brian Mak, Jeffrey Flanigan
The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties. Code for this project can be found at https://github.com/bmac3/residual-matrix-transformer.
nan
Article 460
Title@2025-06-28 (6): Reasoner Outperforms: Generative Stance Detection with Rationalization for Social Media
Title: Reasoner Outperforms: Generative Stance Detection with Rationalization for Social Media | Reasoner Outperforms: Generative Stance Detection mit Rationalisierung für Social Media | 理性外向表现:社会媒体合理化的 “ 产生式发现 “ 和 “ 社会媒体合理化 “ 。 2412.10266v2 |
Authors (3): Jiaqing Yuan, Ruijie Xi, Munindar P. Singh
Stance detection is crucial for fostering a human-centric Web by analyzing user-generated content to identify biases and harmful narratives that undermine trust. With the development of Large Language Models (LLMs), existing approaches treat stance detection as a classification problem, providing robust methodologies for modeling complex group interactions and advancing capabilities in natural language tasks. However, these methods often lack interpretability, limiting their ability to offer transparent and understandable justifications for predictions. This study adopts a generative approach, where stance predictions include explicit, interpretable rationales, and integrates them into smaller language models through single-task and multitask learning. We find that incorporating reasoning into stance detection enables the smaller model (FlanT5) to outperform GPT-3.5’s zero-shot performance, achieving an improvement of up to 9.57%. Moreover, our results show that reasoning capabilities enhance multitask learning performance but may reduce effectiveness in single-task settings. Crucially, we demonstrate that faithful rationales improve rationale distillation into SLMs, advancing efforts to build interpretable, trustworthy systems for addressing discrimination, fostering trust, and promoting equitable engagement on social media.
nan
Article 461
Title@2025-06-28 (6): VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs
Title: VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs | VOCABTRIM: Vokabelabgleich für effizientes spekulatives Decodieren in LLMs | VOCABTRIM: 有效投机下限的词汇 2506.22694v1 |
Authors (12): Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyan Lu, Chris Lott, Mingu Lee
In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, followed by verification by a base LLM, a target model, accepting a subset as its valid generation. As it is usually considered that the speculative decoding requires one-to-one mapping between vocabularies of the target model and the draft model, it has been natural to share the vocabulary between them, or even share the LM head as in EAGLE or Medusa. We first identify that this draft token sampling scheme inherently contains an unnecessary inference overhead in drafting, especially for some target LLMs with very large vocabularies. Then, we propose a simple technique, VocabTrim, to mitigate the drafting overhead to improve the generation speed in memory-bound environment. VocabTrim reconstructs the drafter LM head to contain only a limited set of tokens, selected by the most frequently sampled from the vocabulary of the target model. While limiting the vocabulary in drafting slightly degrades the acceptance rate, it significantly reduces the drafting latency in memory-bound process which is often the case on edge devices, resulting in higher memory-bound speed up (MBSU). We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct.
nan
Article 462
Title@2025-06-28 (6): Scaling Data-Constrained Language Models
Title: Scaling Data-Constrained Language Models | Skalierung von datengebundenen Sprachmodellen | 受数据约束的语言模式 2305.16264v5 |
Authors (9): Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel
The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.
nan
Article 463
Title@2025-06-27 (5): Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
Title: Organize the Web: Constructing Domains Enhances Pre-Training Data Curation | Organisation des Webs: Aufbau von Domains verbessert die Vorschulung von Daten-Curation | 组织网络: 构建域域 增强培训前数据曲线 2502.10341v2 |
Authors (6): Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, Luca Soldaini
Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.
nan
Article 464
Title@2025-06-27 (5): PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation
Title: PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation | PriorDiffusion: Leverage Language Prior in Diffusionsmodellen für monookulare Tiefenschätzung | 先前传播:在单人深度估算扩散模型中先使用语言 2411.16750v3 |
Authors (8): Ziyao Zeng, Jingcheng Ni, Daniel Wang, Patrick Rim, Younjoon Chung, Fengyu Yang, Byung-Woo Hong, Alex Wong
Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisance. We argue that language prior can enhance monocular depth estimation by leveraging the inductive bias learned during the text-to-image pre-training of diffusion models. The ability of these models to generate images that align with text indicates that they have learned the spatial relationships, size, and shape of specified objects, which can be applied to improve depth estimation. Thus, we propose PriorDiffusion, using a pre-trained text-to-image diffusion model that takes both images and corresponding text descriptions to infer affine-invariant depth through a denoising process. We also show that language prior enhances the model’s perception of specific regions of images that users care about and describe. Simultaneously, language prior acts as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. By training on HyperSim and Virtual KITTI, we achieve faster training convergence, fewer inference diffusion steps, and state-of-the-art zero-shot performance across NYUv2, KITTI, ETH3D, and ScanNet. Code will be released upon acceptance.
nan
Article 465
Title@2025-06-27 (5): Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missions
Title: Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missions | Bewertung der Machbarkeit von Large Language Models zur Erkennung von Mikroverhalten in Teaminteraktionen während Weltraummissionen | 评估大语言模型在空间飞行任务期间在团队互动中探测微型行为力模型的可行性 2506.22679v1 |
Authors (5): Ankush Raut, Projna Paromita, Sydney Begerowski, Suzanne Bell, Theodora Chaspari
We explore the feasibility of large language models (LLMs) in detecting subtle expressions of micro-behaviors in team conversations using transcripts collected during simulated space missions. Specifically, we examine zero-shot classification, fine-tuning, and paraphrase-augmented fine-tuning with encoder-only sequence classification LLMs, as well as few-shot text generation with decoder-only causal language modeling LLMs, to predict the micro-behavior associated with each conversational turn (i.e., dialogue). Our findings indicate that encoder-only LLMs, such as RoBERTa and DistilBERT, struggled to detect underrepresented micro-behaviors, particularly discouraging speech, even with weighted fine-tuning. In contrast, the instruction fine-tuned version of Llama-3.1, a decoder-only LLM, demonstrated superior performance, with the best models achieving macro F1-scores of 44% for 3-way classification and 68% for binary classification. These results have implications for the development of speech technologies aimed at analyzing team communication dynamics and enhancing training interventions in high-stakes environments such as space missions, particularly in scenarios where text is the only accessible data.
nan
Article 466
Title@2025-06-27 (5): Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs
Title: Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs | Kann LLMs Dolmetschen und Leverage strukturierte sprachliche Repräsentationen? Eine Fallstudie mit AMRs | LLMs 能够解释和利用结构化语言代表吗? 2504.04745v4 |
Authors (3): Ankush Raut, Xiaofeng Zhu, Maria Leonor Pacheco
This paper evaluates the ability of Large Language Models (LLMs) to leverage contextual information in the form of structured linguistic representations. Specifically, we examine the impact of encoding both short and long contexts using Abstract Meaning Representation (AMR) structures across a diverse set of language tasks. We perform our analysis using 8-bit quantized and instruction-tuned versions of Llama 3.1 (8B), Phi-3, and Mistral 7B. Our results indicate that, for tasks involving short contexts, augmenting the prompt with the AMR of the original language context often degrades the performance of the underlying LLM. However, for tasks that involve long contexts, such as dialogue summarization in the SAMSum dataset, this enhancement improves LLM performance, for example, by increasing the zero-shot cosine similarity score of Llama 3.1 from 66% to 76%. This improvement is more evident in the newer and larger LLMs, but does not extend to the older or smaller ones. In addition, we observe that LLMs can effectively reconstruct the original text from a linearized AMR, achieving a cosine similarity of 81% in the best-case scenario.
nan
Article 467
Title@2025-06-27 (5): VERA: Variational Inference Framework for Jailbreaking Large Language Models
Title: VERA: Variational Inference Framework for Jailbreaking Large Language Models | VERA: Variationaler Bezugsrahmen für Jailbreaking große Sprachmodelle | VERA:破碎大型语言模型变化推断框架 2506.22666v1 |
Authors (5): Anamika Lochab, Lu Yan, Patrick Pynadath, Xiangyu Zhang, Ruqi Zhang
The rise of API-only access to state-of-the-art LLMs highlights the need for effective black-box jailbreak methods to identify model vulnerabilities in real-world settings. Without a principled objective for gradient-based optimization, most existing approaches rely on genetic algorithms, which are limited by their initialization and dependence on manually curated prompt pools. Furthermore, these methods require individual optimization for each prompt, failing to provide a comprehensive characterization of model vulnerabilities. To address this gap, we introduce VERA: Variational infErence fRamework for jAilbreaking. VERA casts black-box jailbreak prompting as a variational inference problem, training a small attacker LLM to approximate the target LLM’s posterior over adversarial prompts. Once trained, the attacker can generate diverse, fluent jailbreak prompts for a target query without re-optimization. Experimental results show that VERA achieves strong performance across a range of target LLMs, highlighting the value of probabilistic inference for adversarial prompt generation.
nan
Article 468
Title@2025-06-27 (5): Demystifying Singular Defects in Large Language Models
Title: Demystifying Singular Defects in Large Language Models | Entmystifizieren von Singularfehlern in großen Sprachmodellen | 解开大语言模型中奇异的奇特缺陷 2502.07004v2 |
Authors (3): Haoqi Wang, Tong Zhang, Mathieu Salzmann
Large transformer models are known to produce high-norm tokens. In vision transformers (ViTs), such tokens have been mathematically modeled through the singular vectors of the linear approximations of layers. However, in large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored, and their different properties from those of ViTs require a new analysis framework. In this paper, we provide both theoretical insights and empirical validation across a range of recent models, leading to the following observations: i) The layer-wise singular direction predicts the abrupt explosion of token norms in LLMs. ii) The negative eigenvalues of a layer explain its sudden decay. iii) The computational pathways leading to high-norm tokens differ between initial and noninitial tokens. iv) High-norm tokens are triggered by the right leading singular vector of the matrix approximating the corresponding modules. We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures. Our findings not only advance the understanding of singular defects in LLMs but also open new avenues for their application. We expect that this work will stimulate further research into the internal mechanisms of LLMs. Code is released at https://github.com/haoqiwang/singular_defect.
nan
Article 469
Title@2025-06-27 (5): Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge
Title: Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge | Bewertung der Hybrid Retrieval Augmented Generation mit Dynamic Test Sets: LiveRAG Challenge | 使用动态测试组评估混合回收增殖下一代:LiveRAG挑战 2506.22644v1 |
Authors (4): Chase Fensore, Kaustubh Dhole, Joyce C Ho, Eugene Agichtein
We present our submission to the LiveRAG Challenge 2025, which evaluates retrieval-augmented generation (RAG) systems on dynamic test sets using the FineWeb-10BT corpus. Our final hybrid approach combines sparse (BM25) and dense (E5) retrieval methods and then aims to generate relevant and faithful answers with Falcon3-10B-Instruct. Through systematic evaluation on 200 synthetic questions generated with DataMorgana across 64 unique question-user combinations, we demonstrate that neural re-ranking with RankLLaMA improves MAP from 0.523 to 0.797 (52% relative improvement) but introduces prohibitive computational costs (84s vs 1.74s per question). While DSPy-optimized prompting strategies achieved higher semantic similarity (0.771 vs 0.668), their 0% refusal rates raised concerns about over-confidence and generalizability. Our submitted hybrid system without re-ranking achieved 4th place in faithfulness and 11th place in correctness among 25 teams. Analysis across question categories reveals that vocabulary alignment between questions and documents was the strongest predictor of performance on our development set, with document-similar phrasing improving cosine similarity from 0.562 to 0.762.
nan
Article 470
Title@2025-06-27 (5): Temperature Matters: Enhancing Watermark Robustness Against Paraphrasing Attacks
Title: Temperature Matters: Enhancing Watermark Robustness Against Paraphrasing Attacks | Temperaturfaktoren: Verbesserung der Robustheit des Wasserzeichens gegen paraphrasierende Angriffe | 温度事项:加强水印力,防止袭击 2506.22623v1 |
Authors (5): Badr Youbi Idrissi, Monica Millunzi, Amelia Sorrenti, Lorenzo Baraldi, Daryna Dementieva
In the present-day scenario, Large Language Models (LLMs) are establishing their presence as powerful instruments permeating various sectors of society. While their utility offers valuable support to individuals, there are multiple concerns over potential misuse. Consequently, some academic endeavors have sought to introduce watermarking techniques, characterized by the inclusion of markers within machine-generated text, to facilitate algorithmic identification. This research project is focused on the development of a novel methodology for the detection of synthetic text, with the overarching goal of ensuring the ethical application of LLMs in AI-driven text generation. The investigation commences with replicating findings from a previous baseline study, thereby underscoring its susceptibility to variations in the underlying generation model. Subsequently, we propose an innovative watermarking approach and subject it to rigorous evaluation, employing paraphrased generated text to asses its robustness. Experimental results highlight the robustness of our proposal compared to the~\cite{aarson} watermarking method.
nan
Article 471
Title@2025-06-27 (5): RExBench: Can coding agents autonomously implement AI research extensions?
Title: RExBench: Can coding agents autonomously implement AI research extensions? | RExBench: Können Codierer KI-Forschungserweiterungen autonom implementieren? | RExBench:编码代理商能否自主实施AI研究扩展? 2506.22598v1 |
Authors (7): Nicholas Edwards, Yukyung Lee, Yujun, Mao, Yulu Qin, Sebastian Schuster, Najoung Kim
Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of 12 realistic research experiment implementation tasks that aim to investigate research hypotheses that have not previously been implemented. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination, and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate nine LLM agents implemented using three different frameworks: aider, Claude Code, and OpenHands. We find that all agents evaluated fail to autonomously implement the majority of the extensions. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 40%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.
nan
Article 472
Title@2025-06-27 (5): What Makes the Preferred Thinking Direction for LLMs in Multiple-choice Questions?
Title: What Makes the Preferred Thinking Direction for LLMs in Multiple-choice Questions? | Was macht die bevorzugte Denkrichtung für LLMs in Multiple-Choice-Fragen? | ” 多种选择问题 “ 中LLMs的首选思维方向是什么? 2502.18435v3 |
Authors (8): Yizhe Zhang, Richard Bai, Zijin Gu, Ruixiang Zhang, Jiatao Gu, Emmanuel Abbe, Samy Bengio, Navdeep Jaitly
Language models usually use left-to-right (L2R) autoregressive factorization. However, L2R factorization may not always be the best inductive bias. Therefore, we investigate whether alternative factorizations of the text distribution could be beneficial in some tasks. We investigate right-to-left (R2L) training as a compelling alternative, focusing on multiple-choice questions (MCQs) as a test bed for knowledge extraction and reasoning. Through extensive experiments across various model sizes (2B-8B parameters) and training datasets, we find that R2L models can significantly outperform L2R models on several MCQ benchmarks, including logical reasoning, commonsense understanding, and truthfulness assessment tasks. Our analysis reveals that this performance difference may be fundamentally linked to multiple factors including calibration, computability, and directional conditional entropy. We analyze the impact of these factors through controlled simulation studies using arithmetic tasks, where the impacting factors can be better disentangled. Our work demonstrates that exploring alternative factorizations of the text distribution can lead to improvements in LLM capabilities and provides theoretical insights into optimal factorization towards approximating human language distribution, and when each reasoning order might be more advantageous. Our code and checkpoints are released at https://github.com/apple/ml-reversal-blessing.
nan
Article 473
Title@2025-06-27 (5): Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning
Title: Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning | Datenqualitätsfragen in mehrsprachigen Sprachdatensätzen: Der Bedarf an soziolinguistischer Sensibilisierung und proaktiver Sprachplanung | 多语言语言数据集的数据质量问题:社会语言意识和前瞻性语言规划的必要性 2506.17525v2 |
Authors (6): Mingfei Lau, Qian Chen, Yeming Fang, Tingting Xu, Tongzhou Chen, Pavel Golik
Our quality audit for three widely used public multilingual speech datasets - Mozilla Common Voice 17.0, FLEURS, and Vox Populi - shows that in some languages, these datasets suffer from significant quality issues, which may obfuscate downstream evaluation results while creating an illusion of success. We divide these quality issues into two categories: micro-level and macro-level. We find that macro-level issues are more prevalent in less institutionalized, often under-resourced languages. We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e.g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the dataset creation process. We conclude by proposing guidelines and recommendations to mitigate these issues in future dataset development, emphasizing the importance of sociolinguistic awareness and language planning principles. Furthermore, we encourage research into how this creation process itself can be leveraged as a tool for community-led language planning and revitalization.
nan
Article 474
Title@2025-06-27 (5): Refining Czech GEC: Insights from a Multi-Experiment Approach
Title: Refining Czech GEC: Insights from a Multi-Experiment Approach | Refining Czech GEC: Einblicke aus einem Multi-Experiment-Ansatz | 完善捷克的GEC:从多种经验方法中得出的看法 2506.22402v1 |
Authors (4): Petr Pechman, Milan Straka, Jana Straková, Jakub Náplava
We present a grammar error correction (GEC) system that achieves state of the art for the Czech language. Our system is based on a neural network translation approach with the Transformer architecture, and its key feature is its real-time synthetic generation pipeline, which dynamically augments sentences with artificial errors by introducing both language-agnostic and Czech-specific errors. We conduct a comprehensive series of experiments, investigating the Czech GEC corpora as bases for synthetic error introduction, several error generation strategies, domain balancing, tokenization granularity, model size, and data scaling during fine-tuning. Additionally, we evaluate the performance of large language models (LLMs) on Czech GEC in both end-user and expert fine-tuning scenarios. Our best-performing model is superior both in performance and computational efficiency. The source code and the trained model links are available on https://github.com/ufal/tsd2025-gec.
nan
Article 475
Title@2025-06-27 (5): Metadata Conditioning Accelerates Language Model Pre-training
Title: Metadata Conditioning Accelerates Language Model Pre-training | Metadatenkonditionierung beschleunigt Sprachmodell Vortraining | 训练前训练模式 2501.01956v3 |
Authors (6): Tianyu Gao, Alexander Wettig, Luxi He, Yihe Dong, Sadhika Malladi, Danqi Chen
The vast diversity of styles, domains, and quality levels present in language model pre-training corpora is essential in developing general model capabilities, but efficiently learning and deploying the correct behaviors exemplified in each of these heterogeneous data sources is challenging. To address this, we propose a new method, termed Metadata Conditioning then Cooldown (MeCo), to incorporate additional learning cues during pre-training. MeCo first provides metadata (e.g., URLs like www$.$wikipedia$.$org) alongside the text during training and later uses a cooldown phase with only the standard text, thereby enabling the model to function normally even without metadata. MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM). For instance, a 1.6B language model trained with MeCo matches the downstream task performance of standard pre-training while using 33% less data. Additionally, MeCo enables us to steer language models by conditioning the inference prompt on either real or fabricated metadata that encodes the desired properties of the output: for example, prepending wikipedia$.$org to reduce harmful generations or factquizmaster$.$com (fabricated) to improve common knowledge task performance. We also demonstrate that MeCo is compatible with different types of metadata, such as model-generated topics. MeCo is remarkably simple, adds no computational overhead, and demonstrates promise in producing more capable and steerable language models.
nan
Article 476
Title@2025-06-27 (5): QuickSilver – Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization
Title: QuickSilver – Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization | QuickSilver – Beschleunigung der LLM-Inferenz durch dynamisches Token-Halten, KV-Überspringen, Kontext-Token-Fusion und adaptive Matryoshka-Quantisierung | QuickSilver – – 通过动态声调停止、 KV 跳过、 上下文声调融合和适应性 Matryoshka 量化加速LLLM 推断 2506.22396v1 |
Authors (10): Danush Khanna, Aditya Kumar Guru, Srivarshinee Sridhar, Zidan Ahmed, Rubhav Bahirwani, Meetu Malhotra, Vinija Jain, Aman Chadha, Amitava Das, Kripabandhu Ghosh
Inference accounts for the majority of latency and energy consumption in large language model (LLM) deployments, often exceeding 90% of total cost. While training-time efficiency has seen extensive progress, runtime optimization remains a key bottleneck, particularly under autoregressive decoding. Existing approaches – such as pruning, quantization, early exits, and speculative decoding – often require retraining, architectural changes, or disrupt decoding compatibility. We introduce QuickSilver, a modular, token-level framework that enables semantic adaptivity at inference time without altering model weights or structure. QuickSilver integrates four synergistic mechanisms: (i) Dynamic Token Halting, which halts computation for tokens with converged representations; (ii) KV Cache Skipping, which selectively suppresses memory writes to reduce attention overhead; and (iii) Contextual Token Fusion, which collapses redundant tokens into shared paths to shrink sequence length. Unlike speculative decoding or MoE routing, QuickSilver operates entirely on frozen, dense models and requires no auxiliary networks. Applied to GPT-2 and Llama-2 across WikiText-103 and C4, QuickSilver achieves up to 39.6% FLOP reduction with negligible perplexity degradation (<=0.2).
nan
Article 477
Title@2025-06-27 (5): How to Train Long-Context Language Models (Effectively)
Title: How to Train Long-Context Language Models (Effectively) | Wie man Langkontext-Sprachenmodelle ausbildet (effektiv) | 如何培训长文本语言模型(有效) 2410.02660v3 |
Authors (4): Tianyu Gao, Alexander Wettig, Howard Yen, Danqi Chen
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development – instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context downstream tasks, and we evaluate models after SFT as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and many other design choices such as position extrapolation. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short-context data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K. ProLong outperforms Llama-3.1-8B-Instruct on the majority of long-context tasks despite using only 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.
nan
Article 478
Title@2025-06-27 (5): Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment
Title: Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment | Kann Video Große multimodale Modelle denken wie Doppel-oder Doppel-Down: Eine Studie über defensible Video Entailment | Can Video Can Can Can Video 大型多模式模型思考像质疑者或双向下:关于失败视频内容的研究 2506.22385v1 |
Authors (4): Yue Zhang, Jilei Sun, Yunhui Guo, Vibhav Gogate
Video Large Multimodal Models (VLMMs) have made impressive strides in understanding video content, but they often struggle with abstract and adaptive reasoning-the ability to revise their interpretations when new information emerges. In reality, conclusions are rarely set in stone; additional context can strengthen or weaken an initial inference. To address this, we introduce Defeasible Video Entailment (DVidE), a new task that challenges models to think like doubters, constantly updating their reasoning based on evolving evidence. In DVidE, given a video premise and a textual hypothesis, models must determine whether a new update strengthens or weakens the hypothesis (classification version) or generate a coherent update that modifies the entailment relationship (generation version). For solving the classification task, we propose the Chain of Counterfactual Thought framework, utilizing counterfactual reasoning, ASR-enhanced video content, and rationale refinement to reduce inference bias. For the generation task, we develop a framework that combines ASR output with a Large Language Model (LLM) to produce coherent, contextually relevant updates aligned with the intended strengthener or weakener goals. Additionally, we introduce a novel benchmark dataset, with strengthener/weakener annotations and an LLM-based evaluation metric specifically designed for assessing generative performance. Experimental results demonstrate significant improvements, highlighting our proposed method in enhancing dynamic reasoning capabilities of VLMMs.
nan
Article 479
Title@2025-06-27 (5): Oldies but Goldies: The Potential of Character N-grams for Romanian Texts
Title: Oldies but Goldies: The Potential of Character N-grams for Romanian Texts | Oldies but Goldies: Das Potential des Charakters N-Gramms für rumänische Texte | 旧的但金的:罗马尼亚文本的字符N克潜力 2506.15650v2 |
Authors (3): Dana Lupsa, Sanda-Maria Avram, Radu Lupsa
This study addresses the problem of authorship attribution for Romanian texts using the ROST corpus, a standard benchmark in the field. We systematically evaluate six machine learning techniques: Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Trees (DT), Random Forests (RF), and Artificial Neural Networks (ANN), employing character n-gram features for classification. Among these, the ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. These results demonstrate that lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, rivaling more complex methods. Our findings highlight the potential of simple stylometric features in resource, constrained or under-studied language settings.
nan
Article 480
Title@2025-06-27 (5): Probabilistic Optimality for Inference-time Scaling
Title: Probabilistic Optimality for Inference-time Scaling | Probabilistische Optimalität für Inferenz-Zeitskalierung | 推推时间缩放的概率概率优化度 2506.22376v1 |
Authors (5): Youkang Wang, Jian Wang, Rubing Chen, Xiao-Yong Wei, Qing Li
Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.), and where the Best-of-N selection strategy follows a probability distribution that can be estimated. Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling. Leveraging this insight, we develop \textsc{OptScale}, a practical algorithm that dynamically determines the optimal number of sampled responses. \textsc{OptScale} employs a language model-based predictor to estimate probabilistic prior parameters, enabling the decision of the minimal number of samples needed that satisfy predefined performance thresholds and confidence levels. Extensive experiments on mathematical reasoning benchmarks (including MATH-500, GSM8K, AIME, and AMC) demonstrate that \textsc{OptScale} significantly reduces sampling overhead while remaining better or on par with state-of-the-art reasoning performance. Our work offers both a theoretical foundation and a practical solution for principled inference-time scaling, addressing a critical gap in the efficient deployment of LLMs for complex reasoning.
nan
Article 481
Title@2025-06-27 (5): Towards Fair Rankings: Leveraging LLMs for Gender Bias Detection and Measurement
Title: Towards Fair Rankings: Leveraging LLMs for Gender Bias Detection and Measurement | Auf dem Weg zu fairen Rankings: LLM-Leveraging für Gender-Bias-Erkennung und -Messung | 争取公平评分:利用 “ 性别比重 “ 检测和计量的杠杆作用LMs 2506.22372v1 |
Authors (4): Maryam Mousavian, Zahra Abbasiantaeb, Mohammad Aliannejadi, Fabio Crestani
The presence of social biases in Natural Language Processing (NLP) and Information Retrieval (IR) systems is an ongoing challenge, which underlines the importance of developing robust approaches to identifying and evaluating such biases. In this paper, we aim to address this issue by leveraging Large Language Models (LLMs) to detect and measure gender bias in passage ranking. Existing gender fairness metrics rely on lexical- and frequency-based measures, leading to various limitations, e.g., missing subtle gender disparities. Building on our LLM-based gender bias detection method, we introduce a novel gender fairness metric, named Class-wise Weighted Exposure (CWEx), aiming to address existing limitations. To measure the effectiveness of our proposed metric and study LLMs’ effectiveness in detecting gender bias, we annotate a subset of the MS MARCO Passage Ranking collection and release our new gender bias collection, called MSMGenderBias, to foster future research in this area. Our extensive experimental results on various ranking models show that our proposed metric offers a more detailed evaluation of fairness compared to previous metrics, with improved alignment to human labels (58.77% for Grep-BiasIR, and 18.51% for MSMGenderBias, measured using Cohen’s Kappa agreement), effectively distinguishing gender bias in ranking. By integrating LLM-driven bias detection, an improved fairness metric, and gender bias annotations for an established dataset, this work provides a more robust framework for analyzing and mitigating bias in IR systems.
nan
Article 482
Title@2025-06-27 (5): Robust Detection of Watermarks for Large Language Models Under Human Edits
Title: Robust Detection of Watermarks for Large Language Models Under Human Edits | Robuste Erkennung von Wasserzeichen für große Sprachmodelle unter menschlichen Bearbeitungen | 人类版下大型语言模型水印的强力探测 2411.13868v2 |
Authors (5): Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, Weijie J. Su
Watermarking has offered an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modeling human edits through mixture model detection, we introduce a new method in the form of a truncated goodness-of-fit test for detecting watermarked text under human edits, which we refer to as Tr-GoF. We prove that the Tr-GoF test achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals. Importantly, Tr-GoF achieves this optimality \textit{adaptively} as it does not require precise knowledge of human edit levels or probabilistic specifications of the LLMs, in contrast to the optimal but impractical (Neyman–Pearson) likelihood ratio test. Moreover, we establish that the Tr-GoF test attains the highest detection efficiency rate in a certain regime of moderate text modifications. In stark contrast, we show that sum-based detection rules, as employed by existing methods, fail to achieve optimal robustness in both regimes because the additive nature of their statistics is less resilient to edit-induced noise. Finally, we demonstrate the competitive and sometimes superior empirical performance of the Tr-GoF test on both synthetic data and open-source LLMs in the OPT and LLaMA families.
nan
Article 483
Title@2025-06-27 (5): Why Are Parsing Actions for Understanding Message Hierarchies Not Random?
Title: Why Are Parsing Actions for Understanding Message Hierarchies Not Random? | Warum sind Parsing-Maßnahmen, um Botschaftshierarchien zu verstehen, nicht zufällig? | 为什么为了解信件等级而采取分析行动不是随机的? 2506.22366v1 |
Authors (3): Daichi Kato, Ryo Ueda, Yusuke Miyao
If humans understood language by randomly selecting parsing actions, it might have been necessary to construct a robust symbolic system capable of being interpreted under any hierarchical structure. However, human parsing strategies do not seem to follow such a random pattern. Why is that the case? In fact, a previous study on emergent communication using models with hierarchical biases have reported that agents adopting random parsing strategies$\unicode{x2013}$ones that deviate significantly from human language comprehension$\unicode{x2013}$can achieve high communication accuracy. In this study, we investigate this issue by making two simple and natural modifications to the experimental setup: (I) we use more complex inputs that have hierarchical structures, such that random parsing makes semantic interpretation more difficult, and (II) we incorporate a surprisal-related term, which is known to influence the order of words and characters in natural language, into the objective function. With these changes, we evaluate whether agents employing random parsing strategies still maintain high communication accuracy.
nan
Article 484
Title@2025-06-27 (5): Optimal Estimation of Watermark Proportions in Hybrid AI-Human Texts
Title: Optimal Estimation of Watermark Proportions in Hybrid AI-Human Texts | Optimale Schätzung von Wasserzeichenanteilen in Hybrid-KI-Humantexten | 对混合的AI-人类文案文中水标记比例的最佳估计 2506.22343v1 |
Authors (6): Xiang Li, Garrett Wen, Weiqing He, Jiayuan Wu, Qi Long, Weijie J. Su
Text watermarks in large language models (LLMs) are an increasingly important tool for detecting synthetic text and distinguishing human-written content from LLM-generated text. While most existing studies focus on determining whether entire texts are watermarked, many real-world scenarios involve mixed-source texts, which blend human-written and watermarked content. In this paper, we address the problem of optimally estimating the watermark proportion in mixed-source texts. We cast this problem as estimating the proportion parameter in a mixture model based on \emph{pivotal statistics}. First, we show that this parameter is not even identifiable in certain watermarking schemes, let alone consistently estimable. In stark contrast, for watermarking methods that employ continuous pivotal statistics for detection, we demonstrate that the proportion parameter is identifiable under mild conditions. We propose efficient estimators for this class of methods, which include several popular unbiased watermarks as examples, and derive minimax lower bounds for any measurable estimator based on pivotal statistics, showing that our estimators achieve these lower bounds. Through evaluations on both synthetic data and mixed-source text generated by open-source models, we demonstrate that our proposed estimators consistently achieve high estimation accuracy.
nan
Article 485
Title@2025-06-27 (5): Multi-Turn Code Generation Through Single-Step Rewards
Title: Multi-Turn Code Generation Through Single-Step Rewards | Multi-Turn-Code-Generierung durch Single-Step-Rewards | 通过单级奖励生成多发代码 2502.20380v2 |
Authors (6): Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury
We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $\mu$Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $\mu$Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$Code at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.
nan
Article 486
Title@2025-06-27 (5): Evaluating Scoring Bias in LLM-as-a-Judge
Title: Evaluating Scoring Bias in LLM-as-a-Judge | Bewertung von Bias in LLM-as-a-Richter | 以LLM-as-a-Judge方式评价偏见 2506.22316v1 |
Authors (5): Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, Haixiang Hu
The remarkable performance of Large Language Models (LLMs) gives rise to``LLM-as-a-Judge’’, where LLMs are employed as evaluators for complex tasks. Moreover, it has been widely adopted across fields such as Natural Language Processing (NLP), preference learning, and various specific domains. However, there are various biases within LLM-as-a-Judge, which adversely affect the fairness and reliability of judgments. Current research on evaluating or mitigating bias in LLM-as-a-Judge predominantly focuses on comparison-based evaluations, while systematic investigations into bias in scoring-based evaluations remain limited. Therefore, we define scoring bias in LLM-as-a-Judge as the scores differ when scoring judge models are bias-related perturbed, and provide a well-designed framework to comprehensively evaluate scoring bias. We augment existing LLM-as-a-Judge benchmarks through data synthesis to construct our evaluation dataset and design multi-faceted evaluation metrics. Our experimental results demonstrate that the scoring stability of existing judge models is disrupted by scoring biases. Further exploratory experiments and discussions provide valuable insights into the design of scoring prompt templates and the mitigation of scoring biases on aspects such as score rubrics, score IDs, and reference answer selection.
nan
Article 487
Title@2025-06-27 (5): Conceptual Topic Aggregation
Title: Conceptual Topic Aggregation | Begriffliche Aggregation | 专题汇总概念 2506.22309v1 |
Authors (4): Klara M. Gutekunst, Dominik Dürrschnabel, Johannes Hirth, Gerd Stumme
The vast growth of data has rendered traditional manual inspection infeasible, necessitating the adoption of computational methods for efficient data exploration. Topic modeling has emerged as a powerful tool for analyzing large-scale textual datasets, enabling the extraction of latent semantic structures. However, existing methods for topic modeling often struggle to provide interpretable representations that facilitate deeper insights into data structure and content. In this paper, we propose FAT-CAT, an approach based on Formal Concept Analysis (FCA) to enhance meaningful topic aggregation and visualization of discovered topics. Our approach can handle diverse topics and file types – grouped by directories – to construct a concept lattice that offers a structured, hierarchical representation of their topic distribution. In a case study on the ETYNTKE dataset, we evaluate the effectiveness of our approach against other representation methods to demonstrate that FCA-based aggregation provides more meaningful and interpretable insights into dataset composition than existing topic modeling techniques.
nan
Article 488
Title@2025-06-27 (5): Detection of Personal Data in Structured Datasets Using a Large Language Model
Title: Detection of Personal Data in Structured Datasets Using a Large Language Model | Erkennung personenbezogener Daten in strukturierten Datensätzen mittels eines großen Sprachmodells | 利用大语言模式在结构化数据集中探测个人数据 2506.22305v1 |
Authors (3): Albert Agisha Ntwali, Luca Rück, Martin Heckmann
We propose a novel approach for detecting personal data in structured datasets, leveraging GPT-4o, a state-of-the-art Large Language Model. A key innovation of our method is the incorporation of contextual information: in addition to a feature’s name and values, we utilize information from other feature names within the dataset as well as the dataset description. We compare our approach to alternative methods, including Microsoft Presidio and CASSED, evaluating them on multiple datasets: DeSSI, a large synthetic dataset, datasets we collected from Kaggle and OpenML as well as MIMIC-Demo-Ext, a real-world dataset containing patient information from critical care units. Our findings reveal that detection performance varies significantly depending on the dataset used for evaluation. CASSED excels on DeSSI, the dataset on which it was trained. Performance on the medical dataset MIMIC-Demo-Ext is comparable across all models, with our GPT-4o-based approach clearly outperforming the others. Notably, personal data detection in the Kaggle and OpenML datasets appears to benefit from contextual information. This is evidenced by the poor performance of CASSED and Presidio (both of which do not utilize the context of the dataset) compared to the strong results of our GPT-4o-based approach. We conclude that further progress in this field would greatly benefit from the availability of more real-world datasets containing personal information.
nan
Article 489
Title@2025-06-27 (5): All Entities are Not Created Equal: Examining the Long Tail for Ultra-Fine Entity Typing
Title: All Entities are Not Created Equal: Examining the Long Tail for Ultra-Fine Entity Typing | Alle Entities sind nicht gleich: Prüfung des langen Tails für Ultra-Fine Entity Typing | 并非所有实体都平等创建:检查超功能实体打字的长尾 2410.17355v2 |
Authors (4): Advait Deshmukh, Ashwin Umadi, Dananjay Srinivas, Maria Leonor Pacheco
Due to their capacity to acquire world knowledge from large corpora, pre-trained language models (PLMs) are extensively used in ultra-fine entity typing tasks where the space of labels is extremely large. In this work, we explore the limitations of the knowledge acquired by PLMs by proposing a novel heuristic to approximate the pre-training distribution of entities when the pre-training data is unknown. Then, we systematically demonstrate that entity-typing approaches that rely solely on the parametric knowledge of PLMs struggle significantly with entities at the long tail of the pre-training distribution, and that knowledge-infused approaches can account for some of these shortcomings. Our findings suggest that we need to go beyond PLMs to produce solutions that perform well for infrequent entities.
nan
Article 490
Title@2025-06-27 (5): COOCO – Common Objects Out-of-Context – Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication
Title: COOCO – Common Objects Out-of-Context – Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication | COOCO – Common Objects Out-of-Context – Semantische Verletzung in Szenen: Untersuchung multimodaler Kontexte in referenzieller Kommunikation | COOCO – – 共同点 – – 文本外的公用物体 – – 现场的语义违反:在公用通信中调查多模式背景 2506.22274v1 |
Authors (4): Filippo Merlo, Ece Takmaz, Wenkai Chen, Albert Gatt
Natural scenes provide us with rich contexts for object recognition and reference. In particular, knowing what type of scene one is looking at generates expectations about which objects will occur, and what their spatial configuration should be. Do Vision-Language Models (VLMs) learn to rely on scene contexts in a similar way, when generating references to objects? To address this question, we introduce the \textit{Common Objects Out-of-Context (COOCO)} dataset and test to what extent VLMs rely on scene context to refer to objects under different degrees of scene-object congruency, and different perturbations. Our findings show that models leverage scene context adaptively, depending on both the semantic relatedness between object and scene and the level of noise. In particular, models rely more on context under high target-scene congruence or when objects are degraded. Attention analysis reveals that successful object categorisation involves increased focus on the target in mid-level layers, especially under moderate noise, suggesting that VLMs dynamically balance local and contextual information for reference generation. We make our dataset, code and models available at \href{https://github.com/cs-nlp-uu/scenereg}{https://github.com/cs-nlp-uu/scenereg}.
nan
Article 491
Title@2025-06-27 (5): KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
Title: KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding | KITAB-Bench: Ein umfassender Multi-Domain-Benchmark für arabisches OCR und Dokumentenverständnis | KITAB-Bench:阿拉伯文OCR和文件理解的综合多领域综合基准 2502.14949v2 |
Authors (10): Ahmed Heakl, Abdullah Sohail, Mukul Ranjan, Rania Hossam, Ghazi Shazan Ahmad, Mohamed El-Geish, Omar Maher, Zhiqiang Shen, Fahad Khan, Salman Khan
With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4o, Gemini, and Qwen) outperform traditional OCR approaches (like EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.
nan
Article 492
Title@2025-06-27 (5): Projected Compression: Trainable Projection for Efficient Transformer Compression
Title: Projected Compression: Trainable Projection for Efficient Transformer Compression | Projektierte Kompression: Trainierbare Projektion für effiziente Transformer-Kompression | 预计压缩:高效变压器压缩培训预测 2506.22255v1 |
Authors (9): Maciej Stefaniak, Michał Krutul, Jan Małaśnicki, Maciej Pióro, Jakub Krajewski, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, Jan Ludziejewski
Large language models have steadily increased in size to achieve improved performance; however, this growth has also led to greater inference time and computational demands. Consequently, there is rising interest in model size reduction methods. To address this issue, we propose Projected Compression, a novel model compression technique, that reduces model weights by utilizing projection modules. Specifically, we first train additional trainable projections weights and preserve access to all the original model parameters. Subsequently, these projections are merged into a lower-dimensional product matrix, resulting in a reduced-size standard Transformer-based model. Unlike alternative approaches that require additional computational overhead, our method matches the base model’s per-token computation step in FLOPs. Experimental results show that Projected Compression outperforms the comparable hard pruning and retraining approach on higher quality models. Moreover, the performance margin scales well with the number of tokens.
nan
Article 493
Title@2025-06-27 (5): Quantum-Enhanced Attention Mechanism in NLP: A Hybrid Classical-Quantum Approach
Title: Quantum-Enhanced Attention Mechanism in NLP: A Hybrid Classical-Quantum Approach | Quantenverstärkter Aufmerksamkeitsmechanismus in NLP: Hybrid-Klassisch-Quantum-Ansatz | NLP中加强的注意机制:分类-量子混合办法 2501.15630v2 |
Authors (5): S. M. Yousuf Iqbal Tomal, Abdullah Al Shafin, Debojit Bhattacharjee, MD. Khairul Amin, Rafiad Sadat Shahir
Recent advances in quantum computing have opened new pathways for enhancing deep learning architectures, particularly in domains characterized by high-dimensional and context-rich data such as natural language processing (NLP). In this work, we present a hybrid classical-quantum Transformer model that integrates a quantum-enhanced attention mechanism into the standard classical architecture. By embedding token representations into a quantum Hilbert space via parameterized variational circuits and exploiting entanglement-aware kernel similarities, the model captures complex semantic relationships beyond the reach of conventional dot-product attention. We demonstrate the effectiveness of this approach across diverse NLP benchmarks, showing improvements in both efficiency and representational capacity. The results section reveal that the quantum attention layer yields globally coherent attention maps and more separable latent features, while requiring comparatively fewer parameters than classical counterparts. These findings highlight the potential of quantum-classical hybrid models to serve as a powerful and resource-efficient alternative to existing attention mechanisms in NLP.
nan
Article 494
Title@2025-06-27 (5): Fine-Tuning MIDI-to-Audio Alignment using a Neural Network on Piano Roll and CQT Representations
Title: Fine-Tuning MIDI-to-Audio Alignment using a Neural Network on Piano Roll and CQT Representations | Feintuning MIDI-to-Audio Alignment mit einem neuralen Netzwerk auf Klavierrolle und CQT-Darstellungen | 利用钢琴卷和CQT代表的神经网络,将MIDI至Audi-Audio对齐 2506.22237v1 |
Authors (4): Sebastian Murgul, Moritz Reiser, Michael Heizmann, Christoph Seibert
In this paper, we present a neural network approach for synchronizing audio recordings of human piano performances with their corresponding loosely aligned MIDI files. The task is addressed using a Convolutional Recurrent Neural Network (CRNN) architecture, which effectively captures spectral and temporal features by processing an unaligned piano roll and a spectrogram as inputs to estimate the aligned piano roll. To train the network, we create a dataset of piano pieces with augmented MIDI files that simulate common human timing errors. The proposed model achieves up to 20% higher alignment accuracy than the industry-standard Dynamic Time Warping (DTW) method across various tolerance windows. Furthermore, integrating DTW with the CRNN yields additional improvements, offering enhanced robustness and consistency. These findings demonstrate the potential of neural networks in advancing state-of-the-art MIDI-to-audio alignment.
nan
Article 495
Title@2025-06-27 (5): Leveraging In-Context Learning for Political Bias Testing of LLMs
Title: Leveraging In-Context Learning for Political Bias Testing of LLMs | Leveraging In-Context Learning for Political Bias Testing of LLMs | 利用知识学习促进LLMs的政治偏见测试 2506.22232v1 |
Authors (4): Patrick Haller, Jannis Vamvas, Rico Sennrich, Lena A. Jäger
A growing body of work has been querying LLMs with political questions to evaluate their potential biases. However, this probing method has limited stability, making comparisons between models unreliable. In this paper, we argue that LLMs need more context. We propose a new probing task, Questionnaire Modeling (QM), that uses human survey data as in-context examples. We show that QM improves the stability of question-based bias evaluation, and demonstrate that it may be used to compare instruction-tuned models to their base versions. Experiments with LLMs of various sizes indicate that instruction tuning can indeed change the direction of bias. Furthermore, we observe a trend that larger models are able to leverage in-context examples more effectively, and generally exhibit smaller bias scores in QM. Data and code are publicly available.
nan
Article 496
Title@2025-06-27 (5): TableLoRA: Low-rank Adaptation on Table Structure Understanding for Large Language Models
Title: TableLoRA: Low-rank Adaptation on Table Structure Understanding for Large Language Models | TableLoRA: Niedrigrank-Anpassung an das Verständnis der Tabellenstruktur für große Sprachmodelle | 表LORA:关于大语言模式表格结构理解的低调适应 2503.04396v2 |
Authors (8): Xinyi He, Yihao Liu, Mengyu Zhou, Yeye He, Haoyu Dong, Shi Han, Zejian Yuan, Dongmei Zhang
Tabular data are crucial in many fields and their understanding by large language models (LLMs) under high parameter efficiency paradigm is important. However, directly applying parameter-efficient fine-tuning (PEFT) techniques to tabular tasks presents significant challenges, particularly in terms of better table serialization and the representation of two-dimensional structured information within a one-dimensional sequence. To address this, we propose TableLoRA, a module designed to improve LLMs’ understanding of table structure during PEFT. It incorporates special tokens for serializing tables with special token encoder and uses 2D LoRA to encode low-rank information on cell positions. Experiments on four tabular-related datasets demonstrate that TableLoRA consistently outperforms vanilla LoRA and surpasses various table encoding methods tested in control experiments. These findings reveal that TableLoRA, as a table-specific LoRA, enhances the ability of LLMs to process tabular data effectively, especially in low-parameter settings, demonstrating its potential as a robust solution for handling table-related tasks.
nan
Article 497
Title@2025-06-27 (5): Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment
Title: Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment | Plant in Cupboard, Orange auf Rablely, Inat Aphone. Benchmarking Incremental Lernen von Situation und Sprachmodell mit einer text-simulierten Umgebung | Inat Aphone. 使用文本模拟比照环境对状况和语言模式逐步学习进行基准评估 2502.11733v3 |
Authors (3): Jonathan Jordan, Sherzod Hakimov, David Schlangen
Large Language Models (LLMs) serve not only as chatbots but as key components in agent systems, where their common-sense knowledge significantly impacts performance as language-based planners for situated or embodied action. We assess LLMs’ incremental learning (based on feedback from the environment), and controlled in-context learning abilities using a text-based environment. We introduce challenging yet interesting set of experiments to test i) how agents can incrementally solve tasks related to every day objects in typical rooms in a house where each of them are discovered by interacting within the environment, ii) controlled in-context learning abilities and efficiency of agents by providing short info about locations of objects and rooms to check how faster the task can be solved, and finally iii) using synthetic pseudo-English words to gauge how well LLMs are at inferring meaning of unknown words from environmental feedback. Results show that larger commercial models have a substantial gap in performance compared to open-weight but almost all models struggle with the synthetic words experiments.
nan
Article 498
Title@2025-06-27 (5): Exploring Modularity of Agentic Systems for Drug Discovery
Title: Exploring Modularity of Agentic Systems for Drug Discovery | Erforschung der Modularität von Wirkstoffsystemen für die Drogenentdeckung | 探索药物发现剂系统模式 2506.22189v1 |
Authors (4): Laura van Weesep, Samuel Genheden, Ola Engkvist, Jens Sjölund
Large-language models (LLMs) and agentic systems present exciting opportunities to accelerate drug discovery and design. In this study, we critically examine the modularity of LLM-based agentic systems for drug discovery, i.e., whether parts of the agentic system such as the LLM are interchangeable, a topic that has received limited attention in drug discovery applications. We compare the performance of different large language models (LLMs) and the effectiveness of tool-calling agents versus code-generating agents in this domain. Our case study, comparing performance in orchestrating tools for chemistry and drug discovery using an LLM-as-a-judge score, shows that Claude-3.5-Sonnet, Claude-3.7-Sonnet and GPT-4o outperform alternative language models such as Llama-3.1-8B, Llama-3.1-70B, GPT-3.5-Turbo, and Nova-Micro. Although we confirm that code-generating agents outperform the tool-calling ones on average, we show that this is highly question and model dependent. Furthermore, the impact of replacing system prompts is dependent on the specific question asked and the model used, underscoring that – even in this particular domain – one cannot just replace language models without considering prompt re-engineering. Our study highlights the necessity of further research into the modularity of agentic systems to enable the development of stable and scalable solutions for real-world problems.
nan
Article 499
Title@2025-06-27 (5): LLM as GNN: Graph Vocabulary Learning for Text-Attributed Graph Foundation Models
Title: LLM as GNN: Graph Vocabulary Learning for Text-Attributed Graph Foundation Models | LLM als GNN: Graph Vocabulary Learning für text-Attributed Graph Foundation Models | 作为GNN的LLMLM:文字图表基础模型图表词汇学习 2503.03313v2 |
Authors (9): Xi Zhu, Haochen Xue, Ziwei Zhao, Wujiang Xu, Jingyuan Huang, Minghao Guo, Qifan Wang, Kaixiong Zhou, Yongfeng Zhang
Text-Attributed Graphs (TAGs), where each node is associated with text descriptions, are ubiquitous in real-world scenarios. They typically exhibit distinctive structure and domain-specific knowledge, motivating the development of a Graph Foundation Model (GFM) that generalizes across diverse graphs and tasks. Despite large efforts to integrate Large Language Models (LLMs) and Graph Neural Networks (GNNs) for TAGs, existing approaches suffer from decoupled architectures with two-stage alignment, limiting their synergistic potential. Even worse, existing methods assign out-of-vocabulary (OOV) tokens to graph nodes, leading to graph-specific semantics, token explosion, and incompatibility with task-oriented prompt templates, which hinders cross-graph and cross-task transferability. To address these challenges, we propose PromptGFM, a versatile GFM for TAGs grounded in graph vocabulary learning. PromptGFM comprises two key components: (1) Graph Understanding Module, which explicitly prompts LLMs to replicate the finest GNN workflow within the text space, facilitating seamless GNN-LLM integration and elegant graph-text alignment; (2) Graph Inference Module, which establishes a language-based graph vocabulary ensuring expressiveness, transferability, and scalability, enabling readable instructions for LLM fine-tuning. Extensive experiments demonstrate our superiority and transferability across diverse graphs and tasks. The code is available at this: https://github.com/agiresearch/PromptGFM.
nan
Article 500
Title@2025-06-27 (5): Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models
Title: Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models | Verfeinerung von Salience-Aware Sparse Feintuning-Strategien für Sprachmodelle | 精炼语文模式的精炼素养-软件简简精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精 2412.13488v2 |
Authors (6): Xinxin Liu, Aaron Thomas, Cheng Zhang, Jianyi Cheng, Yiren Zhao, Xitong Gao
Parameter-Efficient Fine-Tuning (PEFT) has gained prominence through low-rank adaptation methods like LoRA. In this paper, we focus on sparsity-based PEFT (SPEFT), which introduces trainable sparse adaptations to the weight matrices in the model, offering greater flexibility in selecting fine-tuned parameters compared to low-rank methods. We conduct the first systematic evaluation of salience metrics for SPEFT, inspired by zero-cost NAS proxies, and identify simple gradient-based metrics is reliable, and results are on par with the best alternatives, offering both computational efficiency and robust performance. Additionally, we compare static and dynamic masking strategies, finding that static masking, which predetermines non-zero entries before training, delivers efficiency without sacrificing performance, while dynamic masking offers no substantial benefits. Across NLP tasks, a simple gradient-based, static SPEFT consistently outperforms other fine-tuning methods for LLMs, providing a simple yet effective baseline for SPEFT. Our work challenges the notion that complexity is necessary for effective PEFT, while our open-source framework establishes a reproducible benchmark for future research, which is available at [https://github.com/0-ml/speft].
nan
Article 501
Title@2025-06-27 (5): MisinfoTeleGraph: Network-driven Misinformation Detection for German Telegram Messages
Title: MisinfoTeleGraph: Network-driven Misinformation Detection for German Telegram Messages | MisinfoTeleGraph: Netzwerkgesteuerte Fehlinformationserkennung für deutsche Telegrammnachrichten | MisinfoTeleGraph:德国电讯用网络驱动的错误信息探测 2506.22529v1 |
Authors (5): Lu Kalkbrenner, Veronika Solopova, Steffen Zeiler, Robert Nickel, Dorothea Kolossa
Connectivity and message propagation are central, yet often underutilized, sources of information in misinformation detection – especially on poorly moderated platforms such as Telegram, which has become a critical channel for misinformation dissemination, namely in the German electoral context. In this paper, we introduce Misinfo-TeleGraph, the first German-language Telegram-based graph dataset for misinformation detection. It includes over 5 million messages from public channels, enriched with metadata, channel relationships, and both weak and strong labels. These labels are derived via semantic similarity to fact-checks and news articles using M3-embeddings, as well as manual annotation. To establish reproducible baselines, we evaluate both text-only models and graph neural networks (GNNs) that incorporate message forwarding as a network structure. Our results show that GraphSAGE with LSTM aggregation significantly outperforms text-only baselines in terms of Matthews Correlation Coefficient (MCC) and F1-score. We further evaluate the impact of subscribers, view counts, and automatically versus human-created labels on performance, and highlight both the potential and challenges of weak supervision in this domain. This work provides a reproducible benchmark and open dataset for future research on misinformation detection in German-language Telegram networks and other low-moderation social platforms.
nan
Article 502
Title@2025-06-27 (5): Training Language Model to Critique for Better Refinement
Title: Training Language Model to Critique for Better Refinement | Training Sprachmodell zu Kritik für eine bessere Verfeinerung | 改进改进工作简化语言培训模式培训语言模式 2506.22157v1 |
Authors (11): Tianshu Yu, Chao Xiang, Mingchuan Yang, Pei Ke, Bosi Wen, Cunxiang Wang, Jiale Cheng, Li Zhang, Xinyu Mu, Chuxiong Sun, Minlie Huang
Large language models (LLMs) have demonstrated remarkable evaluation and critique capabilities, providing insightful feedback and identifying flaws in various tasks. However, limited research has explored which types of critiques are most effective for improving model responses or how to generate such critiques. To address this gap, we introduce \textbf{R}efinement-oriented \textbf{C}ritique \textbf{O}ptimization (RCO), a novel framework designed to train critic models using refinement signals. RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses. The critique utility (CU) quantifies the effectiveness of these refinements, serving as the reward signal for training the critic model. By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment, ensuring that critiques driving meaningful improvements are rewarded. We evaluate RCO across five tasks, i.e., dialog generation, summarization, question answering, mathematical reasoning, and code generation, and show that it significantly outperforms traditional methods and open-source models in terms of critique quality and refinement outcomes. Our contributions include the introduction of RCO, a novel supervision scheme based on refined response preferences, and comprehensive experimental results that highlight the method’s effectiveness in enhancing LLM critique-refinement loops.
nan
Article 503
Title@2025-06-27 (5): MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot
Title: MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot | MedRAG: Verbesserung der retrieval-augmentierten Generation mit Wissen Graph-Eliciated Reasoning für Healthcare Copilot | Medrag:加强利用知识图图获取保健理由的回收养殖业 2502.04413v2 |
Authors (4): Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
Retrieval-augmented generation (RAG) is a well-suited technique for retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a key module of the healthcare copilot, helping reduce misdiagnosis for healthcare practitioners and patients. However, the diagnostic accuracy and specificity of existing heuristic-based RAG models used in the medical domain are inadequate, particularly for diseases with similar manifestations. This paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited reasoning for the medical domain that retrieves diagnosis and treatment recommendations based on manifestations. MedRAG systematically constructs a comprehensive four-tier hierarchical diagnostic KG encompassing critical diagnostic differences of various diseases. These differences are dynamically integrated with similar EHRs retrieved from an EHR database, and reasoned within a large language model. This process enables more accurate and specific decision support, while also proactively providing follow-up questions to enhance personalized medical decision-making. MedRAG is evaluated on both a public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD) collected from Tan Tock Seng Hospital, and its performance is compared against various existing RAG methods. Experimental results show that, leveraging the information integration and relational abilities of the KG, our MedRAG provides more specific diagnostic insights and outperforms state-of-the-art models in reducing misdiagnosis rates. Our code will be available at https://github.com/SNOWTEAM2023/MedRAG
nan
Article 504
Title@2025-06-27 (5): Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX
Title: Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX | Auge des Urteils: Die Bewertung der russischsprachigen LLMs mit POLLUX | 判断之眼:用POLLUX对讲俄语的LLMs的评价进行分解 2505.24616v3 |
Authors (11): Nikita Martynov, Anastasia Mordasheva, Dmitriy Gorbetskiy, Danil Astafurov, Ulyana Isaeva, Elina Basyrova, Sergey Skachkov, Victoria Berestova, Nikolay Ivanov, Valeriia Zanina, Alena Fenogenova
We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B) evaluators trained for nuanced assessment of generative outputs. This approach provides scalable, interpretable evaluation and annotation tools for model development, effectively replacing costly and less precise human judgments.
nan
Article 505
Title@2025-06-27 (5): SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition
Title: SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition | SAGE: Spliced-Audio Generated Data for Enhanced Foundational Models in Low-Resource Arabisch-Englisch Code-Switched Speech Recognition | SAGE:用于加强低资源阿拉伯语-英语代码转换语音识别中基础模型的 2506.22143v1 |
Authors (2): Muhammad Umar Farooq, Oscar Saz
This paper investigates the performance of various speech SSL models on dialectal Arabic (DA) and Arabic-English code-switched (CS) speech. To address data scarcity, a modified audio-splicing approach is introduced to generate artificial CS speech data. Fine-tuning an already fine-tuned SSL model with the proposed Spliced-Audio Generated (SAGE) data results in an absolute improvement on Word Error Rate (WER) of 7.8% on Arabic and English CS benchmarks. Additionally, an Experience Replay (ER) inspired approach is proposed to enhance generalisation across DA and CS speech while mitigating catastrophic forgetting. Integrating an out-of-domain 3-gram language model reduces the overall mean WER from 31.7% to 26.6%. Few-shot fine-tuning for code-switching benchmarks further improves WER by 4.9%. A WER of 31.1% on Arabic-English CS benchmarks surpasses large-scale multilingual models, including USM and Whisper-large-v2 (both over ten times larger) by an absolute margin of 5.5% and 8.4%, respectively.
nan
Article 506
Title@2025-06-27 (5): DAPFAM: A Domain-Aware Patent Retrieval Dataset Aggregated at the Family Level
Title: DAPFAM: A Domain-Aware Patent Retrieval Dataset Aggregated at the Family Level | DAPFAM: A Domain-Aware Patent Retrieval Dataset Aggregiert auf Familienebene | DAPFAM: 家庭一级综合域-软件专利检索数据集 2506.22141v1 |
Authors (3): Iliass Ayaou, Denis Cavallucci, Hicham Chibane
In the landscape of publicly available patent retrieval datasets, the need for explicit indomain and out-of-domain labeling, multi-jurisdiction coverage, balanced query domain representation and manageable sizes that support sub document level experiments on moderate computational resources is often overlooked. To address these gaps, we propose DAPFAM, a new open access domain-aware patent retrieval dataset constructed at the simple-family level. The dataset contains 1,247 domain balanced full text query families and 45,336 full text target families. The dataset is enriched by clear relevance judgments (forward/backward citations as positive links, random negatives), as well as explicit in-domain or out-of-domain relationships via a novel proposed labelling scheme based on via International Patent Classification (IPC) codes, resulting in 49,869 evaluation pairs. The dataset is multi jurisdictional, requires little to no preprocessing for retrieval evaluation, and remains of a size manageable for entities with limited ressources allowing for sub document level retrieval experiments without excessive computational costs. We describe our three-step data-curation pipeline, present comprehensive dataset statistics, and provide baseline experiments using lexical and neural retrieval methods. Our baseline experiments highlight significant challenges in crossdomain patent retrieval. The dataset will be publicly available (for now the access link is this repository: https://osf.io/vbyzd/?view_only=1a40242e0d1941a58aa854af3e50cf6b).
nan
Article 507
Title@2025-06-27 (5): iPrOp: Interactive Prompt Optimization for Large Language Models with a Human in the Loop
Title: iPrOp: Interactive Prompt Optimization for Large Language Models with a Human in the Loop | iPrOp: Interaktive Prompt-Optimierung für große Sprachmodelle mit einem Menschen in der Schleife | iPrOp: 大语言模型与环中人类互动快速优化 iPrOp: iPrOp 2412.12644v2 |
Authors (2): Jiahui Li, Roman Klinger
Prompt engineering has made significant contributions to the era of large language models, yet its effectiveness depends on the skills of a prompt author. This paper introduces $\textit{iPrOp}$, a novel interactive prompt optimization approach, to bridge manual prompt engineering and automatic prompt optimization while offering users the flexibility to assess evolving prompts. We aim to provide users with task-specific guidance to enhance human engagement in the optimization process, which is structured through prompt variations, informative instances, predictions generated by large language models along with their corresponding explanations, and relevant performance metrics. This approach empowers users to choose and further refine the prompts based on their individual preferences and needs. It can not only assist non-technical domain experts in generating optimal prompts tailored to their specific tasks or domains, but also enable to study the intrinsic parameters that influence the performance of prompt optimization. The evaluation shows that our approach has the capability to generate improved prompts, leading to enhanced task performance.
nan
Article 508
Title@2025-06-27 (5): Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs
Title: Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs | Llama See, Llama Do: Eine mechanistische Perspektive auf die kontextabhängige Beanspruchung und Ablenkung in LLMs | Llama See, Llama Do:LLMML中背景教育和遭遇的机械视角 2505.09338v2 |
Authors (5): Jingcheng Niu, Xingdi Yuan, Tong Wang, Hamidreza Saghir, Amir H. Abdi
We observe a novel phenomenon, contextual entrainment, across a wide range of language models (LMs) and prompt settings, providing a new mechanistic perspective on how LMs become distracted by irrelevant'' contextual information in the input prompt. Specifically, LMs assign significantly higher logits (or probabilities) to any tokens that have previously appeared in the context prompt, even for random tokens. This suggests that contextual entrainment is a mechanistic phenomenon, occurring independently of the relevance or semantic relation of the tokens to the question or the rest of the sentence. We find statistically significant evidence that the magnitude of contextual entrainment is influenced by semantic factors. Counterfactual prompts have a greater effect compared to factual ones, suggesting that while contextual entrainment is a mechanistic phenomenon, it is modulated by semantic factors. We hypothesise that there is a circuit of attention heads -- the entrainment heads -- that corresponds to the contextual entrainment phenomenon. Using a novel entrainment head discovery method based on differentiable masking, we identify these heads across various settings. When we
turn off’’ these heads, i.e., set their outputs to zero, the effect of contextual entrainment is significantly attenuated, causing the model to generate output that capitulates to what it would produce if no distracting context were provided. Our discovery of contextual entrainment, along with our investigation into LM distraction via the entrainment heads, marks a key step towards the mechanistic analysis and mitigation of the distraction problem.
nan
Article 509
Title@2025-06-27 (5): Identifying a Circuit for Verb Conjugation in GPT-2
Title: Identifying a Circuit for Verb Conjugation in GPT-2 | Identifizierung eines Kreises für Verbkonjugation in GPT-2 | 在 GPT-2 中确定 Verb 混和的电路 2506.22105v1 |
Authors (1): David Demitri Africa
I implement a procedure to isolate and interpret the sub-network (or “circuit”) responsible for subject-verb agreement in GPT-2 Small. In this study, the model is given prompts where the subject is either singular (e.g. “Alice”) or plural (e.g. “Alice and Bob”), and the task is to correctly predict the appropriate verb form (“walks” for singular subjects, “walk” for plural subjects). Using a series of techniques-including performance verification automatic circuit discovery via direct path patching, and direct logit attribution- I isolate a candidate circuit that contributes significantly to the model’s correct verb conjugation. The results suggest that only a small fraction of the network’s component-token pairs is needed to achieve near-model performance on the base task but substantially more for more complex settings.
nan
Article 510
Title@2025-06-27 (5): English K_Quantization of LLMs Does Not Disproportionately Diminish Multilingual Performance
Title: English K_Quantization of LLMs Does Not Disproportionately Diminish Multilingual Performance | Englische K_Quantisierung von LLMs nicht disproportional diminish Mehrsprachige Leistung | 英文-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语-英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语/英语 2503.03592v3 |
Authors (2): Karl Audun Borgersen, Morten Goodwin
For consumer usage of locally deployed LLMs, the GGUF format and k_quantization are invaluable tools for maintaining the performance of the original model while reducing it to sizes deployable with consumer-grade hardware. The number of bits dedicated to each weight from the original model is reduced based on how important they are thought to be during model inference. This importance is arrived at through the application of an ‘importance matrix’-a relatively small text document meant to be representative of the LLM’s standard use-cases. In the vast majority of quants available online, this document is primarily written in English. It was therefore an open question whether performance on English language tasks was preserved through the sacrifice of multilingual performance and whether it can be preserved with alternate importance matrices. This article investigates these hypotheses by quantizing Llama3.3 70B on importance matrices written in three languages (English, Norwegian, and Malayalam) and evaluating them on the MixEval dataset in both English and Norwegian. All experiments related to yielded non-significant results indicating that current quantization practices do not disproportionately harm multilingual performance.
nan
Article 511
Title@2025-06-27 (5): Beyond Fixed Length: Bucket Pre-training is All You Need
Title: Beyond Fixed Length: Bucket Pre-training is All You Need | Jenseits der festen Länge: Eimer Vor-Training ist alles, was Sie brauchen | 超过固定长度: 巴克特预训练是你们需要的 2407.07495v2 |
Authors (6): Qing Yang, Qiyao Peng, Hongtao Liu, Kai Liu, Bing Qin, Ting Liu
Large Language Models (LLMs) have demonstrated exceptional performance across various tasks, with pre-training stage serving as the cornerstone of their capabilities. However, the conventional fixed-length data composition strategy for pre-training presents several practical challenges. When using shorter sequences, documents are often truncated, potentially leading to information loss and affecting the model’s ability to capture long-range dependencies. Conversely, longer sequences require concatenation of multiple documents, which can introduce noise and affect the natural document boundaries and semantic coherence as well as require substantial computational overhead. To address these challenges, we first establish three quantitative metrics for evaluating data composition quality: padding ratio, truncation ratio, and concatenation ratio. Building upon these metrics, we propose a novel multi-bucket data composition method that transcends the fixed-length paradigm. Our approach adaptively organizes training data to achieve optimal composition quality as measured by the proposed metrics, offering a more flexible and efficient approach for pre-training. We conduct extensive experiments and the results demonstrate that our proposed method significantly enhances both the efficiency and effectiveness of LLM pre-training.
nan
Article 512
Title@2025-06-27 (5): Involvement drives complexity of language in online debates
Title: Involvement drives complexity of language in online debates | Einbeziehung treibt die Komplexität der Sprache in Online-Debatten an | 在线辩论语言的复杂性驱动参与驱动因素 2506.22098v1 |
Authors (10): Eleonora Amadori, Daniele Cirulli, Edoardo Di Martino, Jacopo Nudo, Maria Sahakyan, Emanuele Sangiorgio, Arnaldo Santoro, Simon Zollo, Alessandro Galeazzi, Niccolò Di Marco
Language is a fundamental aspect of human societies, continuously evolving in response to various stimuli, including societal changes and intercultural interactions. Technological advancements have profoundly transformed communication, with social media emerging as a pivotal force that merges entertainment-driven content with complex social dynamics. As these platforms reshape public discourse, analyzing the linguistic features of user-generated content is essential to understanding their broader societal impact. In this paper, we examine the linguistic complexity of content produced by influential users on Twitter across three globally significant and contested topics: COVID-19, COP26, and the Russia-Ukraine war. By combining multiple measures of textual complexity, we assess how language use varies along four key dimensions: account type, political leaning, content reliability, and sentiment. Our analysis reveals significant differences across all four axes, including variations in language complexity between individuals and organizations, between profiles with sided versus moderate political views, and between those associated with higher versus lower reliability scores. Additionally, profiles producing more negative and offensive content tend to use more complex language, with users sharing similar political stances and reliability levels converging toward a common jargon. Our findings offer new insights into the sociolinguistic dynamics of digital platforms and contribute to a deeper understanding of how language reflects ideological and social structures in online spaces.
nan
Article 513
Title@2025-06-27 (5): Large Language Models in Argument Mining: A Survey
Title: Large Language Models in Argument Mining: A Survey | Große Sprachmodelle im Argumentbergbau: Eine Umfrage | 争议采矿大语言模型:调查 2506.16383v2 |
Authors (5): Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, Goran Nenadic
Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques – such as prompting, chain-of-thought reasoning, and retrieval augmentation – have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.
nan
Article 514
Title@2025-06-27 (5): Benchmarking Vision Language Models on German Factual Data
Title: Benchmarking Vision Language Models on German Factual Data | Benchmarking von Vision Language Models auf deutschen Factual Data | 制定德国事实数据愿景语言模型基准 2504.11108v2 |
Authors (2): René Peinl, Vincent Tischler
Similar to LLMs, the development of vision language models is mainly driven by English datasets and models trained in English and Chinese language, whereas support for other languages, even those considered high-resource languages such as German, remains significantly weaker. In this work we present an analysis of open-weight VLMs on factual knowledge in the German and English language. We disentangle the image-related aspects from the textual ones by analyzing accu-racy with jury-as-a-judge in both prompt languages and images from German and international contexts. We found that for celebrities and sights, VLMs struggle because they are lacking visual cognition of German image contents. For animals and plants, the tested models can often correctly identify the image contents ac-cording to the scientific name or English common name but fail in German lan-guage. Cars and supermarket products were identified equally well in English and German images across both prompt languages.
nan
Article 515
Title@2025-06-27 (5): VLM@school – Evaluation of AI image understanding on German middle school knowledge
Title: VLM@school – Evaluation of AI image understanding on German middle school knowledge | VLM@school – Auswertung des KI-Bildverständnisses über deutsche Mittelschulkenntnisse | VLM@school – – 评价AI关于德国中学知识的图像理解 2506.11604v2 |
Authors (2): René Peinl, Vincent Tischler
This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs) on tasks that combine visual reasoning with subject-specific background knowledge in the German language. In contrast to widely used English-language benchmarks that often rely on artificially difficult or decontextualized problems, this dataset draws from real middle school curricula across nine domains including mathematics, history, biology, and religion. The benchmark includes over 2,000 open-ended questions grounded in 486 images, ensuring that models must integrate visual interpretation with factual reasoning rather than rely on superficial textual cues. We evaluate thirteen state-of-the-art open-weight VLMs across multiple dimensions, including domain-specific accuracy and performance on adversarial crafted questions. Our findings reveal that even the strongest models achieve less than 45% overall accuracy, with particularly poor performance in music, mathematics, and adversarial settings. Furthermore, the results indicate significant discrepancies between success on popular benchmarks and real-world multimodal understanding. We conclude that middle school-level tasks offer a meaningful and underutilized avenue for stress-testing VLMs, especially in non-English contexts. The dataset and evaluation protocol serve as a rigorous testbed to better understand and improve the visual and linguistic reasoning capabilities of future AI systems.
nan
Article 516
Title@2025-06-27 (5): Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
Title: Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency | Jailbreaking Multimodale große Sprachmodelle über Shuffle Inkonsistenz | 通过破碎不连贯的打碎和不连贯的多式多式大语言模型 2501.04931v2 |
Authors (10): Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Shouwei Ruan, Jialing Tao, YueFeng Chen, Hui Xue, Xingxing Wei
Multimodal Large Language Models (MLLMs) have achieved impressive performance and have been put into practical use in commercial applications, but they still have potential safety mechanism vulnerabilities. Jailbreak attacks are red teaming methods that aim to bypass safety mechanisms and discover MLLMs’ potential risks. Existing MLLMs’ jailbreak methods often bypass the model’s safety mechanism through complex optimization methods or carefully designed image and text prompts. Despite achieving some progress, they have a low attack success rate on commercial closed-source MLLMs. Unlike previous research, we empirically find that there exists a Shuffle Inconsistency between MLLMs’ comprehension ability and safety ability for the shuffled harmful instruction. That is, from the perspective of comprehension ability, MLLMs can understand the shuffled harmful text-image instructions well. However, they can be easily bypassed by the shuffled harmful instructions from the perspective of safety ability, leading to harmful responses. Then we innovatively propose a text-image jailbreak attack named SI-Attack. Specifically, to fully utilize the Shuffle Inconsistency and overcome the shuffle randomness, we apply a query-based black-box optimization method to select the most harmful shuffled inputs based on the feedback of the toxic judge model. A series of experiments show that SI-Attack can improve the attack’s performance on three benchmarks. In particular, SI-Attack can obviously improve the attack success rate for commercial MLLMs such as GPT-4o or Claude-3.5-Sonnet.
nan
Article 517
Title@2025-06-27 (5): MDC-R: The Minecraft Dialogue Corpus with Reference
Title: MDC-R: The Minecraft Dialogue Corpus with Reference | MDC-R: Der Minecraft Dialogue Corpus mit Referenz | MDC-R: 采矿对话公司(参考) 2506.22062v1 |
Authors (9): Chris Madge, Maris Camilleri, Paloma Carretero Garcia, Mladen Karan, Juexi Shao, Prashant Jayannavar, Julian Hough, Benjamin Roth, Massimo Poesio
We introduce the Minecraft Dialogue Corpus with Reference (MDC-R). MDC-R is a new language resource that supplements the original Minecraft Dialogue Corpus (MDC) with expert annotations of anaphoric and deictic reference. MDC’s task-orientated, multi-turn, situated dialogue in a dynamic environment has motivated multiple annotation efforts, owing to the interesting linguistic phenomena that this setting gives rise to. We believe it can serve as a valuable resource when annotated with reference, too. Here, we discuss our method of annotation and the resulting corpus, and provide both a quantitative and a qualitative analysis of the data. Furthermore, we carry out a short experiment demonstrating the usefulness of our corpus for referring expression comprehension.
nan
Article 518
Title@2025-06-27 (5): Lost at the Beginning of Reasoning
Title: Lost at the Beginning of Reasoning | Verloren am Anfang der Vernunft | 迷失在理性的开始 2506.22058v1 |
Authors (8): Baohao Liao, Xinyi Chen, Sara Rajaee, Yuhui Xu, Christian Herold, Anders Søgaard, Maarten de Rijke, Christof Monz
Recent advancements in large language models (LLMs) have significantly advanced complex reasoning capabilities, particularly through extended chain-of-thought (CoT) reasoning that incorporates mechanisms such as backtracking, self-reflection and self-correction. Despite these developments, the self-correction abilities of LLMs during long CoT reasoning remain underexplored. And recent findings on overthinking suggest that such models often engage in unnecessarily redundant reasoning. In this work, we empirically show that the first reasoning step exerts a disproportionately large influence on the final prediction - errors introduced at this stage can substantially degrade subsequent reasoning quality. This phenomenon is consistently observed across two state-of-the-art open-source reasoning model families: DeepSeek-R1 and Qwen3. To address this, we propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps while discarding suboptimal ones, achieving up to a 70% reduction in inference cost without sacrificing accuracy. Finally, we introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities, offering a foundation for future research on robust reasoning in LLMs.
nan
Article 519
Title@2025-06-27 (5): Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans Due to Impenetrable Semantic Reference
Title: Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans Due to Impenetrable Semantic Reference | Sprache in Vivo vs. in Silico: Größenelemente aber größere Sprachmodelle verstehen die Sprache noch nicht auf einem Par mit Menschen aufgrund undurchdringlicher semantischer Referenz | Vivo语与Silico语:大小问题,但大语言模型仍然不理解人与人之间的语言,因为不可排除的语义参考 2404.14883v3 |
Authors (3): Vittoria Dentella, Fritz Guenther, Evelina Leivada
Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N=1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n=80 humans on the same stimuli. We find that humans are overall less accurate than ChatGPT-4 (76% vs. 80% accuracy, respectively), but that this is due to ChatGPT-4 outperforming humans only in one task condition, namely on grammatical sentences. Additionally, ChatGPT-4 wavers more than humans in its answers (12.5% vs. 9.6% likelihood of an oscillating answer, respectively). Thus, while increased model size may lead to better performance, LLMs are still not sensitive to (un)grammaticality the same way as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.
nan
Article 520
Title@2025-06-27 (5): Decoding Machine Translationese in English-Chinese News: LLMs vs. NMTs
Title: Decoding Machine Translationese in English-Chinese News: LLMs vs. NMTs | Decoding Machine Translationese in Englisch-Chinesisch Nachrichten: LLMs vs. NMTs | 《中英新闻:LLMS诉NMTs》 2506.22050v1 |
Authors (2): Delu Kong, Lieve Macken
This study explores Machine Translationese (MTese) – the linguistic peculiarities of machine translation outputs – focusing on the under-researched English-to-Chinese language pair in news texts. We construct a large dataset consisting of 4 sub-corpora and employ a comprehensive five-layer feature set. Then, a chi-square ranking algorithm is applied for feature selection in both classification and clustering tasks. Our findings confirm the presence of MTese in both Neural Machine Translation systems (NMTs) and Large Language Models (LLMs). Original Chinese texts are nearly perfectly distinguishable from both LLM and NMT outputs. Notable linguistic patterns in MT outputs are shorter sentence lengths and increased use of adversative conjunctions. Comparing LLMs and NMTs, we achieve approximately 70% classification accuracy, with LLMs exhibiting greater lexical diversity and NMTs using more brackets. Additionally, translation-specific LLMs show lower lexical diversity but higher usage of causal conjunctions compared to generic LLMs. Lastly, we find no significant differences between LLMs developed by Chinese firms and their foreign counterparts.
nan
Article 521
Title@2025-06-27 (5): ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Title: ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows | ScienceBoard: Bewertung multimodaler autonomer Agenzien in realistischen wissenschaftlichen Workflows | 科学理事会:评估现实科学工作流程中的多式联运自治机构 2505.19897v2 |
Authors (21): Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers’ workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.
nan
Article 522
Title@2025-06-27 (5): Can Peter Pan Survive MT? A Stylometric Study of LLMs, NMTs, and HTs in Children’s Literature Translation
Title: Can Peter Pan Survive MT? A Stylometric Study of LLMs, NMTs, and HTs in Children’s Literature Translation | Kann Peter Pan MT überleben? Eine stylometrische Studie von LLMs, NMTs und HTs in der Kinderliteratur Übersetzung | Peter Pan Pan Survive MT? 儿童文学翻译中LLMS、NMTs和HTs的理学研究 2506.22038v1 |
Authors (2): Delu Kong, Lieve Macken
This study focuses on evaluating the performance of machine translations (MTs) compared to human translations (HTs) in English-to-Chinese children’s literature translation (CLT) from a stylometric perspective. The research constructs a Peter Pan corpus, comprising 21 translations: 7 human translations (HTs), 7 large language model translations (LLMs), and 7 neural machine translation outputs (NMTs). The analysis employs a generic feature set (including lexical, syntactic, readability, and n-gram features) and a creative text translation (CTT-specific) feature set, which captures repetition, rhythm, translatability, and miscellaneous levels, yielding 447 linguistic features in total. Using classification and clustering techniques in machine learning, we conduct a stylometric analysis of these translations. Results reveal that in generic features, HTs and MTs exhibit significant differences in conjunction word distributions and the ratio of 1-word-gram-YiYang, while NMTs and LLMs show significant variation in descriptive words usage and adverb ratios. Regarding CTT-specific features, LLMs outperform NMTs in distribution, aligning more closely with HTs in stylistic characteristics, demonstrating the potential of LLMs in CLT.
nan
Article 523
Title@2025-06-27 (5): Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores
Title: Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores | Auf dem Weg zu einer reproduzierbaren LLM-Bewertung: Quantifizierung der Unsicherheit in LLM-Benchmark-Scores | 走向可复制的LLM评价:量化LLM基准分数中的不确定性 2410.03492v2 |
Authors (3): Robert E. Blackwell, Jon Barry, Anthony G. Cohn
Large language models (LLMs) are stochastic, and not all models give deterministic answers, even when setting temperature to zero with a fixed random seed. However, few benchmark studies attempt to quantify uncertainty, partly due to the time and cost of repeated experiments. We use benchmarks designed for testing LLMs’ capacity to reason about cardinal directions to explore the impact of experimental repeats on mean score and prediction interval. We suggest a simple method for cost-effectively quantifying the uncertainty of a benchmark score and make recommendations concerning reproducible LLM evaluation.
nan
Article 524
Title@2025-06-27 (5): ACORD: An Expert-Annotated Retrieval Dataset for Legal Contract Drafting
Title: ACORD: An Expert-Annotated Retrieval Dataset for Legal Contract Drafting | ACORD: Ein sachverständiger Datensatz für die Erstellung von Verträgen | ACORD: 法律合同起草专家附加说明的检索数据集 2501.06582v3 |
Authors (8): Steven H. Wang, Maksim Zubkov, Kexin Fan, Sarah Harrell, Yuyang Sun, Wei Chen, Andreas Plesner, Roger Wattenhofer
Information retrieval, specifically contract clause retrieval, is foundational to contract drafting because lawyers rarely draft contracts from scratch; instead, they locate and revise the most relevant precedent. We introduce the Atticus Clause Retrieval Dataset (ACORD), the first retrieval benchmark for contract drafting fully annotated by experts. ACORD focuses on complex contract clauses such as Limitation of Liability, Indemnification, Change of Control, and Most Favored Nation. It includes 114 queries and over 126,000 query-clause pairs, each ranked on a scale from 1 to 5 stars. The task is to find the most relevant precedent clauses to a query. The bi-encoder retriever paired with pointwise LLMs re-rankers shows promising results. However, substantial improvements are still needed to effectively manage the complex legal work typically undertaken by lawyers. As the first retrieval benchmark for contract drafting annotated by experts, ACORD can serve as a valuable IR benchmark for the NLP community.
nan
Article 525
Title@2025-06-27 (5): ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
Title: ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference | ChunkKV: Semantisch-bewahrende KV-Cache-Kompression für effiziente Lang-Kontext-LLM-Inferenz | ChunkKV: 为高效长文本LLM 推断而保存 KV缓存压缩 2502.00299v3 |
Authors (8): Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, Xiaowen Chu
Large Language Models (LLMs) require significant GPU memory when processing long texts, with the key value (KV) cache consuming up to 70\% of total memory during inference. Although existing compression methods reduce memory by evaluating the importance of individual tokens, they overlook critical semantic relationships between tokens, resulting in fragmented context and degraded performance. We introduce ChunkKV, which fundamentally reimagines KV cache compression by treating semantic chunks - rather than isolated tokens - as basic compression units. This approach preserves complete linguistic structures and contextual integrity, ensuring that essential meaning is retained even under aggressive compression. Our innovation includes a novel layer-wise index reuse technique that exploits the higher cross-layer similarity of preserved indices in ChunkKV, reducing computational overhead and improving throughput by 26.5\%. Comprehensive evaluations on challenging benchmarks: LongBench, Needle-In-A-HayStack, GSM8K, and JailbreakV demonstrate that ChunkKV outperforms state-of-the-art methods by up to 8.7\% in precision while maintaining the same compression ratio. These results confirm that semantic-aware compression significantly enhances both efficiency and performance for long-context LLM inference, providing a simple yet effective solution to the memory bottleneck problem.
nan
Article 526
Title@2025-06-27 (5): Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy
Title: Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy | Robuste und effiziente autoregressive Sprachsynthese mit dynamischer Chunk-weiser Vorhersagepolitik | 强力和高效的自动递减语音合成,带有动态整节预测政策 2506.22023v1 |
Authors (8): Bohan Li, Zhihan Li, Haoran Wang, Hanglei Zhang, Yiwei Guo, Hankun Wang, Xie Chen, Kai Yu
Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention, leading to increased latency and degraded synthesis quality, thereby limiting their feasibility for real-time applications. To address these limitations, we introduce a novel dynamic chunk-wise autoregressive synthesis framework, termed DCAR, designed to enhance both efficiency and intelligibility robustness in AR speech generation. DCAR introduces a chunk-to-frame attention mechanism through training with multi-token prediction, enabling dynamic chunk prediction in variable speech contexts using a lightweight module trained on-policy. DCAR dynamically adjusts the token prediction span, significantly reducing the sequence length dependency while obtaining high synthesis quality. Comprehensive empirical evaluations demonstrate that DCAR substantially outperforms traditional next-token prediction models, achieving up to 72.27% intelligibility improvement and 2.61x inference speedup simultaneously on the test set. Furthermore, we conduct comprehensive analysis to support it as a versatile foundation for next-generation speech synthesis systems.
nan
Article 527
Title@2025-06-27 (5): MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration
Title: MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration | MMBoundary: MLLM-Wissensgrenzen-Bewusstsein durch vernünftige Schritt-Vertrauens-Kalibrierung | MMMMMMMMMM MMMMMMMM:通过合理步骤信任校准提高MLLM知识边界认识 2505.23224v3 |
Authors (6): Zhitao He, Sandeep Polisetty, Zhiyuan Fan, Yuchen Huang, Shujin Wu, Yi R. Fung
In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi-level (e.g., perception, reasoning) and multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross-modal self-rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine-tuning MLLM on this set of self-rewarded confidence estimation signal for initial confidence expression warm-up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self-correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.
nan
Article 528
Title@2025-06-27 (5): Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs
Title: Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs | Kann den Wald für die Bäume nicht sehen: Aufruf von Heuristik und Biase zu Elicit Irrationale Wahlmöglichkeiten von LLMs | 无法看到树的森林: 引用光量和比喻来选择LLMM 的不合理选择 。 2505.02862v3 |
Authors (6): Haoming Yang, Ke Ma, Xiaojun Jia, Yingfei Sun, Qianqian Xu, Qingming Huang
Despite the remarkable performance of Large Language Models (LLMs), they remain vulnerable to jailbreak attacks, which can compromise their safety mechanisms. Existing studies often rely on brute-force optimization or manual design, failing to uncover potential risks in real-world scenarios. To address this, we propose a novel jailbreak attack framework, ICRT, inspired by heuristics and biases in human cognition. Leveraging the simplicity effect, we employ cognitive decomposition to reduce the complexity of malicious prompts. Simultaneously, relevance bias is utilized to reorganize prompts, enhancing semantic alignment and inducing harmful outputs effectively. Furthermore, we introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm by employing ranking aggregation methods such as Elo, HodgeRank, and Rank Centrality to comprehensively quantify the harmfulness of generated content. Experimental results show that our approach consistently bypasses mainstream LLMs’ safety mechanisms and generates high-risk content, providing insights into jailbreak attack risks and contributing to stronger defense strategies.
nan
Article 529
Title@2025-06-27 (5): Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization
Title: Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization | Advancing Language Multi-Agent Learning mit Kredit-Re-Zuweisung für interaktive Umwelt Verallgemeinerung | 推进多语言多机构学习,通过信用再分配促进互动环境通用化 2502.14496v2 |
Authors (8): Zhitao He, Zijun Liu, Peng Li, Yi R Fung, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
LLM-based agents have made significant advancements in interactive environments, such as mobile operations and web browsing, and other domains beyond computer using. Current multi-agent systems universally excel in performance, compared to single agents, but struggle with generalization across environments due to predefined roles and inadequate strategies for generalizing language agents. The challenge of achieving both strong performance and good generalization has hindered the progress of multi-agent systems for interactive environments. To address these issues, we propose CollabUIAgents, a multi-agent reinforcement learning framework with a novel multi-agent credit re-assignment (CR) strategy, assigning process rewards with LLMs rather than environment-specific rewards and learning with synthesized preference data, in order to foster generalizable, collaborative behaviors among the role-free agents’ policies. Empirical results show that our framework improves both performance and cross-environment generalizability of multi-agent systems. Moreover, our 7B-parameter system achieves results on par with or exceed strong closed-source models, and the LLM that guides the CR. We also provide insights in using granular CR rewards effectively for environment generalization, and accommodating trained LLMs in multi-agent systems.
nan
Article 530
Title@2025-06-27 (5): OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
Title: OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis | OS-Genese: Automatisieren der GUI Agent Trajectory Construction über Reverse Task Synthesis | OS-主题:通过反向任务合成实现图形界面代理轨迹构造自动化 2412.19723v3 |
Authors (15): Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu
Graphical User Interface (GUI) agents powered by Vision-Language Models (VLMs) have demonstrated human-like computer control capability. Despite their utility in advancing digital automation, a critical bottleneck persists: collecting high-quality trajectory data for training. Common practices for collecting such data rely on human supervision or synthetic data generation through executing pre-defined tasks, which are either resource-intensive or unable to guarantee data quality. Moreover, these methods suffer from limited data diversity and significant gaps between synthetic data and real-world environments. To address these challenges, we propose OS-Genesis, a novel GUI data synthesis pipeline that reverses the conventional trajectory collection process. Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions, then retrospectively derive high-quality tasks to enable trajectory-level exploration. A trajectory reward model is then employed to ensure the quality of the generated trajectories. We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks. In-depth analysis further validates OS-Genesis’s efficiency and its superior data quality and diversity compared to existing synthesis methods. Our codes, data, and checkpoints are available at https://qiushisun.github.io/OS-Genesis-Home/.
nan
Article 531
Title@2025-06-27 (5): Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference
Title: Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference | Erkenntnisgrenzen von Visionsgrößen-Sprachmodellen durch Sampling-basierte Schlussfolgerungen erkennen | 通过基于抽样的推断,检测大语言视觉模型的知识范围 2502.18023v2 |
Authors (8): Zhuo Chen, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinyu Geng, Pengjun Xie, Fei Huang, Kewei Tu
Despite the advancements made in Visual Large Language Models (VLLMs), like text Large Language Models (LLMs), they have limitations in addressing questions that require real-time information or are knowledge-intensive. Indiscriminately adopting Retrieval Augmented Generation (RAG) techniques is an effective yet expensive way to enable models to answer queries beyond their knowledge scopes. To mitigate the dependence on retrieval and simultaneously maintain, or even improve, the performance benefits provided by retrieval, we propose a method to detect the knowledge boundary of VLLMs, allowing for more efficient use of techniques like RAG. Specifically, we propose a method with two variants that fine-tunes a VLLM on an automatically constructed dataset for boundary identification. Experimental results on various types of Visual Question Answering datasets show that our method successfully depicts a VLLM’s knowledge boundary based on which we are able to reduce indiscriminate retrieval while maintaining or improving the performance. In addition, we show that the knowledge boundary identified by our method for one VLLM can be used as a surrogate boundary for other VLLMs. Code will be released at https://github.com/Chord-Chen-30/VLLM-KnowledgeBoundary
nan
Article 532
Title@2025-06-27 (5): Federated Data-Efficient Instruction Tuning for Large Language Models
Title: Federated Data-Efficient Instruction Tuning for Large Language Models | Federated Data-Efficient Instruction Tuning für große Sprachmodelle | 大语言模式联邦数据效率指示图示 2410.10926v2 |
Authors (4): Zhen Qin, Zhaomin Wu, Bingsheng He, Shuiguang Deng
Instruction tuning is a crucial step in improving the responsiveness of pretrained large language models (LLMs) to human instructions. Federated learning (FL) helps to exploit the use of vast private instruction data from clients, becoming popular for LLM tuning by improving data diversity. Existing federated tuning simply consumes all local data, causing excessive computational overhead and overfitting to local data, while centralized data-efficient solutions are not suitable for FL due to privacy concerns. This work presents FedHDS, a federated data-efficient instruction tuning approach, which tunes LLMs with a representative subset of edge-side data. It reduces the data redundancy at both intra- and inter-client levels without sharing raw data. Experiments with various LLMs, datasets and partitions show that FedHDS improves Rouge-L on unseen tasks by an average of 10.72% over the SOTA full-data federated instruction tuning methods, while using less than 1.5% of the data samples, improving training efficiency by up to tens of times.
nan
Article 533
Title@2025-06-27 (5): EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models
Title: EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models | EasyDistill: Ein umfassendes Toolkit für effektive Wissensdestillation von großen Sprachmodellen | 简易蒸馏:大语言模式有效知识蒸馏综合工具箱 2505.20888v2 |
Authors (5): Chengyu Wang, Junbing Yan, Wenrui Cai, Yuanhao Yue, Jun Huang
In this paper, we present EasyDistill, a comprehensive toolkit designed for effective black-box and white-box knowledge distillation (KD) of large language models (LLMs). Our framework offers versatile functionalities, including data synthesis, supervised fine-tuning, ranking optimization, and reinforcement learning techniques specifically tailored for KD scenarios. The toolkit accommodates KD functionalities for both System 1 (fast, intuitive) and System 2 (slow, analytical) models. With its modular design and user-friendly interface, EasyDistill empowers researchers and industry practitioners to seamlessly experiment with and implement state-of-the-art KD strategies for LLMs. In addition, EasyDistill provides a series of robust distilled models and KD-based industrial solutions developed by us, along with the corresponding open-sourced datasets, catering to a variety of use cases. Furthermore, we describe the seamless integration of EasyDistill into Alibaba Cloud’s Platform for AI (PAI). Overall, the EasyDistill toolkit makes advanced KD techniques for LLMs more accessible and impactful within the NLP community.
nan
Article 534
Title@2025-06-27 (5): Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit
Title: Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit | Analysieren und Feintuning-Flüsternmodelle für mehrsprachige Pilot-Sprachtranskription im Cockpit | 分析并精精精细调校车舱多语种试验性语音翻译多语种试听模式 2506.21990v1 |
Authors (3): Kartheek Kumar Reddy Nareddy, Sarah Ternus, Julia Niebling
The developments in transformer encoder-decoder architectures have led to significant breakthroughs in machine translation, Automatic Speech Recognition (ASR), and instruction-based chat machines, among other applications. The pre-trained models were trained on vast amounts of generic data over a few epochs (fewer than five in most cases), resulting in their strong generalization capabilities. Nevertheless, the performance of these models does suffer when applied to niche domains like transcribing pilot speech in the cockpit, which involves a lot of specific vocabulary and multilingual conversations. This paper investigates and improves the transcription accuracy of cockpit conversations with Whisper models. We have collected around 85 minutes of cockpit simulator recordings and 130 minutes of interview recordings with pilots and manually labeled them. The speakers are middle aged men speaking both German and English. To improve the accuracy of transcriptions, we propose multiple normalization schemes to refine the transcripts and improve Word Error Rate (WER). We then employ fine-tuning to enhance ASR performance, utilizing performance-efficient fine-tuning with Low-Rank Adaptation (LoRA). Hereby, WER decreased from 68.49 \% (pretrained whisper Large model without normalization baseline) to 26.26\% (finetuned whisper Large model with the proposed normalization scheme).
nan
Article 535
Title@2025-06-27 (5): BeamLLM: Vision-Empowered mmWave Beam Prediction with Large Language Models
Title: BeamLLM: Vision-Empowered mmWave Beam Prediction with Large Language Models | BeamLLM: Vision-Empowered mmWave Beam Prediction mit großen Sprachmodellen | BeamLLM: 具有大语言模型的视觉-电子动力毫米 2503.10432v2 |
Authors (5): Can Zheng, Jiguang He, Guofa Cai, Zitong Yu, Chung G. Kang
In this paper, we propose BeamLLM, a vision-aided millimeter-wave (mmWave) beam prediction framework leveraging large language models (LLMs) to address the challenges of high training overhead and latency in mmWave communication systems. By combining computer vision (CV) with LLMs’ cross-modal reasoning capabilities, the framework extracts user equipment (UE) positional features from RGB images and aligns visual-temporal features with LLMs’ semantic space through reprogramming techniques. Evaluated on a realistic vehicle-to-infrastructure (V2I) scenario, the proposed method achieves 61.01% top-1 accuracy and 97.39% top-3 accuracy in standard prediction tasks, significantly outperforming traditional deep learning models. In few-shot prediction scenarios, the performance degradation is limited to 12.56% (top-1) and 5.55% (top-3) from time sample 1 to 10, demonstrating superior prediction capability.
nan
Article 536
Title@2025-06-27 (5): Don’t Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism
Title: Don’t Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism | Vertrauen Sie Generative Agents nicht auf die Kommunikation über soziale Netzwerke, es sei denn, Sie haben ihren Empirischen Realismus Benchmarking | 不要相信社会网络移动通信的创造者,除非以其经验现实主义为基准。 2506.21974v1 |
Authors (3): Simon Münker, Nils Schwager, Achim Rettinger
The ability of Large Language Models (LLMs) to mimic human behavior triggered a plethora of computational social science research, assuming that empirical studies of humans can be conducted with AI agents instead. Since there have been conflicting research findings on whether and when this hypothesis holds, there is a need to better understand the differences in their experimental designs. We focus on replicating the behavior of social network users with the use of LLMs for the analysis of communication on social networks. First, we provide a formal framework for the simulation of social networks, before focusing on the sub-task of imitating user communication. We empirically test different approaches to imitate user behavior on X in English and German. Our findings suggest that social simulations should be validated by their empirical realism measured in the setting in which the simulation components were fitted. With this paper, we argue for more rigor when applying generative-agent-based modeling for social simulation.
nan
Article 537
Title@2025-06-27 (5): STAIR: Improving Safety Alignment with Introspective Reasoning
Title: STAIR: Improving Safety Alignment with Introspective Reasoning | STAIR: Verbesserung der Sicherheitsausrichtung mit introspektiver Begründung | STAIR: 提高安全一致性,以内反省理由 2502.02384v2 |
Authors (10): Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, Jun Zhu
Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and the susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose STAIR, a novel framework that integrates SafeTy Alignment with Itrospective Reasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. Relevant resources in this work are available at https://github.com/thu-ml/STAIR.
nan
Article 538
Title@2025-06-27 (5): Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses
Title: Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses | Verbesserung der Strategien des Jailbreaks: Ein hybrider Ansatz, um LLM-Verletzungen auszunutzen und moderne Verteidigungen zu umgehen | 推进破牢战略:利用LLM脆弱性和绕过现代防御的混合办法 2506.21972v1 |
Authors (6): Mohamed Ahmed, Mohamed Abdelmouty, Mingyu Kim, Gunvanth Kandula, Alex Park, James C. Davis
The advancement of Pre-Trained Language Models (PTLMs) and Large Language Models (LLMs) has led to their widespread adoption across diverse applications. Despite their success, these models remain vulnerable to attacks that exploit their inherent weaknesses to bypass safety measures. Two primary inference-phase threats are token-level and prompt-level jailbreaks. Token-level attacks embed adversarial sequences that transfer well to black-box models like GPT but leave detectable patterns and rely on gradient-based token optimization, whereas prompt-level attacks use semantically structured inputs to elicit harmful responses yet depend on iterative feedback that can be unreliable. To address the complementary limitations of these methods, we propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs. GCG + PAIR and the newly explored GCG + WordGame hybrids were evaluated across multiple Vicuna and Llama models. GCG + PAIR consistently raised attack-success rates over its constituent techniques on undefended models; for instance, on Llama-3, its Attack Success Rate (ASR) reached 91.6%, a substantial increase from PAIR’s 58.4% baseline. Meanwhile, GCG + WordGame matched the raw performance of WordGame maintaining a high ASR of over 80% even under stricter evaluators like Mistral-Sorry-Bench. Crucially, both hybrids retained transferability and reliably pierced advanced defenses such as Gradient Cuff and JBShield, which fully blocked single-mode attacks. These findings expose previously unreported vulnerabilities in current safety stacks, highlight trade-offs between raw success and defensive robustness, and underscore the need for holistic safeguards against adaptive adversaries.
nan
Article 539
Title@2025-06-27 (5): ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework
Title: ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework | ShifCon: Verbesserung nicht-dominanter Sprachfähigkeiten mit einem Shift-basierten mehrsprachigen Kontrastrahmen | Shifcon:利用基于轮班的多语言竞争框架,提高非主要语言能力 2410.19453v6 |
Authors (9): Hengyuan Zhang, Chenming Shang, Sizhe Wang, Dongdong Zhang, Yiyao Yu, Feng Yao, Renliang Sun, Yujiu Yang, Furu Wei
Although fine-tuning Large Language Models (LLMs) with multilingual data can rapidly enhance the multilingual capabilities of LLMs, they still exhibit a performance gap between the dominant language (e.g., English) and non-dominant ones due to the imbalance of training data across languages. To further enhance the performance of non-dominant languages, we propose ShifCon, a Shift-based multilingual Contrastive framework that aligns the internal forward process of other languages toward that of the dominant one. Specifically, it shifts the representations of non-dominant languages into the dominant language subspace, allowing them to access relatively rich information encoded in the model parameters. The enriched representations are then shifted back into their original language subspace before generation. Moreover, we introduce a subspace distance metric to pinpoint the optimal layer area for shifting representations and employ multilingual contrastive learning to further enhance the alignment of representations within this area. Experiments demonstrate that our ShifCon framework significantly enhances the performance of non-dominant languages, particularly for low-resource ones. Further analysis offers extra insights to verify the effectiveness of ShifCon and propel future research.
nan
Article 540
Title@2025-06-27 (5): More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents
Title: More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents | Schwacher als Sie denken: Zur Stabilität von werkzeugintegrierten LLM-Agenten | 比你想象的更加脆弱:关于工具集成LLM剂稳定问题 2506.21967v1 |
Authors (7): Weimin Xiong, Ke Wang, Yifan Song, Hanchao Liu, Sai Zhou, Wei Peng, Sujian Li
Current evaluations of tool-integrated LLM agents typically focus on end-to-end tool-usage evaluation while neglecting their stability. This limits their real-world applicability, as various internal or external factors can cause agents to crash or behave abnormally. Our research addresses this by investigating whether agents are vulnerable to errors throughout the entire tool invocation process, including reading tool documentation, selecting tools and generating parameters, and processing the tool’s response. Through extensive experiments, we observe that agents are highly susceptible to errors at each stage and agents based on open-source models are more vulnerable than those based on proprietary models. We also find that increasing the model size does not significantly improve tool invocation reasoning and may make agents more vulnerable to attacks resembling normal user instructions. This highlights the importance of evaluating agent stability and offers valuable insights for future LLM development and evaluation.
nan
Article 541
Title@2025-06-27 (5): Using Large Language Models to Suggest Informative Prior Distributions in Bayesian Statistics
Title: Using Large Language Models to Suggest Informative Prior Distributions in Bayesian Statistics | Große Sprachmodelle verwenden, um informative vorherige Distributionen in Bayesian Statistics vorzuschlagen | Bayesian统计中利用大语言模型建议事先知情分配 2506.21964v1 |
Authors (4): Michael A. Riegler, Kristoffer Herland Hellton, Vajira Thambawita, Hugo L. Hammer
Selecting prior distributions in Bayesian statistics is challenging, resource-intensive, and subjective. We analyze using large-language models (LLMs) to suggest suitable, knowledge-based informative priors. We developed an extensive prompt asking LLMs not only to suggest priors but also to verify and reflect on their choices. We evaluated Claude Opus, Gemini 2.5 Pro, and ChatGPT-4o-mini on two real datasets: heart disease risk and concrete strength. All LLMs correctly identified the direction for all associations (e.g., that heart disease risk is higher for males). The quality of suggested priors was measured by their Kullback-Leibler divergence from the maximum likelihood estimator’s distribution. The LLMs suggested both moderately and weakly informative priors. The moderate priors were often overconfident, resulting in distributions misaligned with the data. In our experiments, Claude and Gemini provided better priors than ChatGPT. For weakly informative priors, a key performance difference emerged: ChatGPT and Gemini defaulted to an “unnecessarily vague” mean of 0, while Claude did not, demonstrating a significant advantage. The ability of LLMs to identify correct associations shows their great potential as an efficient, objective method for developing informative priors. However, the primary challenge remains in calibrating the width of these priors to avoid over- and under-confidence.
nan
Article 542
Title@2025-06-27 (5): PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory
Title: PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory | PapersPlease: Ein Benchmark für die Bewertung von Motivationswerten von großen Sprachmodellen basierend auf der ERG-Theorie | 请文件:根据紧急和紧急和紧急需要理论评价大语言模式动力价值的基准 2506.21961v1 |
Authors (5): Junho Myung, Yeon Su Park, Sunwoo Kim, Shin Yoo, Alice Oh
Evaluating the performance and biases of large language models (LLMs) through role-playing scenarios is becoming increasingly common, as LLMs often exhibit biased behaviors in these contexts. Building on this line of research, we introduce PapersPlease, a benchmark consisting of 3,700 moral dilemmas designed to investigate LLMs’ decision-making in prioritizing various levels of human needs. In our setup, LLMs act as immigration inspectors deciding whether to approve or deny entry based on the short narratives of people. These narratives are constructed using the Existence, Relatedness, and Growth (ERG) theory, which categorizes human needs into three hierarchical levels. Our analysis of six LLMs reveals statistically significant patterns in decision-making, suggesting that LLMs encode implicit preferences. Additionally, our evaluation of the impact of incorporating social identities into the narratives shows varying responsiveness based on both motivational needs and identity cues, with some models exhibiting higher denial rates for marginalized identities. All data is publicly available at https://github.com/yeonsuuuu28/papers-please.
nan
Article 543
Title@2025-06-27 (5): EUR-USD Exchange Rate Forecasting Based on Information Fusion with Large Language Models and Deep Learning Methods
Title: EUR-USD Exchange Rate Forecasting Based on Information Fusion with Large Language Models and Deep Learning Methods | EUR-USD Wechselkursprognose basierend auf Informationsfusion mit großen Sprachmodellen und Deep-Learning-Methoden | 基于与大语言模式和深学习方法信息融合的信息的汇率预测 2408.13214v2 |
Authors (5): Hongcheng Ding, Xuanze Zhao, Ruiting Deng, Shamsul Nahar Abdullah, Deshinta Arrova Dewi
Accurate forecasting of the EUR/USD exchange rate is crucial for investors, businesses, and policymakers. This paper proposes a novel framework, IUS, that integrates unstructured textual data from news and analysis with structured data on exchange rates and financial indicators to enhance exchange rate prediction. The IUS framework employs large language models for sentiment polarity scoring and exchange rate movement classification of texts. These textual features are combined with quantitative features and input into a Causality-Driven Feature Generator. An Optuna-optimized Bi-LSTM model is then used to forecast the EUR/USD exchange rate. Experiments demonstrate that the proposed method outperforms benchmark models, reducing MAE by 10.69% and RMSE by 9.56% compared to the best performing baseline. Results also show the benefits of data fusion, with the combination of unstructured and structured data yielding higher accuracy than structured data alone. Furthermore, feature selection using the top 12 important quantitative features combined with the textual features proves most effective. The proposed IUS framework and Optuna-Bi-LSTM model provide a powerful new approach for exchange rate forecasting through multi-source data integration.
nan
Article 544
Title@2025-06-27 (5): A Survey of Large Language Models in Psychotherapy: Current Landscape and Future Directions
Title: A Survey of Large Language Models in Psychotherapy: Current Landscape and Future Directions | Eine Umfrage zu großen Sprachmodellen in der Psychotherapie: Aktuelle Landschaft und zukünftige Richtungen | 心理治疗中大语言模式调查:当前景观和未来方向 2502.11095v3 |
Authors (9): Hongbin Na, Yining Hua, Zimu Wang, Tao Shen, Beibei Yu, Lilin Wang, Wei Wang, John Torous, Ling Chen
Mental health is increasingly critical in contemporary healthcare, with psychotherapy demanding dynamic, context-sensitive interactions that traditional NLP methods struggle to capture. Large Language Models (LLMs) offer significant potential for addressing this gap due to their ability to handle extensive context and multi-turn reasoning. This review introduces a conceptual taxonomy dividing psychotherapy into interconnected stages–assessment, diagnosis, and treatment–to systematically examine LLM advancements and challenges. Our comprehensive analysis reveals imbalances in current research, such as a focus on common disorders, linguistic biases, fragmented methods, and limited theoretical integration. We identify critical challenges including capturing dynamic symptom fluctuations, overcoming linguistic and cultural biases, and ensuring diagnostic reliability. Highlighting future directions, we advocate for continuous multi-stage modeling, real-time adaptive systems grounded in psychological theory, and diversified research covering broader mental disorders and therapeutic approaches, aiming toward more holistic and clinically integrated psychotherapy LLMs systems.
nan
Article 545
Title@2025-06-27 (5): Dynamic Adaptive Rank Space Exploration for Efficient Sentiment Analysis with Large Language Models
Title: Dynamic Adaptive Rank Space Exploration for Efficient Sentiment Analysis with Large Language Models | Dynamische adaptive Rank Space Exploration für effiziente Sentiment-Analyse mit großen Sprachmodellen | 利用大语言模型进行高效情感分析的空间探索 2410.16589v2 |
Authors (6): Hongcheng Ding, Fuzhen Hu, Ruiting Deng, Xuanze Zhao, Shamsul Nahar Abdullah, Deshinta Arrova Dewi
Sentiment analysis has become increasingly important for assessing public opinion and informing decision-making. Large language models (LLMs) have revolutionized this field by capturing nuanced language patterns. However, adapting LLMs to domain-specific sentiment analysis tasks remains challenging due to computational constraints and the need for optimal fine-tuning. To address these challenges, we propose a novel Dynamic Adaptive Rank Space Exploration (DARSE) framework for efficient and effective sentiment analysis using LLMs. DARSE consists of a coarse-grained greedy algorithm to identify the optimal rank range, a fine-grained exploration algorithm to refine rank selection, and a dynamic rank allocation method to determine the optimal rank combination for each LLM layer. Extensive experiments demonstrate that DARSE significantly improves sentiment analysis accuracy, achieving a 15.1% improvement in MSE and a 4.3% improvement in accuracy compared to previous work. Our framework strikes a balance between computational efficiency and model performance, making it a promising approach for sentiment analysis with LLMs.
nan
Article 546
Title@2025-06-27 (5): Embedding-based Approaches to Hyperpartisan News Detection
Title: Embedding-based Approaches to Hyperpartisan News Detection | Einbetten-basierte Ansätze zu Hyperparteien-Nachrichten-Erkennung | 以嵌入式办法探测超党派新闻 2501.01370v2 |
Authors (2): Karthik Mohan, Pengyu Chen
In this paper, we describe our systems in which the objective is to determine whether a given news article could be considered as hyperpartisan. Hyperpartisan news is news that takes an extremely polarized political standpoint with an intention of creating political divide among the public. We attempted several approaches, including n-grams, sentiment analysis, as well as sentence and document representation using pre-tained ELMo. Our best system using pre-trained ELMo with Bidirectional LSTM achieved an accuracy of 83% through 10-fold cross-validation without much hyperparameter tuning.
nan
Article 547
Title@2025-06-27 (5): PQ-GCN: Enhancing Text Graph Question Classification with Phrase Features
Title: PQ-GCN: Enhancing Text Graph Question Classification with Phrase Features | PQ-GCN: Verbesserung der Textgraphen-Frageklassifikation mit Phrase-Features | PQ-GCN:用词组特征加强文本图问题分类 2409.02481v3 |
Authors (4): Junyoung Lee, Ninad Dixit, Kaustav Chakrabarti, S. Supraja
Effective question classification is crucial for AI-driven educational tools, enabling adaptive learning systems to categorize questions by skill area, difficulty level, and competence. It not only supports educational diagnostics and analytics but also enhances complex downstream tasks like information retrieval and question answering by associating questions with relevant categories. Traditional methods, often based on word embeddings and conventional classifiers, struggle to capture the nuanced relationships in question statements, leading to suboptimal performance. We propose a novel approach leveraging graph convolutional networks, named Phrase Question-Graph Convolutional Network (PQ-GCN). Through PQ-GCN, we evaluate the incorporation of phrase-based features to enhance classification performance on question datasets of various domains and characteristics. The proposed method, augmented with phrase-based features, outperform baseline graph-based methods in low-resource settings, and performs competitively against language model-based methods with a fraction of their parameter size. Our findings offer a possible solution for more context-aware, parameter-efficient question classification, bridging the gap between graph neural network research and its educational applications.
nan
Article 548
Title@2025-06-27 (5): LRP4RAG: Detecting Hallucinations in Retrieval-Augmented Generation via Layer-wise Relevance Propagation
Title: LRP4RAG: Detecting Hallucinations in Retrieval-Augmented Generation via Layer-wise Relevance Propagation | LRP4RAG: Halluzinationen in der retrieval-angereicherten Generation mittels schichtweiser Relevanzvermehrung erkennen | LRP4RAG:通过多层相关性传导探测回溯性养殖中的幻觉 2408.15533v3 |
Authors (4): Haichuan Hu, Congqing He, Xiaochen Xie, Quanjun Zhang
Retrieval-Augmented Generation (RAG) has become a primary technique for mitigating hallucinations in large language models (LLMs). However, incomplete knowledge extraction and insufficient understanding can still mislead LLMs to produce irrelevant or even contradictory responses, which means hallucinations persist in RAG. In this paper, we propose LRP4RAG, a method based on the Layer-wise Relevance Propagation (LRP) algorithm for detecting hallucinations in RAG. Specifically, we first utilize LRP to compute the relevance between the input and output of the RAG generator. We then apply further extraction and resampling to the relevance matrix. The processed relevance data are input into multiple classifiers to determine whether the output contains hallucinations. To the best of our knowledge, this is the first time that LRP has been used for detecting RAG hallucinations, and extensive experiments demonstrate that LRP4RAG outperforms existing baselines.
nan
Article 549
Title@2025-06-27 (5): Dynamic Adaptive Optimization for Effective Sentiment Analysis Fine-Tuning on Large Language Models
Title: Dynamic Adaptive Optimization for Effective Sentiment Analysis Fine-Tuning on Large Language Models | Dynamische Adaptive Optimierung für effektive Sentimentanalyse Feintuning bei großen Sprachmodellen | 动态优化优化,对大语言模型进行有效的感性分析,对大语言模型进行微调 2408.11856v3 |
Authors (6): Hongcheng Ding, Xuanze Zhao, Ruiting Deng, Shamsul Nahar Abdullah, Deshinta Arrova Dewi, Zixiao Jiang
Sentiment analysis plays a crucial role in various domains, such as business intelligence and financial forecasting. Large language models (LLMs) have become a popular paradigm for sentiment analysis, leveraging multi-task learning to address specific tasks concurrently. However, LLMs with fine-tuning for sentiment analysis often underperforms due to the inherent challenges in managing diverse task complexities. Moreover, constant-weight approaches in multi-task learning struggle to adapt to variations in data characteristics, further complicating model effectiveness. To address these issues, we propose a novel multi-task learning framework with a dynamic adaptive optimization (DAO) module. This module is designed as a plug-and-play component that can be seamlessly integrated into existing models, providing an effective and flexible solution for multi-task learning. The key component of the DAO module is dynamic adaptive loss, which dynamically adjusts the weights assigned to different tasks based on their relative importance and data characteristics during training. Sentiment analyses on a standard and customized financial text dataset demonstrate that the proposed framework achieves superior performance. Specifically, this work improves the Mean Squared Error (MSE) and Accuracy (ACC) by 15.58% and 1.24% respectively, compared with previous work.
nan
Article 550
Title@2025-06-27 (5): ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation
Title: ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation | ARAG: Agentische Retrieval Augmented Generation für Personalisierte Empfehlung | AARAG: 个人化推荐的 “ 危险回收增加的一代人 “ 2506.21931v1 |
Authors (10): Reza Yousefi Maragheh, Pratheek Vadla, Priyank Gupta, Kai Zhao, Aysenur Inan, Kehui Yao, Jianpeng Xu, Praveen Kanumala, Jason Cho, Sushant Kumar
Retrieval-Augmented Generation (RAG) has shown promise in enhancing recommendation systems by incorporating external context into large language model prompts. However, existing RAG-based approaches often rely on static retrieval heuristics and fail to capture nuanced user preferences in dynamic recommendation scenarios. In this work, we introduce ARAG, an Agentic Retrieval-Augmented Generation framework for Personalized Recommendation, which integrates a multi-agent collaboration mechanism into the RAG pipeline. To better understand the long-term and session behavior of the user, ARAG leverages four specialized LLM-based agents: a User Understanding Agent that summarizes user preferences from long-term and session contexts, a Natural Language Inference (NLI) Agent that evaluates semantic alignment between candidate items retrieved by RAG and inferred intent, a context summary agent that summarizes the findings of NLI agent, and an Item Ranker Agent that generates a ranked list of recommendations based on contextual fit. We evaluate ARAG accross three datasets. Experimental results demonstrate that ARAG significantly outperforms standard RAG and recency-based baselines, achieving up to 42.1% improvement in NDCG@5 and 35.5% in Hit@5. We also, conduct an ablation study to analyse the effect by different components of ARAG. Our findings highlight the effectiveness of integrating agentic reasoning into retrieval-augmented recommendation and provide new directions for LLM-based personalization.
nan
Article 551
Title@2025-06-27 (5): HyReC: Exploring Hybrid-based Retriever for Chinese
Title: HyReC: Exploring Hybrid-based Retriever for Chinese | HyReC: Hybrid-basiertes Retriever für Chinesen erforschen | HyreC: 探索以混合方式为中国人寻找 2506.21913v1 |
Authors (5): Zunran Wang, Zheng Shenpeng, Wang Shenglan, Minghui Zhao, Zhonghua Li
Hybrid-based retrieval methods, which unify dense-vector and lexicon-based retrieval, have garnered considerable attention in the industry due to performance enhancement. However, despite their promising results, the application of these hybrid paradigms in Chinese retrieval contexts has remained largely underexplored. In this paper, we introduce HyReC, an innovative end-to-end optimization method tailored specifically for hybrid-based retrieval in Chinese. HyReC enhances performance by integrating the semantic union of terms into the representation model. Additionally, it features the Global-Local-Aware Encoder (GLAE) to promote consistent semantic sharing between lexicon-based and dense retrieval while minimizing the interference between them. To further refine alignment, we incorporate a Normalization Module (NM) that fosters mutual benefits between the retrieval approaches. Finally, we evaluate HyReC on the C-MTEB retrieval benchmark to demonstrate its effectiveness.
nan
Article 552
Title@2025-06-27 (5): AutoMixer: Checkpoint Artifacts as Automatic Data Mixers
Title: AutoMixer: Checkpoint Artifacts as Automatic Data Mixers | AutoMixer: Checkpoint-Artefakte als automatische Datenmischer | 自动混音器: 将检查点异形作为自动数据混音器 2506.21910v1 |
Authors (6): Ernie Chang, Yang Li, Patrick Huber, David Kant, Yangyang Shi, Vikas Chandra
In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with performance improvements of up to 1.93%. Overall, this shows the potential of checkpoint models to enhance data quality and optimize data mixtures.
nan
Article 553
Title@2025-06-27 (5): Collective Reasoning Among LLMs: A Framework for Answer Validation Without Ground Truth
Title: Collective Reasoning Among LLMs: A Framework for Answer Validation Without Ground Truth | Kollektive Begründung unter LLMs: Ein Rahmen für die Validierung von Antworten ohne Grundwahrheit | LLM女士的集体理由:无事实根据的回答验证框架 2502.20758v2 |
Authors (4): Seyed Pouyan Mousavi Davoudi, Amin Gholami Davodi, Alireza Amiri-Margavi, Mahdi Jafari
We introduce a new approach in which several advanced large language models-specifically GPT-4-0125-preview, Meta-LLAMA-3-70B-Instruct, Claude-3-Opus, and Gemini-1.5-Flash-collaborate to both produce and answer intricate, doctoral-level probability problems without relying on any single “correct” reference. Rather than depending on an established ground truth, our investigation focuses on how agreement among diverse models can signal the reliability of their outputs and, by extension, reflect the overall quality of the generated questions. To measure this inter-model alignment, we apply a suite of statistical evaluations, including chi-square tests, Fleiss’ Kappa coefficients, and confidence interval calculations, thereby capturing both precision in answers and clarity in question phrasing. Our analysis reveals that Claude and Gemini tend to frame questions more coherently and unambiguously, which is evidenced by their tighter confidence intervals and greater concordance with responding agents. In contrast, LLAMA exhibits wider confidence bands and a lower level of agreement, indicating more variability and reduced consistency in its question formulations. These observations support the notion that a multi-model collaborative strategy not only improves answer dependability but also offers an effective, data-driven mechanism for evaluating and refining question quality when no definitive solution exists. Ultimately, this work delivers actionable insights into enhancing AI-guided reasoning processes through coordinated interactions among heterogeneous language models.
nan
Article 554
Title@2025-06-27 (5): Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference
Title: Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference | Runde Aufmerksamkeit: Ein neuartiger Aufmerksamkeitsmechanismus auf runder Ebene, um die LLM-Inferenz zu beschleunigen | 圆桌关注:加速LLM推断的新一轮圆桌关注机制 2502.15294v3 |
Authors (7): Yaohua Tang, Zhicheng Hu, Kun Cheng, Fan Mo, Qiheng Lv, Hua Wang, Zhi Chen
The increasing context window size in large language models (LLMs) has improved their ability to handle complex, long-text tasks. However, as the conversation rounds continue, it is required to store a large amount of KV cache in GPU memory, which significantly affects the efficiency and even availability of the model serving systems. This paper analyzes dialogue data from real users on the granularity of round and discovers that the LLM inference manifests a watershed layer, after which the distribution of round-level attention shows notable similarity. Based on this, we propose Round Attention - a novel round-level attention mechanism that selectively processes the KV cache of top-k relevant rounds, where k is dynamically determined through the attention matrix in the watershed layer. Theoretical analysis demonstrates that our method reduces memory usage by 54\% to 82\%, while experimental results confirm that loading sparse critical-round KV cache maintains answer accuracy without performance degradation.
nan
Article 555
Title@2025-06-27 (5): A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs
Title: A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs | Eine Dual-Layer-Bewertung von geopolitischen und kulturellen Bias in LLMs | 对LLM中地缘政治和文化偏见的双重评价 2506.21881v1 |
Authors (2): Sean Kim, Hyuhng Joon Kim
As large language models (LLMs) are increasingly deployed across diverse linguistic and cultural contexts, understanding their behavior in both factual and disputable scenarios is essential, especially when their outputs may shape public opinion or reinforce dominant narratives. In this paper, we define two types of bias in LLMs: model bias (bias stemming from model training) and inference bias (bias induced by the language of the query), through a two-phase evaluation. Phase 1 evaluates LLMs on factual questions where a single verifiable answer exists, assessing whether models maintain consistency across different query languages. Phase 2 expands the scope by probing geopolitically sensitive disputes, where responses may reflect culturally embedded or ideologically aligned perspectives. We construct a manually curated dataset spanning both factual and disputable QA, across four languages and question types. The results show that Phase 1 exhibits query language induced alignment, while Phase 2 reflects an interplay between the model’s training context and query language. This paper offers a structured framework for evaluating LLM behavior across neutral and sensitive topics, providing insights for future LLM deployment and culturally aware evaluation practices in multilingual contexts.
nan
Article 556
Title@2025-06-27 (5): Grammar and Gameplay-aligned RL for Game Description Generation with LLMs
Title: Grammar and Gameplay-aligned RL for Game Description Generation with LLMs | Grammatik und Gameplay-aligned RL für Game Description Generation mit LLMs | 使用 LLM 生成游戏描述生成的语法和游戏游戏比对RLRL 2503.15783v2 |
Authors (2): Tsunehiko Tanaka, Edgar Simo-Serra
Game Description Generation (GDG) is the task of generating a game description written in a Game Description Language (GDL) from natural language text. Previous studies have explored generation methods leveraging the contextual understanding capabilities of Large Language Models (LLMs); however, accurately reproducing the game features of the game descriptions remains a challenge. In this paper, we propose reinforcement learning-based fine-tuning of LLMs for GDG (RLGDG). Our training method simultaneously improves grammatical correctness and fidelity to game concepts by introducing both grammar rewards and concept rewards. Furthermore, we adopt a two-stage training strategy where Reinforcement Learning (RL) is applied following Supervised Fine-Tuning (SFT). Experimental results demonstrate that our proposed method significantly outperforms baseline methods using SFT alone. Our code is available at https://github.com/tsunehiko/rlgdg
nan
Article 557
Title@2025-06-27 (5): Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation
Title: Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation | Haben Vision-Sprache Modelle interne Weltmodelle? Auf dem Weg zu einer Atom-Bewertung | 愿景-语言模型有内部世界模型吗? 2506.21876v1 |
Authors (24): Qiyue Gao, Xinyu Pi, Kevin Liu, Junrong Chen, Ruolan Yang, Xinqi Huang, Xinyu Fang, Lu Sun, Gautham Kishore, Bo Ai, Stone Tao, Mengyang Liu, Jiaxi Yang, Chao-Jung Lai, Chuanyang Jin, Jiannan Xiang, Benhao Huang, Zeming Chen, David Danks, Hao Su, Tianmin Shu, Ziqiao Ma, Lianhui Qin, Zhiting Hu
Internal world models (WMs) enable agents to understand the world’s state and predict transitions, serving as the basis for advanced deliberative reasoning. Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs. While the latest studies have evaluated and shown limitations in specific capabilities such as visual understanding, a systematic evaluation of VLMs’ fundamental WM abilities remains absent. Drawing on comparative psychology and cognitive science, we propose a two-stage framework that assesses Perception (visual, spatial, temporal, quantitative, and motion) and Prediction (mechanistic simulation, transitive inference, compositional inference) to provide an atomic evaluation of VLMs as WMs. Guided by this framework, we introduce WM-ABench, a large-scale benchmark comprising 23 fine-grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations. Through 660 experiments on 15 latest commercial and open-source VLMs, we find that these models exhibit striking limitations in basic world modeling abilities. For instance, almost all models perform at near-random accuracy when distinguishing motion trajectories. Additionally, they lack disentangled understanding – e.g., some models tend to believe blue objects move faster than green ones. More rich results and analyses reveal significant gaps between VLMs and human-level world modeling.
nan
Article 558
Title@2025-06-27 (5): WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation
Title: WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation | WildSpeech-Bench: Benchmarking von Audio-LLMs im natürlichen Sprachgespräch | WirdSpeech-Bench:为自然演讲对话中的音频LMs设定基准 2506.21875v1 |
Authors (6): Jian Zhang, Linhao Zhang, Bokai Lei, Chuhan Wu, Wei Jia, Xiao Zhou
Recent multi-modal Large Language Models (LLMs) such as GPT-4o have demonstrated strong capabilities of direct speech interaction. However, the lack of specialized and comprehensive benchmarks for end-to-end speech LLM evaluation hinders optimizing the user experience of Audio LLMs in real-world applications. Existing evaluation methods often adapt text-based benchmarks, overlooking speech’s unique characteristics and challenges, including prosody, homophones, stuttering, and differing user expectations. Here, we present a novel approach to thoroughly evaluate LLMs in practical speech conversations. We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena. We further design a query-aware evaluation method to use customized evaluation checklists and prompts to enhance the accuracy of automatic evaluation. We conduct comprehensive testing and detailed analysis of various mainstream speech models, revealing significant differences in model performance across different speech scenarios. The use of query-aware evaluation further enables a finer-grained assessment under various speech-specific scenarios. Our benchmark can provide valuable insights for speech model development and evaluation.
nan
Article 559
Title@2025-06-27 (5): Time is On My Side: Dynamics of Talk-Time Sharing in Video-chat Conversations
Title: Time is On My Side: Dynamics of Talk-Time Sharing in Video-chat Conversations | Zeit ist auf meiner Seite: Dynamik des Gesprächs-Zeit-Sharing in Video-Chat-Gesprächen | 时间就在我身边:视频聊天中的谈话时间分享动态 2506.20474v2 |
Authors (3): Kaixiang Zhang, Justine Zhang, Cristian Danescu-Niculescu-Mizil
An intrinsic aspect of every conversation is the way talk-time is shared between multiple speakers. Conversations can be balanced, with each speaker claiming a similar amount of talk-time, or imbalanced when one talks disproportionately. Such overall distributions are the consequence of continuous negotiations between the speakers throughout the conversation: who should be talking at every point in time, and for how long? In this work we introduce a computational framework for quantifying both the conversation-level distribution of talk-time between speakers, as well as the lower-level dynamics that lead to it. We derive a typology of talk-time sharing dynamics structured by several intuitive axes of variation. By applying this framework to a large dataset of video-chats between strangers, we confirm that, perhaps unsurprisingly, different conversation-level distributions of talk-time are perceived differently by speakers, with balanced conversations being preferred over imbalanced ones, especially by those who end up talking less. Then we reveal that – even when they lead to the same level of overall balance – different types of talk-time sharing dynamics are perceived differently by the participants, highlighting the relevance of our newly introduced typology. Finally, we discuss how our framework offers new tools to designers of computer-mediated communication platforms, for both human-human and human-AI communication.
nan
Article 560
Title@2025-06-27 (5): Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder
Title: Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder | Bridging Compositional and Distributional Semantics: Eine Umfrage zur latenten Semantischen Geometrie über AutoEncoder | 搭桥构成和分布式语义学:通过自动 Encder 进行边端语义几何测量调查 2506.20083v2 |
Authors (3): Yingji Zhang, Danilo S. Carvalho, André Freitas
Integrating compositional and symbolic properties into current distributional semantic spaces can enhance the interpretability, controllability, compositionality, and generalisation capabilities of Transformer-based auto-regressive language models (LMs). In this survey, we offer a novel perspective on latent space geometry through the lens of compositional semantics, a direction we refer to as \textit{semantic representation learning}. This direction enables a bridge between symbolic and distributional semantics, helping to mitigate the gap between them. We review and compare three mainstream autoencoder architectures-Variational AutoEncoder (VAE), Vector Quantised VAE (VQVAE), and Sparse AutoEncoder (SAE)-and examine the distinctive latent geometries they induce in relation to semantic structure and interpretability.
nan
Article 561
Title@2025-06-27 (5): RiverEcho: Real-Time Interactive Digital System for Ancient Yellow River Culture
Title: RiverEcho: Real-Time Interactive Digital System for Ancient Yellow River Culture | RiverEcho: Real-Time Interactive Digital System für die alte gelbe Flusskultur | RiverEcho:古黄河文化实时互动数字系统 2506.21865v1 |
Authors (10): Haofeng Wang, Yilin Guo, Zehao Li, Tong Yue, Yizong Wang, Enci Zhang, Rongqun Lin, Feng Gao, Shiqi Wang, Siwei Ma
The Yellow River is China’s mother river and a cradle of human civilization. The ancient Yellow River culture is, moreover, an indispensable part of human art history. To conserve and inherit the ancient Yellow River culture, we designed RiverEcho, a real-time interactive system that responds to voice queries using a large language model and a cultural knowledge dataset, delivering explanations through a talking-head digital human. Specifically, we built a knowledge database focused on the ancient Yellow River culture, including the collection of historical texts and the processing pipeline. Experimental results demonstrate that leveraging Retrieval-Augmented Generation (RAG) on the proposed dataset enhances the response quality of the Large Language Model(LLM), enabling the system to generate more professional and informative responses. Our work not only diversifies the means of promoting Yellow River culture but also provides users with deeper cultural insights.
nan
Article 562
Title@2025-06-27 (5): DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE
Title: DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE | DeepTalk: Auf dem Weg zu nahtloser und intelligenter Sprachinteraktion mit adaptiver Modalität-spezifischer MoE | 深谈:实现与适应型模式具体部的无缝和智能语音互动 2506.21864v1 |
Authors (9): Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Lijiang Li, Zuwei Long, Bo Tong, Ke Li, Xing Sun
Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at https://github.com/talkking/DeepTalk.
nan
Article 563
Title@2025-06-27 (5): Derivational Probing: Unveiling the Layer-wise Derivation of Syntactic Structures in Neural Language Models
Title: Derivational Probing: Unveiling the Layer-wise Derivation of Syntactic Structures in Neural Language Models | Derivational Probing: Enthüllen der schichtweisen Ableitung syntaktischer Strukturen in neuralen Sprachmodellen | 派生实验:神经语言模型中同步教学结构图层和图层推算 2506.21861v1 |
Authors (4): Taiga Someya, Ryo Yoshida, Hitomi Yanaka, Yohei Oseki
Recent work has demonstrated that neural language models encode syntactic structures in their internal representations, yet the derivations by which these structures are constructed across layers remain poorly understood. In this paper, we propose Derivational Probing to investigate how micro-syntactic structures (e.g., subject noun phrases) and macro-syntactic structures (e.g., the relationship between the root verbs and their direct dependents) are constructed as word embeddings propagate upward across layers. Our experiments on BERT reveal a clear bottom-up derivation: micro-syntactic structures emerge in lower layers and are gradually integrated into a coherent macro-syntactic structure in higher layers. Furthermore, a targeted evaluation on subject-verb number agreement shows that the timing of constructing macro-syntactic structures is critical for downstream performance, suggesting an optimal timing for integrating global syntactic information.
nan
Article 564
Title@2025-06-27 (5): Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation
Title: Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation | Leveraging Online-Olympiade-Level-Mathematik Probleme für LLMs Training und Kontaminierung-Resistent Evaluation | 利用在线奥林匹克层面的数学问题促进LLM女士的培训和污染 – – 评估 2501.14275v2 |
Authors (6): Sadegh Mahdavi, Muchen Li, Kaiwen Liu, Christos Thrampoulidis, Leonid Sigal, Renjie Liao
Advances in Large Language Models (LLMs) have sparked interest in their ability to solve Olympiad-level math problems. However, the training and evaluation of these models are constrained by the limited size and quality of available datasets, as creating large-scale data for such advanced problems requires extensive effort from human experts. In addition, current benchmarks are prone to contamination, leading to unreliable evaluations. In this paper, we present an automated pipeline that leverages the rich resources of the Art of Problem Solving (AoPS) forum, which predominantly features Olympiad-level problems and community-driven solutions. Using open-source LLMs, we develop a method to extract question-answer pairs from the forum, resulting in AoPS-Instruct, a dataset of more than 600,000 high-quality QA pairs. Our experiments demonstrate that fine-tuning LLMs on AoPS-Instruct improves their reasoning abilities across various benchmarks. Moreover, we build an automatic pipeline that introduces LiveAoPSBench, an evolving evaluation set with timestamps, derived from the latest forum data, providing a contamination-resistant benchmark for assessing LLM performance. Notably, we observe a significant decline in LLM performance over time, suggesting their success on older examples may stem from pre-training exposure rather than true reasoning ability. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning, offering valuable insights into the capabilities and limitations of LLMs in this domain. Our benchmark and code is available at https://github.com/DSL-Lab/aops
nan
Article 565
Title@2025-06-27 (5): The Consistency Hypothesis in Uncertainty Quantification for Large Language Models
Title: The Consistency Hypothesis in Uncertainty Quantification for Large Language Models | Die Kohärenzhypothese in der Unsicherheitsquantifizierung für große Sprachmodelle | 《大语言模型不确定性量化不确定性的一致假设》 2506.21849v1 |
Authors (8): Quan Xiao, Debarun Bhattacharjya, Balaji Ganesan, Radu Marinescu, Katsiaryna Mirylenka, Nhan H Pham, Michael Glass, Junkyu Lee
Estimating the confidence of large language model (LLM) outputs is essential for real-world applications requiring high user trust. Black-box uncertainty quantification (UQ) methods, relying solely on model API access, have gained popularity due to their practical benefits. In this paper, we examine the implicit assumption behind several UQ methods, which use generation consistency as a proxy for confidence, an idea we formalize as the consistency hypothesis. We introduce three mathematical statements with corresponding statistical tests to capture variations of this hypothesis and metrics to evaluate LLM output conformity across tasks. Our empirical investigation, spanning 8 benchmark datasets and 3 tasks (question answering, text summarization, and text-to-SQL), highlights the prevalence of the hypothesis under different settings. Among the statements, we highlight the `Sim-Any’ hypothesis as the most actionable, and demonstrate how it can be leveraged by proposing data-free black-box UQ methods that aggregate similarities between generations for confidence estimation. These approaches can outperform the closest baselines, showcasing the practical value of the empirically observed consistency hypothesis.
nan
Article 566
Title@2025-06-27 (5): 3Description: An Intuitive Human-AI Collaborative 3D Modeling Approach
Title: 3Description: An Intuitive Human-AI Collaborative 3D Modeling Approach | 3Beschreibung: Ein intuitiver human-AI-Kollaborativer 3D-Modellierungsansatz | 3 说明:直观的人类-大赦国际合作3D建模方法 2506.21845v1 |
Authors (1): Zhuodi Cai
This paper presents 3Description, an experimental human-AI collaborative approach for intuitive 3D modeling. 3Description aims to address accessibility and usability challenges in traditional 3D modeling by enabling non-professional individuals to co-create 3D models using verbal and gesture descriptions. Through a combination of qualitative research, product analysis, and user testing, 3Description integrates AI technologies such as Natural Language Processing and Computer Vision, powered by OpenAI and MediaPipe. Recognizing the web has wide cross-platform capabilities, 3Description is web-based, allowing users to describe the desired model and subsequently adjust its components using verbal and gestural inputs. In the era of AI and emerging media, 3Description not only contributes to a more inclusive and user-friendly design process, empowering more people to participate in the construction of the future 3D world, but also strives to increase human engagement in co-creation with AI, thereby avoiding undue surrender to technology and preserving human creativity.
nan
Article 567
Title@2025-06-27 (5): MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers
Title: MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers | MMCR: Benchmarking quellübergreifender Begründungen in wissenschaftlichen Arbeiten | MMCR: 科学文件的跨来源理由基准 2503.16856v2 |
Authors (5): Yang Tian, Zheng Lu, Mingqi Gao, Zheng Liu, Bo Zhao
Fully comprehending scientific papers by machines reflects a high level of Artificial General Intelligence, requiring the ability to reason across fragmented and heterogeneous sources of information, presenting a complex and practically significant challenge. While Vision-Language Models (VLMs) have made remarkable strides in various tasks, particularly those involving reasoning with evidence source from single image or text page, their ability to use cross-source information for reasoning remains an open problem. This work presents MMCR, a high-difficulty benchmark designed to evaluate VLMs’ capacity for reasoning with cross-source information from scientific papers. The benchmark comprises 276 high-quality questions, meticulously annotated by humans across 7 subjects and 10 task types. Experiments with 18 VLMs demonstrate that cross-source reasoning presents a substantial challenge for existing models. Notably, even the top-performing model, GPT-4o, achieved only 48.55% overall accuracy, with only 20% accuracy in multi-table comprehension tasks, while the second-best model, Qwen2.5-VL-72B, reached 39.86% overall accuracy. Furthermore, we investigated the impact of the Chain-of-Thought (CoT) technique on cross-source reasoning and observed a detrimental effect on small models, whereas larger models demonstrated substantially enhanced performance. These results highlight the pressing need to develop VLMs capable of effectively utilizing cross-source information for reasoning.
nan
Article 568
Title@2025-06-27 (5): PARSI: Persian Authorship Recognition via Stylometric Integration
Title: PARSI: Persian Authorship Recognition via Stylometric Integration | PARSI: Persische Anerkennung durch stylometrische Integration | PARSI: 通过星体集成承认波斯语授权 2506.21840v1 |
Authors (3): Kourosh Shahnazari, Mohammadali Keshtparvar, Seyed Moein Ayyoubzadeh
The intricate linguistic, stylistic, and metrical aspects of Persian classical poetry pose a challenge for computational authorship attribution. In this work, we present a versatile framework to determine authorship among 67 prominent poets. We employ a multi-input neural framework consisting of a transformer-based language encoder complemented by features addressing the semantic, stylometric, and metrical dimensions of Persian poetry. Our feature set encompasses 100-dimensional Word2Vec embeddings, seven stylometric measures, and categorical encodings of poetic form and meter. We compiled a vast corpus of 647,653 verses of the Ganjoor digital collection, validating the data through strict preprocessing and author verification while preserving poem-level splitting to prevent overlap. This work employs verse-level classification and majority and weighted voting schemes in evaluation, revealing that weighted voting yields 71% accuracy. We further investigate threshold-based decision filtering, allowing the model to generate highly confident predictions, achieving 97% accuracy at a 0.9 threshold, though at lower coverage. Our work focuses on the integration of deep representational forms with domain-specific features for improved authorship attribution. The results illustrate the potential of our approach for automated classification and the contribution to stylistic analysis, authorship disputes, and general computational literature research. This research will facilitate further research on multilingual author attribution, style shift, and generative modeling of Persian poetry.
nan
Article 569
Title@2025-06-27 (5): GenEscape: Hierarchical Multi-Agent Generation of Escape Room Puzzles
Title: GenEscape: Hierarchical Multi-Agent Generation of Escape Room Puzzles | GenEscape: Hierarchische Multi-Agenten-Generation von Escape Room Puzzles | GenEscape: 相向室谜题的等级化多代理生成 2506.21839v1 |
Authors (4): Mengyi Shan, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz
We challenge text-to-image models with generating escape room puzzle images that are visually appealing, logically solid, and intellectually stimulating. While base image models struggle with spatial relationships and affordance reasoning, we propose a hierarchical multi-agent framework that decomposes this task into structured stages: functional design, symbolic scene graph reasoning, layout synthesis, and local image editing. Specialized agents collaborate through iterative feedback to ensure the scene is visually coherent and functionally solvable. Experiments show that agent collaboration improves output quality in terms of solvability, shortcut avoidance, and affordance clarity, while maintaining visual quality.
nan
Article 570
Title@2025-06-27 (5): Strengthening False Information Propagation Detection: Leveraging SVM and Sophisticated Text Vectorization Techniques in comparison to BERT
Title: Strengthening False Information Propagation Detection: Leveraging SVM and Sophisticated Text Vectorization Techniques in comparison to BERT | Stärkung der falschen Information Propagation Detection: Leveraging SVM und ausgefeilte Text-Vektorisierungstechniken im Vergleich zu BERT | 加强虚假信息传播探测:利用SVM和高频文本矢量技术与BERT相比 2411.12703v2 |
Authors (3): Ahmed Akib Jawad Karim, Kazi Hafiz Md Asad, Aznur Azam
The rapid spread of misinformation, particularly through online platforms, underscores the urgent need for reliable detection systems. This study explores the utilization of machine learning and natural language processing, specifically Support Vector Machines (SVM) and BERT, to detect fake news. We employ three distinct text vectorization methods for SVM: Term Frequency Inverse Document Frequency (TF-IDF), Word2Vec, and Bag of Words (BoW), evaluating their effectiveness in distinguishing between genuine and fake news. Additionally, we compare these methods against the transformer large language model, BERT. Our comprehensive approach includes detailed preprocessing steps, rigorous model implementation, and thorough evaluation to determine the most effective techniques. The results demonstrate that while BERT achieves superior accuracy with 99.98% and an F1-score of 0.9998, the SVM model with a linear kernel and BoW vectorization also performs exceptionally well, achieving 99.81% accuracy and an F1-score of 0.9980. These findings highlight that, despite BERT’s superior performance, SVM models with BoW and TF-IDF vectorization methods come remarkably close, offering highly competitive performance with the advantage of lower computational requirements.
nan
Article 571
Title@2025-06-27 (5): RLSF: Fine-tuning LLMs via Symbolic Feedback
Title: RLSF: Fine-tuning LLMs via Symbolic Feedback | RLSF: Feinjustierende LLMs über symbolisches Feedback | RLSF:通过符号反馈对LLMs进行微调 2405.16661v3 |
Authors (5): Piyush Jha, Prithwish Jana, Pranavkrishna Suresh, Arnav Arora, Vijay Ganesh
Large Language Models (LLMs) have transformed AI but often struggle with tasks that require domain-specific reasoning and logical alignment. Traditional fine-tuning methods do not leverage the vast amount of symbolic domain-knowledge available to us via symbolic reasoning tools (e.g., provers), and are further limited by sparse rewards and unreliable reward models. We introduce Reinforcement Learning via Symbolic Feedback (RLSF), a novel fine-tuning paradigm where symbolic reasoning tools (e.g., solvers, provers, and algebra systems) provide fine-grained feedback to LLMs. RLSF uses poly-sized certificates (e.g., proofs) generated by symbolic tools to identify and correct errors in model outputs, offering token-level guidance without requiring differentiable reasoning systems. This paradigm bridges the gap between symbolic reasoning and LLM fine-tuning, enabling precise alignment with domain-specific constraints while addressing key limitations of traditional reward signals. Via extensive evaluations, we show that our RLSF-based fine-tuning of LLMs outperforms traditional approaches on five different applications (that have some associated logical or domain constraints), namely, program synthesis from natural language pseudo-code to programming language, three chemistry tasks, and solving the Game of 24. A key takeaway is that fine-tuning via RLSF enables relatively smaller LLMs to significantly outperform closed-source models that are orders of magnitude larger.
nan
Article 572
Title@2025-06-26 (4): Exploring the change in scientific readability following the release of ChatGPT
Title: Exploring the change in scientific readability following the release of ChatGPT | Erforschung der Veränderung der wissenschaftlichen Lesbarkeit nach der Veröffentlichung von ChatGPT | 探讨在ChatGPT发布后科学可读性的变化 2506.21825v1 |
Authors (1): Abdulkareem Alsudais
The rise and growing popularity of accessible large language models have raised questions about their impact on various aspects of life, including how scientists write and publish their research. The primary objective of this paper is to analyze a dataset consisting of all abstracts posted on arXiv.org between 2010 and June 7th, 2024, to assess the evolution of their readability and determine whether significant shifts occurred following the release of ChatGPT in November 2022. Four standard readability formulas are used to calculate individual readability scores for each paper, classifying their level of readability. These scores are then aggregated by year and across the eight primary categories covered by the platform. The results show a steady annual decrease in readability, suggesting that abstracts are likely becoming increasingly complex. Additionally, following the release of ChatGPT, a significant change in readability is observed for 2023 and the analyzed months of 2024. Similar trends are found across categories, with most experiencing a notable change in readability during 2023 and 2024. These findings offer insights into the broader changes in readability and point to the likely influence of AI on scientific writing.
nan
Article 573
Title@2025-06-26 (4): Exploring the Structure of AI-Induced Language Change in Scientific English
Title: Exploring the Structure of AI-Induced Language Change in Scientific English | Erforschung der Struktur des KI-induzierten Sprachwandels im wissenschaftlichen Englisch | 探索AI-引自AI的英语科学语言变化结构 2506.21817v1 |
Authors (3): Riley Galpin, Bryce Anderson, Tom S. Juzek
Scientific English has undergone rapid and unprecedented changes in recent years, with words such as “delve,” “intricate,” and “crucial” showing significant spikes in frequency since around 2022. These changes are widely attributed to the growing influence of Large Language Models like ChatGPT in the discourse surrounding bias and misalignment. However, apart from changes in frequency, the exact structure of these linguistic shifts has remained unclear. The present study addresses this and investigates whether these changes involve the replacement of synonyms by suddenly ‘spiking words,’ for example, “crucial” replacing “essential” and “key,” or whether they reflect broader semantic and pragmatic qualifications. To further investigate structural changes, we include part of speech tagging in our analysis to quantify linguistic shifts over grammatical categories and differentiate between word forms, like “potential” as a noun vs. as an adjective. We systematically analyze synonym groups for widely discussed ‘spiking words’ based on frequency trends in scientific abstracts from PubMed. We find that entire semantic clusters often shift together, with most or all words in a group increasing in usage. This pattern suggests that changes induced by Large Language Models are primarily semantic and pragmatic rather than purely lexical. Notably, the adjective “important” shows a significant decline, which prompted us to systematically analyze decreasing lexical items. Our analysis of “collapsing” words reveals a more complex picture, which is consistent with organic language change and contrasts with the patterns of the abrupt spikes. These insights into the structure of language change contribute to our understanding of how language technology continues to shape human language.
nan
Article 574
Title@2025-06-26 (4): Towards Transparent AI: A Survey on Explainable Large Language Models
Title: Towards Transparent AI: A Survey on Explainable Large Language Models | Auf dem Weg zu transparenter KI: Eine Umfrage zu erklärbaren großen Sprachmodellen | 走向透明AI:关于可解释的大型语言模式的调查 2506.21812v1 |
Authors (4): Avash Palikhe, Zhenyu Yu, Zichong Wang, Wenbin Zhang
Large Language Models (LLMs) have played a pivotal role in advancing Artificial Intelligence (AI). However, despite their achievements, LLMs often struggle to explain their decision-making processes, making them a ‘black box’ and presenting a substantial challenge to explainability. This lack of transparency poses a significant obstacle to the adoption of LLMs in high-stakes domain applications, where interpretability is particularly essential. To overcome these limitations, researchers have developed various explainable artificial intelligence (XAI) methods that provide human-interpretable explanations for LLMs. However, a systematic understanding of these methods remains limited. To address this gap, this survey provides a comprehensive review of explainability techniques by categorizing XAI methods based on the underlying transformer architectures of LLMs: encoder-only, decoder-only, and encoder-decoder models. Then these techniques are examined in terms of their evaluation for assessing explainability, and the survey further explores how these explanations are leveraged in practical applications. Finally, it discusses available resources, ongoing research challenges, and future directions, aiming to guide continued efforts toward developing transparent and responsible LLMs.
nan
Article 575
Title@2025-06-26 (4): A suite of allotaxonometric tools for the comparison of complex systems using rank-turbulence divergence
Title: A suite of allotaxonometric tools for the comparison of complex systems using rank-turbulence divergence | Eine Reihe allotaxonometrischer Werkzeuge für den Vergleich komplexer Systeme mit Rang-Turbulenz-Divergenz | 一套用于比较复杂系统、使用降压扰动差异比较的 Alsotalogon 测量工具套套套 2506.21808v1 |
Authors (9): Jonathan St-Onge, Ashley M. A. Fehr, Carter Ward, Calla G. Beauregard, Michael V. Arnold, Samuel F. Rosenblatt, Benjamin Cooley, Christopher M. Danforth, Peter Sheridan Dodds
Describing and comparing complex systems requires principled, theoretically grounded tools. Built around the phenomenon of type turbulence, allotaxonographs provide map-and-list visual comparisons of pairs of heavy-tailed distributions. Allotaxonographs are designed to accommodate a wide range of instruments including rank- and probability-turbulence divergences, Jenson-Shannon divergence, and generalized entropy divergences. Here, we describe a suite of programmatic tools for rendering allotaxonographs for rank-turbulence divergence in Matlab, Javascript, and Python, all of which have different use cases.
nan
Article 576
Title@2025-06-26 (4): CitySim: Modeling Urban Behaviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation
Title: CitySim: Modeling Urban Behaviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation | CitySim: Modellierung städtischer Verhaltensmuster und Stadtdynamik mit großformatiger LLM-Driven Agent Simulation | 城市模拟:利用大型LLM-驱动剂模拟模拟模型模拟城市行为和城市动态 2506.21805v1 |
Authors (2): Nicolas Bougie, Narimasa Watanabe
Modeling human behavior in urban environments is fundamental for social science, behavioral studies, and urban planning. Prior work often rely on rigid, hand-crafted rules, limiting their ability to simulate nuanced intentions, plans, and adaptive behaviors. Addressing these challenges, we envision an urban simulator (CitySim), capitalizing on breakthroughs in human-level intelligence exhibited by large language models. In CitySim, agents generate realistic daily schedules using a recursive value-driven approach that balances mandatory activities, personal habits, and situational factors. To enable long-term, lifelike simulations, we endow agents with beliefs, long-term goals, and spatial memory for navigation. CitySim exhibits closer alignment with real humans than prior work, both at micro and macro levels. Additionally, we conduct insightful experiments by modeling tens of thousands of agents and evaluating their collective behaviors under various real-world scenarios, including estimating crowd density, predicting place popularity, and assessing well-being. Our results highlight CitySim as a scalable, flexible testbed for understanding and forecasting urban phenomena.
nan
Article 577
Title@2025-06-26 (4): Offensive Language Detection on Social Media Using XLNet
Title: Offensive Language Detection on Social Media Using XLNet | Offensive Spracherkennung auf Social Media mit XLNet | 使用XLNet在社交媒体上发现攻击性语言 2506.21795v1 |
Authors (3): Reem Alothman, Hafida Benhidour, Said Kerrache
The widespread use of text-based communication on social media-through chats, comments, and microblogs-has improved user interaction but has also led to an increase in offensive content, including hate speech, racism, and other forms of abuse. Due to the enormous volume of user-generated content, manual moderation is impractical, which creates a need for automated systems that can detect offensive language. Deep learning models, particularly those using transfer learning, have demonstrated significant success in understanding natural language through large-scale pretraining. In this study, we propose an automatic offensive language detection model based on XLNet, a generalized autoregressive pretraining method, and compare its performance with BERT (Bidirectional Encoder Representations from Transformers), which is a widely used baseline in natural language processing (NLP). Both models are evaluated using the Offensive Language Identification Dataset (OLID), a benchmark Twitter dataset that includes hierarchical annotations. Our experimental results show that XLNet outperforms BERT in detecting offensive content and in categorizing the types of offenses, while BERT performs slightly better in identifying the targets of the offenses. Additionally, we find that oversampling and undersampling strategies are effective in addressing class imbalance and improving classification performance. These findings highlight the potential of transfer learning and XLNet-based architectures to create robust systems for detecting offensive language on social media platforms.
nan
Article 578
Title@2025-06-26 (4): Evaluating List Construction and Temporal Understanding capabilities of Large Language Models
Title: Evaluating List Construction and Temporal Understanding capabilities of Large Language Models | Bewertung der Listenkonstruktion und des zeitlichen Verständnisses von großen Sprachmodellen | 评价大语言模型的建筑和时间理解能力清单 2506.21783v1 |
Authors (4): Alexandru Dumitru, V Venktesh, Adam Jatowt, Avishek Anand
Large Language Models (LLMs) have demonstrated immense advances in a wide range of natural language tasks. However, these models are susceptible to hallucinations and errors on particularly temporal understanding tasks involving multiple entities in answers. In such tasks, they fail to associate entities with accurate time intervals, generate a complete list of entities in answers or reason about events associated with specific temporal bounds. Existing works do not extensively evaluate the abilities of the model to perform implicit and explicit temporal understanding in a list answer construction setup. To bridge this gap, we propose the Time referenced List based Question Answering or TLQA benchmark that requires structured answers in list format aligned with corresponding time periods. Our TLQA benchmark, requires both list construction and temporal understanding simultaneously, which to the best of our knowledge has not been explored in prior benchmarks. We investigate the temporal understanding and list construction capabilities of state-of-the-art generative models on TLQA in closed-book and open-domain settings. Our findings reveal significant shortcomings in current models, particularly their inability to provide complete answers and temporally align facts in a closed-book setup and the need to improve retrieval in open-domain setup, providing clear future directions for research on TLQA. The benchmark and code at https://github.com/elixir-research-group/TLQA.
nan
Article 579
Title@2025-06-26 (4): Are Triggers Needed for Document-Level Event Extraction?
Title: Are Triggers Needed for Document-Level Event Extraction? | Sind Auslöser für die Dokument-Level-Ereignisextraktion erforderlich? | 需要触发文件级活动吗? 2411.08708v2 |
Authors (6): Shaden Shaar, Wayne Chen, Maitreyi Chatterjee, Barry Wang, Wenting Zhao, Claire Cardie
Most existing work on event extraction has focused on sentence-level texts and presumes the identification of a trigger-span – a word or phrase in the input that evokes the occurrence of an event of interest. Event arguments are then extracted with respect to the trigger. Indeed, triggers are treated as integral to, and trigger detection as an essential component of, event extraction. In this paper, we provide the first investigation of the role of triggers for the more difficult and much less studied task of document-level event extraction. We analyze their usefulness in multiple end-to-end and pipelined transformer-based event extraction models for three document-level event extraction datasets, measuring performance using triggers of varying quality (human-annotated, LLM-generated, keyword-based, and random). We find that whether or not systems benefit from explicitly extracting triggers depends both on dataset characteristics (i.e. the typical number of events per document) and task-specific information available during extraction (i.e. natural language event schemas). Perhaps surprisingly, we also observe that the mere existence of triggers in the input, even random ones, is important for prompt-based in-context learning approaches to the task.
nan
Article 580
Title@2025-06-26 (4): (Fact) Check Your Bias
Title: (Fact) Check Your Bias | (Fakt) Prüfen Sie Ihre Bias | (事实) 检查您的比亚 2506.21745v1 |
Authors (2): Eivind Morris Bakke, Nora Winger Heggelund
Automatic fact verification systems increasingly rely on large language models (LLMs). We investigate how parametric knowledge biases in these models affect fact-checking outcomes of the HerO system (baseline for FEVER-25). We examine how the system is affected by: (1) potential bias in Llama 3.1’s parametric knowledge and (2) intentionally injected bias. When prompted directly to perform fact-verification, Llama 3.1 labels nearly half the claims as “Not Enough Evidence”. Using only its parametric knowledge it is able to reach a verdict on the remaining half of the claims. In the second experiment, we prompt the model to generate supporting, refuting, or neutral fact-checking documents. These prompts significantly influence retrieval outcomes, with approximately 50\% of retrieved evidence being unique to each perspective. Notably, the model sometimes refuses to generate supporting documents for claims it believes to be false, creating an inherent negative bias. Despite differences in retrieved evidence, final verdict predictions show stability across prompting strategies. The code is available at: https://github.com/eibakke/FEVER-8-Shared-Task
nan
Article 581
Title@2025-06-26 (4): Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Title: Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought | Bias-Augmented Consistency Training reduziert biased Reasoning in Chain-of-Thought | 避免和强化的一致培训减少在寻求的连锁努力中造成不利和 不利理由 2403.05518v3 |
Authors (7): James Chua, Edward Rees, Hunar Batra, Samuel R. Bowman, Julian Michael, Ethan Perez, Miles Turpin
Chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning. But CoT can also systematically misrepresent the factors influencing models’ behavior – for example, rationalizing answers in line with a user’s opinion. We first create a new dataset of 9 different biases that affect GPT-3.5-Turbo and Llama-8b models. These consist of spurious-few-shot patterns, post hoc rationalization, and sycophantic settings. Models switch to the answer implied by the bias, without mentioning the effect of the bias in the CoT. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervised fine-tuning scheme that trains models to give consistent reasoning across prompts with and without biasing features. We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86\% on held-out tasks. Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37\%. As BCT generalizes to held-out biases and does not require gold labels, this method may hold promise for reducing biased reasoning from as-of-yet unknown biases and on tasks where ground truth reasoning is unavailable.
nan
Article 582
Title@2025-06-26 (4): Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers
Title: Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers | Identifizierung von Sprecherinformationen in Feed-Forward-Schichten von selbstüberwachten Sprachtransformatoren | 识别自我支持的语音变换者向往进进言层中的演讲者信息 2506.21712v1 |
Authors (4): Tzu-Quan Lin, Hsi-Chun Cheng, Hung-yi Lee, Hao Tang
In recent years, the impact of self-supervised speech Transformers has extended to speaker-related applications. However, little research has explored how these models encode speaker information. In this work, we address this gap by identifying neurons in the feed-forward layers that are correlated with speaker information. Specifically, we analyze neurons associated with k-means clusters of self-supervised features and i-vectors. Our analysis reveals that these clusters correspond to broad phonetic and gender classes, making them suitable for identifying neurons that represent speakers. By protecting these neurons during pruning, we can significantly preserve performance on speaker-related task, demonstrating their crucial role in encoding speaker information.
nan
Article 583
Title@2025-06-26 (4): End-to-End Long Document Summarization using Gradient Caching
Title: End-to-End Long Document Summarization using Gradient Caching | End-to-End-Langdokumentzusammenfassung mit Gradient Caching | 使用梯度缓存对端到 End 长文档的缩写 2501.01805v2 |
Authors (3): Rohit Saxena, Hao Tang, Frank Keller
Training transformer-based encoder-decoder models for long document summarization poses a significant challenge due to the quadratic memory consumption during training. Several approaches have been proposed to extend the input length at test time, but training with these approaches is still difficult, requiring truncation of input documents and causing a mismatch between training and test conditions. In this work, we propose CachED (Gradient $\textbf{Cach}$ing for $\textbf{E}$ncoder-$\textbf{D}$ecoder models), an approach that enables end-to-end training of existing transformer-based encoder-decoder models, using the entire document without truncation. Specifically, we apply non-overlapping sliding windows to input documents, followed by fusion in decoder. During backpropagation, the gradients are cached at the decoder and are passed through the encoder in chunks by re-computing the hidden vectors, similar to gradient checkpointing. In the experiments on long document summarization, we extend BART to CachED BART, processing more than 500K tokens during training and achieving superior performance without using any additional parameters.
nan
Article 584
Title@2025-06-26 (4): Introducing MAPO: Momentum-Aided Gradient Descent Prompt Optimization
Title: Introducing MAPO: Momentum-Aided Gradient Descent Prompt Optimization | Einführung von MAPO: Momentum-Aided Gradient Descent Prompt Optimization | 介绍MAPO: 动力-援助渐变人后裔快速优化 2410.19499v3 |
Authors (7): Anthony Cui, Pranav Nandyalam, Andrew Rufail, Ethan Cheung, Aiden Lei, Kevin Zhu, Sean O’Brien
Momentum-Aided Prompt Optimization (MAPO) enhances the efficiency and efficacy of prompt optimization for Large Language Models (LLMs). Building on ProTeGi, MAPO uses positive natural language “gradients” and a momentum-based extension to refine prompts effectively. By tracking gradient history, MAPO avoids local minima and oscillations. It also utilizes beam search and an Upper Confidence Bound (UCB) algorithm for balanced candidate expansion and selection. Benchmark testing shows that MAPO achieves faster convergence time with fewer API calls and higher F1 scores than ProTeGi, proving it as a robust and scalable solution for automated prompt engineering in LLMs.
nan
Article 585
Title@2025-06-26 (4): Multimodal Misinformation Detection Using Early Fusion of Linguistic, Visual, and Social Features
Title: Multimodal Misinformation Detection Using Early Fusion of Linguistic, Visual, and Social Features | Multimodale Fehlinformationserkennung mittels frühzeitiger Fusion sprachlicher, visueller und sozialer Merkmale | 利用语言、视觉和社会特征的早期融合来进行多模式错误信息探测 2507.01984v1 |
Authors (1): Gautam Kishore Shahi
Amid a tidal wave of misinformation flooding social media during elections and crises, extensive research has been conducted on misinformation detection, primarily focusing on text-based or image-based approaches. However, only a few studies have explored multimodal feature combinations, such as integrating text and images for building a classification model to detect misinformation. This study investigates the effectiveness of different multimodal feature combinations, incorporating text, images, and social features using an early fusion approach for the classification model. This study analyzed 1,529 tweets containing both text and images during the COVID-19 pandemic and election periods collected from Twitter (now X). A data enrichment process was applied to extract additional social features, as well as visual features, through techniques such as object detection and optical character recognition (OCR). The results show that combining unsupervised and supervised machine learning models improves classification performance by 15% compared to unimodal models and by 5% compared to bimodal models. Additionally, the study analyzes the propagation patterns of misinformation based on the characteristics of misinformation tweets and the users who disseminate them.
nan
Article 586
Title@2025-06-26 (4): ANUBHUTI: A Comprehensive Corpus For Sentiment Analysis In Bangla Regional Languages
Title: ANUBHUTI: A Comprehensive Corpus For Sentiment Analysis In Bangla Regional Languages | ANUBHUTI: Ein umfassender Corpus für die Sentimentanalyse in Bangla Regionalsprachen | ANUBUHUTI:孟加拉语地区语言中感应分析综合整体体 2506.21686v1 |
Authors (4): Swastika Kundu, Autoshi Ibrahim, Mithila Rahman, Tanvir Ahmed
Sentiment analysis for regional dialects of Bangla remains an underexplored area due to linguistic diversity and limited annotated data. This paper introduces ANUBHUTI, a comprehensive dataset consisting of 2000 sentences manually translated from standard Bangla into four major regional dialects Mymensingh, Noakhali, Sylhet, and Chittagong. The dataset predominantly features political and religious content, reflecting the contemporary socio political landscape of Bangladesh, alongside neutral texts to maintain balance. Each sentence is annotated using a dual annotation scheme: multiclass thematic labeling categorizes sentences as Political, Religious, or Neutral, and multilabel emotion annotation assigns one or more emotions from Anger, Contempt, Disgust, Enjoyment, Fear, Sadness, and Surprise. Expert native translators conducted the translation and annotation, with quality assurance performed via Cohens Kappa inter annotator agreement, achieving strong consistency across dialects. The dataset was further refined through systematic checks for missing data, anomalies, and inconsistencies. ANUBHUTI fills a critical gap in resources for sentiment analysis in low resource Bangla dialects, enabling more accurate and context aware natural language processing.
nan
Article 587
Title@2025-06-26 (4): Cohort Retrieval using Dense Passage Retrieval
Title: Cohort Retrieval using Dense Passage Retrieval | Cohort Retrieval mit Dense Passage Retrieval | 使用毒气通过通过访问检索的 Cohort 获取地址 2507.01049v1 |
Authors (1): Pranav Jadhav
Patient cohort retrieval is a pivotal task in medical research and clinical practice, enabling the identification of specific patient groups from extensive electronic health records (EHRs). In this work, we address the challenge of cohort retrieval in the echocardiography domain by applying Dense Passage Retrieval (DPR), a prominent methodology in semantic search. We propose a systematic approach to transform an echocardiographic EHR dataset of unstructured nature into a Query-Passage dataset, framing the problem as a Cohort Retrieval task. Additionally, we design and implement evaluation metrics inspired by real-world clinical scenarios to rigorously test the models across diverse retrieval tasks. Furthermore, we present a custom-trained DPR embedding model that demonstrates superior performance compared to traditional and off-the-shelf SOTA methods.To our knowledge, this is the first work to apply DPR for patient cohort retrieval in the echocardiography domain, establishing a framework that can be adapted to other medical domains.
nan
Article 588
Title@2025-06-26 (4): Do We Really Need GNNs with Explicit Structural Modeling? MLPs Suffice for Language Model Representations
Title: Do We Really Need GNNs with Explicit Structural Modeling? MLPs Suffice for Language Model Representations | Brauchen wir wirklich GNNs mit expliziter Strukturmodellierung? MLPs Mangel an Sprachmodelldarstellungen | 我们真的需要具有明确结构模型的GNNs吗? 2506.21682v1 |
Authors (7): Li Zhou, Hao Jiang, Junjie Li, Zefeng Zhao, Feng Jiang, Wenyu Chen, Haizhou Li
Explicit structural information has been proven to be encoded by Graph Neural Networks (GNNs), serving as auxiliary knowledge to enhance model capabilities and improve performance in downstream NLP tasks. However, recent studies indicate that GNNs fail to fully utilize structural information, whereas Multi-Layer Perceptrons (MLPs), despite lacking the message-passing mechanisms inherent to GNNs, exhibit a surprising ability in structure-aware tasks. Motivated by these findings, this paper introduces a comprehensive probing framework from an information-theoretic perspective. The framework is designed to systematically assess the role of explicit structural modeling in enhancing language model (LM) representations and to investigate the potential of MLPs as efficient and scalable alternatives to GNNs. We extend traditional probing classifiers by incorporating a control module that allows for selective use of either the full GNN model or its decoupled components, specifically, the message-passing and feature-transformation operations.This modular approach isolates and assesses the individual contributions of these operations, avoiding confounding effects from the complete GNN architecture. Using the Edge Probing Suite, a diagnostic tool for evaluating the linguistic knowledge encoded in LMs, we find that MLPs, when used as feature-transformation modules, consistently improve the linguistic knowledge captured in LM representations across different architectures. They effectively encode both syntactic and semantic patterns. Similarly, GNNs that incorporate feature-transformation operations show beneficial effects. In contrast, models that rely solely on message-passing operations tend to underperform, often leading to negative impacts on probing task performance.
nan
Article 589
Title@2025-06-26 (4): Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
Title: Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs | Feinkörnige Preference-Optimierung verbessert räumliche Vernunft in VLMs | 优化优化优化优化改进甚低LMs的空间理性 2506.21656v1 |
Authors (9): Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, Ismini Lourentzou
Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose fine-grained Direct Preference Optimization (fDPO), which introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% in average accuracy, while maintaining competitive performance on general vision-language tasks.
nan
Article 590
Title@2025-06-26 (4): Data Efficacy for Language Model Training
Title: Data Efficacy for Language Model Training | Dateneffizienz für Sprachmodellschulungen | 语文示范培训的数据效率 2506.21545v1 |
Authors (9): Yalun Dai, Yangyu Huang, Xin Zhang, Wenshan Wu, Chong Li, Wenhui Lu, Shijie Cao, Li Dong, Scarlett Li
Data is fundamental to the training of language models (LM). Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data. Techniques such as data filtering, sampling, and selection play a crucial role in this area. To complement it, we define Data Efficacy, which focuses on maximizing performance by optimizing the organization of training data and remains relatively underexplored. This work introduces a general paradigm, DELT, for considering data efficacy in LM training, which highlights the significance of training data organization. DELT comprises three components: Data Scoring, Data Selection, and Data Ordering. Among these components, we design Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which considers both the learnability and quality of each data sample from the gradient consistency perspective. We also devise Folding Ordering (FO), as a novel instance of Data Ordering, which addresses issues such as model forgetting and data distribution bias. Comprehensive experiments validate the data efficacy in LM training, which demonstrates the following: Firstly, various instances of the proposed DELT enhance LM performance to varying degrees without increasing the data scale and model size. Secondly, among these instances, the combination of our proposed LQS for data scoring and Folding for data ordering achieves the most significant improvement. Lastly, data efficacy can be achieved together with data efficiency by applying data selection. Therefore, we believe that data efficacy is a promising foundational area in LM training.
nan
Article 591
Title@2025-06-26 (4): “What’s Up, Doc?”: Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets
Title: “What’s Up, Doc?”: Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets | “Was ist los, Doc?”: Analysieren, wie Nutzer Gesundheitsinformationen in groß angelegten KI-Datensätzen suchen | “怎么了,医生?” :分析用户如何在大型对话的AI数据集中寻求健康信息。 2506.21532v1 |
Authors (8): Akshay Paruchuri, Maryam Aziz, Rohit Vartak, Ayman Ali, Best Uchehara, Xin Liu, Ishan Chatterjee, Monica Agrawal
People are increasingly seeking healthcare information from large language models (LLMs) via interactive chatbots, yet the nature and inherent risks of these conversations remain largely unexplored. In this paper, we filter large-scale conversational AI datasets to achieve HealthChat-11K, a curated dataset of 11K real-world conversations composed of 25K user messages. We use HealthChat-11K and a clinician-driven taxonomy for how users interact with LLMs when seeking healthcare information in order to systematically study user interactions across 21 distinct health specialties. Our analysis reveals insights into the nature of how and why users seek health information, such as common interactions, instances of incomplete context, affective behaviors, and interactions (e.g., leading questions) that can induce sycophancy, underscoring the need for improvements in the healthcare support capabilities of LLMs deployed as conversational AI. Code and artifacts to retrieve our analyses and combine them into a curated dataset can be found here: https://github.com/yahskapar/HealthChat
nan
Article 592
Title@2025-06-26 (4): OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
Title: OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages | OpenNER 1.0: Standardisierte Open-Access-Datensätze für die Entity-Erkennung in 50+ Sprachen | OpenNER 1.0:标准化的开放获取实体识别数据集,50+语言 2412.09587v2 |
Authors (5): Chester Palen-Michel, Maxwell Pickering, Maya Kruse, Jonne Sälevä, Constantine Lignos
We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER. We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task.
nan
Article 593
Title@2025-06-26 (4): Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation
Title: Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation | Weak-to-Strong GraphRAG: Richten von schwachen Retrievern mit großen Sprachmodellen für graphisch basierte Retrieval Augmented Generation | 弱至强强石图RAG:与基于图的回取增代大语言模型对齐 2506.22518v1 |
Authors (8): Deyu Zou, Yongqiang Chen, Mufei Li, Siqi Miao, Chenxi Liu, Bo Han, James Cheng, Pan Li
Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to ground responses with structured external knowledge from up-to-date knowledge graphs (KGs) and reduce hallucinations. However, LLMs often rely on a weak retriever in graph-based RAG: I) Due to the lack of ground truth, the retriever is often trained on weak supervision, which often introduces spurious signals to the LLMs. II) Due to the abstraction of graph data, the retrieved knowledge is often presented in unorganized forms. To mitigate the issue, we present Refined Graph-based RAG (ReG) to align weak retrievers to LLMs for graph-based RAG. Specifically, ReG incorporates LLM feedback to get rid of spurious signals and improve the quality of the supervision. Meanwhile, ReG introduces a structure-aware reorganization module to refactor the retrieval results into logically coherent evidence chains. Experiments on prominent benchmarks demonstrate that ReG significantly and consistently brings improvements across different LLM backbones by up to 10%. The improved supervision quality enables ReG to match the state-of-the-art performance with 5% training data and to transfer to out-of-distribution KGs. Notably, when adopted to reasoning-based LLMs, ReG reduces the reasoning token cost by up to 30% and improves the performance by up to 4%.
nan
Article 594
Title@2025-06-26 (4): skLEP: A Slovak General Language Understanding Benchmark
Title: skLEP: A Slovak General Language Understanding Benchmark | sklep: Ein slowakisches allgemeines Sprachverständnis Benchmark | SkLEP:斯洛伐克一般语言理解基准 2506.21508v1 |
Authors (8): Marek Šuppa, Andrej Ridzik, Daniel Hládek, Tomáš Javůrek, Viktória Ondrejová, Kristína Sásiková, Martin Tamajka, Marián Šimko
In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a thorough assessment of model capabilities. To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated established English NLU resources. Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models using the skLEP tasks. Finally, we also release the complete benchmark data, an open-source toolkit facilitating both fine-tuning and evaluation of models, and a public leaderboard at https://github.com/slovak-nlp/sklep in the hopes of fostering reproducibility and drive future research in Slovak NLU.
nan
Article 595
Title@2025-06-26 (4): Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments
Title: Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments | Verbesserung des Nutzerengagements im sozial-gesteuerten Dialog durch interaktive LLM-Alignments | 通过互动LLM调整,加强用户参与社会驱动对话 2506.21497v1 |
Authors (8): Jiashuo Wang, Kaitao Song, Chunpu Xu, Changhe Song, Yang Xiao, Dongsheng Li, Lili Qiu, Wenjie Li
Enhancing user engagement through interactions plays an essential role in socially-driven dialogues. While prior works have optimized models to reason over relevant knowledge or plan a dialogue act flow, the relationship between user engagement and knowledge or dialogue acts is subtle and does not guarantee user engagement in socially-driven dialogues. To this end, we enable interactive LLMs to learn user engagement by leveraging signals from the future development of conversations. Specifically, we adopt a more direct and relevant indicator of user engagement, i.e., the user’s reaction related to dialogue intention after the interaction, as a reward to align interactive LLMs. To achieve this, we develop a user simulator to interact with target interactive LLMs and explore interactions between the user and the interactive LLM system via \textit{i$\times$MCTS} (\textit{M}onte \textit{C}arlo \textit{T}ree \textit{S}earch for \textit{i}nteraction). In this way, we collect a dataset containing pairs of higher and lower-quality experiences using \textit{i$\times$MCTS}, and align interactive LLMs for high-level user engagement by direct preference optimization (DPO) accordingly. Experiments conducted on two socially-driven dialogue scenarios (emotional support conversations and persuasion for good) demonstrate that our method effectively enhances user engagement in interactive LLMs.
nan
Article 596
Title@2025-06-26 (4): Bridging Offline and Online Reinforcement Learning for LLMs
Title: Bridging Offline and Online Reinforcement Learning for LLMs | Überbrückung Offline- und Online-Verstärkungslernen für LLMs | 为LLMMs搭桥离线和在线加强学习 2506.21495v1 |
Authors (12): Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, Sainbayar Sukhbaatar, Ilia Kulikov
We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.
nan
Article 597
Title@2025-06-26 (4): Prompting with Phonemes: Enhancing LLMs’ Multilinguality for Non-Latin Script Languages
Title: Prompting with Phonemes: Enhancing LLMs’ Multilinguality for Non-Latin Script Languages | Mit Phonemes: Mehrsprachigkeit von LLMs für nicht-lateinische Script-Sprachen verbessern | 以电话提示:提高LLMS的非拉丁文拼写语言多重语言质量 2411.02398v3 |
Authors (7): Hoang H Nguyen, Khyati Mahajan, Vikas Yadav, Julian Salazar, Philip S. Yu, Masoud Hashemi, Rishabh Maheshwary
Although multilingual LLMs have achieved remarkable performance across benchmarks, we find they continue to underperform on non-Latin script languages across contemporary LLM families. This discrepancy arises from the fact that LLMs are pretrained with orthographic scripts, which are dominated by Latin characters that obscure their shared phonology with non-Latin scripts. We propose leveraging phonemic transcriptions as complementary signals to induce script-invariant representations. Our study demonstrates that integrating phonemic signals improves performance across both non-Latin and Latin script languages, with a particularly significant impact on closing the performance gap between the two. Through detailed experiments, we show that phonemic and orthographic scripts retrieve distinct examples for in-context learning (ICL). This motivates our proposed Mixed-ICL retrieval strategy, where further aggregation from both leads to our significant performance improvements for both Latin script languages (up to 12.6%) and non-Latin script languages (up to 15.1%) compared to randomized ICL retrieval.
nan
Article 598
Title@2025-06-26 (4): Logios : An open source Greek Polytonic Optical Character Recognition system
Title: Logios : An open source Greek Polytonic Optical Character Recognition system | Logios : Ein offenes griechisches Polytonisches optisches Zeichenerkennungssystem | Logios: 开放源码希腊多元光学特征识别系统 2506.21474v1 |
Authors (2): Perifanos Konstantinos, Goutsos Dionisis
In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use.
nan
Article 599
Title@2025-06-26 (4): TopK Language Models
Title: TopK Language Models | TopK-Sprachenmodelle | 顶 K 语言模式 2506.21468v1 |
Authors (4): Ryosuke Takahashi, Tatsuro Inaba, Kentaro Inui, Benjamin Heinzerling
Sparse autoencoders (SAEs) have become an important tool for analyzing and interpreting the activation space of transformer-based language models (LMs). However, SAEs suffer several shortcomings that diminish their utility and internal validity. Since SAEs are trained post-hoc, it is unclear if the failure to discover a particular concept is a failure on the SAE’s side or due to the underlying LM not representing this concept. This problem is exacerbated by training conditions and architecture choices affecting which features an SAE learns. When tracing how LMs learn concepts during training, the lack of feature stability also makes it difficult to compare SAEs features across different checkpoints. To address these limitations, we introduce a modification to the transformer architecture that incorporates a TopK activation function at chosen layers, making the model’s hidden states equivalent to the latent features of a TopK SAE. This approach eliminates the need for post-hoc training while providing interpretability comparable to SAEs. The resulting TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability. Despite this simple architectural change, TopK LMs maintain their original capabilities while providing robust interpretability benefits. Our experiments demonstrate that the sparse representations learned by TopK LMs enable successful steering through targeted neuron interventions and facilitate detailed analysis of neuron formation processes across checkpoints and layers. These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts, which we believe will significantly advance future research on model interpretability and controllability.
nan
Article 600
Title@2025-06-26 (4): Aligning Spoken Dialogue Models from User Interactions
Title: Aligning Spoken Dialogue Models from User Interactions | Ausrichten von gesprochenen Dialogmodellen aus Benutzerinteraktionen | 校对用户互动中的口语对话框模型 2506.21463v1 |
Authors (4): Anne Wu, Laurent Mazaré, Neil Zeghidour, Alexandre Défossez
We propose a novel preference alignment framework for improving spoken dialogue models on real-time conversations from user interactions. Current preference learning methods primarily focus on text-based language models, and are not directly suited to the complexities of real-time speech interactions, with richer dynamics (e.g. interruption, interjection) and no explicit segmentation between speaker turns.We create a large-scale dataset of more than 150,000 preference pairs from raw multi-turn speech conversations, annotated with AI feedback, to cover preferences over both linguistic content and temporal context variations. We leverage offline alignment methods to finetune a full-duplex autoregressive speech-to-speech model. Extensive experiments demonstrate that feedback on generic conversations can be consistently effective in improving spoken dialogue models to produce more factual, safer and more contextually aligned interactions. We deploy the finetuned model and conduct holistic human evaluations to assess the impact beyond single-turn conversations. Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
nan
Article 601
Title@2025-06-26 (4): Spatial Mental Modeling from Limited Views
Title: Spatial Mental Modeling from Limited Views | Räumliche mentale Modellierung aus begrenzten Ansichten | 根据有限观点进行空间精神建模 2506.21458v1 |
Authors (14): Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei
Can Vision Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for “what-if” movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, “map-then-reason”, that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.
nan
Article 602
Title@2025-06-26 (4): Text2Cypher Across Languages: Evaluating Foundational Models Beyond English
Title: Text2Cypher Across Languages: Evaluating Foundational Models Beyond English | Text2Cypher Across Sprachen: Bewertung von Grundmodellen jenseits des Englischen | 跨语言文本:评价超越英语的基础模型 2506.21445v1 |
Authors (2): Makbule Gulcin Ozsoy, William Tai
Recent advances in large language models have enabled natural language interfaces that translate user questions into database queries, such as Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database accessibility, most research today focuses solely on English, with limited evaluation in other languages. This paper investigates the performance of foundational LLMs on the Text2Cypher task across multiple languages. We create and release a multilingual test set by translating English questions into Spanish and Turkish while preserving the original Cypher queries, enabling fair cross-lingual comparison. We evaluate multiple foundational models using standardized prompts and metrics. Our results show a consistent performance pattern: highest on English, then Spanish, and lowest on Turkish. We attribute this to differences in training data availability and linguistic characteristics. Additionally, we explore the impact of translating task prompts into Spanish and Turkish. Results show little to no change in evaluation metrics, suggesting prompt translation has minor impact. Our findings highlight the need for more inclusive evaluation and development in multilingual query generation. Future work includes schema localization and fine-tuning across diverse languages.
nan
Article 603
Title@2025-06-26 (4): Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection
Title: Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection | Domänenwissen-verbesserte LLMs für Betrug und Konzept-Drift-Erkennung | 防止欺诈和概念漂流探测的有知识增强的有限LMs 2506.21443v1 |
Authors (3): Ali Şenol, Garima Agrawal, Huan Liu
Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework that integrates pretrained LLMs with structured, task-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA-based implementation achieves 98% classification accuracy. Comparative studies against zero-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high-stakes NLP applications.
nan
Article 604
Title@2025-06-26 (4): Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations
Title: Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations | Erklärbarkeit großer Sprachmodelle mit SMILE: Statistische Modell-agnostische Interpretierbarkeit mit lokalen Erklärungen | 使用SMILE解释大语言模型的可解释性:统计模型 – – 与当地解释的可解释性 2505.21657v3 |
Authors (4): Zeinab Dehghani, Mohammed Naveed Akram, Koorosh Aslansefat, Adil Khan
Large language models like GPT, LLAMA, and Claude have become incredibly powerful at generating text, but they are still black boxes, so it is hard to understand how they decide what to say. That lack of transparency can be problematic, especially in fields where trust and accountability matter. To help with this, we introduce SMILE, a new method that explains how these models respond to different parts of a prompt. SMILE is model-agnostic and works by slightly changing the input, measuring how the output changes, and then highlighting which words had the most impact. Create simple visual heat maps showing which parts of a prompt matter the most. We tested SMILE on several leading LLMs and used metrics such as accuracy, consistency, stability, and fidelity to show that it gives clear and reliable explanations. By making these models easier to understand, SMILE brings us one step closer to making AI more transparent and trustworthy.
nan
Article 605
Title@2025-06-26 (4): Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference
Title: Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference | Skalierbare Bayesische Low-Rank-Anpassung von großen Sprachmodellen über stochastische Variations-Subraum-Inferenz | 通过Stochastic变异性子空间推断,对大语言模型进行可缩放的Bayesian低Rank 2506.21408v1 |
Authors (5): Colin Samplawski, Adam D. Cobb, Manoj Acharya, Ramneet Kaur, Susmit Jha
Despite their widespread use, large language models (LLMs) are known to hallucinate incorrect information and be poorly calibrated. This makes the uncertainty quantification of these models of critical importance, especially in high-stakes domains, such as autonomy and healthcare. Prior work has made Bayesian deep learning-based approaches to this problem more tractable by performing inference over the low-rank adaptation (LoRA) parameters of a fine-tuned model. While effective, these approaches struggle to scale to larger LLMs due to requiring further additional parameters compared to LoRA. In this work we present $\textbf{Scala}$ble $\textbf{B}$ayesian $\textbf{L}$ow-Rank Adaptation via Stochastic Variational Subspace Inference (ScalaBL). We perform Bayesian inference in an $r$-dimensional subspace, for LoRA rank $r$. By repurposing the LoRA parameters as projection matrices, we are able to map samples from this subspace into the full weight space of the LLM. This allows us to learn all the parameters of our approach using stochastic variational inference. Despite the low dimensionality of our subspace, we are able to achieve competitive performance with state-of-the-art approaches while only requiring ${\sim}1000$ additional parameters. Furthermore, it allows us to scale up to the largest Bayesian LLM to date, with four times as a many base parameters as prior work.
nan
Article 606
Title@2025-06-26 (4): DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
Title: DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation | DiffuCoder: Maskierte Difffusionsmodelle für die Codegenerierung verstehen und verbessern | DiffuCoder:理解和改进代代码生成的蒙面传播模式 2506.20639v2 |
Authors (7): Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, \textbf{DiffuCoder}, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose \textbf{coupled-GRPO}, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder’s performance on code generation benchmarks (+4.4\% on EvalPlus) and reduces reliance on AR bias during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework. https://github.com/apple/ml-diffucoder.
nan
Article 607
Title@2025-06-26 (4): Hybrid Deep Learning and Signal Processing for Arabic Dialect Recognition in Low-Resource Settings
Title: Hybrid Deep Learning and Signal Processing for Arabic Dialect Recognition in Low-Resource Settings | Hybrides Deep Learning und Signalverarbeitung für die arabische Dialekterkennung in Low-Resource-Einstellungen | 低资源设置中阿拉伯语语音识别的混合深深学习和信号处理 2506.21386v1 |
Authors (2): Ghazal Al-Shwayyat, Omer Nezih Gerek
Arabic dialect recognition presents a significant challenge in speech technology due to the linguistic diversity of Arabic and the scarcity of large annotated datasets, particularly for underrepresented dialects. This research investigates hybrid modeling strategies that integrate classical signal processing techniques with deep learning architectures to address this problem in low-resource scenarios. Two hybrid models were developed and evaluated: (1) Mel-Frequency Cepstral Coefficients (MFCC) combined with a Convolutional Neural Network (CNN), and (2) Discrete Wavelet Transform (DWT) features combined with a Recurrent Neural Network (RNN). The models were trained on a dialect-filtered subset of the Common Voice Arabic dataset, with dialect labels assigned based on speaker metadata. Experimental results demonstrate that the MFCC + CNN architecture achieved superior performance, with an accuracy of 91.2% and strong precision, recall, and F1-scores, significantly outperforming the Wavelet + RNN configuration, which achieved an accuracy of 66.5%. These findings highlight the effectiveness of leveraging spectral features with convolutional models for Arabic dialect recognition, especially when working with limited labeled data. The study also identifies limitations related to dataset size, potential regional overlaps in labeling, and model optimization, providing a roadmap for future research. Recommendations for further improvement include the adoption of larger annotated corpora, integration of self-supervised learning techniques, and exploration of advanced neural architectures such as Transformers. Overall, this research establishes a strong baseline for future developments in Arabic dialect recognition within resource-constrained environments.
nan
Article 608
Title@2025-06-26 (4): Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation
Title: Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation | Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation | 利用LLM协助的对活检索一代人查询了解 2506.21384v1 |
Authors (4): Guanting Dong, Xiaoxi Li, Yuyao Zhang, Mengjie Deng
Real-world live retrieval-augmented generation (RAG) systems face significant challenges when processing user queries that are often noisy, ambiguous, and contain multiple intents. While RAG enhances large language models (LLMs) with external knowledge, current systems typically struggle with such complex inputs, as they are often trained or evaluated on cleaner data. This paper introduces Omni-RAG, a novel framework designed to improve the robustness and effectiveness of RAG systems in live, open-domain settings. Omni-RAG employs LLM-assisted query understanding to preprocess user inputs through three key modules: (1) Deep Query Understanding and Decomposition, which utilizes LLMs with tailored prompts to denoise queries (e.g., correcting spelling errors) and decompose multi-intent queries into structured sub-queries; (2) Intent-Aware Knowledge Retrieval, which performs retrieval for each sub-query from a corpus (i.e., FineWeb using OpenSearch) and aggregates the results; and (3) Reranking and Generation, where a reranker (i.e., BGE) refines document selection before a final response is generated by an LLM (i.e., Falcon-10B) using a chain-of-thought prompt. Omni-RAG aims to bridge the gap between current RAG capabilities and the demands of real-world applications, such as those highlighted by the SIGIR 2025 LiveRAG Challenge, by robustly handling complex and noisy queries.
nan
Article 609
Title@2025-06-26 (4): Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models
Title: Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models | Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models | AI 文学批评主义的结构性方法:大语言模型利用Greimas半语言广场 2506.21360v1 |
Authors (4): Fangzhou Dong, Yifan Zeng, Yingpeng Sang, Hong Shen
Large Language Models (LLMs) excel in understanding and generating text but struggle with providing professional literary criticism for works with profound thoughts and complex narratives. This paper proposes GLASS (Greimas Literary Analysis via Semiotic Square), a structured analytical framework based on Greimas Semiotic Square (GSS), to enhance LLMs’ ability to conduct in-depth literary analysis. GLASS facilitates the rapid dissection of narrative structures and deep meanings in narrative works. We propose the first dataset for GSS-based literary criticism, featuring detailed analyses of 48 works. Then we propose quantitative metrics for GSS-based literary criticism using the LLM-as-a-judge paradigm. Our framework’s results, compared with expert criticism across multiple works and LLMs, show high performance. Finally, we applied GLASS to 39 classic works, producing original and high-quality analyses that address existing research gaps. This research provides an AI-based tool for literary research and education, offering insights into the cognitive mechanisms underlying literary engagement.
nan
Article 610
Title@2025-06-26 (4): Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts
Title: Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts | Latent Prototype Routing: Erzielen einer nahezu perfekten Lastabgleichung in Mixture-of-Experts | 原型原型路由:在混合专家中实现近效果负载平衡 2506.21328v1 |
Authors (1): Jiajie Yang
Mixture-of-Experts (MoE) architectures have emerged as a key strategy for scaling large language models (LLMs) efficiently. However, current MoE systems suffer from severe load imbalance, where only a small subset of experts is consistently activated during training and inference, leading to significant underutilization of model capacity and computational resources. In this work, we revisit expert routing through a clustering perspective and propose Latent Prototype Routing (LPR), a novel routing framework that generalizes existing approaches while promoting balanced expert utilization without compromising downstream performance. Extensive experiments across multiple open-source MoE models – including DeepSeek-V3, Qwen3-MoE, and Mixtral – demonstrate that LPR reduces the Gini coefficient of expert load from 0.70 to 0.035 on average, improves the min-max expert load ratio from 1e-6 to 0.70, achieving near-perfect load balancing.
nan
Article 611
Title@2025-06-26 (4): Exploring Adapter Design Tradeoffs for Low Resource Music Generation
Title: Exploring Adapter Design Tradeoffs for Low Resource Music Generation | Erforschung von Adapter-Design-Tradeoffs für Low Resource Music Generation | 探索用于低资源音乐制作的适应设计取舍 2506.21298v1 |
Authors (3): Atharva Mehta, Shivam Chauhan, Monojit Choudhury
Fine-tuning large-scale music generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources. Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance. However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre. In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music. Our findings reveal distinct trade-offs: convolution-based adapters excel in capturing fine-grained local musical details such as ornamentations and short melodic phrases, while transformer-based adapters better preserve long-range dependencies crucial for structured improvisation. Additionally, we analyze computational resource requirements across different adapter scales, demonstrating how mid-sized adapters (40M parameters) achieve an optimal balance between expressivity and quality. Furthermore, we find that Mustango, a diffusion-based model, generates more diverse outputs with better adherence to the description in the input prompt while lacking in providing stability in notes, rhythm alignment, and aesthetics. Also, it is computationally intensive and requires significantly more time to train. In contrast, autoregressive models like MusicGen offer faster training and are more efficient, and can produce better quality output in comparison, but have slightly higher redundancy in their generations.
nan
Article 612
Title@2025-06-26 (4): Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models
Title: Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models | Erkennung von Verweisen auf Ausdrücke im visuell begründeten Dialog mit autoregressiven Sprachmodellen | 与自动递减语言模型进行视觉基础对话中检测引用表达式 2506.21294v1 |
Authors (2): Bram Willemsen, Gabriel Skantze
In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.
nan
Article 613
Title@2025-06-26 (4): Small Encoders Can Rival Large Decoders in Detecting Groundedness
Title: Small Encoders Can Rival Large Decoders in Detecting Groundedness | Kleine Encoder können große Decoder bei der Erkennung von Erdlichkeit rivalisieren | 在地面探测中能够使大型分离器在探测地面时发生迭接 2506.21288v1 |
Authors (7): Istabrak Abbes, Gabriele Prato, Quentin Fournier, Fernando Rodriguez, Alaa Boukhary, Adam Elwood, Sarath Chandar
Augmenting large language models (LLMs) with external context significantly improves their performance in natural language processing (NLP) tasks. However, LLMs struggle to answer queries reliably when the provided context lacks information, often resorting to ungrounded speculation or internal knowledge. Groundedness - generating responses strictly supported by the context - is essential for ensuring factual consistency and trustworthiness. This study focuses on detecting whether a given query is grounded in a document provided in context before the costly answer generation by LLMs. Such a detection mechanism can significantly reduce both inference time and resource consumption. We show that lightweight, task specific encoder models such as RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in groundedness detection while reducing inference latency by orders of magnitude. The code is available at : https://github.com/chandarlab/Hallucinate-less
nan
Article 614
Title@2025-06-26 (4): Thinkless: LLM Learns When to Think
Title: Thinkless: LLM Learns When to Think | Denklos: LLM lernt, wann man denkt | 无思想:LLM学习思考时间 2505.13379v2 |
Authors (3): Gongfan Fang, Xinyin Ma, Xinchao Wang
Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model’s ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens,
nan
Article 615
Title@2025-06-26 (4): Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning
Title: Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning | Double-Checker: Bessere Begründung von langsam denkenden LLMs über selbstkritische Feinsteuerung | 双重检查者:通过自批评性微调,加强慢思考低迷LMs的理由 2506.21285v1 |
Authors (14): Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin
While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the “aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique.
nan
Article 616
Title@2025-06-26 (4): HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
Title: HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context | HumanOmniV2: Vom Verständnis zur Omni-Modalen Vernunft mit Kontext | HumanOmniV2:从理解到以上下文为根据的全方位模式 2506.21277v1 |
Authors (10): Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, Jingren Zhou
With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.
nan
Article 617
Title@2025-06-26 (4): Can “consciousness” be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis
Title: Can “consciousness” be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis | Kann “Bewusstsein” von großen Sprachmodellen (LLM) innerhalb von Zuständen beobachtet werden? | 从大型语言模型内部状态观察到“意识”吗?通过综合信息理论和全方位代表分析,将从思维理论测试中获得的LLM表示法解析 2506.22516v1 |
Authors (1): Jingkai Li
Integrated Information Theory (IIT) provides a quantitative framework for explaining consciousness phenomenon, positing that conscious systems comprise elements integrated through causal properties. We apply IIT 3.0 and 4.0 – the latest iterations of this framework – to sequences of Large Language Model (LLM) representations, analyzing data derived from existing Theory of Mind (ToM) test results. Our study systematically investigates whether the differences of ToM test performances, when presented in the LLM representations, can be revealed by IIT estimates, i.e., $\Phi^{\max}$ (IIT 3.0), $\Phi$ (IIT 4.0), Conceptual Information (IIT 3.0), and $\Phi$-structure (IIT 4.0). Furthermore, we compare these metrics with the Span Representations independent of any estimate for consciousness. This additional effort aims to differentiate between potential “consciousness” phenomena and inherent separations within LLM representational space. We conduct comprehensive experiments examining variations across LLM transformer layers and linguistic spans from stimuli. Our results suggest that sequences of contemporary Transformer-based LLM representations lack statistically significant indicators of observed “consciousness” phenomena but exhibit intriguing patterns under $\textit{spatio}$-permutational analyses. The Appendix and code are available as Supplementary Materials at: https://doi.org/10.1016/j.nlp.2025.100163.
nan
Article 618
Title@2025-06-26 (4): Cat and Mouse – Can Fake Text Generation Outpace Detector Systems?
Title: Cat and Mouse – Can Fake Text Generation Outpace Detector Systems? | Katze und Maus – Kann die Textgenerierung ausfallende Detektorsysteme fälschen? | 猫和老鼠 – – 假文本生成能否超越检测器系统? 2506.21274v1 |
Authors (2): Andrea McGlinchey, Peter J Barclay
Large language models can produce convincing “fake text” in domains such as academic writing, product reviews, and political news. Many approaches have been investigated for the detection of artificially generated text. While this may seem to presage an endless “arms race”, we note that newer LLMs use ever more parameters, training data, and energy, while relatively simple classifiers demonstrate a good level of detection accuracy with modest resources. To approach the question of whether the models’ ability to beat the detectors may therefore reach a plateau, we examine the ability of statistical classifiers to identify “fake text” in the style of classical detective fiction. Over a 0.5 version increase, we found that Gemini showed an increased ability to generate deceptive text, while GPT did not. This suggests that reliable detection of fake text may remain feasible even for ever-larger models, though new model architectures may improve their deceptiveness
nan
Article 619
Title@2025-06-26 (4): A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns
Title: A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns | Ein Troublemaker mit ansteckenden Jailbreak macht Chaos in ehrlichen Städten | 一个麻烦制造者 与贪婪的监狱破碎 制造混乱 在诚实的城镇 2410.16155v2 |
Authors (6): Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao
With the development of large language models, they are widely used as agents in various fields. A key component of agents is memory, which stores vital information but is susceptible to jailbreak attacks. Existing research mainly focuses on single-agent attacks and shared memory attacks. However, real-world scenarios often involve independent memory. In this paper, we propose the Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large-scale, multi-agent, multi-topology text-based attack evaluation framework. TMCHT involves one attacker agent attempting to mislead an entire society of agents. We identify two major challenges in multi-agent attacks: (1) Non-complete graph structure, (2) Large-scale systems. We attribute these challenges to a phenomenon we term toxicity disappearing. To address these issues, we propose an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes the retrieval suffix to make poisoned samples more easily retrieved and optimizes the replication suffix to make poisoned samples have contagious ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%, 18.95%, and 52.93% improvements in line topology, star topology, and 100-agent settings. Encourage community attention to the security of multi-agent systems.
nan
Article 620
Title@2025-06-26 (4): DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster
Title: DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster | DiLoCoX: Ein kommunikationsarmer groß angelegter Ausbildungsrahmen für dezentralisierte Cluster | DILOCOX:权力下放小组的低通信大范围培训框架 2506.21263v1 |
Authors (9): Ji Qi, WenPeng Zhu, Li Li, Ming Wu, YingJun Wu, Wu He, Xun Gao, Jason Zeng, Michael Heinrich
The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we conduct training on slow networks and thereby unleash the power of decentralized clusters when dealing with models exceeding 100 billion parameters? In this paper, we propose DiLoCoX, a low-communication large-scale decentralized cluster training framework. It combines Pipeline Parallelism with Dual Optimizer Policy, One-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme. This combination significantly improves the scale of parameters and the speed of model pre-training. We justify the benefits of one-step-delay overlap of communication and local training, as well as the adaptive gradient compression scheme, through a theoretical analysis of convergence. Empirically, we demonstrate that DiLoCoX is capable of pre-training a 107B foundation model over a 1Gbps network. Compared to vanilla AllReduce, DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence. To the best of our knowledge, this is the first decentralized training framework successfully applied to models with over 100 billion parameters.
nan
Article 621
Title@2025-06-26 (4): Simulating Hard Attention Using Soft Attention
Title: Simulating Hard Attention Using Soft Attention | Simulation der harten Aufmerksamkeit mit weicher Aufmerksamkeit | 使用软关注模拟硬关注 2412.09925v2 |
Authors (4): Andy Yang, Lena Strobl, David Chiang, Dana Angluin
We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several subclasses of languages recognized by hard-attention transformers, which can be defined in variants of linear temporal logic. We demonstrate how soft-attention transformers can compute formulas of these logics using unbounded positional embeddings or temperature scaling. Second, we demonstrate how temperature scaling allows softmax transformers to simulate general hard-attention transformers, using a temperature that depends on the minimum gap between the maximum attention scores and other attention scores.
nan
Article 622
Title@2025-06-26 (4): Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents
Title: Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents | Agent-RewardBench: Auf dem Weg zu einem einheitlichen Benchmark für Prämienmodellierung über Wahrnehmung, Planung und Sicherheit in multimodalen Real-World-Agenten | Agent-RewardBench:建立一个统一基准,用于在现实世界多式联运代理中建立跨认知、规划和安全概念、规划与安全的奖励模型 2506.21252v1 |
Authors (6): Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of a task, providing a more granular view of performance during the planning process; and (3) Appropriately difficulty and high-quality. We carefully sample from 10 diverse models, difficulty control to maintain task challenges, and manual verification to ensure the integrity of the data. Experiments demonstrate that even state-of-the-art multimodal models show limited performance, highlighting the need for specialized training in agent reward modeling. Code is available at github.
nan
Article 623
Title@2025-06-26 (4): Capturing Style in Author and Document Representation
Title: Capturing Style in Author and Document Representation | Stil in der Autor- und Dokumentdarstellung erfassen | 在作者和文件代表中获取样式 2407.13358v2 |
Authors (3): Enzo Terreau, Antoine Gourru, Julien Velcin
A wide range of Deep Natural Language Processing (NLP) models integrates continuous and low dimensional representations of words and documents. Surprisingly, very few models study representation learning for authors. These representations can be used for many NLP tasks, such as author identification and classification, or in recommendation systems. A strong limitation of existing works is that they do not explicitly capture writing style, making them hardly applicable to literary data. We therefore propose a new architecture based on Variational Information Bottleneck (VIB) that learns embeddings for both authors and documents with a stylistic constraint. Our model fine-tunes a pre-trained document encoder. We stimulate the detection of writing style by adding predefined stylistic features making the representation axis interpretable with respect to writing style indicators. We evaluate our method on three datasets: a literary corpus extracted from the Gutenberg Project, the Blog Authorship Corpus and IMDb62, for which we show that it matches or outperforms strong/recent baselines in authorship attribution while capturing much more accurately the authors stylistic aspects.
nan
Article 624
Title@2025-06-26 (4): Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval
Title: Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval | Automatische Termextraktion mit großen Sprachmodellen durch syntactic Retrieval verbessern | 通过同步检索增强使用大语言模型的自动定期抽取功能 2506.21222v1 |
Authors (5): Yongchan Chun, Minhyuk Kim, Dongjun Kim, Chanjun Park, Heuiseok Lim
Automatic Term Extraction (ATE) identifies domain-specific expressions that are crucial for downstream tasks such as machine translation and information retrieval. Although large language models (LLMs) have significantly advanced various NLP tasks, their potential for ATE has scarcely been examined. We propose a retrieval-based prompting strategy that, in the few-shot setting, selects demonstrations according to \emph{syntactic} rather than semantic similarity. This syntactic retrieval method is domain-agnostic and provides more reliable guidance for capturing term boundaries. We evaluate the approach in both in-domain and cross-domain settings, analyzing how lexical overlap between the query sentence and its retrieved examples affects performance. Experiments on three specialized ATE benchmarks show that syntactic retrieval improves F1-score. These findings highlight the importance of syntactic cues when adapting LLMs to terminology-extraction tasks.
nan
Article 625
Title@2025-06-26 (4): Complexity-aware fine-tuning
Title: Complexity-aware fine-tuning | Komplexitätsbewusste Feinabstimmung | 复杂度认知微调 2506.21220v1 |
Authors (5): Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev
General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data. We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across two small open models ($\approx 3B$) we split the training data into complexity categories by a single token answer entropy (ROC AUC $0.73$), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach ($0.55$ vs $0.43$ average accuracy) and provides comparable with distillation performance while using $62\%$ less data ($0.55$ average accuracy for both). We publish our code and data to facilitate further research in this direction.
nan
Article 626
Title@2025-06-26 (4): Unveiling Causal Reasoning in Large Language Models: Reality or Mirage?
Title: Unveiling Causal Reasoning in Large Language Models: Reality or Mirage? | Kausale Vernunft in großen Sprachmodellen enthüllen: Realität oder Mirage? | 大语言模型中未解的因果理由:现实还是幻影? 2506.21215v1 |
Authors (8): Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, Bo Han
Causal reasoning capability is critical in advancing large language models (LLMs) toward strong artificial intelligence. While versatile LLMs appear to have demonstrated capabilities in understanding contextual causality and providing responses that obey the laws of causality, it remains unclear whether they perform genuine causal reasoning akin to humans. However, current evidence indicates the contrary. Specifically, LLMs are only capable of performing shallow (level-1) causal reasoning, primarily attributed to the causal knowledge embedded in their parameters, but they lack the capacity for genuine human-like (level-2) causal reasoning. To support this hypothesis, methodologically, we delve into the autoregression mechanism of transformer-based LLMs, revealing that it is not inherently causal. Empirically, we introduce a new causal Q&A benchmark called CausalProbe-2024, whose corpora are fresh and nearly unseen for the studied LLMs. The LLMs exhibit a significant performance drop on CausalProbe-2024 compared to earlier benchmarks, indicating the fact that they primarily engage in level-1 causal reasoning. To bridge the gap towards level-2 causal reasoning, we draw inspiration from the fact that human reasoning is usually facilitated by general knowledge and intended goals. We propose G^2-Reasoner, a method that incorporates general knowledge and goal-oriented prompts into LLMs’ causal reasoning processes. Experiments demonstrate that G^2-Reasoner significantly enhances LLMs’ causal reasoning capability, particularly in fresh and counterfactual contexts. This work sheds light on a new path for LLMs to advance towards genuine causal reasoning, going beyond level-1 and making strides towards level-2.
nan
Article 627
Title@2025-06-26 (4): TAPS: Tool-Augmented Personalisation via Structured Tagging
Title: TAPS: Tool-Augmented Personalisation via Structured Tagging | TAPS: Tool-Augmented Personalisierung durch strukturiertes Tagging | TAPS: 通过结构拖网提高工具的个性化 2506.20409v2 |
Authors (2): Ekaterina Taktasheva, Jeff Dalton
Recent advancements in tool-augmented large language models have enabled them to interact with external tools, enhancing their ability to perform complex user tasks. However, existing approaches overlook the role of personalisation in guiding tool use. This work investigates how user preferences can be effectively integrated into goal-oriented dialogue agents. Through extensive analysis, we identify key weaknesses in the ability of LLMs to personalise tool use. To this end, we introduce TAPS, a novel solution that enhances personalised tool use by leveraging a structured tagging tool and an uncertainty-based tool detector. TAPS significantly improves the ability of LLMs to incorporate user preferences, achieving the new state-of-the-art for open source models on the NLSI task.
nan
Article 628
Title@2025-06-26 (4): LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey
Title: LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey | LLM-basierte human-agente Kooperations- und Interaktionssysteme: Eine Umfrage | 以LLM为基础的人类-机构协作和互动系统:调查 2505.00753v4 |
Authors (15): Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, Yangning Li, Dongyuan Li, Renhe Jiang, Xue Liu, Philip S. Yu
Recent advances in large language models (LLMs) have sparked growing interest in building fully autonomous agents. However, fully autonomous LLM-based agents still face significant challenges, including limited reliability due to hallucinations, difficulty in handling complex tasks, and substantial safety and ethical risks, all of which limit their feasibility and trustworthiness in real-world applications. To overcome these limitations, LLM-based human-agent systems (LLM-HAS) incorporate human-provided information, feedback, or control into the agent system to enhance system performance, reliability and safety. These human-agent collaboration systems enable humans and LLM-based agents to collaborate effectively by leveraging their complementary strengths. This paper provides the first comprehensive and structured survey of LLM-HAS. It clarifies fundamental concepts, systematically presents core components shaping these systems, including environment & profiling, human feedback, interaction types, orchestration and communication, explores emerging applications, and discusses unique challenges and opportunities arising from human-AI collaboration. By consolidating current knowledge and offering a structured overview, we aim to foster further research and innovation in this rapidly evolving interdisciplinary field. Paper lists and resources are available at https://github.com/HenryPengZou/Awesome-Human-Agent-Collaboration-Interaction-Systems.
nan
Article 629
Title@2025-06-26 (4): Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks
Title: Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks | Beibehaltung des MTEB: Langfristige Nutzungsfähigkeit und Reproduzierbarkeit von Einbettungs-Benchmarks | 维持MDEB:实现长期使用和可复制嵌入基准 2506.21182v1 |
Authors (5): Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, Kenneth Enevoldsen
The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models. While previous work has established the core benchmark methodology, this paper focuses on the engineering aspects that ensure MTEB’s continued reproducibility and extensibility. We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results’ generalizability. We detail the design choices that collectively enhance reproducibility and usability. Furthermore, we discuss our strategies for handling community contributions and extending the benchmark with new tasks and datasets. These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field. Our experiences offer valuable insights for benchmark maintainers facing similar challenges in ensuring reproducibility and usability in machine learning evaluation frameworks. The MTEB repository is available at: https://github.com/embeddings-benchmark/mteb
nan
Article 630
Title@2025-06-26 (4): Compressed and Smooth Latent Space for Text Diffusion Modeling
Title: Compressed and Smooth Latent Space for Text Diffusion Modeling | Komprimierter und glatter Latent-Raum für Text-Diffusionsmodellierung | 压缩和平滑的文本传播中缓流空间模型模型 2506.21170v1 |
Authors (5): Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, Dmitry Vetrov
Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by $8\times$ while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than $2\times$ faster inference.
nan
Article 631
Title@2025-06-26 (4): CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models
Title: CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models | CVC: Eine groß angelegte chinesische Wertregel Corpus zur Wertausrichtung großer Sprachmodelle | CVC: 大型中文大语言模式价值调整大型中国价值规则公司 2506.01495v4 |
Authors (9): Ping Wu, Guobin Shen, Dongcheng Zhao, Yuwei Wang, Yiting Dong, Yu Shi, Enmeng Lu, Feifei Zhao, Yi Zeng
Ensuring that Large Language Models (LLMs) align with mainstream human values and ethical norms is crucial for the safe and sustainable development of AI. Current value evaluation and alignment are constrained by Western cultural bias and incomplete domestic frameworks reliant on non-native rules; furthermore, the lack of scalable, rule-driven scenario generation methods makes evaluations costly and inadequate across diverse cultural contexts. To address these challenges, we propose a hierarchical value framework grounded in core Chinese values, encompassing three main dimensions, 12 core values, and 50 derived values. Based on this framework, we construct a large-scale Chinese Values Corpus (CVC) containing over 250,000 value rules enhanced and expanded through human annotation. Experimental results show that CVC-guided scenarios outperform direct generation ones in value boundaries and content diversity. In the evaluation across six sensitive themes (e.g., surrogacy, suicide), seven mainstream LLMs preferred CVC-generated options in over 70.5% of cases, while five Chinese human annotators showed an 87.5% alignment with CVC, confirming its universality, cultural relevance, and strong alignment with Chinese values. Additionally, we construct 400,000 rule-based moral dilemma scenarios that objectively capture nuanced distinctions in conflicting value prioritization across 17 LLMs. Our work establishes a culturally-adaptive benchmarking framework for comprehensive value evaluation and alignment, representing Chinese characteristics. All data are available at https://huggingface.co/datasets/Beijing-AISI/CVC, and the code is available at https://github.com/Beijing-AISI/CVC.
nan
Article 632
Title@2025-06-26 (4): Do Large Language Models Advocate for Inferentialism?
Title: Do Large Language Models Advocate for Inferentialism? | Befürworten große Sprachmodelle den Inferentialismus? | 大语言模型是否为推定主义辩护? 2412.14501v2 |
Authors (2): Yuzuki Arai, Sho Tsugawa
The emergence of large language models (LLMs) such as ChatGPT and Claude presents new challenges for philosophy of language, particularly regarding the nature of linguistic meaning and representation. While LLMs have traditionally been understood through distributional semantics, this paper explores Robert Brandom’s inferential semantics as an alternative foundational framework for understanding these systems. We examine how key features of inferential semantics – including its anti-representationalist stance, logical expressivism, and quasi-compositional approach – align with the architectural and functional characteristics of Transformer-based LLMs. Through analysis of the ISA (Inference, Substitution, Anaphora) approach, we demonstrate that LLMs exhibit fundamentally anti-representationalist properties in their processing of language. We further develop a consensus theory of truth appropriate for LLMs, grounded in their interactive and normative dimensions through mechanisms like RLHF. While acknowledging significant tensions between inferentialism’s philosophical commitments and LLMs’ sub-symbolic processing, this paper argues that inferential semantics provides valuable insights into how LLMs generate meaning without reference to external world representations. Our analysis suggests that LLMs may challenge traditional assumptions in philosophy of language, including strict compositionality and semantic externalism, though further empirical investigation is needed to fully substantiate these theoretical claims.
nan
Article 633
Title@2025-06-26 (4): Learning Evaluation Models from Large Language Models for Sequence Generation
Title: Learning Evaluation Models from Large Language Models for Sequence Generation | Learning Evaluation Models aus großen Sprachmodellen für die Sequenzgenerierung | 序列生成大语言模式学习评价模式 2308.04386v3 |
Authors (9): Chenglong Wang, Hang Zhou, Kaiyan Chang, Tongran Liu, Chunliang Zhang, Quan Du, Tong Xiao, Yue Zhang, Jingbo Zhu
Automatic evaluation of sequence generation, traditionally reliant on metrics like BLEU and ROUGE, often fails to capture the semantic accuracy of generated text sequences due to their emphasis on n-gram overlap. A promising solution to this problem is to develop model-based metrics, such as BLEURT and COMET. However, these approaches are typically hindered by the scarcity of labeled evaluation data, which is necessary to train the evaluation models. In this work, we build upon this challenge by proposing the Customized Sequence Evaluation Metric (CSEM), a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development, thereby eliminating the need for human-labeled data. Additionally, we expand the scope of CSEM to support various evaluation types, including single-aspect, multi-aspect, reference-free, and reference-based evaluations, enabling the customization of metrics to suit diverse real-world scenarios. Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data. Further experiments in reinforcement learning and reranking show that metrics developed through CSEM outperform traditional evaluation metrics, leading to substantial improvements in sequence quality as evaluated by both commonly used metrics and ChatGPT.
nan
Article 634
Title@2025-06-26 (4): Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models
Title: Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models | Progtuning: Progressives Fine-Tuning-Framework für transformerbasierte Sprachmodelle | 改进:基于变换器的语文模式逐步微调框架 2506.21119v1 |
Authors (5): Xiaoshuang Ji, Zhendong Zhao, Xiaojun Chen, Xin Zhao, Zeyao Liu
Fine-tuning is a promising technique for leveraging Transformer-based language models in downstream tasks. As model sizes continue to grow, updating all model parameters becomes increasingly costly. Parameter-efficient fine-tuning methods effectively address this issue by selectively updating a small subset of parameters. However, fine-tuning and most existing parameter-efficient fine-tuning methods require updating the same number of parameters as the initial size, ignoring the unequal contribution across Transformer blocks and leading to extremely inefficient allocation of computing resources. In this paper, we propose Progtuning, the novel fine-tuning framework combined with progressive learning for Transformer-based language models. Specifically, Progtuning progressively reduces the number of updated transformer blocks based on the contribution. Remarkably, Progtuning optimizes resource allocation and reduces the number of updated parameters by approximately 25\%, while still maintaining competitive performance. And it also exhibits high adaptability with parameter-efficient fine-tuning methods, demonstrating excellent performance across various adaptation scenarios.
nan
Article 635
Title@2025-06-26 (4): Learning to Skip the Middle Layers of Transformers
Title: Learning to Skip the Middle Layers of Transformers | Lernen, die mittleren Schichten der Transformer zu überspringen | 学习跳过变换器的中层 2506.21103v1 |
Authors (2): Tim Lawson, Laurence Aitchison
Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a ‘sandwich’ or ‘perilayernorm’ scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for ‘simpler’ tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.
nan
Article 636
Title@2025-06-26 (4): HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
Title: HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics | HERMES: zeitlich-zusammenhängendes lang-für-M Verständnis mit Episoden und Semantik | HERMES: 与分数和语义学的理解 2408.17443v4 |
Authors (6): Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Shang-Hong Lai, Winston H. Hsu
Long-form video understanding presents unique challenges that extend beyond traditional short-video analysis approaches, particularly in capturing long-range dependencies, processing redundant information efficiently, and extracting high-level semantic concepts. To address these challenges, we propose a novel approach that more accurately reflects human cognition. This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, featuring two versatile modules that can enhance existing video-language models or operate as a standalone system. Our Episodic COmpressor (ECO) efficiently aggregates representations from micro to semi-macro levels, reducing computational overhead while preserving temporal dependencies. Our Semantics ReTRiever (SeTR) enriches these representations with semantic information by focusing on broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. We demonstrate that these modules can be seamlessly integrated into existing SOTA models, consistently improving their performance while reducing inference latency by up to 43% and memory usage by 46%. As a standalone system, HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings.
nan
Article 637
Title@2025-06-26 (4): Enhancing LLM Tool Use with High-quality Instruction Data from Knowledge Graph
Title: Enhancing LLM Tool Use with High-quality Instruction Data from Knowledge Graph | Verbesserung der LLM-Tool-Nutzung mit hochwertigen Instruktionsdaten aus Wissensgrafik | 利用来自知识图的高质量教学数据加强LLM工具的使用 2506.21071v1 |
Authors (10): Jingwei Wang, Zai Zhang, Hao Qian, Chunjing Gan, Binbin Hu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou, Bin Shi, Bo Dong
Teaching large language models (LLMs) to use tools is crucial for improving their problem-solving abilities and expanding their applications. However, effectively using tools is challenging because it requires a deep understanding of tool functionalities and user intentions. Previous methods relied mainly on LLMs to generate instruction data, but the quality of these data was often insufficient. In this paper, we propose a new method that uses knowledge graphs to generate high-quality instruction data for LLMs. Knowledge graphs are manually curated datasets rich in semantic information. We begin by extracting various query pathways from a given knowledge graph, which are transformed into a broad spectrum of user queries. We then translate the relationships between entities into actionable tools and parse the pathways of each query into detailed solution steps, thereby creating high-quality instruction data. Our experiments show that fine-tuning on just a small sample of this synthetic data can significantly improve the tool utilization and overall capabilities of LLMs.
nan
Article 638
Title@2025-06-26 (4): MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection
Title: MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection | MT2-CSD: Eine neue Datensatz- und Multi-Semantic Knowledge Fusion Methode zur konversatorischen Stance-Erkennung | MT2-CSD: 用于语音稳定探测的新数据集和多语层知识融合方法 2506.21053v1 |
Authors (7): Fuqiang Niu, Genan Dai, Yisha Lu, Jiayu Liao, Xiang Li, Hu Huang, Bowen Zhang
In the realm of contemporary social media, automatic stance detection is pivotal for opinion mining, as it synthesizes and examines user perspectives on contentious topics to uncover prevailing trends and sentiments. Traditional stance detection research often targets individual instances, thereby limiting its capacity to model multi-party discussions typical in real social media scenarios. This shortcoming largely stems from the scarcity of datasets that authentically capture the dynamics of social media interactions, hindering advancements in conversational stance detection. In this paper, we introduce MT2-CSD, a comprehensive dataset for multi-target, multi-turn conversational stance detection. To the best of our knowledge, MT2-CSD is the largest dataset available for this purpose, comprising 24,457 annotated instances and exhibiting the greatest conversational depth, thereby presenting new challenges for stance detection. To address these challenges, we propose the Large Language model enhanced Conversational Relational Attention Network (LLM-CRAN), which exploits the reasoning capabilities of LLMs to improve conversational understanding. We conduct extensive experiments to evaluate the efficacy of LLM-CRAN on the MT2-CSD dataset. The experimental results indicate that LLM-CRAN significantly outperforms strong baseline models in the task of conversational stance detection.
nan
Article 639
Title@2025-06-26 (4): Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs
Title: Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs | Suche und Verfeinerung während des Denkens: Autonome Retrieval-Augmented Reasoning von LLMs | 思考期间的搜索和记忆:自主检索-强化理据(LLM) 2505.11277v3 |
Authors (8): Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, Xiang Wang
Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new ``search-and-refine-during-think’’ paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.
nan
Article 640
Title@2025-06-26 (4): A Semi-supervised Scalable Unified Framework for E-commerce Query Classification
Title: A Semi-supervised Scalable Unified Framework for E-commerce Query Classification | Ein halbüberwachtes skalierbares Unified Framework für die E-Commerce Query Classification | 半监督的电子商务查询分类可扩展统一框架 2506.21049v1 |
Authors (8): Chunyuan Yuan, Chong Zhang, Zheng Fang, Ming Pang, Xue Jiang, Changping Peng, Zhangang Lin, Ching Law
Query classification, including multiple subtasks such as intent and category prediction, is vital to e-commerce applications. E-commerce queries are usually short and lack context, and the information between labels cannot be used, resulting in insufficient prior information for modeling. Most existing industrial query classification methods rely on users’ posterior click behavior to construct training samples, resulting in a Matthew vicious cycle. Furthermore, the subtasks of query classification lack a unified framework, leading to low efficiency for algorithm optimization. In this paper, we propose a novel Semi-supervised Scalable Unified Framework (SSUF), containing multiple enhanced modules to unify the query classification tasks. The knowledge-enhanced module uses world knowledge to enhance query representations and solve the problem of insufficient query information. The label-enhanced module uses label semantics and semi-supervised signals to reduce the dependence on posterior labels. The structure-enhanced module enhances the label representation based on the complex label relations. Each module is highly pluggable, and input features can be added or removed as needed according to each subtask. We conduct extensive offline and online A/B experiments, and the results show that SSUF significantly outperforms the state-of-the-art models.
nan
Article 641
Title@2025-06-26 (4): MockLLM: A Multi-Agent Behavior Collaboration Framework for Online Job Seeking and Recruiting
Title: MockLLM: A Multi-Agent Behavior Collaboration Framework for Online Job Seeking and Recruiting | MockLLM: Ein Multi-Agent-Behavior-Kooperationsrahmen für Online-Jobsuche und Recruiting | MockLLLM:网上求职和招聘多代理行为协作框架 2405.18113v2 |
Authors (6): Hongda Sun, Hongzhan Lin, Haiyu Yan, Yang Song, Xin Gao, Rui Yan
Online recruitment platforms have reshaped job-seeking and recruiting processes, driving increased demand for applications that enhance person-job matching. Traditional methods generally rely on analyzing textual data from resumes and job descriptions, limiting the dynamic, interactive aspects crucial to effective recruitment. Recent advances in Large Language Models (LLMs) have revealed remarkable potential in simulating adaptive, role-based dialogues, making them well-suited for recruitment scenarios. In this paper, we propose \textbf{MockLLM}, a novel framework to generate and evaluate mock interview interactions. The system consists of two key components: mock interview generation and two-sided evaluation in handshake protocol. By simulating both interviewer and candidate roles, MockLLM enables consistent and collaborative interactions for real-time and two-sided matching. To further improve the matching quality, MockLLM further incorporates reflection memory generation and dynamic strategy modification, refining behaviors based on previous experience. We evaluate MockLLM on real-world data Boss Zhipin, a major Chinese recruitment platform. The experimental results indicate that MockLLM outperforms existing methods in matching accuracy, scalability, and adaptability across job domains, highlighting its potential to advance candidate assessment and online recruitment.
nan
Article 642
Title@2025-06-26 (4): SceneGenAgent: Precise Industrial Scene Generation with Coding Agent
Title: SceneGenAgent: Precise Industrial Scene Generation with Coding Agent | SceneGenAgent: Präzise industrielle Szenegenerierung mit Coding Agent | SceneGenerAgenti: 精密工业场景与编码剂生成 2410.21909v3 |
Authors (8): Xiao Xia, Dan Zhang, Zibo Liao, Zhenyu Hou, Tianrui Sun, Jing Li, Ling Fu, Yuxiao Dong
The modeling of industrial scenes is essential for simulations in industrial manufacturing. While large language models (LLMs) have shown significant progress in generating general 3D scenes from textual descriptions, generating industrial scenes with LLMs poses a unique challenge due to their demand for precise measurements and positioning, requiring complex planning over spatial arrangement. To address this challenge, we introduce SceneGenAgent, an LLM-based agent for generating industrial scenes through C# code. SceneGenAgent ensures precise layout planning through a structured and calculable format, layout verification, and iterative refinement to meet the quantitative requirements of industrial scenarios. Experiment results demonstrate that LLMs powered by SceneGenAgent exceed their original performance, reaching up to 81.0% success rate in real-world industrial scene generation tasks and effectively meeting most scene generation requirements. To further enhance accessibility, we construct SceneInstruct, a dataset designed for fine-tuning open-source LLMs to integrate into SceneGenAgent. Experiments show that fine-tuning open-source LLMs on SceneInstruct yields significant performance improvements, with Llama3.1-70B approaching the capabilities of GPT-4o. Our code and data are available at https://github.com/THUDM/SceneGenAgent .
nan
Article 643
Title@2025-06-26 (4): Large Language Models Acing Chartered Accountancy
Title: Large Language Models Acing Chartered Accountancy | Große Sprachmodelle Aking Chartered Accountancy | 特许会计会计 2506.21031v1 |
Authors (7): Jatin Gupta, Akhil Sharma, Saransh Singhania, Mohammad Adnan, Sakshi Deo, Ali Imam Abidi, Keshav Gupta
Advanced intelligent systems, particularly Large Language Models (LLMs), are significantly reshaping financial practices through advancements in Natural Language Processing (NLP). However, the extent to which these models effectively capture and apply domain-specific financial knowledge remains uncertain. Addressing a critical gap in the expansive Indian financial context, this paper introduces CA-Ben, a Chartered Accountancy benchmark specifically designed to evaluate the financial, legal, and quantitative reasoning capabilities of LLMs. CA-Ben comprises structured question-answer datasets derived from the rigorous examinations conducted by the Institute of Chartered Accountants of India (ICAI), spanning foundational, intermediate, and advanced CA curriculum stages. Six prominent LLMs i.e. GPT 4o, LLAMA 3.3 70B, LLAMA 3.1 405B, MISTRAL Large, Claude 3.5 Sonnet, and Microsoft Phi 4 were evaluated using standardized protocols. Results indicate variations in performance, with Claude 3.5 Sonnet and GPT-4o outperforming others, especially in conceptual and legal reasoning. Notable challenges emerged in numerical computations and legal interpretations. The findings emphasize the strengths and limitations of current LLMs, suggesting future improvements through hybrid reasoning and retrieval-augmented generation methods, particularly for quantitative analysis and accurate legal interpretation.
nan
Article 644
Title@2025-06-26 (4): SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control
Title: SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control | SAC: Ein Rahmen für die Messung und Induktion von Persönlichkeitseigenschaften in LLMs mit dynamischer Intensitätskontrolle | SAC: 具有动态强度控制的LMLM中测量和诱导个性轨迹的框架 2506.20993v1 |
Authors (5): Adithya Chittem, Aishna Shrivastava, Sai Tarun Pendela, Jagat Sesh Challa, Dhruv Kumar
Large language models (LLMs) have gained significant traction across a wide range of fields in recent years. There is also a growing expectation for them to display human-like personalities during interactions. To meet this expectation, numerous studies have proposed methods for modelling LLM personalities through psychometric evaluations. However, most existing models face two major limitations: they rely on the Big Five (OCEAN) framework, which only provides coarse personality dimensions, and they lack mechanisms for controlling trait intensity. In this paper, we address this gap by extending the Machine Personality Inventory (MPI), which originally used the Big Five model, to incorporate the 16 Personality Factor (16PF) model, allowing expressive control over sixteen distinct traits. We also developed a structured framework known as Specific Attribute Control (SAC) for evaluating and dynamically inducing trait intensity in LLMs. Our method introduces adjective-based semantic anchoring to guide trait intensity expression and leverages behavioural questions across five intensity factors: \textit{Frequency}, \textit{Depth}, \textit{Threshold}, \textit{Effort}, and \textit{Willingness}. Through experimentation, we find that modelling intensity as a continuous spectrum yields substantially more consistent and controllable personality expression compared to binary trait toggling. Moreover, we observe that changes in target trait intensity systematically influence closely related traits in psychologically coherent directions, suggesting that LLMs internalize multi-dimensional personality structures rather than treating traits in isolation. Our work opens new pathways for controlled and nuanced human-machine interactions in domains such as healthcare, education, and interviewing processes, bringing us one step closer to truly human-like social machines.
nan
Article 645
Title@2025-06-26 (4): SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes
Title: SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes | SharpZO: Hybrid Sharpness-Aware Vision Sprachmodell Prompt Tuning via Forward-Only Passes | SharpZO: 混合尖锐-敏锐视觉语言模型,通过前向-单行道快速调试 2506.20990v1 |
Authors (6): Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang
Fine-tuning vision language models (VLMs) has achieved remarkable performance across various downstream tasks; yet, it requires access to model gradients through backpropagation (BP), making them unsuitable for memory-constrained, inference-only edge devices. To address this limitation, previous work has explored various BP-free fine-tuning methods. However, these approaches often rely on high-variance evolutionary strategies (ES) or zeroth-order (ZO) optimization, and often fail to achieve satisfactory performance. In this paper, we propose a hybrid Sharpness-aware Zeroth-order optimization (SharpZO) approach, specifically designed to enhance the performance of ZO VLM fine-tuning via a sharpness-aware warm-up training. SharpZO features a two-stage optimization process: a sharpness-aware ES stage that globally explores and smooths the loss landscape to construct a strong initialization, followed by a fine-grained local search via sparse ZO optimization. The entire optimization relies solely on forward passes. Detailed theoretical analysis and extensive experiments on CLIP models demonstrate that SharpZO significantly improves accuracy and convergence speed, achieving up to 7% average gain over state-of-the-art forward-only methods.
nan
Article 646
Title@2025-06-26 (4): SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization
Title: SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization | SACL: Verständnis und Bekämpfung von Textbias im Code Retrieval mit semantisch-angereicherter Reranking und Lokalisierung | SACL: 理解和打击《规则》中与语义-增强的重新排级和本地化相结合的 “ 检索法 “ 中的 “ 理解和打击 “ 理论上的 “ 种族 “ 行为 2506.20081v2 |
Authors (3): Dhruv Gupta, Gayathri Ganesh Lakshmy, Yiqing Xie
Retrieval-Augmented Code Generation (RACG) is a critical technique for enhancing code generation by retrieving relevant information. In this work, we conduct an in-depth analysis of code retrieval by systematically masking specific features while preserving code functionality. Our discoveries include: (1) although trained on code, current retrievers heavily rely on surface-level textual features (e.g., docstrings, identifier names), and (2) they exhibit a strong bias towards well-documented code, even if the documentation is irrelevant. Based on our discoveries, we propose SACL, a framework that enriches textual information and reduces bias by augmenting code or structural knowledge with semantic information. Extensive experiments show that SACL substantially improves code retrieval (e.g., by 12.8% / 9.4% / 7.0% Recall@1 on HumanEval / MBPP / SWE-Bench-Lite), which also leads to better code generation performance (e.g., by 4.88% Pass@1 on HumanEval).
nan
Article 647
Title@2025-06-26 (4): Can Gradient Descent Simulate Prompting?
Title: Can Gradient Descent Simulate Prompting? | Kann Gradient Descent Simulate Prompting? | 梯子源模拟能刺激吗? 2506.20989v1 |
Authors (3): Eric Zhang, Leshem Choshen, Jacob Andreas
There are two primary ways of incorporating new information into a language model (LM): changing its prompt or changing its parameters, e.g. via fine-tuning. Parameter updates incur no long-term storage cost for model changes. However, for many model updates, prompting is significantly more effective: prompted models can generalize robustly from single examples and draw logical inferences that do not occur under standard fine-tuning. Can models be modified so that fine-tuning does emulate prompting? This paper describes a method for meta-training LMs such that gradient updates emulate the effects of conditioning on new information. Our approach uses tools from gradient-based meta-learning but uses an LM’s own prompted predictions as targets, eliminating the need for ground-truth labels. Subsequent gradient descent training recovers some (and occasionally all) of prompted model performance – showing improvement on the ``reversal curse’’ tasks, and answering questions about text passages after a single gradient update. These results suggest that, with appropriate initialization, gradient descent can be surprisingly expressive. Our results suggest new avenues for long-context modeling and offer insight into the generalization capabilities of gradient-based learning.
nan
Article 648
Title@2025-06-26 (4): Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models
Title: Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models | Vergleich von Retrieval-Augmentation und Parameter-Effizient Fine-Tuning für Datenschutz-Erhaltung Personalisierung von großen Sprachmodellen | 比较大语言模型的检索增强和参数有效微量微量美化,促进保护隐私和保持个人特征化 2409.09510v2 |
Authors (2): Alireza Salemi, Hamed Zamani
Despite its substantial impact on various search, recommendation, and question answering tasks, privacy-preserving methods for personalizing large language models (LLMs) have received relatively limited exploration. There is one primary approach in this area through retrieval-augmented generation (RAG), which generates personalized outputs by enriching the input prompt with information retrieved from the user’s personal data. This paper studies an orthogonal approach to RAG that involves learning user-dependent LLM parameters through parameter-efficient fine-tuning (PEFT). This paper presents the first systematic study for exploration of PEFT for LLM personalization and provides an extensive comparisons between RAG- and PEFT-based solutions, across a broad set of seven diverse datasets from the LaMP benchmark. Our results demonstrate that, on average, both RAG- and PEFT-based personalization methods yield 14.92% and 1.07% improvements over non-personalized LLMs, respectively. When combining RAG with PEFT, we observe a further improvement of 15.98%, highlighting the effectiveness of their integration in enhancing personalized text generation. Additionally, we identify a positive correlation between the amount of user data available and the effectiveness of PEFT. This finding suggests that RAG is particularly beneficial for cold-start users – users with limited personal data – while PEFT performs better when more user-specific data is available.
nan
Article 649
Title@2025-06-26 (4): Towards Text-free Graph Foundation Models: Rethinking Multi-Domain Graph Contrastive Learning
Title: Towards Text-free Graph Foundation Models: Rethinking Multi-Domain Graph Contrastive Learning | Auf dem Weg zu textfreien Graph Foundation-Modellen: Multi-Domain-Graph Kontrastives Lernen neu denken | 走向无文本图表基础模型:重新思考多领域图表对比学习 2506.22510v1 |
Authors (4): Zihao Zhao, Xinlong Zhai, Jinyu Yang, Chuan Shi
Foundation models have achieved great success in natural language processing (NLP) and computer vision (CV). Their success largely stems from the ability to integrate multi-domain knowledge in pre-training and transfer it to target domains. Considering graph data, especially graphs without textual features, is ubiquitous in real-world applications such as social networks and recommendation systems, some researchers have attempted to extend this paradigm to the graph field, aiming to construct graph foundation models. However, unlike CV and NLP, there are huge gaps among the semantics and properties of graphs in different domains, while current works still adopt traditional contrastive pre-training strategies designed in the single-domain scenario, which regard contrastive samples from different domains as equivalent. From experimental investigations, we discovered that inherent domain-specific differences prevent these strategies from effectively absorbing knowledge from different domains to generate informative representations. In this paper, we propose a novel multi-domain pre-training and cross-domain transfer framework, namely MDGCL.In the pre-training stage, we design a contrastive learning strategy to substantially recognize and capture domain differences, and introduce domain tokens to encode domain-level global information. In the downstream stage, we introduce a domain attention mechanism to enable fine-grained domain knowledge transfer. Extensive experiments on five benchmark datasets have demonstrated that our method outperforms state-of-the-art significantly, with the maximum improvement of 19.33\% on accuracy and 19.13\% on Macro-F1 score.
nan
Article 650
Title@2025-06-26 (4): Reward-Guided Speculative Decoding for Efficient LLM Reasoning
Title: Reward-Guided Speculative Decoding for Efficient LLM Reasoning | Belohnungsgeführte spekulative Dekodierung für effiziente LLM-Reasoning | 高效 LLM 理由说明的受奖励指导的投机性说明 2501.19324v3 |
Authors (8): Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios. The code is available at https://github.com/BaohaoLiao/RSD.
nan
Article 651
Title@2025-06-26 (4): Learning to Rank for Multiple Retrieval-Augmented Models through Iterative Utility Maximization
Title: Learning to Rank for Multiple Retrieval-Augmented Models through Iterative Utility Maximization | Ranken lernen für mehrere Retrieval-Augmented Modelle durch iterative Utility Maximierung | 通过迭代功用最大化学习多重检索增强型号排名 2410.09942v2 |
Authors (2): Alireza Salemi, Hamed Zamani
This paper investigates the design of a unified search engine to serve multiple retrieval-augmented generation (RAG) agents, each with a distinct task, backbone large language model (LLM), and RAG strategy. We introduce an iterative approach where the search engine generates retrieval results for the RAG agents and gathers feedback on the quality of the retrieved documents during an offline phase. This feedback is then used to iteratively optimize the search engine using an expectation-maximization algorithm, with the goal of maximizing each agent’s utility function. Additionally, we adapt this to an online setting, allowing the search engine to refine its behavior based on real-time individual agents feedback to better serve the results for each of them. Experiments on datasets from the Knowledge-Intensive Language Tasks (KILT) benchmark demonstrates that our approach significantly on average outperforms baselines across 18 RAG models. We demonstrate that our method effectively ``personalizes’’ the retrieval for each RAG agent based on the collected feedback. Finally, we provide a comprehensive ablation study to explore various aspects of our method.
nan
Article 652
Title@2025-06-26 (4): AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text
Title: AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text | AgentStealth: Verstärkung des Large Language Models zur Anonymisierung von benutzergeneriertem Text | AgentStealth:加强用户生成文本匿名大语言模式 2506.22508v1 |
Authors (5): Chenyang Shao, Tianxing Li, Chenhao Pu, Fengli Xu, Yong Li
In today’s digital world, casual user-generated content often contains subtle cues that may inadvertently expose sensitive personal attributes. Such risks underscore the growing importance of effective text anonymization to safeguard individual privacy. However, existing methods either rely on rigid replacements that damage utility or cloud-based LLMs that are costly and pose privacy risks. To address these issues, we explore the use of locally deployed smaller-scale language models (SLMs) for anonymization. Yet training effective SLMs remains challenging due to limited high-quality supervision. To address the challenge, we propose AgentStealth, a self-reinforcing LLM anonymization framework.First, we introduce an adversarial anonymization workflow enhanced by In-context Contrastive Learning and Adaptive Utility-Aware Control. Second, we perform supervised adaptation of SLMs using high-quality data collected from the workflow, which includes both anonymization and attack signals. Finally, we apply online reinforcement learning where the model leverages its internal adversarial feedback to iteratively improve anonymization performance. Experiments on two datasets show that our method outperforms baselines in both anonymization effectiveness (+12.3%) and utility (+6.8%). Our lightweight design supports direct deployment on edge devices, avoiding cloud reliance and communication-based privacy risks. Our code is open-source at https://github.com/tsinghua-fib-lab/AgentStealth.
nan
Article 653
Title@2025-06-26 (4): Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation
Title: Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation | Jenseits der reaktiven Sicherheit: Risiko-Bewusst LLM-Ausrichtung über Long-Horizon Simulation | 超越反应安全性:通过长休松模拟使风险-警用LLM对齐 2506.20949v1 |
Authors (4): Chenkai Sun, Denghui Zhang, ChengXiang Zhai, Heng Ji
Given the growing influence of language model-based agents on high-stakes societal decisions, from public policy to healthcare, ensuring their beneficial impact requires understanding the far-reaching implications of their suggestions. We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems on a macroscopic scale over time, enabling more robust alignment. To assess the long-term safety awareness of language models, we also introduce a dataset of 100 indirect harm scenarios, testing models’ ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts. Our approach achieves not only over 20% improvement on the new dataset but also an average win rate exceeding 70% against strong baselines on existing safety benchmarks (AdvBench, SafeRLHF, WildGuardMix), suggesting a promising direction for safer agents.
nan
Article 654
Title@2025-06-26 (4): Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters
Title: Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters | Bewertung großer Sprachmodelle für automatisierte klinische Abstraktion in pulmonalen Embolism Registries: Performance Across Modellgrößen, Versionen und Parameter | 评价肺部新陈代谢登记簿自动临床抽象化的大型语言模型:不同模型大小、版本和参数的性能 2503.21004v2 |
Authors (9): Mahmoud Alwakeel, Emory Buck, Jonathan G. Martin, Imran Aslam, Sudarshan Rajagopal, Jian Pei, Mihai V. Podgoreanu, Christopher J. Lindsell, An-Kwok Ian Wong
Pulmonary embolism (PE) registries accelerate practice improving research but rely on labor intensive manual abstraction of radiology reports. We examined whether openly available large language models (LLMs) can automate concept extraction from computed tomography PE (CTPE) reports without loss of data quality. Four Llama 3 variants (3.0 8B, 3.1 8B, 3.1 70B, 3.3 70B) and one reviewer model, Phi 4 14B, were tested on 250 dual annotated CTPE reports from each of MIMIC IV and Duke University. Accuracy, positive predictive value (PPV) and negative predictive value (NPV) versus a human gold standard were measured across model size, temperature and shot count. Mean accuracy rose with scale: 0.83 (3.0 8B), 0.91 (3.1 8B) and 0.96 for both 70B variants; Phi 4 14B reached 0.98. Accuracy differed by less than 0.03 between datasets, indicating external robustness. In dual model concordance (L3 70B plus Phi 4 14B) PPV for PE presence was at least 0.95 and NPV at least 0.98, while location, thrombus burden, right heart strain and image quality artifacts each achieved PPV of at least 0.90 and NPV of at least 0.95. Fewer than four percent of individual concept annotations were discordant, and full agreement occurred in more than seventy five percent of reports. Large language models therefore provide a scalable, accurate solution for PE registry abstraction, and a dual model review workflow can safeguard data quality with minimal human oversight.
nan
Article 655
Title@2025-06-26 (4): PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
Title: PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks | PP-DocBee: Multimodales Dokumentenverständnis durch Tricks verbessern | PP-Docbee:通过一袋小把戏改进多式文件理解 2503.04065v3 |
Authors (7): Feng Ni, Kui Huang, Yao Lu, Wenyu Lv, Guanzhong Wang, Zeyu Chen, Yi Liu
With the rapid advancement of digitalization, various document images are being applied more extensively in production and daily life, and there is an increasingly urgent need for fast and accurate parsing of the content in document images. Therefore, this report presents PP-DocBee, a novel multimodal large language model designed for end-to-end document image understanding. First, we develop a data synthesis strategy tailored to document scenarios in which we build a diverse dataset to improve the model generalization. Then, we apply a few training techniques, including dynamic proportional sampling, data preprocessing, and OCR postprocessing strategies. Extensive evaluations demonstrate the superior performance of PP-DocBee, achieving state-of-the-art results on English document understanding benchmarks and even outperforming existing open source and commercial models in Chinese document understanding. The source code and pre-trained models are publicly available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
nan
Article 656
Title@2025-06-26 (4): KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model
Title: KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model | KaLM-Embedding-V2: Überlegene Trainingstechniken und Daten inspirieren ein vielseitiges Einbettungsmodell | KaLM-Embedding-V2:高级培训技术和数据预报 2506.20923v1 |
Authors (17): Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang
In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with simple yet effective mean-pooling to produce fixed-length embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization. Besides, we introduce a focal-style reweighting mechanism that concentrates learning on difficult samples and an online hard-negative mixing strategy to continuously enrich hard negatives without expensive offline mining; (3) We collect over 20 categories of data for pre-training and 100 categories of data for fine-tuning, to boost both the performance and generalization of the embedding model. Extensive evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English show that our model significantly outperforms others of comparable size, and competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new standard for a versatile and compact embedding model with less than 1B parameters.
nan
Article 657
Title@2025-06-26 (4): FineWeb2: One Pipeline to Scale Them All – Adapting Pre-Training Data Processing to Every Language
Title: FineWeb2: One Pipeline to Scale Them All – Adapting Pre-Training Data Processing to Every Language | FineWeb2: Eine Pipeline, um sie alle zu skalieren – Anpassung der Vorschulungsdatenverarbeitung an jede Sprache | FineWeb2: 将全部标准缩放的一条管道 – – 将培训前数据处理适应于每种语言 2506.20920v1 |
Authors (10): Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, Thomas Wolf
Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets. We additionally introduce a straightforward and principled approach to rebalance datasets that takes into consideration both duplication count and quality, providing an additional performance uplift. Finally, we scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset which we release along with our pipeline, training, and evaluation codebases.
nan
Article 658
Title@2025-06-26 (4): Optimising Language Models for Downstream Tasks: A Post-Training Perspective
Title: Optimising Language Models for Downstream Tasks: A Post-Training Perspective | Sprachmodelle für Downstream-Aufgaben optimieren: Eine Perspektive nach dem Training | 优化下游任务的语言模式:培训后展望 2506.20917v1 |
Authors (1): Zhengyan Shi
Language models (LMs) have demonstrated remarkable capabilities in NLP, yet adapting them efficiently and robustly to specific tasks remains challenging. As their scale and complexity grow, fine-tuning LMs on labelled data often underutilizes available unlabelled data, leads to overfitting on small task-specific sets, and imposes significant computational costs. These limitations hamper their application to the open-ended landscape of real-world language tasks. This thesis proposes a series of methods to better adapt LMs to downstream applications. First, we explore strategies for extracting task-relevant knowledge from unlabelled data, introducing a novel continued pre-training technique that outperforms state-of-the-art semi-supervised approaches. Next, we present a parameter-efficient fine-tuning method that substantially reduces memory and compute costs while maintaining competitive performance. We also introduce improved supervised fine-tuning methods that enable LMs to better follow instructions, especially when labelled data is scarce, enhancing their performance across a range of NLP tasks, including open-ended generation. Finally, we develop new evaluation methods and benchmarks, such as multi-hop spatial reasoning tasks, to assess LM capabilities and adaptation more comprehensively. Through extensive empirical studies across diverse NLP tasks, our results demonstrate that these approaches substantially improve LM robustness, efficiency, and generalization, making them more adaptable to a broad range of applications. These advances mark a significant step towards more robust and efficient LMs, bringing us closer to the goal of artificial general intelligence.
nan