指令微调与对齐186 篇
基础/前沿模型 (含LLM)
指令微调与对齐
#safety alignment #safety preservation #low-rank adaptation #parameter-efficient fine-tuning
🎯 研究动机虽然大语言模型在多项任务中表现卓越,但其安全对齐在微调过程中易受损,需要稳定的安全机制。
❓ 解决问题解决微调期间安全行为易被恶化的问题,保持模型在安全相关任务中的一致性和拒绝有害内容。
🔍 现象分析微调在良性数据或低秩适应情况下,预训练的安全行为仍会显著退化,导致不良响应增多。
🛠️ 主要方法提出GuardSpace框架,通过协方差预处理的奇异值分解将权重分解为安全相关和不相关部分,同时利用零空间投影限制适配器更新以维持安全输出行为。
📊 数据与实验在多个下游任务和预训练模型上验证,包括Llama-2-7B-Chat;在GSM8K任务中,将有害得分从14.4%降至3.6%,准确率从26.0%提升至28.0%。
⭐ 主要贡献提出了一种创新性框架,在微调期间实现了安全行为的保留,并显著提升了任务性能与安全对齐效果。
查看完整摘要 (Abstract)
Large language models (LLMs) have achieved remarkable success in diverse tasks, yet their safety alignment remains fragile during adaptation.
Even when fine-tuning on benign data or with low-rank adaptation, pre-trained safety behaviors are easily degraded, leading to harmful responses in the fine-tuned models.
To address this challenge, we propose GuardSpace, a guardrail framework for preserving safety alignment throughout fine-tuning, composed of two key components: a safety-sensitive subspace and a harmful-resistant null space.
First, we explicitly decompose pre-trained weights into safety-relevant and safety-irrelevant components using covariance-preconditioned singular value decomposition, and initialize low-rank adapters from the safety-irrelevant ones, while freezing safety-relevant components to preserve their associated safety mechanism.
Second, we construct a null space projector that restricts adapter updates from altering safe outputs on harmful prompts, thereby maintaining the original refusal behavior.
Experiments with various pre-trained models on multiple downstream tasks demonstrate that GuardSpace achieves superior performance over existing methods.
Notably, for Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT, reducing the average harmful score from 14.4\% to 3.6\%, while improving the accuracy from from 26.0\% to 28.0\%.
基础/前沿模型 (含LLM)
指令微调与对齐
#Trajectory Data Preparation #Federated Learning #Large Language Model #Trajectory Preprocessing
🎯 研究动机轨迹数据因传感器误差和传输失败常存在噪声、不完整及不一致性,可靠的下游分析需要高效的预处理,但现有方法难以在隐私限制或者数据孤岛情况下实现,且缺乏对多样任务的泛化能力。
❓ 解决问题提出一个统一的联邦学习框架 FedTDP,支持在分布式且隐私受限场景下对轨迹数据进行高效且泛化的预处理。
🔍 现象分析现有轨迹数据预处理方法依赖数据集中访问,并针对特定任务独立训练模型,不符合隐私要求且在多任务中表现受限。
🛠️ 主要方法设计了三大创新:(i) 提出带隐私保障的轻量轨迹隐私自动编码器 (TPA);(ii) 融合 LLM 的轨迹知识增强模块 (TKE),通过提示设计和知识蒸馏适应时空模式;(iii) 提出联邦并行优化 (FPO),降低通信成本并提升训练效率。
📊 数据与实验使用6个真实数据集和10种典型预处理任务测试,实验结果表明 FedTDP 在准确性、效率和可扩展性上超越了13种最新基线模型。
⭐ 主要贡献构建了首个支持多任务的联邦轨迹数据预处理框架,在保障隐私的同时实现高效、多样化任务的泛化能力。
查看完整摘要 (Abstract)
Trajectory data records the spatio-temporal movements of people and vehicles. However, raw trajectories are often noisy, incomplete, or inconsistent due to sensor errors and transmission failures. To ensure reliable downstream analytics, Trajectory Data Preparation (TDP) has emerged as a critical preprocessing stage, encompassing various tasks such as imputation, map matching, anomaly detection, trajectory recovery, compression, etc. However, existing TDP methods face two major limitations: (i) they assume centralized access to data, which is unrealistic under strict privacy regulations and data silo situations, and (ii) they train task-specific models that lack generalization across diverse or unseen TDP tasks. To this end, we propose FedTDP for Federated Trajectory Data Preparation (F-TDP), where trajectories are vertically partitioned across regions and cannot be directly shared. FedTDP introduces three innovations: (i) lightweight Trajectory Privacy AutoEncoder (TPA) with secret-sharing aggregation, providing formal privacy guarantees; (ii) Trajectory Knowledge Enhancer (TKE) that adapts LLMs to spatio-temporal patterns via trajectory-aware prompts, offsite-tuning, sparse-tuning, and bidirectional knowledge distillation; and (iii) Federated Parallel Optimization (FPO), which reduces communication overhead and accelerates federated training. We conduct experiments on 6 real-world datasets and 10 representative TDP tasks, showing that FedTDP surpasses 13 state-of-the-art baselines in accuracy, efficiency, and scalability, while also generalizing effectively across diverse TDP tasks.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM #Reasoning #Reinforcement Learning #Supervised Fine-tuning #Math #Code
TL;DR:We study the synergy between SFT and RL in developing strong reasoning models. Our final 7B model attains top-tier performance among Qwen2.5-based 7B models.
🎯 研究动机探索监督微调(SFT)与强化学习(RL)结合的潜力,以提升复杂推理任务的性能表现,特别是在数学和代码领域。
❓ 解决问题研究如何通过优化SFT和RL的协作策略,解决SFT模型与大规模RL训练间的性能提升问题,以及平衡RL训练中的探索与利用难题。
🔍 现象分析初期更强的SFT模型通常能在有效的RL训练后获得更好的最终性能。温度调节的熵保持在约0.3时,能够在探索与利用间实现最佳平衡,且RL过程可缩小起始SFT模型间的性能差异。
🛠️ 主要方法通过扩展SFT训练数据规模(包括扩展提示数量和每个提示的响应生成量),结合基于温度调节优化探索与利用平衡的RL训练流程,实现性能提升。
📊 数据与实验使用扩展的数学和代码任务数据集进行训练与验证,对比不同规模SFT模型以及多种RL训练策略的性能,最终模型在Qwen2.5-7B基准上表现优异。
⭐ 主要贡献提出基于SFT与RL协作优化的AceReason-Nemotron-1.1模型,建立数学与代码推理任务的新性能标杆,并验证了鲁棒的后训练配方的有效性。
查看完整摘要 (Abstract)
In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt. Both approaches yield notable improvements in reasoning performance, with scaling the number of prompts resulting in more substantial gains. We then explore the following questions regarding the synergy between SFT and RL:
(i) Does a stronger SFT model consistently lead to better final performance after large-scale RL training?
(ii) How can we determine an appropriate sampling temperature during RL training to effectively balance exploration and exploitation for a given SFT initialization?
Our findings suggest that (i) holds true, provided effective RL training is conducted, particularly when the sampling temperature is carefully chosen to maintain the temperature-adjusted entropy around 0.3, a setting that strikes a good balance between exploration and exploitation. Notably, the performance gap between initial SFT models narrows significantly throughout the RL process. Built on a strong SFT foundation and SFT–RL synergy, our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and achieves new state-of-the-art performance among Qwen2.5-7B-based reasoning models on challenging math and code benchmarks, thereby demonstrating the effectiveness of our post-training recipe.
基础/前沿模型 (含LLM)
指令微调与对齐
#activation steering #behaviour control #alignment #PID control #mechanistic interpretability #language models
TL;DR:We propose Proportional–Integral–Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs.
🎯 研究动机大型语言模型的行为控制对其安全性和可靠部署至关重要,但现有方法缺乏理论性能保证,主要依赖经验性洞察。
❓ 解决问题目前流行的激活引导方法对应比例控制器(P),缺乏完整的控制理论框架,无法充分捕捉误差累积和快速响应问题。
🔍 现象分析比例项(P)对齐激活和目标语义,但难以处理误差积累和快速变化,导致控制不稳定和过冲现象。
🛠️ 主要方法提出基于比例-积分-微分(PID)控制器的激活引导框架,利用比例、积分和微分项分别实现目标对齐、误差纠正和过冲缓解,提供闭环设计的稳定性分析。
📊 数据与实验在多个语言模型家族和多种基准上进行实验,证明PID引导在行为控制的稳健性和可靠性上均优于现有方法。
⭐ 主要贡献开发了基于控制理论的激活引导方法,将PID控制器引入语言模型行为控制,提供了理论支持和性能提升,并具备轻量化和模块化特性。
查看完整摘要 (Abstract)
Controlling the behaviors of large language models (LLMs) is fundamental to their safety alignment and reliable deployment. However, existing steering methods are primarily driven by empirical insights and lack theoretical performance guarantees. In this work, we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. Building on this finding, we propose Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs. The proportional (P) term aligns activations with target semantic directions, the integral (I) term accumulates errors to enforce persistent corrections across layers, and the derivative (D) term mitigates overshoot by counteracting rapid activation changes. This closed-loop design yields interpretable error dynamics and connects activation steering to classical stability guarantees in control theory. Moreover, PID Steering is lightweight, modular, and readily integrates with state-of-the-art steering methods. Extensive experiments across multiple LLM families and benchmarks demonstrate that PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control.
基础/前沿模型 (含LLM)
指令微调与对齐
#Active Data Selection #Direct Preference Optimization #Human Feedback #LLM Alignment
TL;DR:We propose an active learning algorithm that uses a theoretically grounded selection criterion while using LLM to parameterize the reward model for efficiently collecting human preference feedback when the latent reward function is non-linear.
🎯 研究动机通过人类偏好对大语言模型(LLM)进行对齐,在问答、数学推理和代码生成等任务中表现出显著优势;但高质量偏好数据集的构建消耗高昂,且现有主动数据选择方法理论基础不足或假设过于严格。
❓ 解决问题设计一种能在非线性潜在奖励函数条件下高效收集人类偏好数据的主动学习算法,同时直接结合LLM对奖励模型进行参数化。
🔍 现象分析现有方法仅依赖简单假设的奖励函数,无法充分建模LLM在数据选择中的影响,导致数据收集效率低下。
🛠️ 主要方法提出ActiveDPO算法,以理论支持的数据选择标准为核心,同时直接利用LLM参数化奖励模型,从而显式考虑LLM对数据选择的影响。
📊 数据与实验通过多个模型和现实偏好数据集进行全面实验,结果表明ActiveDPO在数据收集效率和效果上均优于现有方法。
⭐ 主要贡献提出了一种具有理论支持的非线性主动数据选择算法,显著提升了LLM对齐过程中人类偏好数据收集的效果和效率。
查看完整摘要 (Abstract)
The recent success in using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks, such as question answering, mathematical reasoning, and code generation. However, achieving effective LLM alignment depends on high-quality datasets of human preferences. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive assumptions about the reward function, such as linear latent reward functions. To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model used for active data selection. As a result, ActiveDPO explicitly accounts for the LLM's influence on data selection, unlike methods that select data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Our extensive experiments demonstrate that ActiveDPO outperforms existing methods across various models and real-world preference datasets.
基础/前沿模型 (含LLM)
指令微调与对齐
#SFT
🎯 研究动机大语言模型后训练在监督微调(SFT)和强化学习(RL)之间存在效率与泛化能力的权衡问题,需探索更稳定且高效的解决方案。
❓ 解决问题动态微调(DFT)虽然改进了某些推理任务,但稳定性不足且存在分布漂移问题,需要一种既提升稳定性又保持效率的新方法。
🔍 现象分析通过奖励加权回归(RWR)框架分析,DFT提供了更紧的RL界限,但缺乏分布锚定,导致训练不稳定。
🛠️ 主要方法提出锚定监督微调(ASFT),在DFT目标重加权基础上引入轻量级KL正则化,以同时保证稳定性和理论界的紧度。
📊 数据与实验在数学推理、医学知识和代码生成领域展开实验,ASFT以最低计算成本持续优于SFT和DFT。
⭐ 主要贡献提供了系统化的RWR理论框架,提出ASFT并实现其在多领域的显著性能提升,结合理论分析与实践改进以解决模型后训练的不稳定性。
查看完整摘要 (Abstract)
Post-training of large language models involves a fundamental trade-off between
supervised fine-tuning (SFT), which efficiently mimics demonstrations but tends
to memorize, and reinforcement learning (RL), which achieves better generaliza-
tion at higher computational cost. Dynamic Fine-Tuning (DFT) recently emerged
as a promising middle ground, reweighting SFT objectives with token probabili-
ties and achieving improvements in certain reasoning domains, though it exhibits
instability in other tasks. We provide a analysis of DFT through the reward-
weighted regression (RWR) framework, revealing that it corresponds to a spe-
cific auxiliary distribution choice that yields provably tighter RL bounds than
standard SFT. However, our analysis also uncovers a critical limitation: this con-
struction lacks distributional anchoring, leading to progressive drift that under-
mines training stability. To address this, we propose Anchored Supervised Fine-
Tuning (ASFT), which augments DFT’s reweighting with lightweight KL regu-
larization to preserve tightness while ensuring stability. Empirically, ASFT con-
sistently outperforms both SFT and DFT across mathematical reasoning, medical
knowledge grounding, and code generation, achieving substantial improvements
with minimal computational overhead. Our RWR framework provides a system-
atic lens for understanding post-training methods and demonstrates that principled
theoretical analysis leads to both stronger guarantees and practical gains.
基础/前沿模型 (含LLM)
指令微调与对齐
#AI slop #slop #constrained generation #delve #patterns #sampleing #dpo #preference optimization #fine-tuning #fine tuning #creativity #AI writing #Creative AI
TL;DR:We show several techniques for removing characteristic patterns from LLM generated texts at both the sampler level and at the model weights level.
🎯 研究动机语言模型生成的重复性词汇模式(称为“slop”)影响文本质量,使其容易被识别为 AI生成内容。研究旨在改善生成文本的多样性和自然性。
❓ 解决问题开发框架以检测和消除生成文本中的过度使用模式,同时保持文本创作质量和语义完整性。
🔍 现象分析研究发现LLM生成的部分重复模式频率比人类文本高出1,000倍,这显著降低了内容的写作质量和多样性。
🛠️ 主要方法提出三项创新:基于回溯抑制无效串的Antislop采样器、自动化管道生成训练数据,以及FTPO细致优化模型输出概率以消除不必要的模式。
📊 数据与实验使用多领域评估(如GSM8K、MMLU和创意写作任务),FTPO显示出显著的90%重复模式去除效果,同时优化质量和表现。
⭐ 主要贡献完整开源Antislop框架,包括工具、代码和数据集,为语言模型增强创造性输出提供了有效解决方案。
查看完整摘要 (Abstract)
Repetitive lexical patterns in LLM output, termed "slop," degrade writing quality through over-use and make AI-generated text immediately recognizable. We present Antislop, a comprehensive framework providing tools to both detect and eliminate these overused patterns. Our approach combines three innovations: (1) The Antislop Sampler, which uses backtracking to suppress unwanted strings at inference time without destroying vocabulary. (2) An automated pipeline that profiles model-specific slop against human baselines and generates training data. and, (3) Final Token Preference Optimization (FTPO), a novel fine-tuning method that operates in logit-space on individual tokens, surgically adjusting logits wherever a banned pattern has appeared in an inference trace. We demonstrate that some slop patterns appear over 1,000 times more frequently in LLM output than human text. The Antislop Sampler successfully suppresses 8,000+ patterns while maintaining quality, whereas token banning becomes unusable at just 2,000. Most importantly, FTPO achieves 90% slop reduction while maintaining or improving performance in cross-domain evals including GSM8K, MMLU, and creative writing tasks. In contrast, DPO suffers significant degradation in writing quality and lexical diversity despite achieving weaker suppression. We release all code and results datasets under MIT license.
基础/前沿模型 (含LLM)
指令微调与对齐
#Reinforcement Learning #Large Language Model #Math Reasoning
TL;DR:AsyPPO efficiently restores the role of critics through lightweight mini-critics and reconstructs the policy learning objective, enhancing the reasoning ability of LLMs while emphasizing the research value of critic-based algorithms.
🎯 研究动机强化学习已成为提升大型语言模型推理能力的关键方法,但传统价值函数因计算昂贵及稀疏奖励场景下表现不佳,严重限制了其效率与适用性。
❓ 解决问题针对当前方法规避显式评论机制的问题,引入高效的轻量化评论架构,解决价值函数估计偏差及长推理路径中学习不稳定的问题。
🔍 现象分析现有方法多利用平均优势基线替代评论机制,但在稀疏奖励和长推理路径场景下易出现偏差与无效探索,亟需高效、稳定的解决方案。
🛠️ 主要方法提出了非对称近端策略优化(AsyPPO),通过部署轻量化微评论器网络分片训练以提高估值多样性,并利用评论不确定性进一步优化策略更新。
📊 数据与实验在多个推理基准上测试,包括 Qwen3-4b-Base、Qwen3-8b-Base 和 Qwen3-14b-Base,比经典 PPO 提升最多达 6%。
⭐ 主要贡献通过创新的微评论器架构显著提升了大型语言模型的推理能力,验证了基于评论的算法在效率与扩展性上的潜力。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs) to elicit stronger reasoning. Yet, most recent RL for LLMs (RL4LLM) methods avoid explicit critics, replacing them with average advantage baselines. This shift is largely pragmatic: conventional value functions are computationally expensive to train at LLM scale and often fail under sparse rewards and long reasoning horizons. We revisit this bottleneck from an architectural perspective and introduce Asymmetric Proximal Policy Optimization (**AsyPPO**), a simple and scalable framework that restores the critic’s role while remaining efficient in large-model settings. **AsyPPO** employs a set of lightweight *mini-critics*, each trained on disjoint prompt shards. This design encourages diversity while preserving calibration, reducing value-estimation bias. Beyond robust estimation, **AsyPPO** leverages inter-critic uncertainty to refine the policy update: (i) masking advantages in states where critics agree and gradients add little learning signal, and (ii) filtering high-divergence states from entropy regularization, suppressing spurious exploration. Across multiple reasoning benchmarks, **AsyPPO** consistently improves learning stability and performance over strong baselines, e.g., GRPO, achieving performance gains of $> 6$% on *Qwen3-4b-Base* and about $3$% on *Qwen3-8b-Base* and *Qwen3-14b-Base* over classic PPO. Such results highlight the importance of architectural innovations in critics for scalable, efficient algorithms.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Competitive Programming #Test Case Generation #Problem Generation
TL;DR:We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases, far surpassing current state-of-the-art performance.
🎯 研究动机编写高质量的编程竞赛问题需要严格设定约束、分布和边界条件,同时针对特定算法进行设计并校准复杂度,具有很高的挑战性。
❓ 解决问题探讨大语言模型能否可靠地生成符合竞赛标准的编程问题及测试案例。
🔍 现象分析现有方法(如 HardTests)在一致性和问题生成的质量上表现较差,无法满足高水平竞赛需求。
🛠️ 主要方法提出了 AutoCode 系统,通过多轮验证生成竞赛级问题及测试案例,并使用交叉验证过滤有缺陷的问题。
📊 数据与实验在保留测试集上,AutoCode 的测试套件达到了与官方判断 99% 一致性的水平,相较现有方法(81%)有显著提升。
⭐ 主要贡献首次展示大语言模型能够生成被顶级程序员认可的高质量竞赛问题,并引入了一种高效验证和过滤机制,提升问题生成的可靠性和新颖性。
查看完整摘要 (Abstract)
Writing competitive programming problems is exacting. Authors must: set constraints, input distributions, and edge cases that rule out shortcuts; target specific algorithms (e.g., max-flow, dynamic programming, data structures); and calibrate complexity beyond the reach of most competitors. We argue that this makes for an ideal test of general large language model capabilities and study whether they can do this reliably. We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases. On held-out problems, AutoCode test suites approach 99% consistency with official judgments, a significant improvement over current state-of-the-art methods like HardTests, which achieve less than 81%. Furthermore, starting with a random seed problem, AutoCode can create novel variants with reference and brute-force solutions. By cross-verifying these generated solutions against test cases, we can further filter out malformed problems. Our system ensures high correctness, as verified by human experts. AutoCode successfully produces novel problems judged by Grandmaster-level (top 0.3%) competitive programmers to be of contest quality.
基础/前沿模型 (含LLM)
指令微调与对齐
#evaluation #LLM-as-a-judge #metrics #human feedback #open-ended tasks #user-centered evaluation #data-efficient evaluation #automatic metric generation #benchmarking
TL;DR:We use LLMs to automatically generate and validate task-specific evaluation criteria (metrics) that correlate well with human judgements, and release a library/framework for automatic metric induction.
🎯 研究动机评估面向用户的 AI 应用在开放性任务中依然是挑战,尤其是缺乏用户反馈或行为信号的情况下难以优化系统性能。
❓ 解决问题如何在数据有限的条件下自动生成贴近人类判断的任务特定评估指标,以提升评估的效率和可靠性。
🔍 现象分析传统评估方法依赖昂贵或稀缺的人类反馈,但现有的 LLM-as-a-Judge 方法在与人类判断的相关性上仍有提升空间。
🛠️ 主要方法提出 AutoMetrics 框架,结合预先整理的 MetricBank 库和通过 LLM 自动生成的评估标准,通过回归优化与人类信号的相关性。
📊 数据与实验在五个不同任务上验证,AutoMetrics 在无需超过 100 条反馈的情况下,提升与人类评分的 Kendall 相关性最高达 33.4%。
⭐ 主要贡献开发并开源 AutoMetrics 工具包和 MetricBank 数据库,为生成高效评估指标和加速 LLM 应用的适应性评估提供了一体化解决方案。
查看完整摘要 (Abstract)
Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present **AutoMetrics**, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from **MetricBank**, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.
基础/前沿模型 (含LLM)
指令微调与对齐
#RLVR #LLM Reasoning
TL;DR:This paper reveals the imbalanced optimization and the Entropy-CLip Rule in off-policy RL for LLMs, and proposes BAlanced Policy Optimization with Adaptive Clipping (BAPO) to stabilize RL optimization.
🎯 研究动机在离轨强化学习中,使用历史数据训练虽然提高了样本效率,但政策熵急剧下降、优化过程不稳定甚至崩溃的问题亟待解决。
❓ 解决问题提出了一种基于平衡策略优化和自适应裁剪(BAPO)的稳定方法,动态调整裁剪边界以平衡正负样本贡献、保持熵、并稳定优化过程。
🔍 现象分析识别出优化失衡问题(负优势样本主导梯度导致梯度爆炸)和熵-裁剪规则(固定裁剪机制会抑制熵增更新,导致政策过度利用而非探索)。
🛠️ 主要方法BAPO 是一种简单有效的自适应方法,动态调整裁剪边界以重新平衡正负样本贡献,保护政策熵,确保离轨强化学习训练的快速与稳定。
📊 数据与实验在AIME 2024 和 AIME 2025基准测试中,BAPO 模型(7B 和 32B)优于开源对手,并在同尺度模型上取得SOTA结果,超越了专有系统。
⭐ 主要贡献揭示了离轨强化学习中的优化失衡问题和熵-裁剪规则,并提出了BAPO方法,实现了稳定高效的训练,同时模型在推理基准上取得了领先性能。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings—where stale data from past policies are used for training—improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios—including sample replay and partial rollout—BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.
基础/前沿模型 (含LLM)
指令微调与对齐
#Reinforcement finetuning (RFT) #Large Language Models (LLMs) #Online task selection #Bayesian inference
TL;DR:We propose BOTS, a Bayesian framework for online task selection in LLM finetuning that integrates explicit and implicit evidence with posterior sampling, achieving efficient and effective training.
🎯 研究动机强化微调(RFT)对对齐大型语言模型(LLMs)和提升推理能力至关重要,但任务选择的低效显著影响训练效果。
❓ 解决问题现有任务选择方法存在计算浪费、高成本、不适应性或证据不足的问题,需开发更高效的在线任务选择框架。
🔍 现象分析均匀任务采样会导致在无意义任务上浪费资源,而利用不完整证据的选择方法缺乏对模型动态变化的响应。
🛠️ 主要方法提出 BOTS 框架,基于贝叶斯推理结合后验采样,适配任务难度变化,并通过轻量插值方法估算未评估任务难度,优化探索与利用的平衡。
📊 数据与实验在多个领域和LLM规模上进行实验,验证其可扩展性和数据效率相较于基线及变体的显著提升。
⭐ 主要贡献开发了一种高效的动态任务选择框架,通过整合显式和隐式证据及简化难度估算机制,为RFT提供了实用的解决方案。
查看完整摘要 (Abstract)
Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce \textbf{BOTS}, a unified framework for \textbf{B}ayesian \textbf{O}nline \textbf{T}ask \textbf{S}election in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates \emph{explicit evidence} from direct evaluations of selected tasks and \emph{implicit evidence} inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation for task selection. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models (LLMs) #Reinforcement Learning from Human Feedback (RLHF) #Mixture-of-Experts (MoE) #Parameter-Efficient Fine-Tuning (PEFT) #Group Relative Policy Optimization (GRPO)
TL;DR:We introduce RO-GRPO, a method that prevents routing collapse in MoE models during GRPO by transforming internal routing statistics into a reward signal, enabling the simultaneous alignment of a model's behavior and its internal mechanisms.
🎯 研究动机当前先进的强化学习算法(如GRPO)对高效微调架构(如LoRA-MoE)进行微调时,存在目标不兼容的问题,导致路由崩溃和专家利用率不足,限制了MoE的性能潜力。
❓ 解决问题提出了RO-GRPO方法,旨在解决GRPO微调MoE模型时的路由崩溃问题,通过将内部路由统计量转化为奖励信号,实现对模型行为和内部机制的同时优化。
🔍 现象分析传统的监督技术与GRPO目标不兼容,简单组合会导致MoE适配器参数利用不足和路由崩溃,阻碍了模型在复杂任务(如数学推理)上的性能提升。
🛠️ 主要方法设计了一种机制感知的框架,将训练过程中收集的专家路由统计量直接转化为奖励信号,无需额外训练阶段,将路由监督无缝集成到强化微调过程中。
📊 数据与实验方法在单模态和多模态数学推理任务上进行了验证,实验表明,RO-GRPO能有效优化参数利用率,并显著提升模型在这些任务上的性能表现。
⭐ 主要贡献首次证明GRPO中的标量奖励可以从模型内部机制中设计出来,以明确指导优化,将对齐从单纯的行为调优扩展到全面的机制对齐,为MoE模型的强化学习微调提供了新思路。
查看完整摘要 (Abstract)
Parameter-efficient Mixture-of-Experts (MoE) architectures, such as LoRA-MoE, enable strong and generalizable fine-tuning. However, a critical problem arises when fine-tuning these architectures with advanced reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO). Traditional supervised techniques are not naturally compatible with
the GRPO objective, and naive combinations fail to effectively address routing collapse and the underutilization of MoE adapter parameters. To resolve this disconnect, we introduce Routing-Optimized Group Relative Policy Optimization (RO-GRPO), a mechanism-aware framework. It turns internal expert routing statistics collected during training into a direct reward signal, seamlessly integrating routing supervision into the reinforcement fine-tuning (RFT) process. This enables effective optimization of parameter utilization and improves performance on both unimodal and multimodal mathematical reasoning tasks, all without extra training stages. Our work provides the first demonstration that a scalar reward in GRPO can be engineered from a model's own internal mechanics to explicitly guide its optimization, extending alignment from mere behavior tuning to holistic mechanism alignment.
基础/前沿模型 (含LLM)
指令微调与对齐
#RL #calibration #reasoning #uncertainty
🎯 研究动机当前通过强化学习训练语言模型以生成推理链条虽在问答任务中表现卓越,但过于依赖二元奖励函数,会导致模型置信度校准下降,并提高错误生成率,引发模型不可信问题。
❓ 解决问题提出一种新的训练方法RLCR,将模型准确率与置信度校准性同步优化,减轻因使用单纯二元奖励函数引发的模型校准与泛化性能下降问题。
🔍 现象分析传统强化学习方法倾向于忽略模型生成预测时的置信度与校准性,容易导致模型在跨领域任务中生成错误回答或自信错误现象(‘幻觉’)。
🛠️ 主要方法RLCR通过将二元正确性奖励与基于Brier评分的置信度奖励结合,直接优化模型输出预测的准确性与置信度校准水平,同时允许模型生成数值化置信评估。
📊 数据与实验通过多样化的数据集进行实验验证,RLCR在域内与跨域评估中显著提升了模型校准性,且未损失准确率,优于传统RL和后处理校准方法。
⭐ 主要贡献证明基于有界、合理评分规则的奖励函数可优化模型校准性;提出RLCR方法显著改善模型校准表现;开发利用置信度进行推理调整的方法,提升模型可靠性。
查看完整摘要 (Abstract)
When language models (LMs) are trained via reinforcement learning (RL) to generate natural language “reasoning chains”, their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or “hallucinate”) in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score—a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations—outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM #LLM Agent #Prompt Evolving
🎯 研究动机大型语言模型的性能依赖精巧设计的提示,但现有优化方法在应对语义等价的微小变动导致性能波动时表现脆弱,亟需提升提示的鲁棒性。
❓ 解决问题识别并修正提示在语义等距空间中的性能波动问题(文本尖锐性),以增强提示在语义邻域中的稳定性和鲁棒性。
🔍 现象分析自动提示搜索方法在面对语义保存的微小改写时呈现不稳定性,表现为提示空间中的强烈波动,削弱实际应用的可靠性。
🛠️ 主要方法提出无梯度框架TARE,通过对抗性搜索和鲁棒性选择交替优化提示,同时设计ATARE以学习各项异性权重,动态调整语义邻域间距以实现探索与保真性的平衡。
📊 数据与实验基于多样化任务验证方法的有效性,观察到减少文本尖锐性差距的设计能显著提升提示的语义稳定性和准确性,且计算开销可控。
⭐ 主要贡献首次提供提示空间文本尖锐性的正式描述及鲁棒性指标,提出TARE框架及其改进ATARE,实现了更稳健且高效的提示搜索机制。
查看完整摘要 (Abstract)
The performance of Large Language Models (LLMs) hinges on carefully engineered prompts. However, prevailing prompt optimization methods, ranging from heuristic edits and reinforcement learning to evolutionary search, primarily target point-wise accuracy. They seldom enforce paraphrase invariance or searching stability, and therefore cannot remedy this brittleness in practice. Automated prompt search remains brittle: small, semantically preserving paraphrases often cause large performance swings. We identify this brittleness as the **textual sharpness** of the **prompt landscape**. In this work, we provide the first formal treatment of textual sharpness in the discrete, semantic space of prompts, together with an operational robustness criterion over a semantic neighborhood; the design is black-box or API-only, requiring no gradients to update the model's parameters. Then we introduce **TARE** (Textual Sharpness-Aware Evolving), a derivative-free framework that alternates between an inner, sampling-based adversarial search that stresses a prompt with hard paraphrases and an outer, robust selection that prefers candidates whose neighborhoods remain strong. We further propose **ATARE**, which learns anisotropic weights to shape the semantic neighborhood and adapts its radius over time to balance exploration and fidelity. Diverse tasks evaluate our methods, whose design for minimizing textual sharpness gap leads to prompts that preserve accuracy under paraphrasing, outperforming accuracy-only prompt search while remaining computationally practical. The code is available for anonymous access at https://anonymous.4open.science/r/ATARE_TARE/.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM Reasoning; Reinforcement Learning; Self-envolving
TL;DR:We propose an online Self-play with Variational Problem Synthesis strategy for RLVR training that iteratively leverages model responses to synthesize variational problems for augmentation.
🎯 研究动机现有的强化学习与可验证奖励(RLVR)方法在提升模型推理能力的同时,降低了策略熵,限制了生成多样性及 Pass@k 性能的提升。
❓ 解决问题为缓解训练中的熵塌问题,通过扩展与更新训练问题集,提升生成多样性并改进更高指标下的推理性能。
🔍 现象分析分析表明,训练问题的多样性对维持策略熵至关重要,传统 RLVR 训练存在生成单一化的缺陷。
🛠️ 主要方法提出在线自对弈与变异问题生成策略(SvS),利用模型正确解答动态合成变异问题,同时确保参考答案一致性,以实现策略自改进。
📊 数据与实验在 AIME24 和 AIME25 等 12 个推理基准数据集上进行实验,涵盖 3B 至 32B 多种模型规模,验证了方法的通用性和鲁棒性,并在 Pass@32 上获得最高 22.8% 提升。
⭐ 主要贡献系统分析了问题多样性对 RLVR 训练的影响,提出了 SvS 方法显著改进模型推理性能,并在多项基准测试上实现了可推广的性能提升。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.
基础/前沿模型 (含LLM)
指令微调与对齐
#large language models #policy optimization #multi-domain reasoning
TL;DR:We propose CGPO, a scalable, curvature-guided RL method that leverages cross-domain gradient interactions to enhance multi-domain reasoning in LLMs, achieving faster and more consistent improvements across diverse tasks.
🎯 研究动机大型语言模型在多领域强化学习中面临复杂的奖励曲面,以实现跨领域性能优化时存在显著困难。不同领域间常出现冲突,解决领域间能力权衡问题亟需有效方法。
❓ 解决问题提出一种可扩展的曲率引导策略优化框架,旨在缓解多领域冲突并提高大型语言模型的多领域推理能力。
🔍 现象分析奖励曲面具有几何结构特性,但领域间的梯度交互常导致优化困难且损益不均。现有方法未充分利用跨领域梯度内积对齐的潜力。
🛠️ 主要方法通过曲率引导策略优化(CGPO),利用类似牛顿法的曲率信息预处理梯度。采用随机领域更新序列,从其他领域的曲率信息中促进梯度的隐式对齐,优化多领域表现。
📊 数据与实验基于包含数学、编程、科学与创意写作的混合数据集,在七个常用基准测试中评估方法。实验结果表明,CGPO在奖励提升速度与跨领域能力方面优于所有基线。
⭐ 主要贡献提出了曲率引导的RL框架CGPO,从几何结构中挖掘跨领域交互机制,显著提升大型语言模型的多任务推理能力与训练效率。
查看完整摘要 (Abstract)
Multi-domain reinforcement learning (RL) for large language models (LLMs) involves highly intricate reward surfaces, posing significant challenges in finding parameters that excel across all domains. Recent empirical studies have further highlighted conflicts among domains, where gains in one capability often come at the expense of another. However, approaches to mitigate such conflicts and enhance multi-domain reasoning remain largely underexplored. To address this challenge, we propose **C**urvature-**G**uided **P**olicy **O**ptimization (**CGPO**), a principled and scalable training framework to advance the multi-domain reasoning of LLMs. Inspired by Newton's method, CGPO exploits the geometric structure in the reward surface, while sidestepping the prohibitive cost of Hessian computation. At each update, CGPO processes domains in random order, preconditioning their gradients with curvature information from other domains to foster richer cross-domain interactions. This mechanism further promotes implicit gradient alignment by maximizing inter-domain inner products in expectation, steering the parameters toward regions that jointly enhance multi-domain performance. Extensive experiments on a mixed dataset covering math, coding, science, and creative writing, evaluated across seven widely-used benchmarks, show that CGPO significantly outperforms all baselines in terms of faster reward improvement and stronger multi-domain capability.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models; Reward Models
🎯 研究动机奖励模型通过人类偏好数据训练,可有效将大型语言模型与人类意图对齐,但容易遭受奖励欺骗问题,尤其是当前方法侧重于同分布情况下,忽视了更具挑战性的分布外场景。
❓ 解决问题提出在分布外场景中,结合细粒度多属性评分以改进奖励模型表现,同时避免该方法因高质量数据稀缺限制性能的瓶颈。
🔍 现象分析现有最先进方法在分布外场景表现较差,多目标奖励函数虽有所改善,但受数据质量限制表现较弱。
🛠️ 主要方法提出一个联合奖励建模框架,将Bradley-Terry单目标和多目标回归奖励函数在共享嵌入空间中联合训练,并从理论上揭示BT损失与回归目标的互补性。
📊 数据与实验实验结果表明,该框架显著提升了奖励模型的鲁棒性与评分性能,其中7B参数模型优于70B基线。
⭐ 主要贡献1) 提出了统一的奖励建模框架;2) 理论分析了BT与多目标回归的互补机制;3) 实现小模型超越大模型性能的结果,验证了方法有效性。
查看完整摘要 (Abstract)
Reward models trained on human preference data have demonstrated strong effectiveness in aligning Large Language Models (LLMs) with human intent under the framework of Reinforcement Learning from Human Feedback (RLHF). However, RLHF remains vulnerable to reward hacking, where the policy exploits imperfections in the reward function rather than genuinely learning the intended behavior.
Although significant efforts have been made to mitigate reward hacking, they predominantly focus on and evaluate in-distribution scenarios, where the training and testing data for the reward model share the same distribution.
In this paper, we empirically show that state-of-the-art methods struggle in more challenging out-of-distribution (OOD) settings. We further demonstrate that incorporating fine-grained multi-attribute scores helps address this challenge. However, the limited availability of high-quality data often leads to weak performance of multi-objective reward functions, which can negatively impact overall performance and become the bottleneck. To address this issue, we propose a unified reward modeling framework that jointly trains Bradley-Terry (BT) single-objective and multi-objective regression-based reward functions using a shared embedding space. We theoretically establish a connection between the BT loss and the regression objective and highlight their complementary benefits. Specifically, the regression task enhances the single-objective reward function’s ability to mitigate reward hacking in challenging OOD settings, while BT-based training improves the scoring capability of the multi-objective reward function, enabling a 7B model to outperform a 70B baseline.
Extensive experimental results demonstrate that our framework significantly improves both the robustness and the scoring performance of reward models.
基础/前沿模型 (含LLM)
指令微调与对齐
#Chart-to-Code Generation #Reinforcement Learning
🎯 研究动机强化学习虽然在视觉语言模型的通用推理中表现优异,但在需要深度理解信息丰富图像并生成结构化输出的任务中应用尚少。图表到代码的生成正体现了这一挑战,它要求对视觉图表进行复杂推理以输出结构化代码。仅靠监督微调通常不足,因此需要针对结构化输出设计有效的强化学习策略。
❓ 解决问题本文旨在突破图表到代码生成任务中监督微调的性能瓶颈。通过系统研究监督微调的性能停滞现象,提出多模态结构化强化学习方法,以解决现有方法在处理复杂视觉结构和精细代码细节上的不足。
🔍 现象分析大规模实验表明,尽管监督微调能取得先进性能,但仅增加训练数据最终会导致改进收益递减。这揭示了监督微调在提升复杂结构化输出任务性能时存在固有的瓶颈或高原效应。
🛠️ 主要方法提出多模态结构化强化学习(MSRL),采用多粒度奖励系统整合文本与视觉反馈。文本层面使用基于规则的奖励验证细粒度代码细节,视觉层面则通过基于模型的奖励评估渲染代码与真实图表间的结构相似性,并辅以两阶段课程训练策略。
📊 数据与实验构建了迄今最大的训练语料库,包含300万个从arXiv论文真实表格中整理的图表-代码对。实验表明,MSRL在ChartMimic和ReachQA基准上将高级指标分别提升了6.2%和9.9%,显著打破了监督微调的性能高原。
⭐ 主要贡献首次系统研究了图表到代码生成中监督微调的性能停滞问题,并提出MSRL框架以突破此瓶颈。构建了大规模真实数据集,并通过多模态奖励和课程学习策略实现了优于所有现有方法的性能,甚至达到了与先进闭源模型相竞争的结果。
查看完整摘要 (Abstract)
While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring deep understanding of information-rich images and structured output generation remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to produce structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies tailored to structured outputs. In this paper, we systematically investigate the performance plateau of SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation. We construct the largest training corpus to date, with 3 million chart-code pairs curated from real-world tables in arXiv papers, addressing the limitations of previous synthetic datasets. Despite achieving state-of-the-art performance, our experiments show that simply increasing SFT data eventually leads to diminishing improvements. To break this plateau, MSRL employs a multi-granularity reward system that integrates both textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details, while at the visual level, a model-based reward assesses the structural similarity between rendered code and ground-truth charts. We implement a two-stage curriculum training strategy, first optimizing the model with textual rewards and then incorporating visual signals for further enhancement. Experimental results demonstrate that MSRL substantially breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks, respectively. Notably, our method outperforms all existing approaches in the chart domain and achieves competitive results with advanced closed-source models.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM unlearning #circuit discovery #conjunctive normal form #interpretability
TL;DR:We use circuit discovery and CNF solving to design the localization for forget neurons and retain neurons in the LLM unlearning task.
🎯 研究动机大语言模型(LLM)在忘却任务中需要对不良数据的影响进行清除,同时保持与目标无关的信息完整性。然而,现有方法难以拆解负责遗忘与保留的神经元角色,导致遗忘不完全或过度遗忘的问题。
❓ 解决问题提出一种新的定位框架,能够有效区分需要遗忘的神经元与需要保留的神经元,从而避免遗忘和保留功能的干扰,提升模型性能。
🔍 现象分析现有定位方法将遗忘和保留神经元混为一谈,无法针对性地干预神经元作用,导致遗忘过程容易出现目标知识删除不彻底或非目标技能被破坏的现象。
🛠️ 主要方法利用电路发现技术对神经元进行归因分析,将遗忘与保留电路转化为合取范式(CNF),通过求解CNF的可满足性问题确定每个神经元的角色,并制定有针对性地优化策略。
📊 数据与实验通过广泛实验验证新方法效果,与现有方法相比,CLUE能够在移除目标知识和保持非目标能力之间达到更优的平衡,实现更高的忘却效率和保留效用。
⭐ 主要贡献提出了CLUE框架,创新性地结合电路发现和CNF求解方法,实现对LLM神经元的精准分类与定位,改善了遗忘任务的效果和可靠性。
查看完整摘要 (Abstract)
The LLM unlearning aims to eliminate the influence of undesirable data without affecting causally unrelated information.
This process typically involves using a **forget set** to remove target information, alongside a **retain set** to maintain non-target capabilities. While recent localization-based methods demonstrate promise in identifying important nodes (neurons) to be unlearned, they fail to disentangle nodes responsible for forgetting undesirable knowledge or retaining essential skills, often treating them as a single entangled group. As a result, these methods apply uniform interventions, risking catastrophic over-forgetting or incomplete erasure of the target knowledge. To address this, we turn to circuit discovery, a mechanistic interpretability technique, and propose the **C**onflict-guided **L**ocalization for LLM **U**nlearning fram**E**work (**CLUE**). This framework identifies the forget and retain circuit composed of important nodes, and then the circuits are transformed into conjunctive normal forms (CNF). The assignment of each node in the CNF satisfiability solution reveals whether it should be forgotten or retained. We then provide targeted fine-tuning strategies for different categories of nodes. Extensive experiments demonstrate that, compared to existing localization methods, CLUE achieves superior forget efficacy and retain utility through precise neural localization.
基础/前沿模型 (含LLM)
指令微调与对齐
#Steerable Generation #Large language models #Representation Engineering #Test-time Intervention #Learning Dynamics
TL;DR:We introduce COLD-Steer, an optimization-free, sample-efficient activation steering framework that leverages the in-Context One-step Learning Dynamics of given examples to steer LLM behavior during inference.
🎯 研究动机现有的大语言模型推理控制方法在样本效率与信号捕获间存在权衡,亟需无需训练的高效方法实现推理时行为引导。
❓ 解决问题提出一种无需重新训练的框架,利用少量样本近似梯度更新下的表示变化,在推理阶段高效引导模型行为。
🔍 现象分析现有方法中高效利用样本的方法难以充分捕获指导信号,而捕获信号能力强的方法依赖大量样本,造成效率低下。
🛠️ 主要方法通过单次学习动态,提出(i)基于单位核的梯度近似更新方法;(ii)利用有限差分法实现两次前向传播完成更新,无需参数调整。
📊 数据与实验在多种引导任务和基准上进行实验,COLD-Steer比最佳基线减少50倍样本,仍实现最高95%的引导效果,特别是在多元对齐任务中展示了实时适应性。
⭐ 主要贡献提出了一种训练无关的高效引导框架COLD-Steer,实现了灵活的上下文感知模型控制,为多样化的用户偏好适应提供了新可能。
查看完整摘要 (Abstract)
Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95\% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer enables real-time adaptation to new steering objectives and facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.
基础/前沿模型 (含LLM)
指令微调与对齐
#Instruction Fine-tuning #LLMs #Data Filtering #CPQS #Hidden States
🎯 研究动机指令微调在提升大型语言模型性能中至关重要,但低质量和冗余数据限制了其效果。近年来通过过滤高质量数据提高效率成为研究趋势。
❓ 解决问题现有数据过滤方法依赖预定义的模型或人工设计指标,与目标模型的需求可能不匹配,导致微调效果受限。
🔍 现象分析大型语言模型的隐藏状态隐含地反映了数据质量,可作为模型感知数据特性的代表性特征。
🛠️ 主要方法提出基于模型隐藏状态的对比感知质量评分(CPQS)算法,通过构建数据分类模型,以此评分筛选用于微调的数据。
📊 数据与实验在通用领域,方法在Alpaca_GPT4和DeepSeek-R1数据集上仅使用10%数据便超越完整数据集训练模型和现有方法。在下游任务中,在多项基准测试上平均提升超过3.6%。
⭐ 主要贡献首次利用LLM隐藏状态进行数据质量感知。提出高效的CPQS数据过滤算法,在多个领域实现性能突破。
查看完整摘要 (Abstract)
Instruction fine-tuning is a key technique for enhancing the performance of large language models (LLMs), but low-quality and redundant data often hinder its effectiveness. Recent studies suggest that filtering a small amount of high-quality data for instruction fine-tuning can achieve faster and more efficient training performance. However, existing data filtering approaches predominantly depend on predefined evaluation models or manually designed metrics, without leveraging information from the target LLM itself. This limitation may result in a mismatch between the filtering criteria and the actual requirements of the LLM being fine-tuned, thereby reducing the effectiveness of the fine-tuning process. To address these issues, we propose a novel perspective: the hidden states of LLMs implicitly reflect the quality of the training data. Based on this insight, we propose a novel data filtering method that extracts the hidden states that reflect the target LLM’s perception of the data as representative features, and builds a data classification model upon them, which outputs the Contrastive Perception Quality Score (CPQS) for dataset filtering. Our experiments are conducted in both general and downstream domains.
(1) In the general domain, our experiments show that training on under 10\% of the data from both the Alpaca\_GPT4 and DeepSeek-R1 synthesized reasoning datasets enables our method to outperform models trained on the complete datasets. Moreover, it surpasses the performance of current state-of-the-art data-selection techniques.
(2) In downstream tasks, our approach delivers an average performance gain exceeding 3.6\% over leading data-selection algorithms across multiple benchmarks, including GSM8K, HumanEval, and HumanEval-Plus.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM uncertainty #confidence calibration #verbalized confidence
🎯 研究动机大语言模型(LLM)的输出需要可信的置信估计,但目前模型生成的置信分数往往校准不足,尤其是在低准确性场景中表现出过度自信,这危及用户信任与安全。
❓ 解决问题针对LLM置信评估中的校准不足问题,提出了一种新的方法来有效缓解模型的暗示性偏差并提高置信估计的可靠性。
🔍 现象分析通过实验证明,LLM在接触其编码信息较少的声明时表现出明显的暗示性,从而导致在低准确性声明上更容易出现过度自信。
🛠️ 主要方法引入了Distractor-Normalized Coherence(DINCO)方法,通过生成一组干扰选项并独立评估置信度,然后对总置信度进行归一化,以补偿模型的暗示性偏差;同时结合生成器和验证器的一致性评估以优化校准。
📊 数据与实验实验表明,DINCO方法提供了更不饱和、更实用的置信估计;即使在采样次数较少的情况下,DINCO仍优于以较高采样次数运行的基线方法如自一致性。
⭐ 主要贡献提出DINCO方法,从干扰选项和一致性校准两个维度改善LLM置信估计;验证DINCO在较少资源条件下仍能达到优越性能;公开相关代码以支持社区研究。
查看完整摘要 (Abstract)
Calibrated confidence estimates are necessary for large language model (LLM) outputs to be trusted by human users. While LLMs can express their confidence in human-interpretable ways, verbalized LLM-generated confidence scores have empirically been found to be miscalibrated, reporting high confidence on instances with low accuracy and thereby harming trust and safety. We hypothesize that this overconfidence often stems from a given LLM’s heightened suggestibility when faced with claims that it encodes little information about; we empirically validate this hypothesis, finding more suggestibility on lower-accuracy claims. Building on this finding, we introduce Distractor-Normalized Coherence (DINCO), which estimates and accounts for an LLM’s suggestibility bias by having the model verbalize its confidence independently across several self-generated distractors (i.e. alternative claims), and normalizes by the total verbalized confidence. To further improve calibration, we leverage generator-validator disagreement, augmenting normalized validator confidence with a consistency-based estimate of generator confidence. Here, we frame the popular approach of self-consistency as leveraging coherence across sampled generations, and normalized verbalized confidence as leveraging coherence across validations on incompatible claims, allowing us to integrate these complementary dimensions of coherence into DINCO. Moreover, our analysis shows that DINCO provides less saturated – and therefore more usable – confidence estimates, and that further sampling alone cannot close the gap between DINCO and baselines, with DINCO at 10 runs outperforming self-consistency at 100. We release our code at https://github.com/victorwang37/dinco.
基础/前沿模型 (含LLM)
指令微调与对齐
#Image Caption #Reinforcement learning #Large Vision Language Model
TL;DR:We present CapRL, an effective decoupled two-stage training scheme with verifiable caption reward to boost image captioning model.
🎯 研究动机图像描述任务是连接视觉与语言领域的基础任务,对大规模视觉-语言模型的预训练至关重要。现有基于有监督微调的方法依赖昂贵的人工标注数据,导致模型缺乏泛化性和多样性。
❓ 解决问题为解决SFT的局限,首次将可验证奖励的强化学习范式引入开放式图像描述任务。核心挑战是为主观的“优质描述”设计客观的奖励函数。
🔍 现象分析传统SFT方法使模型机械记忆标准答案,限制了其泛化能力和创造多样性描述的潜力。现有方法缺乏可扩展、客观的评价机制。
🛠️ 主要方法提出CapRL框架,通过描述实用性定义质量:优质描述应使非视觉语言模型能仅根据描述准确回答图像相关问题。采用解耦两阶段流程:先用LVLM生成描述,再用纯语言模型根据描述回答多选题,以准确率作为客观奖励。
📊 数据与实验使用CapRL-3B标注的CapRL-5M数据集进行预训练,在12个基准上取得显著提升。在Prism评估框架中,性能媲美Qwen2.5-VL-72B,平均超出基线8.4%。
⭐ 主要贡献首次将RLVR应用于主观性图像描述任务,突破了SFT的数据依赖和记忆局限。提出的实用性奖励机制为描述质量提供了客观评估方法,显著提升了模型的泛化能力和描述准确性。
查看完整摘要 (Abstract)
Image captioning is a fundamental task that bridges the visual and linguistic
domains, playing a critical role in pre-training Large Vision-Language Models
(LVLMs). Current state-of-the-art captioning models are typically trained with
Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable
data annotated by humans or proprietary models. This approach often leads to
models that memorize specific ground-truth answers, limiting their generality and
ability to generate diverse, creative descriptions. To overcome the limitation of
SFT, we propose applying the Reinforcement Learning with Verifiable Rewards
(RLVR) paradigm to the open-ended task of image captioning. A primary challenge,
however, is designing an objective reward function for the inherently subjective
nature of what constitutes a "good" caption. We introduce Captioning Reinforce-
ment Learning (CapRL), a novel training framework that redefines caption quality
through its utility: a high-quality caption should enable a non-visual language
model to accurately answer questions about the corresponding image. CapRL
employs a decoupled two-stage pipeline where an LVLM generates a caption, and
the objective reward is derived from the accuracy of a separate, vision-free LLM
answering Multiple-Choice Questions based solely on that caption. As the first
study to apply RLVR to the subjective image captioning task, we demonstrate
that CapRL significantly enhances multiple settings. Pretraining on the CapRL-
5M caption dataset annotated by CapRL-3B results in substantial gains across 12
benchmarks. Moreover, within the Prism Framework for caption quality evaluation,
CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding
the baseline by an average margin of 8.4%. Results validate that our CapRL effec-
tively trains models to produce a more general and accurate image descriptions,
moving beyond the limitations of traditional SFT-based image captioning models.
基础/前沿模型 (含LLM)
指令微调与对齐
#large vision-language model #instruction-tuning #EEG #clinical
TL;DR:We present CerebraGloss, the first instruction-tuned LVLM for fine-grained clinical EEG analysis, enabled by a novel automated data generation pipeline and evaluated on our new comprehensive benchmark, CerebraGloss-Bench.
🎯 研究动机临床脑电图(EEG)解读是一项劳动密集且主观性强的任务,现有计算方法通常仅限于狭窄的分类任务,缺乏整体性解释。大型视觉语言模型(LVLM)在该领域的应用,受限于缺乏包含精细专家级注释的EEG可视化数据集。
❓ 解决问题为应对这一挑战,论文提出了CerebraGloss,首个经过指令微调的LVLM,用于细粒度临床EEG分析。其核心贡献在于解决了该领域高质量配对数据集稀缺的关键瓶颈。
🔍 现象分析直接应用现有LVLM进行开放的EEG分析面临主要障碍:EEG信号复杂,难以获得大规模带有专家级语言描述的波形视觉数据对,这限制了模型学习细粒度和上下文感知的解读能力。
🛠️ 主要方法核心方法包含两点:一是提出一个新颖的自动化数据生成流水线,该流水线利用定制的基于YOLO的波形检测器,程序化创建大规模EEG-文本指令数据。二是使用该数据对LVLM进行指令微调,开发出CerebraGloss模型。
📊 数据与实验为评估这种新的生成式分析能力,研究构建并发布了全面的开放式基准CerebraGloss-Bench。实验表明,CerebraGloss在该基准上超越了包括GPT-5在内的主流LVLM,并在TUSZ癫痫检测任务上创造了新的最先进性能。
⭐ 主要贡献主要贡献是开创性地提出了CerebraGloss,第一个用于细粒度、生成式临床EEG解读的指令微调LVLM。此外,贡献还包括创新的自动化数据生成流水线、全面的评测基准,以及开源发布的模型、基准和工具。
查看完整摘要 (Abstract)
Interpreting clinical electroencephalography (EEG) is a laborious, subjective process, and existing computational models are limited to narrow classification tasks rather than holistic interpretation. A key bottleneck for applying powerful Large Vision-Language Models (LVLMs) to this domain is the scarcity of datasets pairing EEG visualizations with fine-grained, expert-level annotations. We address this by introducing CerebraGloss, an instruction-tuned LVLM for nuanced EEG interpretation. We first introduce a novel, automated data generation pipeline, featuring a bespoke YOLO-based waveform detector, to programmatically create a large-scale corpus of EEG-text instruction data. Using this data, we develop CerebraGloss, the first model of its kind capable of unified, generative analysis—performing tasks from detailed waveform description to multi-turn, context-aware dialogue. To evaluate this new capability, we construct and release CerebraGloss-Bench, a comprehensive benchmark for open-ended EEG interpretation. CerebraGloss demonstrates strong performance, surpassing leading LVLMs, including proprietary models like GPT-5, on this benchmark and achieving a new state-of-the-art on the TUSZ seizure detection task. Models, benchmark and tools are available at https://github.com/iewug/CerebraGloss.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Model #Post-training #Reinforcement Learning #Alignment
🎯 研究动机强化学习微调常因奖励信号过度优化导致模型输出质量下降,问题在于高奖励区域内奖励定义不明确,难以分辨优质与卓越的响应。
❓ 解决问题提出基于评分规则的奖励建模方法,以有效利用高奖励区域的离线样本,同时避免因样本特性引入的奖励偏差。
🔍 现象分析理论分析表明,高奖励尾部区域的误判是造成奖励过度优化的核心问题,并且基础语言模型生成的高奖励尾部样本稀缺。
🛠️ 主要方法设计基于评分规则的奖励机制,利用离线样本而不依赖其特性,并通过区分多样性与优越性来捕获高奖励尾部特征。
📊 数据与实验实验证明基于评分规则的奖励有效缓解了奖励过度优化问题,并显著提升了大语言模型的后训练效果。
⭐ 主要贡献提出了可处理高奖励尾部差异的评分规则奖励方法,并验证了其在改进模型对齐和微调效果中的效率。
查看完整摘要 (Abstract)
Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish excellent responses from merely great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements.
基础/前沿模型 (含LLM)
指令微调与对齐
#in-context learning #supervised fine-tuning #inductive biases #learning dynamics
🎯 研究动机现有研究表明语言模型的上下文学习与监督微调之间的泛化性能和归纳偏差存在差异,但这些差异的机制尚未被深入理解。
❓ 解决问题探究上下文学习和监督微调作为两种独立学习算法的动态变化,分析它们如何影响归纳偏差及内部表示的演化。
🔍 现象分析上下文学习保留了丰富的输入表示,并施加了预训练继承的强归纳先验;监督微调则抑制了与任务无关的特征,导致在少样本场景中的较弱泛化能力。
🛠️ 主要方法比较中型语言模型在上下文学习与监督微调过程中的学习动态,通过归纳偏差演化及内部表示变化进行机制分析。
📊 数据与实验实验基于多个中型语言模型,评估其在不同学习算法下的输入表示保留、归纳偏差变化及任务相关特征压制情况。
⭐ 主要贡献揭示了上下文驱动学习与权重驱动学习的机制差异,阐明了两者在泛化性能和少样本学习中的表现差异来源。
查看完整摘要 (Abstract)
Pretrained language models can acquire novel tasks either through in-context learning (ICL)---adapting behavior via activations without weight updates---or through supervised fine-tuning (SFT), where parameters are explicitly updated. Prior work has reported differences in their generalization performance and inductive biases, but the origins of these differences remain poorly understood. In this work, we treat ICL and SFT as distinct learning algorithms and directly compare the learning dynamics they induce across medium-sized models, analyzing both the evolution of their inductive biases and the underlying internal representations. We find that ICL preserves rich input representations but imposes stronger priors inherited from pretraining, whereas SFT suppresses task-irrelevant features---potentially explaining its weaker generalization in few-shot regimes. These results highlight a mechanistic distinction between context-driven and weight-driven learning.
基础/前沿模型 (含LLM)
指令微调与对齐
#language models #reinforcement learning
🎯 研究动机当前基于强化学习的语言模型奖励机制效果显著,但依赖人工设计的奖励方式可能导致偏差和失败,亟需一种更稳健的估计方法。
❓ 解决问题避免传统奖励调整中对方向偏好过度依赖,提出对指标趋势进行中性估计的方法来优化语言模型的推理能力。
🔍 现象分析针对不同指标如熵和响应长度,其趋势与模型性能存在相关性,但人工设定的优劣方向可能引入偏差。
🛠️ 主要方法提出CANON,通过对样本按指标高低分组后进行跨组比较和组内排序,评估指标趋势对性能的贡献,优化模型行为。
📊 数据与实验在数学推理和高复杂性逻辑任务中,基于三个大型语言模型展开实验,评估熵和响应长度对推理能力及性能成本的影响。
⭐ 主要贡献提出了无需方向假设的条件优势估计方法,显著提升了推理能力,改善了性能与成本之间的平衡,为强化学习模型开辟了新方向。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) for large language models (LLMs) has achieved remarkable progress in enhancing LLMs’ reasoning capabilities on tasks with clear correctness criteria, such as mathematical reasoning tasks. Several training metrics, such as entropy or response length, have been observed to correlate with different reasoning behaviors in reinforcement learning. Prior approaches incorporate such priors through reward or advantage shaping, which often relies on hand-crafted penalties and preferences (e.g., higher-is-better or lower-is-better). However, without careful hyper-parameter tuning, these directional priors can be overly biased and may lead to failure. To this end, we introduce ***C****onditional adv****AN****tage estimati****ON*** (***CANON***), amplifying the impact of the target metric without presuming its direction. Specifically, *CANON* regroups the sampled responses into two groups based on the higher or lower value of a target metric, measures which metric trend contributes to better performance through inter-group comparison, and identifies the better response within the same group. In summary, *CANON* based on entropy consistently outperforms prior methods across three LLMs on both math reasoning and high-complexity logic tasks. When applied to response length, *CANON* further improves token efficiency, yielding a more favorable Pareto frontier in the performance–cost trade-off.
基础/前沿模型 (含LLM)
指令微调与对齐
#robustness #safeguards
TL;DR:We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses.
🎯 研究动机为应对通用越狱攻击,提出在大语言模型中部署高效生产级防御机制的需求。
❓ 解决问题显著降低计算成本与拒绝率,同时增强对模型越狱的鲁棒性,解决上一代系统易受攻击的问题。
🔍 现象分析上一代防御体系的单点输出评估容易忽略对话上下文中的漏洞,且计算成本较高。
🛠️ 主要方法提出交换分类器结合上下文分析,两阶段分类器级联轻量化检测,训练线性探测分类器并与外部分类器集成以提升性能。
📊 数据与实验通过1,700小时的红队测试,在生产流量上实现40倍计算成本降低且拒绝率仅0.05%,有效抵御通用越狱攻击。
⭐ 主要贡献建立了一套高效的生产级防御体系,为大语言模型提供实际可行的保护方案。
查看完整摘要 (Abstract)
We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. Our system combines several key insights. First, we develop exchange classifiers that evaluate model responses in their full conversational context, which addresses vulnerabilities in last-generation systems that examine outputs in isolation. Second, we implement a two-stage classifier cascade where lightweight classifiers screen all traffic and escalate only suspicious exchanges to more expensive classifiers. Third, we train efficient linear probe classifiers and ensemble them with external classifiers to simultaneously improve robustness and reduce computational costs. Together, these techniques yield a production-grade system achieving a 40x computational cost reduction compared to our baseline exchange classifier, while maintaining a 0.05% refusal rate on production traffic. Through extensive red-teaming comprising over 1,700 hours, we demonstrate strong protection against universal jailbreaks---no attack on this system successfully elicited responses to all eight target queries comparable in detail to an undefended model. Our work establishes Constitutional Classifiers as practical and efficient safeguards for large language models.
基础/前沿模型 (含LLM)
指令微调与对齐
#Multi-Modal Reasoning #Reinforcement Learning from Verifiable Rewards
🎯 研究动机基于可验证奖励的强化学习(RLVR)是多模态大语言模型推理能力提升的主要范式,但在训练中巨大的状态空间和稀疏奖励易导致策略退化或次优行为的过度利用,需要一种可控的探索策略。
❓ 解决问题针对RLVR训练中探索效率低、策略熵崩溃和分布失配等问题,提出一种支持可控探索的混合策略RLVR框架,以平衡探索与利用并提升训练稳定性。
🔍 现象分析RLVR训练时,多模态大语言模型状态空间庞大且奖励稀疏,易引发熵崩溃、策略退化或对次优行为的过度利用,而无控制的随机采样则导致探索效率低下。
🛠️ 主要方法提出CalibRL框架,通过分布感知的优势权重调整更新以校准分布保持探索性,并利用LeakyReLU不对称激活函数结合专家知识作为校准基线,在指导性探索中提升策略熵并明确目标分布。
📊 数据与实验在八个基准数据集上进行广泛实验,涵盖领域内和领域外设置,结果均显示一致提升,验证了可控混合策略RLVR训练的有效性。
⭐ 主要贡献设计了一种支持专家指导的可控探索混合策略RLVR框架,通过分布校准和专家知识集成缓解策略与专家轨迹的分布失配,实现了探索与利用的更稳定平衡。
查看完整摘要 (Abstract)
Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function (LeakyReLU) leverages the expert knowledge as a calibration baseline to moderate overconfident updates while preserving their corrective direction. CalibRL increases policy entropy in a guided manner and clarifies the target distribution by estimating the on-policy distribution through online sampling. Updates are driven by these informative behaviors, avoiding convergence to erroneous patterns. Importantly, these designs help alleviate the distributional mismatch between the model’s policy and expert trajectories, thereby achieving a more stable balance between exploration and exploitation. Extensive experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements, validating the effectiveness of our controllable hybrid-policy RLVR training. Code is available at https://github.com/zhh6425/CalibRL.
基础/前沿模型 (含LLM)
指令微调与对齐
#Hallucination #Context Learning #Contextual Faithfulness #Knowledge Conflict #Model Interpretability
TL;DR:We propose Copy-Paste, a paradigm embedding contextual fragments for faithfulness, instantiated as CopyPasteLLM—achieving 12.2-24.5% accuracy gains with only 365 samples (1/50th of baseline) by recalibrating parametric knowledge.
🎯 研究动机当前的大语言模型虽然可以通过检索增强生成方法生成有上下文支持的回答,但在上下文忠实性方面仍存在挑战,容易导致幻觉现象并影响可靠性。
❓ 解决问题通过观察发现,在生成过程中提高对上下文字段的复制程度可以有效减少幻觉现象,进而提出一种新的生成范式以增强上下文忠实性。
🔍 现象分析研究表明,生成模型中对上下文的复制程度与幻觉现象呈反比关系,复制程度越高的回答更有可能保持上下文的忠实性。
🛠️ 主要方法提出名为 Copy-Paste 的生成范式,通过两阶段的高复制偏好训练实现,并设计三种提示方法以提高复制程度,同时开发自动化管道转化生成数据以优化模型性能。
📊 数据与实验在 FaithEval、ConFiQA 和 PubMedQA 数据集上进行实验,CopyPasteLLM 在对比基准模型的上下文忠实性和幻觉控制方面表现最佳,仅用 365 个样本实现 12.2%-24.5% 的准确率提高。
⭐ 主要贡献提出了一种减少模型幻觉的新范式,通过重新校准模型的内参知识依赖提高上下文忠实性,同时显著减少训练数据的需求。
查看完整摘要 (Abstract)
While Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to generate contextually grounded responses, contextual faithfulness remains challenging as LLMs may not consistently trust provided context, leading to hallucinations that undermine reliability. We observe an inverse correlation between response copying degree and context-unfaithful hallucinations on RAGTruth, suggesting higher copying degrees reduce hallucinations by fostering genuine contextual belief. We propose Copy-Paste, a generation paradigm that directly embeds contextual fragments to ensure faithfulness, and instantiate it through CopyPasteLLM via two-stage high-copying preference training. We design three prompting methods to enhance copying degree, demonstrating that high-copying responses achieve superior contextual faithfulness and hallucination control. These approaches enable a fully automated pipeline that transforms generated responses into high-copying preference data for training CopyPasteLLM. On FaithEval, ConFiQA and PubMedQA, CopyPasteLLM achieves best performance in both counterfactual and original contexts, remarkably with 12.2\% to 24.5\% accuracy improvements on FaithEval over the best baseline, while requiring only 365 training samples—1/50th of baseline data. To elucidate CopyPasteLLM's effectiveness, we propose the Context-Parameter Copying Capturing algorithm. Interestingly, this reveals that CopyPasteLLM recalibrates reliance on internal parametric knowledge rather than external knowledge during generation. All codes are available at https://github.com/longyongchao/CopyPasteLLM
基础/前沿模型 (含LLM)
指令微调与对齐
#Privacy #Large Language Models #Document Privatization
TL;DR:LLMs anonymize text but remain vulnerable to membership inference. Differential privacy protects but degrades quality. We introduce a token-wise distribution-fusion algorithm for DP-LLM inference while preserving text utility.
🎯 研究动机大型语言模型(LLMs)在推理过程中可能泄露上下文中的敏感信息,这对隐私安全提出了挑战,且现有方法存在缺少可证明性或隐私与实用性权衡不佳的问题。
❓ 解决问题提出在模型推理过程中保护隐私的Differentially Private Inference (DPI)机制,同时优化隐私保护和生成文本质量之间的平衡。
🔍 现象分析现有隐私保护方法在文档含敏感信息(如个人身份信息)的情况下难以避免攻击;需要一种能够为特定词元设置明确影响范围的机制。
🛠️ 主要方法提出DP-Fusion算法,通过标记敏感词元、计算基线和敏感分布输出、并融合分布确保隐私受控,生成受隐私保护且高质量的文档。
📊 数据与实验实验展示DP-Fusion方法在理论和实证层面显著提升隐私保护性能,比现有方法在困惑度指标上降低6倍,并能灵活调整隐私和文本质量的权衡。
⭐ 主要贡献首次提出基于词元的差分隐私推理算法,为文档私有化提供可验证的隐私保证,显著提升文本质量与隐私保护的协同效果。
查看完整摘要 (Abstract)
Large language models (LLMs) do not preserve privacy at inference-time. The LLM's outputs can inadvertently reveal information about the model's context, which presents a privacy challenge when the LLM is augmented via tools or databases containing sensitive information. Existing privacy-preserving methods at inference-time have significant limitations since they (i) lack provable guarantees or (ii) have a poor utility/privacy trade-off. We propose DP-Fusion, a Differentially Private Inference (DPI) mechanism for LLMs that provably bounds the influence a set of tokens in the context can have on the LLM's output. DP-Fusion works as follows: (1) label a subset of sensitive tokens, (2) infer the LLM without any sensitive tokens to obtain a baseline, (3) infer the LLM with the sensitive tokens, and (4) blend distributions so that the final output remains within a bounded distance of the baseline distribution. While this per-token influence bound also mitigates jailbreak-style prompt injection, we focus on document privatization, where the goal is to paraphrase a document containing sensitive tokens, e.g., personally identifiable information, so that no attacker can reliably infer them from the paraphrased document while preserving high text quality. The privacy/utility trade-off is controlled by $\epsilon$, where $\epsilon=0$ hides sensitive tokens entirely, while higher values trade off privacy for improved text quality. We show that our method creates token-level provably privatized documents with substantially improved theoretical and empirical privacy, achieving $6\times$ lower perplexity than related DPI methods.
基础/前沿模型 (含LLM)
指令微调与对齐
#creativity #creative writing #evaluation #creativity evaluation #machine creativity #n-gram novelty
TL;DR:Study with expert writers cautions against using n-gram novelty for creativity evaluation. Open-source LLMs tend to sound less pragmatic as n-gram novelty increases. Evaluation of close reading skills of frontier and fine-tuned LLMs.
🎯 研究动机探讨n-gram新颖性作为文本创造性评价标准的局限性,研究创造性中的双重属性:新颖性与适切性。
❓ 解决问题提出n-gram新颖性不足以全面衡量文本创造性,尤其在评估生成的文本是否既原创又适用方面表现欠佳。
🔍 现象分析专家注解显示,n-gram新颖性与创造性评级相关,但约91%的高新颖性表达不被认为具有创造性;开源LLM新颖性高时实际适用性降低。
🛠️ 主要方法通过专家对大规模数据集的人类及AI生成文本进行精读注解,从创造性组成角度验证新颖性与实用性关系,并测试模型对新颖或不实用表达的识别能力。
📊 数据与实验利用8618份专业写手注释数据评估各类模型表现,包含零样本、少样本以及微调模型,并分析封闭源与开源LLM的创作能力差异。
⭐ 主要贡献强调创造性评价需综合考虑新颖性及适切性,验证LLM-as-a-Judge评分优于n-gram指标,为未来模型开发及评价提供指导。
查看完整摘要 (Abstract)
$N$-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and $n$-gram novelty through 8,618 expert writer annotations of novelty, pragmaticality, and sensicality via \emph{close reading} of human- and AI-generated text. We find that while $n$-gram novelty is positively associated with expert writer-judged creativity, approximately 91% of top-quartile $n$-gram novel expressions are not judged as creative, cautioning against relying on $n$-gram novelty alone. Furthermore, unlike in human-written text, higher $n$-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier closed-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify expressions perceived as novel by experts (a positive aspect of writing) or non-pragmatic (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty ratings align with expert writer preferences in an out-of-distribution dataset, more so than an n-gram based metric.
基础/前沿模型 (含LLM)
指令微调与对齐
#Finetuning #Parameter-Efficient #LLM #Diagonal Block
TL;DR:Fine-tuning only the diagonal blocks of weights yields superior performance
🎯 研究动机大规模语言模型在适配特定下游任务时需进行微调,但完全模型微调成本高昂,因此需要参数高效微调方法以降低计算和内存开销。
❓ 解决问题现有参数高效微调方法存在性能与完全模型微调之间的差距,本文旨在提出一种方法以缩小这一差距,同时保持效率与稳定性。
🔍 现象分析通过对权重矩阵对角块的微调,模型性能不仅能够稳定收敛且表现优异,表明完整权重矩阵的更新并非必要。
🛠️ 主要方法提出一种称为DiaBlo的新方法,仅微调权重矩阵的对角块,避免低秩矩阵乘法和额外初始化或优化策略,提升收敛稳定性与表达能力。
📊 数据与实验在常识推理、算术推理、代码生成与安全对齐等多个任务上进行实验,显示仅微调对角块即可实现良好且一致的性能,同时保持高效的内存利用与微调速度。
⭐ 主要贡献提出DiaBlo方法,通过对角块微调提升参数效率;提供理论保证显示其优于LoRA;在多任务实验中验证其高效性与强性能表现。
查看完整摘要 (Abstract)
Fine-tuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present *DiaBlo*, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low-Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low-rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. Moreover, we provide theoretical guarantees showing that, under mild low-rank conditions, DiaBlo is more expressive than LoRA in the linear problem and converges to a stationary point of the general nonlinear full fine-tuning. Through extensive experiments across a range of tasks—including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment—we show that fine-tuning only diagonal blocks is sufficient for strong and consistent performance. DiaBlo not only achieves competitive accuracy but also preserves high memory efficiency and fast fine-tuning speed. Codes are available at https://github.com/ziyangjoy/DiaBlo.
基础/前沿模型 (含LLM)
指令微调与对齐
#Deficiency Diagnosis #Data Synthesis #LLMs Reasoning
TL;DR:Diagnose the knowledge deficiencies of LLMs and remedy them with a novel approach.
🎯 研究动机大型语言模型展现了出色的泛化能力,但仍存在推理错误,影响其可靠性与可信度。全面评估模型的知识缺陷和弥补这些问题是关键挑战。
❓ 解决问题如何在无标签数据环境下诊断和改善LLM的推理能力,并通过有效方法解决知识缺陷问题。
🔍 现象分析推理错误往往源于知识缺陷,现有方法难以通过有限的有标签样本全面评估模型性能,并且高质量用户反馈获取成本较高。
🛠️ 主要方法提出LaMer方法,利用相对熵在无标签环境中量化模型知识缺陷,并基于缺陷严重程度自适应生成增强数据,结合课程学习策略逐步改进模型。
📊 数据与实验实验使用七个OOD推理基准,结果显示LaMer在减少训练数据的情况下效果优于依赖有标签数据的方法并获得可比性能。
⭐ 主要贡献提出一种无需标签的知识缺陷诊断与改进方法,大幅减少训练数据需求,为高效开发与诊断LLM提供了新的工具。
查看完整摘要 (Abstract)
Large Language Models (LLMs) have demonstrated impressive generalization ability by learning from extensive unlabeled text. However, they still exhibit reasoning mistakes, which can affect their trustworthiness and reliability. Although users can interact with LLMs and provide diverse and comprehensive queries to expose the flaws of LLMs, obtaining sufficient and effective feedback is demanding. Furthermore, comprehensively evaluating LLMs with limited labeled samples is difficult. These make it a challenge to diagnose and remedy the deficiencies in LLMs through rich label-free user queries. To tackle this challenge and considersing that LLMs' reasoning mistakes often stem from knowledge deficiencies, we propose label-free curricular meaningful learning (LaMer), which first employs relative entropy to diagnose and quantify knowledge deficiencies of LLMs in a label-free setting. Then, LaMer adaptively synthesizes augmentation data based on deficiency severity and progressively remedies them with a curricular remedy strategy. Experiments show that LaMer effectively diagnoses and remedies knowledge deficiencies in LLMs, improving various LLMs across seven out-of-distribution (OOD) reasoning benchmarks, achieving comparable results to baselines with only 40% training data. LaMer even surpasses methods that rely on labeled data for deficiency diagnosis. In application, LaMer offers a diagnostic tool for efficient LLM development.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Reasoning Abilities #Supervised Fine-Tuning
🎯 研究动机大语言模型的推理能力需要通过明确推理过程进行完善,现有的监督微调方法在执行多样化推理任务时容易引入冲突或性能下降,亟需更优的解决方案。
❓ 解决问题提出一种能够缓解任务间冲突并提升多任务推理能力的微调框架,同时保证单任务微调性能不被削弱。
🔍 现象分析通过分析推理微调和基础模型推理过程中的参数变化,发现不同推理能力依赖于专属参数,而任务间重叠参数则可能导致性能冲突或协同作用。
🛠️ 主要方法设计了差异化参数微调方法,根据特定推理任务组合分别更新专属参数和重叠参数,同时有效避免冲突并保留优点。
📊 数据与实验使用多个大语言模型(如 Llama3-8B、Mistral-7B 和 Qwen2.5-14B)和多种推理任务进行混合微调与连续微调实验,验证方法的一致性与优越性。
⭐ 主要贡献提出了一种新颖的微调策略,显著提升多任务推理表现,同时揭示了推理任务间参数作用的关键性,为大语言模型优化提供了通用方法。
查看完整摘要 (Abstract)
Reasoning abilities of large language models (LLMs) require explicit derivations compared to general question-answering, supervised fine-tuning (SFT) can empower multiple reasoning abilities in LLMs via learning from various datasets. However, neither training the datasets jointly (mix-up) nor continually can maintain the performance of single-dataset SFT, sometimes better while sometimes even worse, illustrating vanilla SFT can not only facilitate reasoning abilities but also introduce conflicts. In this paper, we propose a novel framework to mitigate the conflicts and preserve benefits among different reasoning tasks, and even surpass each task's single dataset SFT performance. We start by exploring the differences between reasoning fine-tuned and base LLMs by analyzing their parameter variations during model inference, and we discover that each reasoning capability has exclusive parameters that benefit it more evidently than others. In contrast, the overlapped parameters of tasks can bring benefits or conflicts. Inspired by the findings, we propose to update the exclusive and overlapped parameters according to specific reasoning task combinations differentially, thereby avoiding unnecessary conflicts while maintaining benefits. Consistent improvements in mix-up and continual SFT experiments demonstrate that the proposed SFT strategy can achieve better performance on various LLMs (Llama3-8B, Mistral-7B, and Qwen2.5-14B) and diverse reasoning tasks with fewer conflicts, showing the superiority and generality of our analysis findings and the proposed approach.
基础/前沿模型 (含LLM)
指令微调与对齐
#text diffusion model; diffusion large language model; code generation
TL;DR:We introduce DiffuCoder 7B, show that higher temperature diversifies both token choices and generation order to aid RL, and propose coupled-GRPO, a diffusion-native RL method that avoids semi-AR and improves EvalPlus by 4.4%.
🎯 研究动机扩散式语言模型(dLLMs)因其迭代优化特性在代码生成领域具潜力,但其训练与推理机制仍未深入研究。
❓ 解决问题揭示dLLM解码行为与自回归模型的差异,并开发适配扩散模型的强化学习方法以提升代码生成性能。
🔍 现象分析研究表明,dLLM无需依赖半自回归解码便可调整生成因果性;提高采样温度能同时增强token选择与生成顺序的多样性,有助于RL中的搜索空间扩展。
🛠️ 主要方法提出新采样策略coupled-GRPO,通过构建互补的掩码噪声优化token日志似然估计,显著改进RL效率与性能。
📊 数据与实验训练基于7B参数的DiffuCoder模型,使用130B代码token,并在代码生成基准测试EvalPlus上展示了+4.4%的性能提升。
⭐ 主要贡献深度解析扩散语言模型生成机制,开发原生适配扩散模型的RL框架coupled-GRPO,有效降低AR偏向并提升代码生成表现。
查看完整摘要 (Abstract)
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, DiffuCoder, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose coupled-GRPO, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder's performance on code generation benchmarks (+4.4\% on EvalPlus) and reduces reliance on AR bias during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Model #LLM Safety #Over-refusal #Safety Alignment
TL;DR:This paper introduces Discernment via Contrastive Refinement (DCR), a two-stage safety alignment method that uses contrastive learning to reduce over-refusal in LLMs while preserving safety and general abilities.
🎯 研究动机当前大型语言模型在安全对齐中存在过度拒绝现象,影响了模型在敏感或复杂场景中的实用性和有效性。
❓ 解决问题提出一种减少过度拒绝的对齐方法,同时保留模型拒绝真实有害内容的能力及其通用性能。
🔍 现象分析过度拒绝源于模型对有害与表面有害提示的学习动态难以区分,导致分类错误。
🛠️ 主要方法设计两阶段对齐策略,DCR,通过对比学习优化模型辨别能力,提高区分真实有害与表面有害提示的精确性。
📊 数据与实验使用多样化基准测试,验证方法在减少过度拒绝的同时保持安全对齐效益,并仅对模型通用能力造成微小影响。
⭐ 主要贡献首次提出系统性对比学习框架,有效减少过度拒绝,为安全对齐研究提供了更稳健的方法路径。
查看完整摘要 (Abstract)
Large language models (LLMs) aligned for safety often suffer from over-refusal—the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model’s ability to reject genuinely harmful content.
We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model’s learning dynamics. To address it, we introduce a preceding alignment stage, DCR: $\textbf{D}$iscernment via $\textbf{C}$ontrastive $\textbf{R}$efinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM’s capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse benchmarks shows that our method effectively reduces over-refusal while preserving the safety benefits of alignment. Importantly, it achieves this with minimal degradation of general capabilities, offering a more principled and robust direction for safety alignment.
基础/前沿模型 (含LLM)
指令微调与对齐
#Prompt Learning #Adversarial Robustness #Vision-Language Models
TL;DR:We propose a Discrete Latent Feature based Adversarial Training (DEFEAT) method that mitigates the adversarial attacks for VLMs.
🎯 研究动机对抗性微调可增强视觉语言模型的鲁棒性,但计算成本高昂。对抗性提示调优作为实用替代方案,但其依赖的连续图像特征存在漏洞。
❓ 解决问题提出 DEFEAT 方法,通过离散潜在特征重建缓解对抗性攻击,减少干净与对抗图像表征间的差异,增强 VLM 的鲁棒性。
🔍 现象分析现有对抗性提示调优方法受限于对易受攻击的连续图像特征的依赖,导致表征脆弱性,影响模型在面对对抗样本时的稳定性。
🛠️ 主要方法DEFEAT 引入扰动离散屏蔽模块重构离散潜在特征,设计 logits 融合策略,并结合对抗训练与提示调优,利用可学习提示对手工提示进行正则化。
📊 数据与实验在 15 个数据集上进行广泛实验,验证 DEFEAT 在现有对抗性提示调优方法中的有效性,官方代码已开源。
⭐ 主要贡献提出基于离散潜在特征的对抗训练框架 DEFEAT,有效提升 VLM 的对抗鲁棒性,为高效鲁棒的提示调优提供了新方向。
查看完整摘要 (Abstract)
While adversarial fine-tuning can enhance the robustness of vision-language models (VLMs), such approaches are computationally expensive. Adversarial prompt tuning has emerged as a practical alternative. However, existing methods are limited by their reliance on vulnerable continuous image features. To mitigate the vulnerability in the feature representation, we propose **DEFEAT** (**D**iscrete Lat**E**nt **F**eatur**E** based **A**dversarial **T**raining), a robust prompt tuning framework for VLMs.
Specifically, the DEFEAT method introduces a perturbation discrete shield module that reconstructs discrete latent features and designs a logits fusion strategy, substantially reducing the discrepancy between clean and adversarial image representations.
Moreover, the DEFEAT method integrates prompt tuning with adversarial training while applying regularization from learnable prompts to hand-crafted prompts, further enhancing the adversarial robustness.
Extensive experiments across 15 datasets validate the effectiveness of the proposed DEFEAT method among existing adversarial prompt tuning methods. The official code is available at https://github.com/cheny02/DEFEAT-ICLR2026.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large language models #Knowledge editing
TL;DR:We propose a novel locate-then-edit approach that disentangles knowledge representations for large language model editing to preserve fine-grained irrelevant knowledge.
🎯 研究动机大语言模型的知识编辑技术可以有效更新嵌入知识,但难以维护与编辑知识同主体但无关的细粒度知识,本研究旨在解决这一问题。
❓ 解决问题现有方法因主体表示的多属性编码导致知识编辑时相关和无关知识混杂,易引发非目标知识意外修改,本研究提出了针对这一问题的解决方案。
🔍 现象分析主体表示空间内目标知识与无关知识存在纠缠,现有方法难以确保在知识编辑过程中精确保留无关但重要的细粒度知识。
🛠️ 主要方法提出DiKE方法,通过知识表示解耦模块将主体表示分解为相关与无关组件,并使用基于解耦的编辑模块仅更新相关部分,同时显式保留无关部分。
📊 数据与实验构建新的FINE-KED基准,涵盖不同关系相似度的无关知识;在多种大语言模型上进行广泛实验,验证方法对细粒度知识保留和通用编辑性能的提升。
⭐ 主要贡献提出了提高大语言模型知识编辑精度的DiKE方法,开发了FINE-KED基准,并证明了其显著改善细粒度知识保留能力和保持竞争力编辑表现。
查看完整摘要 (Abstract)
Knowledge Editing has emerged as a promising solution for efficiently updating embedded knowledge in large language models (LLMs). While existing approaches demonstrate effectiveness in integrating new knowledge and preserving the original capabilities of LLMs, they fail to maintain fine-grained irrelevant knowledge, namely facts that share the same subject as edited knowledge but differ in relation and object. This challenge arises because subject representations inherently encode multiple attributes, causing the target and fine-grained irrelevant knowledge to become entangled in the representation space, and thus vulnerable to unintended alterations during editing. To address this, we propose DiKE, a novel approach that Disentangles Knowledge representations for LLM Editing (DiKE). DiKE consists of two key components: a Knowledge Representation Disentanglement (KRD) module that decomposes the subject representation into target-knowledge-related and -unrelated components, and a Disentanglementbased Knowledge Edit (DKE) module that updates only the target-related component while explicitly preserving the unrelated one. We further derive a closedform, rank-one parameter update based on matrix theory to enable efficient and minimally invasive edits. To rigorously evaluate fine-grained irrelevant knowledge preservation, we construct FINE-KED, a new benchmark comprising fine-grained irrelevant knowledge at different levels of relational similarity to the edited knowledge. Extensive experiments across multiple LLMs demonstrate that DiKE substantially improves fine-grained irrelevant knowledge preservation while maintaining competitive general editing performance.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM Calibration #Decision Making #Overconfidence #In-context learning #LLM Agents #LLM self-knowledge #AI Safety
TL;DR:We find that LLMs are overconfident in predicting their success on tasks, but some learn from in-context experience to make more risk-averse decisions about which tasks to attempt.
🎯 研究动机探讨大型语言模型是否能够准确预测自己在任务中的成功率,以及是否能通过上下文学习改善决策能力和降低高失败成本任务中的风险。
❓ 解决问题解决大型语言模型在任务选择上的过度自信问题,以及其对自身能力缺乏正确认识导致的决策质量低下问题。
🔍 现象分析大多数模型在任务预测上具有一定识别能力,但普遍表现出过度自信;模型规模与版本较新的并不一定更具优越性能,且多步任务中部分模型的自信程度进一步恶化。
🛠️ 主要方法通过对不同模型在多步任务中对成功率的预测能力进行测试,同时观察模型在接收失败反馈后的表现变化,以分析其上下文学习能力。
📊 数据与实验使用具有渐进任务性质的测试环境及多种大型语言模型,包括Claude系列模型,进行定量分析和对比实验,评估其预测能力和决策变化。
⭐ 主要贡献揭示目前 LLM 的过度自信与决策偏误问题,强调模型自我能力认知的不足及其对 AI 安全性和一致性风险的潜在影响。
查看完整摘要 (Abstract)
We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs’ decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs' awareness of their capabilities for AI misuse and misalignment risks.
基础/前沿模型 (含LLM)
指令微调与对齐
#Reinforcement Learning #Large Language Models #Generative Models #Post Training #Chain of Thought
TL;DR:We identify the issue of over-dominance of low-probability tokens in RL training for LLMs, and propose two effective methods accordingly which evidently enhance the performance of RL-trained LLMs across various models and datasets.
🎯 研究动机强化学习在提升大型语言模型推理能力方面十分关键,但目前训练中存在低概率词过度影响模型更新的问题,亟待解决。
❓ 解决问题提出两种方法,通过削弱低概率词的梯度影响来增强高概率词的学习,推动训练更新更均衡。
🔍 现象分析低概率词因梯度较大而主导模型更新,这抑制了高概率词对模型性能的关键贡献。
🛠️ 主要方法提出优势重权重方法和低概率词隔离方法,这两种方法分别通过削弱低概率词的梯度和突出高概率词的权重实现训练优化。
📊 数据与实验在多个模型和任务数据集上测试,尤其在K&K逻辑谜题推理任务中,性能提升达46.2%。
⭐ 主要贡献揭示低概率词在RL训练中的过度影响问题,提出两种方法有效缓解此问题,并显著提升LLM的强化学习性能。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.
基础/前沿模型 (含LLM)
指令微调与对齐
#model collaboration #collaborative inference
🎯 研究动机对齐训练提升了语言模型的推理和指令遵循能力,但可能削弱创造力与校准能力。研究旨在结合对齐模型和未对齐模型的优势,通过模型协作优化性能。
❓ 解决问题探索如何通过模型协作克服单一模型在复杂任务中的局限性,实现技能互补和性能提升。
🔍 现象分析语言模型生成的响应包含交替的技能表现,某些任务适合未对齐模型,而另一些任务更适合对齐模型,体现出协作的潜力。
🛠️ 主要方法提出Switch Generation方法,通过训练一个调度模型动态选择预训练模型和对齐模型生成响应片段,根据任务上下文发挥各模型特长。
📊 数据与实验基于18个数据集和8种协作基线模型实验,模型协作在16个任务中优于单一模型,Switch Generation进一步平均提升12.9%的性能。
⭐ 主要贡献发现模型协作可组合技能解决复杂任务,提出一种复用传统训练管道的副产品的方法,实现未见模型和任务的泛化能力。
查看完整摘要 (Abstract)
Alignment training has tradeoffs: it helps language models (LMs) gain in reasoning and instruction following but might lose out on skills such as creativity and calibration, where unaligned base models are better at. We aim to make the best of both worlds through model collaboration, where different models in the training pipeline collaborate and complement each other. Since LM responses feature interleaving skills that favor different models, we propose Switch Generation, where pretrained and aligned model versions take turns to ``speak'' in a response sequence. Specifically, we train a switcher LM by learning from outcomes of choosing different models to generate the next segment across diverse queries and contexts. At inference time, the switcher LM guides different model checkpoints to dynamically generate the next segment where their strengths are most needed. Extensive experiments with 8 model collaboration baselines and 18 datasets show that 1) model collaboration consistently outperforms individual models on 16 out of 18 tasks, and 2) Switch Generation further outperforms baselines by 12.9% on average. Further analysis reveals that Switch Generation discovers compositional skills to solve problems where individual models struggle and generalizes to unseen models and tasks, reusing and repurposing by-products in expensive model training pipelines that are otherwise discarded.
基础/前沿模型 (含LLM)
指令微调与对齐
#Self-Verification #Dual Learning #Preference Optimization #Large Language Model
🎯 研究动机当前模型优化存在对高成本标签的依赖和任务限制,亟需开发能够减少监督依赖并提高通用性的自验证机制。
❓ 解决问题解决强化学习中难以获得低成本验证性奖励的问题,同时扩展传统双学习无法应用于非双任务的局限性。
🔍 现象分析传统方法在不可逆任务和复杂任务中表现受限,DuPO通过优化自监督奖励明显提升了模型多样任务的性能。
🛠️ 主要方法提出一种双偏好优化框架,将原始任务分解为已知与未知部分,通过构造逆任务来自监督优化原始任务的表现。
📊 数据与实验实验覆盖翻译、数学推理等领域,获得翻译质量提升2.1 COMET分,数学推理准确率平均提升6.4分,推理时重排序性能提升9.3分。
⭐ 主要贡献提出具备扩展性、无需标注的优化方法,显著提升了多样任务性能,深化了大规模语言模型的通用性与自验证能力。
查看完整摘要 (Abstract)
We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)’s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning’s restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task’s input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs’ ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.1 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on four challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker~(trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.
基础/前沿模型 (含LLM)
指令微调与对齐
#large reasoning model #reinforcement learning finetuning
TL;DR:We propose Dynamics-Predictive Sampling (DPS) for RLVR, which infers prompt learning dynamics online to select informative prompts before rollout, accelerating RL finetuning of large reasoning models without the need for rollout-intensive filtering.
🎯 研究动机强化学习微调已成为提升大语言模型推理能力的关键技术,但其效果依赖于高效的数据选择,现有方法在代价昂贵的提示筛选上存在问题。
❓ 解决问题提出一种无需大规模模型展开即可高效筛选具有信息量提示的新方法,从而降低计算开销并提升微调效率。
🔍 现象分析现存在线提示选择方法虽能聚焦于中等难度样本以有效优化模型,但因需大批量展开候选集,导致成本甚至超过微调本身。
🛠️ 主要方法使用隐藏马尔可夫模型将提示的解决进度建模为动力学系统,通过历史奖励信号进行在线贝叶斯推断,预测学习状态变化并指导提示选择。
📊 数据与实验在数学、规划和视觉几何等多种推理任务上进行实验,结果表明新方法减少多余展开步骤,加速了训练,同时提升了推理能力。
⭐ 主要贡献首次引入提示学习动力学建模的视角,提出动态预测采样方法,显著提高强化学习微调效率并增强模型推理性能。
查看完整摘要 (Abstract)
Reinforcement learning (RL) finetuning has become a key technique for enhancing the reasoning abilities of large language models (LLMs). However, its effectiveness critically depends on the selection of training data. Recent advances underscore the importance of online prompt selection methods, which typically concentrate training on partially solved or moderately challenging examples under the current policy, thereby yielding more effective model updates. While significantly accelerating RL finetuning in terms of training steps, they also incur substantial computational overhead by requiring extensive LLM rollouts over large candidate batches to identify informative samples, an expense that can outweigh the finetuning process itself. To address this challenge, this work proposes Dynamics-Predictive Sampling (DPS), which online predicts and selects informative prompts by inferring their learning dynamics prior to costly rollouts. Specifically, we introduce a new perspective by modeling each prompt's solving progress during RL finetuning as a dynamical system, where the extent of solving is represented as the state and the transition is characterized by a hidden Markov model. Using historical rollout reward signals, we perform online Bayesian inference to estimate evolving state distributions, and the inference outcome provides a predictive prior for efficient prompt selection without rollout-intensive filtering. Empirical results across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that DPS substantially reduces redundant rollouts, accelerates the training process, and achieves superior reasoning performance.
基础/前沿模型 (含LLM)
指令微调与对齐
#Model Editing #Massive Editing #Large Language Models
🎯 研究动机现有模型编辑方法在大规模编辑场景中效果较差,尤其在实际评估指标下表现不佳,同时在语境丰富或同一主体的多重事实编辑时,鲁棒性受到限制。
❓ 解决问题通过优化嵌入对齐问题,提高大规模知识编辑场景下的模型可靠性与有效性。
🔍 现象分析嵌入空间中知识项错位导致编辑可靠性下降,特别是当同时更新多个事实时。
🛠️ 主要方法提出EAMET方法,通过对Transformer模型关键嵌入与残差嵌入进行空间对齐优化,增强编辑的一致性与鲁棒性。
📊 数据与实验在六种大型语言模型和三个数据集上的实验表明,EAMET在编辑1万个事实时可实现约90%的编辑效率,并显著优于现有方法。
⭐ 主要贡献首次系统解决大规模嵌入错位问题,提出一种鲁棒的模型编辑框架,并通过广泛实验验证其有效性与优越性。
查看完整摘要 (Abstract)
Model editing techniques are essential for efficiently updating knowledge in
large language models (LLMs). However, the effectiveness of existing approaches
degrades in massive editing scenarios, particularly when evaluated with
practical metrics. Their robustness is also limited in context-rich settings or
when editing multiple facts of the same subject simultaneously. We attribute
these failures to the embedding misalignment among knowledge items, which
undermines editing reliability at scale. To address this, we propose EAMET
(Embedding Alignment Model Editing in Transformers), which addresses this issue
by aligning the space of key and residual embeddings. Extensive experiments
across six LLMs and three datasets demonstrate that EAMET consistently
outperforms existing methods, achieving about 90\% editing efficacy when editing
10k facts.
基础/前沿模型 (含LLM)
指令微调与对齐
#large language models #reasoning models #reinforcement learning #RLVR #exploration #unlearning
TL;DR:We propose EEPO, which enhances exploration in RLVR by temporarily suppressing sampled trajectories during rollouts, achieving 10-33% improvements across mathematical reasoning benchmarks.
🎯 研究动机在具备可验证奖励的强化学习(RLVR)中,平衡探索与利用是大语言模型训练的核心挑战,但当前方法过度注重利用,导致熵崩溃和探索能力下降。
❓ 解决问题现有方法难以跳出行为主导模式,形成自我强化循环,进一步抑制探索,限制了模型在复杂任务中的表现。
🔍 现象分析当前策略随机性增加的技术虽然一定程度上促进了探索,但仍无法有效打破主导模式的支配,探索空间受到严重局限。
🛠️ 主要方法提出EEPO框架,采用两阶段生成机制,通过第一阶段生成样本并轻量化遗忘这些样本以抑制其影响,在第二阶段强制模型探索新的输出空间。
📊 数据与实验在五个数学推理基准上测试,包括Qwen和Llama系列模型,EEPO在多个配置中取得了10%-33%的相对性能提升。
⭐ 主要贡献首次通过采样后遗忘机制解决RLVR中的探索不足问题,实现了在大语言模型推理能力上的显著改进。
查看完整摘要 (Abstract)
Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, diminished exploratory capacity, and ultimately limited performance gains. Although techniques that increase policy stochasticity can promote exploration, they frequently fail to escape dominant behavioral modes. This creates a self-reinforcing loop—repeatedly sampling and rewarding dominant modes—that further erodes exploration. We introduce **E**xploration-**E**nhanced **P**olicy **O**ptimization (**EEPO**), a framework that promotes exploration via two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight unlearning step to temporarily suppress these sampled responses, forcing the second stage to explore different regions of the output space. This *sample-then-forget* mechanism disrupts the self-reinforcing loop and promotes wider exploration during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO, achieving average relative gains of 24.3\% on Qwen2.5-3B, 33.0\% on Llama3.2-3B-Instruct, and 10.4\% on Qwen3-8B-Base.
基础/前沿模型 (含LLM)
指令微调与对齐
#Multi-objective prompt optimization; multi-objective bandits; best feasible arm identification; fixed-budget pure exploration
🎯 研究动机提示工程已成为释放大型语言模型能力的重要途径,但现有研究忽视了其表现的多维特性,无法用单一指标全面评估提示性能。
❓ 解决问题研究如何在多目标环境下优化提示选择,提出在两种设置下解决问题:帕累托提示集恢复和最优可行提示识别。
🔍 现象分析提示性能具有多角度特性,需要在效率和效果之间找到平衡,同时确保选择最优提示的准确性。
🛠️ 主要方法将问题转化为纯探索型多臂老虎机框架,改进现有多目标算法,并设计一种结构化老虎机的最优提示识别方法,提供理论上的误识别率保证。
📊 数据与实验通过多种大型语言模型的实验对方法进行验证,结果显示基于老虎机的算法相比基线有显著提升。
⭐ 主要贡献建立了多目标提示优化的系统性理论框架,提出高效算法并验证其实际表现,推动提示工程在多维场景下的应用。
查看完整摘要 (Abstract)
Prompt engineering has become central to eliciting the capabilities of large language models (LLMs). At its core lies prompt selection - efficiently identifying the most effective prompts. However, most prior investigations overlook a key challenge: the inherently multi-faceted nature of prompt performance, which cannot be captured by a single metric. To fill this gap, we study the multi-objective prompt selection problem under two practical settings: Pareto prompt set recovery and best feasible prompt identification. Casting the problem into the pure-exploration bandits framework, we adapt provably efficient algorithms from multi-objective bandits and further introduce a novel design for best feasible arm identification in structured bandits, with theoretical guarantees on the identification error in the linear case. Extensive experiments across multiple LLMs show that the bandit-based approaches yield significant improvements over baselines, establishing a principled and efficient framework for multi-objective prompt optimization.
基础/前沿模型 (含LLM)
指令微调与对齐
#mechanistic interpretability #uncertainty estimation #LLMs #time series #probing
TL;DR:We demonstrate that LLMs' hidden states contain information about their own numerical predictive distribution, that can be elicited without the need of auto-regressive decoding.
🎯 研究动机当前大语言模型(LLMs)在回归任务中的应用受限于其自回归解码方式,从而导致预测连续值分布时计算成本和推理时间较高。
❓ 解决问题探索无需自回归解码即可从LLMs的内部表示中提取其数值预测分布的统计特征,以减少采样依赖。
🔍 现象分析实验表明,LLMs的隐藏状态中包含其预测分布的统计函数信息,包括数值不确定性,揭示了模型编码的不明确性。
🛠️ 主要方法使用一组回归探测器(regression probes)直接从LLMs的内部表示中预测数值输出分布的统计功能(如均值、中位数、分位数等)。
📊 数据与实验通过对时间序列和表格数据任务的实验,验证了探测器在提取预测分布特性上的有效性,同时提出关于LLMs内部编码机制的新研究方向。
⭐ 主要贡献首次证明LLMs内部嵌入能够直接传递预测分布统计特性,减少对采样过程的依赖,启发了轻量化不确定性估计的新方式。
查看完整摘要 (Abstract)
Large Language Models (LLMs) have recently been successfully applied to regression tasks---such as time series forecasting and tabular prediction---by leveraging their in-context learning abilities. However, their autoregressive decoding process may be ill-suited to continuous-valued outputs, where obtaining predictive distributions over numerical targets requires repeated sampling, leading to high computational cost and inference time. In this work, we investigate whether distributional properties of LLM predictions can be recovered _without_ explicit autoregressive generation. To this end, we study a set of regression probes trained to predict statistical functionals (e.g., mean, median, quantiles) of the LLM’s numerical output distribution directly from its internal representations. Our results suggest that LLM embeddings carry informative signals about summary statistics of their predictive distributions, including the numerical uncertainty. This investigation opens up new questions about how LLMs internally encode uncertainty in numerical tasks, and about the feasibility of lightweight alternatives to sampling-based approaches for uncertainty-aware numerical predictions.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #LLM Steering #Instruction following #Activation engineering
TL;DR:We propose DIRECTER, a dynamic steering method that mitigates oversteering in LLMs by plausibility-guided decoding loop that rejects implausibile outputs and adaptively modulates steering strength.
🎯 研究动机尽管经过指令微调,大语言模型在处理复杂用户指令时仍存在缺陷,现有激活导向技术虽能缓解此问题,但容易产生过导向风险,导致任务准确性和文本质量下降。
❓ 解决问题提出DIRECTER方法,通过动态调控导向强度来缓解过导向问题,具体采用基于合理性的解码循环自适应调整强度,避免损害任务精度和文本质量。
🔍 现象分析过导向现象源于现有方法对指令过度强调,从而破坏了模型原有生成分布,进而降低输出合理性和任务表现。
🛠️ 主要方法DIRECTER耦合轻量级注意力敏感性分析确定各层影响程度,通过动态缩放KV缓存调控导向强度;在解码循环中实时对比导向与原始输出分布,若合理性不足则逐步减弱导向强度。
📊 数据与实验在多样化基准上进行广泛评估,DIRECTER显著提升了指令遵循能力,比基线准确率最高提升6.5%,且未牺牲生成质量或任务忠实度。
⭐ 主要贡献提出首个动态、基于合理性控制的激活导向方法,有效解决过导向问题;方法无需额外数据集,与现有基线兼容,为LLM导向提供了通用控制机制。
查看完整摘要 (Abstract)
Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality. To address this, we introduce DIRECTER (Dynamic rejection steering), a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that ranks layers by their influence on model representations. Extensive evaluations show that DIRECTER significantly enhances instruction-following capabilities across diverse benchmarks, improving accuracy by up to 6.5% over baselines without the common trade-offs in generation quality or task fidelity. The proposed dynamic, plausibility-guided control during activation steering further demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large language model #Reinforcement learning #Model merging
TL;DR:We provide a comprehensive analysis of how reinforcement learning mitigates task conflicts in LLMs
🎯 研究动机模型合并在整合多个特化模型时至关重要,但现有研究较少探讨训练范式(如强化学习和监督微调)对合并效果的影响。
❓ 解决问题分析并验证使用强化学习训练的大语言模型如何缓解任务冲突,特别是在模型合并过程中减少性能降级。
🔍 现象分析强化学习训练的模型相比传统的监督微调模型,能显著减少任务间冲突和参数更新引发的知识覆盖问题。
🛠️ 主要方法通过理论分析和广泛实验证明,强化学习通过控制梯度更新幅度、优化目标的收敛行为,以及联合优化正负样本,缓解任务参数冲突。
📊 数据与实验基于五个代表性任务进行全面评估,验证了强化学习在模型合并场景下的优势表现。
⭐ 主要贡献系统揭示强化学习如何在模型合并中缓解任务冲突,并提出了三大关键机制,为后续研究提供明确方向。
查看完整摘要 (Abstract)
Model merging plays a crucial role in consolidating multiple specialized models into a single, unified model, especially in the era of large language models (LLMs). Recent research has primarily focused on developing strategies to enhance merging performance with the trained models, while the impact of training paradigms, such as supervised fine-tuning (SFT) and reinforcement learning (RL), on the effectiveness of model merging remains underexplored. In this study, we systematically explore the merging behavior of RL-trained LLMs compared to those trained with traditional SFT. Through comprehensive evaluations across five representative tasks, we find that RL significantly reduces task conflicts and results in less performance degradation after merging, making RL-trained models particularly well-suited for this process.
To unearth the reasons behind the superior suitability of RL for model merging, we conduct extensive empirical experiments and theoretical analyses. Our findings highlight three key factors: (1) On-policy training data in RL control the gradient updates in a smaller magnitude, reducing the risk of overwriting existing knowledge for other tasks in the model. (2) The RL optimization objective, which favors "\textit{enough is as good as a feast}", progressively reduces the magnitude of parameter updates as the model converges, thereby alleviating inter-task conflicts. (3) Joint optimization of positive and negative examples in RL steers the model towards an unbiased task-specific parameter subspace, ensuring robust performance while further preventing parameter conflicts.
基础/前沿模型 (含LLM)
指令微调与对齐
#Policy Contraction #Proximal Policy Optimization #Large Language Models
🎯 研究动机当前基于人类反馈的强化学习(RLHF)通过近端策略优化(PPO)提升了语言模型优化效果,但普遍导致输出多样性降低,这归因于策略在优化过程中的收缩问题。
❓ 解决问题为解决策略收缩问题,提出了一种能够在优化奖励的同时维持多样性的算法框架,缓解PPO在语言模型微调过程中引发的输出支持集收缩现象。
🔍 现象分析通过定义支持保留比(SRR),结合token熵值、KL散度和重复率等指标,量化并验证策略收缩效应对输出多样性的显著影响。
🛠️ 主要方法提出收缩感知PPO(CaPPO),利用最小范数多梯度更新同时优化奖励、熵和KL散度,并引入熵控制器有针对性地引导探索方向。
📊 数据与实验在HH-RLHF、Summarize-from-Feedback及UltraFeedback数据集上,与Qwen2-7B等多种大型语言模型结合,CaPPO相较于PPO提升胜率2至4个百分点,提高SRR 0.2至0.3,且结果对解码参数调整及奖励缩放具有鲁棒性。
⭐ 主要贡献通过将奖励、多样性和稳定性作为核心优化目标,CaPPO显著缓解了策略收缩,达成了多样性和性能平衡,为语言模型微调提供了高效稳健的新方法。
查看完整摘要 (Abstract)
Reinforcement learning from human feedback (RLHF) with proximal policy optimization (PPO) is widely used but often yields less diverse outputs than supervised fine-tuning, suggesting an effect in which the policy’s support contracts during on-policy optimization. We formalize this “policy contraction” with the Support Retention Ratio (SRR)—the share of SFT completions that retain non-negligible probability under the RL policy—and additionally track token-entropy, Kullback–Leibler (KL) divergence to the reference, and repetition. We propose Contraction-Aware PPO (CaPPO), a minimum-norm multi-gradient update that co-optimizes reward, entropy, and KL, paired with a controller that steers exploration toward a target token entropy. On HH-RLHF, Summarize-from-Feedback, and UltraFeedback with Qwen2-7B, Qwen2.5-14B, Mistral-7B-Instruct, and Llama-3-8B-Instruct, CaPPO increases win rate by 2 to 4 points over PPO and improves diversity, gaining 0.2 to 0.3 higher SRR. The gains persist under decoding sweeps and are robust to reward scaling and critic variance. Treating reward, diversity, and stability as first-class objectives, CaPPO mitigates contraction without sacrificing alignment performance.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM #Metacognition #Evaluations #AI Safety #Self-Awareness #Consciousness #Model Welfare
🎯 研究动机近年来,公众对大规模语言模型(LLM)自我意识乃至感知能力的关注迅速增加,这对安全性与政策制定具有重要影响,但相关测量方法仍处于初步阶段。
❓ 解决问题提出衡量LLM元认知能力的新方法,避免依赖模型自我报告,聚焦其能否战略性地利用内部状态知识。
🔍 现象分析前沿LLM展现出一定的元认知能力,包括评估与利用其对问题回答正确性的信心,以及预测自身回答并合理运用的能力,且这些能力具有分辨率限制、情境依赖性,并与人类元认知能力存在质的差异。
🛠️ 主要方法借鉴非人类动物元认知研究设计,通过行为测试与模型返回的token概率分析,研究模型的元认知特性与内部信号。
📊 数据与实验采用了两个实验范式,分析已有自2024年早期以来的前沿LLM,以评估其在回答事实问题与推理问题上的信心评估与预测能力。
⭐ 主要贡献揭示了LLM的元认知能力及其局限性,发现能力差异与情境依赖性,提出模型后训练可能对元认知能力发展有重要作用,为未来安全性研究与政策制定提供科学依据。
查看完整摘要 (Abstract)
The possibility of LLM self-awareness and even sentience is gaining increasing public attention and has major safety and policy implications, but the science of measuring them is still in a nascent state. Here we introduce a novel methodology for quantitatively evaluating metacognitive abilities in LLMs. Taking inspiration from research on metacognition in nonhuman animals, our approach eschews model self-reports and instead tests to what degree models can strategically deploy knowledge of internal states. Using two experimental paradigms, we demonstrate that frontier LLMs introduced since early 2024 show increasingly strong evidence of certain metacognitive abilities, specifically the ability to assess and utilize their own confidence in their ability to answer factual and reasoning questions correctly and the ability to anticipate what answers they would give and utilize that information appropriately. We buttress these behavioral findings with an analysis of the token probabilities returned by the models, which suggests the presence of an upstream internal signal that could provide the basis for metacognition. We further find that these abilities 1) are limited in resolution, 2) emerge in context-dependent manners, and 3) seem to be qualitatively different from those of humans. We also report intriguing differences across models of similar capabilities, suggesting that LLM post-training may have a role in developing metacognitive abilities.
基础/前沿模型 (含LLM)
指令微调与对齐
#Reinforcement Learning with Verifiable Rewards #Group Relative Policy Optimization #LLM Reasoning
🎯 研究动机在基于可验证奖励的强化学习(RLVR)中,近期研究发现奖励无关真相的虚假奖励(spurious rewards)和降低策略熵(entropy minimization)这两个看似矛盾的机制都能提升大语言模型的推理能力。这种既抑制利用又抑制探索的现象,其背后的原理尚不清晰。
❓ 解决问题本文旨在阐明RLVR中虚假奖励为何有效以及策略熵如何影响性能。主要解决两个核心问题:一是策略熵与模型表现的具体关系,二是虚假奖励是否通过裁剪偏差(clipping bias)和模型污染(contamination)的交互作用产生增益。
🔍 现象分析研究发现,虚假奖励通过引入裁剪偏差降低了策略熵,使模型输出更自信和确定;而仅靠熵最小化本身不足以带来性能提升。同时,虚假奖励的益处可以在非污染设置下得到解释,超越了简单的数据记忆效应。
🛠️ 主要方法提出了奖励错位模型(reward-misalignment model)来解释虚假奖励的增益机制。该方法重点关注虚假奖励下裁剪偏差与熵变化的相互作用,并分析了这种错位如何引导模型产生更有效的探索-利用平衡。
📊 数据与实验实验基于RLVR框架,聚焦于提升LLM的数学推理能力。通过设置包含虚假奖励的训练环境,实证分析了裁剪偏差、策略熵变化与最终推理性能之间的关系。
⭐ 主要贡献阐明了虚假奖励在RLVR中提升性能的作用机制,特别是通过裁剪偏差降低熵的路径。所提出的奖励错位模型为理解虚假奖励的增益提供了理论解释。为更有效的RLVR训练提供了设计原则,深化了对探索-利用权衡的理解。
查看完整摘要 (Abstract)
This paper examines the exploration–exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: \textit{spurious rewards}, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and \textit{entropy minimization}, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
基础/前沿模型 (含LLM)
指令微调与对齐
#Personalization #Synthetic Data #Meta-Learning #Preference Optimization
TL;DR:rapidly personalize to users with few examples by learning from synthetic preference data through meta-learning
🎯 研究动机为了满足虚拟助手和内容推荐等用户交互应用对个性化的需求,提出了对大型语言模型(LLMs)进行个性化优化的方法。
❓ 解决问题针对真实用户偏好数据难以大规模收集的问题,研究如何利用合成偏好数据结合元学习方法快速个性化模型。
🔍 现象分析通过实验发现,从合成数据到真实用户的迁移需要数据在多样性和一致性之间达到平衡,以确保性能提升。
🛠️ 主要方法提出了FSPO算法,将奖励建模转化为元学习问题,并通过用户描述合理化(RAT)提升建模效果;同时设计生成策略,从公开的大型语言模型中构建超过100万条合成偏好数据。
📊 数据与实验在电影评论、教育和开放性问题回答三个领域对1,500个合成用户进行测试,并进行受控的人类用户研究,FSPO在合成用户中达到了87%的Alpaca Eval胜率,在真实用户中达到了70%的胜率。
⭐ 主要贡献提出了一种基于合成偏好数据和元学习的新方法FSPO,有效实现了对真实用户的快速个性化优化,并提供了生成高质量、可迁移的合成数据的方法。
查看完整摘要 (Abstract)
Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context capabilities of LLMs, we propose few-shot preference optimization (FSPO), an algorithm for LLM personalization that reframes reward modeling as a meta-learning problem. Under FSPO, an LLM learns to quickly infer a personalized reward function for a user via a few labeled preferences. FSPO also utilizes user description rationalization (RAT) to encourage better reward modeling and instruction following, recovering performance with the oracle user description. Since real-world preference data is challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. To successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across three domains: movie reviews, education, and open-ended question answering. We also run a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval win-rate in generating responses that are personalized to synthetic users and a 70% win-rate with real human users in open-ended question answering.
基础/前沿模型 (含LLM)
指令微调与对齐
#Vision Language Models #Image Forensics #AIGC Detection
TL;DR:This work fine-tunes a Vision Language Model based on human-annotated data to classify AI-generated images and pinpoint where and why it considers so.
🎯 研究动机针对现有AI生成图像检测方法缺乏可解释性、鲁棒性不足的问题,提出构建同时具备可靠检测与可解释推理能力的新框架。
❓ 解决问题通过融合人类标注的空间视觉线索,解决传统检测方法黑箱化、泛化性弱以及多模态大模型易产生幻觉的两大核心缺陷。
🔍 现象分析现有高精度检测模型缺乏对生成痕迹的空间定位能力,而具备推理能力的多模态大模型在细粒度视觉任务中常出现事实性错误。
🛠️ 主要方法构建包含边界框与描述性标注的FakeXplained数据集,并采用渐进式训练策略微调多模态大模型,实现检测、定位与解释的三位一体输出。
📊 数据与实验基于新构建的数据集进行系统性验证,模型在检测准确率达98.2%、定位交并比36.0%的指标上刷新SOTA,并展现出优异的分布外泛化能力。
⭐ 主要贡献首创具有空间可解释性的AI生成图像检测框架,发布首个融合视觉定位标注的检测数据集,为可解释取证研究提供新范式。
查看完整摘要 (Abstract)
The rapid rise of image generation calls for detection methods that are both interpretable and reliable. Existing approaches, though accurate, act as black boxes and fail to generalize to out-of-distribution data, while multi-modal large language models (MLLMs) provide reasoning ability but often hallucinate.
To address these issues, we construct \textbf{FakeXplained} dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, forming the basis for human-aligned, visually grounded reasoning. Leveraging \textbf{FakeXplained}, we develop \textbf{FakeXplainer} which fine-tunes MLLMs with a progressive training pipeline, enabling accurate detection, artifact localization, and coherent textual explanations. Extensive experiments show that \textbf{FakeXplainer} not only sets a new state-of-the-art in detection and localization accuracy ($98.2\%$ accuracy, $36.0\%$ IoU), but also demonstrates strong robustness and out-of-distribution generalization, uniquely delivering spatially grounded, human-aligned rationales. The code and dataset are available at: \href{https://github.com/Gennadiyev/FakeXplain}{https://github.com/Gennadiyev/FakeXplain}.
基础/前沿模型 (含LLM)
指令微调与对齐
#Activation Steering #Large Language Models #Fine-Grained Intervention
TL;DR:Breaking LLM blocks to fine-grained atomic units for intervention: steering less achieves more
🎯 研究动机激活引导是一种成本较低的方式,用于修改大语言模型 (LLM) 的行为,但现有方法存在干预粒度过粗的问题,效率低且过于侵入性。
❓ 解决问题现有的块级激活干预方式无法有效区分有益、无关和有害特征,导致大语言模型行为调整的不精确性和低效性。
🔍 现象分析块级激活包含异质性特征,其中不同的激活单元 (AU) 对输出的词分布有独立影响,统一干预会同时影响有利与有害方向,降低控制质量。
🛠️ 主要方法提出了一种名为 AUSteer 的方法,从激活单元级别分解并调整激活驱动,通过计算对比样本的激活动量识别关键单元,并根据输入内容动态分配干预强度。
📊 数据与实验在多种 LLM 和任务上的综合实验表明,AUSteer 能在减少干预激活数量的同时表现优于先进基线方法。
⭐ 主要贡献实现了更精细的激活干预机制,显著提升了效率与效果,为精细化调整大语言模型行为提供了新的范式。
查看完整摘要 (Abstract)
Activation steering has emerged as a cost-effective paradigm for modifying large language model (LLM) behaviors. Existing methods typically intervene at the block level, steering the bundled activations of selected attention heads, feedforward networks, or residual streams. However, we reveal that block-level activations are inherently heterogeneous, entangling beneficial, irrelevant, and harmful features, thereby rendering block-level steering coarse, inefficient, and intrusive. To investigate the root cause, we decompose block activations into fine-grained atomic unit (AU)–level activations, where each AU-level activation corresponds to a single dimension of the block activation, and each AU denotes a slice of the block weight matrix. Steering an AU-level activation is thus equivalent to steering its associated AU. Our theoretical and empirical analysis show that heterogeneity arises because different AUs or dimensions control distinct token distributions in LLM outputs. Hence, block-level steering inevitably moves helpful and harmful token directions together, which reduces efficiency. Restricting intervention to beneficial AUs yields more precise and effective steering. Building on this insight, we propose AUSteer, a simple and efficient method that operates at a finer granularity of the AU level. AUSteer first identifies discriminative AUs globally by computing activation momenta on contrastive samples. It then assigns adaptive steering strengths tailored to diverse inputs and selected AU activations. Comprehensive experiments on multiple LLMs and tasks show that AUSteer consistently surpasses advanced baselines while steering considerably fewer activations, demonstrating that steering less achieves more.
基础/前沿模型 (含LLM)
指令微调与对齐
#User Language Models #User Simulation #Interactive Evaluation #Post-Training
TL;DR:We introduce and evaluate user language models - models that are post-trained to simulate users that interact with assistants.
🎯 研究动机语言模型通常被训练为帮助型助手,但用户语言却有不完美的特点,这种差异可能影响模型性能评估的真实度。
❓ 解决问题提出并评估一种专门模拟用户行为的语言模型用户语料模型(User LMs),以改进现有的用户模拟方法及交互评估准确性。
🔍 现象分析现有的方法依赖助手型语言模型模拟用户,却发现更优秀的助手模型表现为更差的用户模拟器,说明两种任务需求不一致。
🛠️ 主要方法通过后训练调整模型,使其更接近多轮对话中用户语言的行为特征,例如请求独特性与即时调整,增强模拟的真实度。
📊 数据与实验使用编码与数学对话模拟测试环境,通过User LMs模拟实现更真实的用户行为,并观察模型在该环境中的性能变化。
⭐ 主要贡献提出并验证User LMs的有效性,实现了更真实的用户行为模拟,揭示了真实交互环境中助手模型的潜在局限性。
查看完整摘要 (Abstract)
Conversations with LMs involve two participants: a human user leading the conversation, and an LM assistant responding to the user's request. To satisfy this specific role, LMs are post-trained to be helpful assistants -- optimized to produce exhaustive and well-structured responses, free of ambiguity and grammar errors. User utterances, on the other hand, are rarely perfected, with each user phrasing requests in unique ways, sometimes putting in partial effort at each turn and refining on the fly. To evaluate LM performance in realistic settings, prior work simulated users in multi-turn conversations, often by prompting an LM originally trained to be a helpful assistant to act as a user. However, we show that assistant LMs make for poor user simulators, with the surprising finding that better assistants yield worse simulators. Instead, we introduce purpose-built User Language Models (User LMs) - models post-trained to simulate human users in multi-turn conversations. Through various evaluations, we show how User LMs align better with human behavior and achieve better simulation robustness than existing simulation methods. When leveraging User LMs to simulate coding and math conversations, the performance of a strong assistant (GPT-4o) drops from 74.6% to 57.4%, confirming that more realistic simulation environments lead to assistant struggles as they fail to cope with the nuances of users in multi-turn setups.
基础/前沿模型 (含LLM)
指令微调与对齐
#language model #post-training #fluency #low-resource languages #RLAIF
TL;DR:We propose a preference-optimization method for lower-resource languages that results in fluent language models even when aligned by disfluent reward models.
🎯 研究动机低资源语言缺乏母语写作数据和经过指令微调的生成模型,导致语言模型难以达到流畅性优化效果。
❓ 解决问题提出了一种后训练方法,可在缺乏目标语言指令微调数据的情况下,生成流畅的偏好对齐语言模型。
🔍 现象分析现有研究主要针对英语和中文,而低资源语言在奖励模型质量低时难以生成流畅文本。
🛠️ 主要方法通过一种基于策略训练的偏好优化方法,与机器翻译微调和多语言微调进行对比。
📊 数据与实验以挪威书面语为案例进行测试,并通过母语者评估其语言流畅性。
⭐ 主要贡献证明了基于策略训练的方法的重要性,可高效地生成流畅文本,且无需依赖稀缺数据。
查看完整摘要 (Abstract)
We propose a post-training method for lower-resource languages that preserves the fluency of language models even when aligned by disfluent reward models. Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and instruction-tuned language models capable of generating fluent synthetic data. To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common alternatives: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.
基础/前沿模型 (含LLM)
指令微调与对齐
#Automatic evaluation #LLM-as-judge #multi-task evaluators #step-level evaluation #verifers
TL;DR:We train two foundational automatic evaluators at large data scales, demonstrating state-of-the-art performance
🎯 研究动机生成式任务评估需求迅速增长,但现有评估器多关注新方法如强化学习,而忽视大规模数据驱动的开发。
❓ 解决问题针对推理域的多任务评估需求,提出一种基于数据扩展的自动评估器训练方法,超越特定任务的评估限制。
🔍 现象分析大规模数据(2.5M样本)支持的模型在多个推理评估任务上性能优于特定任务优化的模型,且在真实任务中展现强大性能。
🛠️ 主要方法设计了8B和20B参数的评估器,以迭代拒绝采样的监督微调方式训练,同时涵盖五种评估任务与多个领域。
📊 数据与实验构建了多任务大规模评估数据集,并在基准和实际测试中验证,FARE-20B在MATH推理和RL验证任务中表现卓越。
⭐ 主要贡献提出FARE评估器系列,首次展示大规模数据扩展对评估器性能的显著提升,且设定了开源评估器的新标准。
查看完整摘要 (Abstract)
Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1\% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality
基础/前沿模型 (含LLM)
指令微调与对齐
#Reinforcement Learning #Large Language Model #Reasoning #Exploration
TL;DR:We show that once a model has acquired the necessary atomic skills for a task, RL enables the composition of these skills into more complex capabilities when properly incentivized.
🎯 研究动机探索强化学习是否能赋予大型语言模型全新的能力,还是仅激活已有能力,这是关于强化学习在后训练中作用的核心争议点。
❓ 解决问题验证强化学习是否能通过组合已有的基础技能,使大型语言模型习得新的复杂技能,并转移到不同任务中。
🔍 现象分析实验表明强化学习能让模型在掌握基础技能后,通过组合这些技能学习复杂的函数变换,并且这种能力可推广到未见的复杂任务。
🛠️ 主要方法构建合成框架,将技能定义为执行字符串变换函数的能力,通过强化学习激励模型学习未见的函数组合,分析推理行为变化。
📊 数据与实验针对函数组合任务设计合成数据,确保无数据污染并精确控制任务复杂度,进一步测试技能转移能力及与传统训练方式的对比。
⭐ 主要贡献提供强化学习赋予新组合技能的证据,揭示其学习行为变化特性,强调构建基础模型并以强化学习为驱动提升复杂任务泛化能力的重要性。
查看完整摘要 (Abstract)
Does reinforcement learning (RL) teach large language models (LLMs) genuinely new skills, or does it merely activate existing ones? This question lies at the core of ongoing debates about the role of RL in LLM post-training. On one side, strong empirical results can be achieved with RL alone even without preceding supervised finetuning; on the other, critics argue that RL contributes little beyond reweighting existing reasoning strategies. This work provides concrete evidence that LLMs can acquire genuinely new skills during RL by composing existing ones, mirroring one of the central mechanisms by which humans acquire new cognitive skills \citep{Anderson1982Acquisition}. To mitigate data contamination and other confounding factors and to allow precise control over task complexity, we develop a synthetic framework for our investigation. Specifically, we define a skill as the ability to infer the output of a string transformation function $f(x)$ given $x$. Once an LLM has already learned $f$ and $g$ prior to RL, our experiments reveal that RL enables it to learn unseen compositions of them $h(x)=g(f(x))$. Further, this compositional ability generalizes to more difficult problems such as compositions of $>2$ functions unseen during training. Our experiments provide surprising evidence that this compositional ability, acquired on the source task, transfers to a different target task. This transfer occurs even though the model has never trained with RL on any compositional problems in the target task, as long as it has acquired the target task's atomic skills prior to RL on the source task. Our qualitative analysis shows that RL fundamentally changes the reasoning behaviors of the models. In contrast, neither of the findings is observed in next-token prediction training with the same data. Our systematic experiments provide fresh insights into the learning behaviors of widely-used post-training approaches for LLMs. They suggest the value of building base models with the necessary basic skills, followed by RL with appropriate incentivization to acquire more advanced skills that generalize better to complex and out-of-domain problems.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Model #Model Reasoning #Multimodal Model
🎯 研究动机强化学习能够直接增强大语言模型的推理能力,而无需过度依赖监督微调。现有方法如GRPO的流程较为复杂,存在计算开销和偏差问题。本研究旨在探索一种更简洁高效的强化学习框架来优化模型的推理性能。
❓ 解决问题传统的策略梯度方法依赖复杂的损失函数和参考模型,容易引入偏差并增加计算成本。本文针对这些问题,提出了一种简化方案,消除评论家模型、参考模型和KL散度约束,直接优化原始RL目标。GPG方法显著降低了训练复杂性并提升了性能。
🔍 现象分析现有强化学习方法(如GRPO)需要引入各种辅助机制(如优势函数估计、KL惩罚)来确保训练稳定,但这些设计增加了偏差和计算负担。实验显示,简化这些组件能够直接提升大模型在推理任务上的效率和性能。
🛠️ 主要方法本文提出了Group Policy Gradient(GPG)方法,它是一种极简强化学习框架。GPG直接优化原始的强化学习目标,无需采用替代损失函数或引入评论家及参考模型。该方法消除了优势估计偏差和梯度偏差,从而简化了训练流程。
📊 数据与实验研究通过大量实验验证了GPG在单模态和多模态任务上的有效性。如图1所示,该方法在多种基准上均优于GRPO,并显著降低了计算成本。实验未依赖任何辅助技术或额外调整,证明了其鲁棒性。
⭐ 主要贡献本文提出了GPG,一个简洁且高效的强化学习基线方法,用于提升模型推理能力。GPG通过简化训练过程,去除了复杂组件,直接优化原始目标,从而在性能和计算效率上超越现有方法。该方法为模型推理的强化学习研究提供了新的基础框架。
查看完整摘要 (Abstract)
Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and propose a minimalist RL approach termed Group Policy Gradient (GPG). Unlike conventional methods, GPG directly optimizes the original RL objective, thus obviating the need for surrogate loss functions. By eliminating the critic and reference models, avoiding KL divergence constraints, and addressing the advantage and gradient estimation bias, our approach significantly simplifies the training process compared to Group Relative Policy Optimization (GRPO). Our approach achieves superior performance without relying on auxiliary techniques or adjustments. As illustrated in Figure 1, extensive experiments demonstrate that our method not only reduces computational costs but also consistently outperforms GRPO across various unimodal and multimodal tasks.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Reinforcement Learning with Verifiable Rewards #Generalization #Causal Reasoning
🎯 研究动机研究通过RLVR提升大语言模型在复杂因果推理任务中的泛化能力,探究其适用条件。
❓ 解决问题分析RLVR在概率因果图模型推理中的泛化表现,并比较其与监督微调方法的效果。
🔍 现象分析发现RLVR相比监督微调在部分模型规模和训练条件下能更好地提升模型的推理能力,但强依赖于模型初始的推理能力。
🛠️ 主要方法设计多难度因果图及查询数据集,使用RLVR和SFT微调不同规模的语言模型,并比较不同训练条件下的泛化表现。
📊 数据与实验构建覆盖关联、干预及反事实查询的因果图数据集,实验涵盖3B至32B规模模型,系统地验证不同训练方法和数据组合对结果的影响。
⭐ 主要贡献提出RLVR对模型因果推理能力的提升机制,明确其依赖模型初始能力,为增强复杂推理任务泛化性提供方法依据。
查看完整摘要 (Abstract)
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for post-training large language models (LLMs) on complex reasoning tasks. Yet, the conditions under which RLVR yields robust generalization remain underexplored. This paper provides an empirical study of RLVR generalization in the setting of probabilistic inference over causal graphical models.
This setting offers two natural axes along which to examine generalization: (i) the level of the probabilistic query---associational, interventional, or counterfactual---and (ii) the structural complexity of the query, measured by the size of its relevant subgraph. We construct a dataset of causal graphs and queries spanning these difficulty axes and fine-tune Qwen-2.5-Instruct models using RLVR or supervised fine-tuning (SFT). We vary both the model scale (3B-32B) and the query level included in training. We find that RLVR yields stronger within-level and across-level generalization than SFT, but only for specific combinations of model size and training query level. Further analysis shows that RLVR's effectiveness depends on the model's initial reasoning competence.
With sufficient initial competence, RLVR improves an LLM's marginalization strategy and reduces errors in intermediate probability calculations, producing substantial accuracy gains, particularly on more complex queries. These results show that RLVR can improve specific causal reasoning subskills, with its benefits emerging only when the model has sufficient initial competence. Our code and data is available at https://github.com/zhichul/rlcausal.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLMs #SFT #Post-train
🎯 研究动机强化学习作为大语言模型后训练范式的潜力巨大,但其效果对模型基线依赖性较高。探索如何通过轻量化的监督微调(SFT)优化初始阶段以提高强化学习效果非常重要。
❓ 解决问题SFT存在分布遗忘问题,导致模型偏离基线分布从而影响后续强化学习的表现。需要新的方法来确定最佳初始点以增强模型的强化学习能力。
🔍 现象分析发现最佳评估性能的SFT检查点无法最大化RL效果,模型在传统过拟合前存在分布偏移。多样性指标如熵与自BLEU比传统性能指标更适合作为早停标准。
🛠️ 主要方法提出自适应早停损失(AESL)方法,通过动态调节冷启动过程,平衡新模式的获取与基线分布的保持。从字符级和子序列级控制冷启动细节优化。
📊 数据与实验在数学推理基准测试上进行实验,使用多样性超越传统基于性能的早停策略,AESL进一步改善强化学习的初始准备效果。
⭐ 主要贡献揭示SFT分布遗忘现象并设计多样性指标的新早停策略;提出AESL轻量方法提升模型冷启动性能,代码公开以支持相关研究与应用。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has emerged as a powerful post-training paradigm for large language models (LLMs), yet its effectiveness varies significantly across base models. While incorporating a pre-RL supervised fine-tuning (SFT) phase can enhance RL training, key questions remain: how long should the SFT cold-start phase last, and is the SFT objective truly aligned with the requirements for effective RL preparation?
In our analysis of cold-start dynamics, we uncover a key limitation: the SFT checkpoint with the highest evaluation performance often fails to maximize RL potential due to distributional forgetting—a phenomenon where the model drifts excessively away from the base model’s distribution even before traditional overfitting occurs. We identify diversity metrics, such as the entropy and self-BLEU, as more reliable early-stopping criteria than the standard performance-based checkpoint selection. Our findings show that SFT checkpoints with peak diversity consistently lead to superior post-RL results. Building on these insights, we introduce Adaptive Early-Stop Loss (AESL), a lightweight and dynamic cold-start method that balances the acquisition of new patterns with the preservation of the base model's distribution. AESL operates at both the token and subsequence levels, providing finer-grained control over the cold-start process. Experimental results on mathematical reasoning benchmarks demonstrate that diversity-based early stopping surpasses traditional performance-based SFT, while AESL further enhances RL preparation. By steering LLMs toward better initialization points for RL, AESL consistently achieves superior final performance compared to existing SFT and cold-start strategies. The
code is publicly available at \url{https://github.com/LXXXXR/AESL}.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Reinforcement Learning #LLM Post-Training #Off-Policy RL #GRPO
TL;DR:We present a native off-policy interpretation for group-relative REINFORCE, and its broad implications.
🎯 研究动机大语言模型(LLM)的离策略(off-policy)强化学习(RL)因实际应用约束、LLM-RL 系统复杂性及方法创新的需求而受到关注。但传统 REINFORCE 及其现代变体(如 GRPO)通常被视为仅能容忍有限离策略性的在策略算法,导致对其中离策略性质的理解不足。
❓ 解决问题澄清 GRPO 类算法本质上是离策略算法,并揭示其适应离策略环境的原则,以破除关于重要性采样(importance sampling)和梯度裁剪(clipping)等作用的常见误解。
🔍 现象分析通过基础原理推导证明,以组内平均奖励作为优势计算基准的 group-relative REINFORCE(包括 GRPO)本身就具有离策略性,无需假设特定的训练数据分布,从而为离策略应用提供了理论依据。
🛠️ 主要方法提出两个通用原则:正则化策略更新(regularizing policy updates)与主动塑造数据分布(actively shaping the data distribution),将 Online Policy Mirror Descent 与 Asymmetric REINFORCE 统一解释为 REINFORCE 损失的正则化形式。
📊 数据与实验在代码仓库中提供了实证验证,通过大量实验验证了方法的可行性与可操作性,代码在 https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k 公开。
⭐ 主要贡献重新诠释了 group-relative REINFORCE 本质上是离策略算法,并提出了理论原则与算法统一框架,为 LLM 离策略强化学习的原理性算法设计开辟了新思路,并解释了启发式数据加权策略的理论合理性。
查看完整摘要 (Abstract)
Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE — a REINFORCE variant that uses the within-group mean reward as the baseline for advantage calculation — without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to truly off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms — Online Policy Mirror Descent and Asymmetric REINFORCE — as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Hallucination Detection
🎯 研究动机大型语言模型在医疗、法律及科学等高风险领域中因幻觉现象导致可靠性下降,亟需系统性解决方案以应对不同来源的幻觉问题。
❓ 解决问题现有方法仅能处理数据驱动或推理驱动的单一类型幻觉,且依赖于任务特定启发,难以适应复杂多样的应用场景。
🔍 现象分析通过提出统一理论框架——幻觉风险界,将幻觉风险正式分解为与训练时数据不匹配相关的数据驱动幻觉及推理过程不稳定性导致的推理驱动幻觉。
🛠️ 主要方法提出基于神经切核(NTK)的评分方法 **HalluGuard**,利用其几何结构和表示能力,联合检测数据驱动和推理驱动的幻觉问题。
📊 数据与实验在涵盖10种多样基准任务、11种对比方法及9种流行模型的实验中,**HalluGuard** 在多种形式的幻觉检测任务中实现了最先进性能。
⭐ 主要贡献提供了统一的幻觉风险理论框架,设计了一种高效的检测方法 **HalluGuard**,并开源了完整实现以供社区研究。
查看完整摘要 (Abstract)
The reliability of Large Language Models (LLMs) in high-stakes domains such as healthcare, law, and scientific discovery is often compromised by hallucinations. These failures typically stem from two sources: *data-driven hallucinations* and *reasoning-driven hallucinations*. However, existing detection methods usually address only one source and rely on task-specific heuristics, limiting their generalization to complex scenarios. To overcome these limitations, we introduce the *Hallucination Risk Bound*, a unified theoretical framework that formally decomposes hallucination risk into data-driven and reasoning-driven components, linked respectively to training-time mismatches and inference-time instabilities. This provides a principled foundation for analyzing how hallucinations emerge and evolve. Building on this foundation, we introduce **HalluGuard**, an NTK-based score that leverages the induced geometry and captured representations of the NTK to jointly identify data-driven and reasoning-driven hallucinations. We evaluate **HalluGuard** on 10 diverse benchmarks, 11 competitive baselines, and 9 popular LLM backbones, consistently achieving state-of-the-art performance in detecting diverse forms of LLM hallucinations. We open-source our proposed \model{} model at https://github.com/Susan571/HalluGuard-ICLR2026.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLMs #hallucination #abstention
🎯 研究动机大型语言模型(LLMs)在生成内容时存在幻觉问题,即在缺乏知识或能力时可能产生错误回答,亟需一种提高模型可靠性的方法。
❓ 解决问题通过能力对齐的微调使LLMs在缺乏信心时部分或完全拒答,从而减少幻觉问题并提高生成内容的正确性。
🔍 现象分析模型在生成答案时可能包含错误的片段,通过将生成内容拆分为事实性片段,并根据真实信息识别错误部分,可细化模型的能力范围。
🛠️ 主要方法提出HALT方法,通过能力对齐的后训练数据生成技术,移除或替换不可靠的片段为“从此不确定”,并设定可调门限以平衡回答的完整性与正确性。
📊 数据与实验对四个公开模型在传记写作、数学计算、编程和医学领域进行实验,在三种阈值设置下验证HALT,显著提升生成片段平均正确性和F1分数。
⭐ 主要贡献通过HALT方法将Llama3-70B模型的正确性从51%提升至87%,同时保持53%的回答完整性,相较基线方法有明显改进,为可靠LLM开发提供新思路。
查看完整摘要 (Abstract)
Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability -- a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data that encodes what the model can and cannot reliably generate. We generate this data by splitting responses of the pretrained LLM into factual fragments (atomic statements or reasoning steps), and use ground truth information to identify incorrect fragments. We achieve capability-aligned finetuning responses by either removing incorrect fragments or replacing them with "Unsure from Here" -- according to a tunable threshold that allows practitioners to trade off response completeness and mean correctness of the response's fragments. We finetune four open-source models for biography writing, mathematics, coding, and medicine with HALT for three different trade-off thresholds. HALT effectively trades off response completeness for correctness, increasing the mean correctness of response fragments by 15% on average, while resulting in a 4% improvement in the F1 score (mean of completeness and correctness of the response) compared to the relevant baselines. By tuning HALT for highest correctness, we train a single reliable Llama3-70B model with correctness increased from 51% to 87% across all four domains while maintaining 53% of the response completeness achieved with standard finetuning.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM Fine-tuning (SFT #DPO #SimPO); Data selection; Holdout loss; In-context learning; Gradient reweighting
TL;DR:We present a holdout-loss-based data selection framework that leverages in-context learning for efficient computation.
🎯 研究动机微调大型预训练语言模型常用于与人类偏好对齐,但噪声或偏离目标的数据会削弱监督效果。需要系统且高效的数据筛选方法以提升模型性能。
❓ 解决问题现有数据筛选方法依赖启发式规则或昂贵的模型重训练,缺乏资源节约且高效的解决方案。
🔍 现象分析小规模、高质量的数据集在性能上往往可与更大数据集匹敌。噪声数据显著影响梯度更新效率和模型对齐能力。
🛠️ 主要方法提出一种基于保留集损失的框架,利用上下文学习进行数据选择与梯度重新加权,避免参考模型与额外微调操作。
📊 数据与实验在SFT、DPO 和 SimPO框架,以及多种模型与数据集上进行实验,验证在最小资源开销情况下,该方法能持续提升对齐性能,并探讨灵敏度与局限性。
⭐ 主要贡献设计基于保留集损失的准则,提出ICA分数进行动态数据加权,从理论到实践实现资源高效性与训练效果优化,为快速变化策略环境提供未来研究方向。
查看完整摘要 (Abstract)
Fine-tuning large pretrained language models is a common approach for aligning them with human preferences, but noisy or off-target examples can dilute supervision. While small, well-chosen datasets often match the performance of much larger ones, systematic and efficient ways to identify high-value training data remain underexplored. Many current methods rely on heuristics or expensive retraining. We present a principled, resource-efficient framework for data selection and reweighting. At its core is an In-Context Approximation (ICA) that estimates the holdout loss a model would incur after training on a candidate example by conditioning on a small, curated holdout set in context. ICA requires no reference model and no additional finetuning.
We define the resulting estimate as the ICA score, and derive per-example weights that dynamically reweight gradient updates as model parameters evolve. Across SFT, DPO, and SimPO, and over diverse backbones and datasets, ICA-based reweighting consistently improves model alignment with minimal overhead. We analyze sensitivity to score update frequency and the number of in-context holdout examples. We also discuss
limitations in rapidly drifting on-policy settings, highlighting directions for future work.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Unsupervised Reward #Reinforcement Learning #Reasoning
TL;DR:We revisit unsupervised RLVR through intrinsic and external rewards, unifying existing methods, analyzing their impact on confidence and failure modes, discussing their potential applications.
🎯 研究动机该研究旨在突破监督瓶颈,通过无需标签的奖励机制优化大规模语言模型的训练效率,探索未被充分理解的内在信号潜力和局限性。
❓ 解决问题探讨如何在无监督强化学习中利用内外部奖励,克服模型初始信心与准确性不一致导致的失败模式,并评估其扩展应用前景。
🔍 现象分析通过内在奖励驱动,模型表现出‘性能提升-逐步崩塌’的规律,崩塌时点由模型初始分布决定,非设计选择驱动;外部奖励显示出潜在逃逸信心-准确性上限的可能性。
🛠️ 主要方法提出统一理论框架,将奖励机制分为内在与外部两类并进行分类分析;定义‘模型崩塌步’衡量模型初始分布对训练效果的影响,并探索外部验证奖励方法。
📊 数据与实验使用多种数据集开展系统实验,验证内在信号驱动的内在奖励的限制和外部奖励方法的初步优势;实验代码公开以供研究者深入测试。
⭐ 主要贡献揭示内在奖励的理论边界与性能模式,定义新的指标辅助RLVR训练决策,并通过框架统一及实验分析,为扩展训练路径提供参考。
查看完整摘要 (Abstract)
Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels.
Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear.
In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments.
We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution
This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned.
Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices.
Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability.
Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling.
Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.
Code is available at \url{https://github.com/PRIME-RL/TTRL}.
基础/前沿模型 (含LLM)
指令微调与对齐
#Hybrid rewards for reinforcement learning
🎯 研究动机现有强化学习中,基于验证器的稀疏奖励机制存在“非全即无”的限制,对部分正确或可替代解答的奖励不足,阻碍了模型学习潜力的充分发挥。
❓ 解决问题提出一种混合奖励框架,融合稀疏验证器信号与连续奖励模型分数,以保持验证器的准确性,同时通过奖励模型提供更丰富的质量辨别。
🔍 现象分析稀疏奖励模型虽稳定,但难以捕捉任务复杂性的细腻差别;混合设计可以平衡稀疏信号的可靠性与连续信号的细腻度。
🛠️ 主要方法引入HERO框架,采用分层归一化将奖励模型分数限制在验证器定义组内,同时使用方差感知加权强化困难任务中的重要信号。
📊 数据与实验在多种数学推理基准上验证HERO的有效性,对比奖励模型及验证器单独方法,表现出在可验证任务与难以验证任务上的显著提升。
⭐ 主要贡献提出混合奖励优化框架HERO,展示其在复杂推理任务中的稳定性与能力提升,并证实混合设计能够有效弥合稀疏与连续奖励的局限性。
查看完整摘要 (Abstract)
Post-training for reasoning in large language models has increasingly relied on verifiable rewards: deterministic checkers that provide $0$–$1$ correctness signals. While reliable, such binary feedback is brittle—many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates sparse verifier signals with dense reward model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms reward model-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.
基础/前沿模型 (含LLM)
指令微调与对齐
#In Context Learning #ICL #Supervised Fine Tuning #SFT #Adaptation #Self-Distillation #Distillation
TL;DR:Use activations produced during ICL to align SFT models' functional behavior with ICL. This results in better accuracy and calibration in SFT models.
🎯 研究动机探讨能否利用ICL内部计算提升SFT模型的表现。
❓ 解决问题如何利用ICL的激活模式改善SFT模型的准确性和校准能力。
🔍 现象分析ICL和SFT生成不同的激活模式,表明其通过不同机制实现适配。
🛠️ 主要方法提出ICL激活对齐技术,尝试在SFT模型中复制ICL的激活模式,通过自蒸馏实现。
📊 数据与实验在12个流行基准和两个模型家族上进行实验,显示通过在SFT之前执行激活对齐能显著提升模型表现。
⭐ 主要贡献提供了改进SFT模型性能的实用方法,并深入理解模型适配的内部机制。
查看完整摘要 (Abstract)
Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt.
ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: \textit{Can ICL's internal computations be used to improve the qualities of SFT?} We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL's rich functionality, we introduce \textbf{I}CL \textbf{A}ctivation \textbf{A}lignment (\act), a self-distillation technique which aims to replicate ICL's activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing \act as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and two model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.
基础/前沿模型 (含LLM)
指令微调与对齐
#Data Selection #Large Language Models
🎯 研究动机监督微调依赖于选择有效的训练数据,但现有梯度影响方法计算成本过高,难以应用于大型语言模型(LLMs)。
❓ 解决问题现有小模型代理存在学习动力学不清晰、规模缺乏灵活性及难以对齐影响估计等问题。
🔍 现象分析直接从目标模型派生且对齐影响信息的代理比现成的小模型代理更有效。
🛠️ 主要方法提出名为 IProX 的两阶段框架,通过低秩压缩保留目标模型的影响信息,并通过梯度和输出的对齐生成可灵活控制计算成本的代理。
📊 数据与实验在多种 LLM 和下游任务中验证,IProX 在性能和计算成本上均优于现成代理和基线方法。
⭐ 主要贡献提供了一种影响保持型代理方法,为 LLM 梯度数据选择提升可扩展性与效率。
查看完整摘要 (Abstract)
Supervised fine-tuning (SFT) relies critically on selecting training data that most benefits model's downstream performance. Gradient-based data selection methods such as TracIn and Influence Functions leverage influence to identify useful samples, but their computational cost scales poorly, making them impractical for multi-billion-parameter large language models (LLMs). A common alternative is to use off-the-shelf smaller models as proxies, but they remain suboptimal since their learning dynamics are unclear, their sizes cannot be flexibly adjusted, and they cannot be further aligned with the target model in terms of gradient-based influence estimation. To address these challenges, we introduce IProX, a two-stage framework that derives influence-preserving proxies directly from the target model. It first applies a low-rank compression stage to preserve influence information of the target model, and then an aligning stage to align both model gradients and logits, thereby constructing proxies that flexibly control computational cost while retaining the target model’s influence. Experimental results across diverse LLM families and evaluation tasks show that IProX consistently outperforms off-the-shelf proxies and baseline methods. On Qwen3-4B, a 1.5B proxy constructed with IProX achieves stronger performance than the larger 1.7B off-the-shelf proxy. Notably, on Llama3.2, IProX achieves better performance than baselines while reducing computational cost by more than half relative to the full 3B model. These results show that IProX provides effective influence-preserving proxies, making gradient-based data selection more scalable for LLMs.
基础/前沿模型 (含LLM)
指令微调与对齐
#Diffusion Large Language Models #Reinforcement Learning #Inpainting #Group Relative Policy Optimization
TL;DR:IGPO, an RL method for diffusion LLMs that uses inpainting to inject partial reasoning hints when stuck with all-wrong responses, achieving SoTA results on math benchmarks for masked diffusion LLMs
🎯 研究动机针对扩散型大语言模型(dLLMs)的独特生成能力,如补全(inpainting),利用其潜力开发更有效的强化学习算法,以应对稀疏奖励信号和样本浪费问题。
❓ 解决问题解决传统RL训练中探索效率低下的问题,通过补全引导探索路径,提升样本效率和梯度质量,同时改善扩散LLMs在数学任务中的表现。
🔍 现象分析dLLMs具备补全能力,可注入局部正确的推理线索,与全监督解决方案不同,这种方法能兼顾模型自生成推理与优化路径,避免完全依赖外部答案。
🛠️ 主要方法提出IGPO框架,通过在线采样时插入部分真实推理轨迹,引导探索路径;结合群体相对策略优化(GRPO)及基于熵的样本过滤,优化训练过程。
📊 数据与实验在GSM8K、Math500、AMC和Minerva四个数学基准数据集上进行评估,采用归纳的简化推理线索及其他训练优化技术,显著提升模型性能。
⭐ 主要贡献开发IGPO框架,引入补全策略结合RL优化,显著提升dLLMs在数学任务上的效果并实现多个基准的最优结果,同时扩展了dLLMs在生成任务中的应用潜力。
查看完整摘要 (Abstract)
Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity—their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning.
We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across four mathematical benchmarks—GSM8K, Math500, AMC and Minerva—achieving new state-of-the-art results for full-attention masked dLLMs.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM-as-a-Judge #Reasoning #Reinforcement Learning
🎯 研究动机AI 的进展受评估质量限制,高效的 LLM-as-a-Judge 是核心解决方案,需要优化其链式思维能力以提升判断效果。
❓ 解决问题当前缺乏统一的方法优化 LLM 判断任务中的推理过程,特别是在兼顾可验证性和减轻位置偏差的情况下。
🔍 现象分析评估显示,现有模型在思维深度和判断一致性方面存在不足,导致评估结果与真实质量间有偏差。
🛠️ 主要方法提出 J1 框架,通过强化学习将所有判断任务统一为基于可验证奖励的格式,直接优化评估质量,并训练不同规模模型以消除位置偏差。
📊 数据与实验在基准任务上,使用 8B、32B、70B 模型进行实验,J1-Qwen-32B 在多项基准上优于 o1-mini、o3 和更大的 671B DeepSeek-R1,训练仅基于合成数据。
⭐ 主要贡献通过 J1 框架实现统一化、多任务评估,显著提升模型链式推理能力,开发合成评价策略,并证明系统性改进策略的实用性。
查看完整摘要 (Abstract)
The progress of AI is bottlenecked by the quality of evaluation, making powerful LLM-as-a-Judge models a core solution. The efficacy of these judges depends on their chain-of-thought reasoning, creating a critical need for methods that can effectively optimize this reasoning process. In this work, we introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions. Our core contribution lies in converting all judgment tasks for nonverifiable and verifiable prompts into a unified format with verifiable rewards, enabling direct optimization of evaluation quality while mitigating positional bias. We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance across multiple benchmarks. In particular, J1-Qwen-32B, our multitasked pointwise and pairwise judge also outperforms o1-mini, o3, and a much larger 671B DeepSeek-R1 on some benchmarks, while only training on synthetic data. Through comprehensive ablations of pairwise, pointwise, and multitask J1 variants, we demonstrate the effectiveness of our approach across seed prompts, reward strategies, and training recipes. Qualitative analysis reveals that J1 develops systematic evaluation strategies, including dynamic criteria generation, reference answer creation, iterative self-correction of initial assessments, and feedback generation for low-quality responses.
基础/前沿模型 (含LLM)
指令微调与对齐
#Language Models; Preference Optimization; RLHF
TL;DR:We present RAPPO, an order-aware preference optimization framework that achieves tighter generalization guarantees and outperforms DPO baselines on multiple LLM tasks.
🎯 研究动机DPO在大语言模型与人与偏好对齐时效果受训练样本质量影响显著,尤其在参考策略与人类偏好不一致情况下表现较差。
❓ 解决问题提出一种能够过滤掉最难和最模糊样本的优化框架,以解决参考策略偏差导致的梯度信号混乱问题,提高泛化能力。
🔍 现象分析通过理论证明,样本质量选择对模型泛化性有显著优化效果;高质量样本有助于加强模型与偏好的可靠对齐。
🛠️ 主要方法设计了一种轻量化的RAPPO损失函数调整机制,只需少量代码即可集成到现有DPO算法中,用于过滤质量较差的样本。
📊 数据与实验在多种对齐任务和基准测试中进行实验,包括PKU-SafeRLHF基准,结果显示该方法在有效性与安全性指标上优于DPO与其他最新基线。
⭐ 主要贡献提出RAPPO框架并提供理论支持,显著提高偏好对齐任务性能,扩展了DPO的适用性,且实现简单易用,可广泛集成。
查看完整摘要 (Abstract)
Direct Preference Optimization (DPO) has emerged as a powerful framework for aligning large language models (LLMs) with human preferences via pairwise comparisons. However, its performance is highly sensitive to the quality of training samples: when the reference policy is poorly aligned with human preferences, ambiguous pairs can dominate the gradient signal and degrade generalization. To address this, we propose RAPPO($\textbf{R}$eliable $\textbf{A}$lignment for $\textbf{P}$reference $\textbf{P}$olicy $\textbf{O}$ptimization), a simple sample-aware modification of the DPO loss that mitigates reference-policy misalignment by filtering out the hardest, most ambiguous samples. We theoretically show that RAPPO yields improved generalization guarantees. RAPPO is lightweight and requires only a few lines of code to be integrated into any existing DPO-type algorithm. Surprisingly, With this simple modification, our simulations across a broad suite of alignment tasks and benchmarks show consistent gains over DPO and recent state-of-the-art baselines. On the PKU-SafeRLHF benchmark, RAPPO attains helpfulness $0.693$ ($+34.8\%$ over DPO) and harmlessness $0.357$ ($-21.0\%$ vs DPO).
基础/前沿模型 (含LLM)
指令微调与对齐
#Knowledge Editing #Machine Unlearning #Knowledge Graph
🎯 研究动机大型语言模型知识更新机制研究较少,现有评估方式零散且规模有限,无法系统理解编辑和忘除的规律。
❓ 解决问题提出统一框架探讨知识编辑与忘除如何随着训练数据的规模和策略变化而影响模型知识更新。
🔍 现象分析实验发现语言模型在不同层级的知识修改上不具备与人类相似的行为,并存在一致性与容量的权衡现象。
🛠️ 主要方法将知识编辑和忘除统一为约束优化问题,并设计自动数据集生成器,支持多图层级和数据规模干预研究。
📊 数据与实验用结构化生成的数据集开展多维实验,评估知识传播、可塑性扩展、一致性和鲁棒性等关键特性。
⭐ 主要贡献提出系统性框架,揭示语言模型知识更新复杂性及关键权衡关系,为可靠可扩展策略设计提供了指导。
查看完整摘要 (Abstract)
Knowledge editing and machine unlearning are two popular approaches for large language models (LLMs) to stay up-to-date. However, the knowledge updating mechanism of LLMs remains largely unexplored due to insufficient, isolated, and small-scale evaluation. For instance, are LLMs similar to humans in modifying certain knowledge? What differs editing and unlearning as training data increases? This paper proposes KnowledgeSmith, a unified framework to systematically understand the updating mechanism of LLMs. We first cast editing and unlearning as instances of one constrained optimization problem. Then, we propose an automatic dataset generator that provides structured interventions across multiple graph levels and data scales, enabling controlled studies of how different modification strategies propagate through model knowledge. Extensive experiments demonstrate nuanced insights over knowledge propagation, plasticity scaling, consistency, and robustness. For instance, our results show that LLMs do not exhibit similar updating as humans for different levels of knowledge, and there exists consistency-capacity trade-off. We hope our findings can offer suggestions to the design of more reliable and scalable strategies. Code: https://github.com/AIFrontierLab/KnowledgeSmith
基础/前沿模型 (含LLM)
指令微调与对齐
#Reinforcement Learning #Fine-tuning #Agents #Decision-Making #Exploration #Analysis
🎯 研究动机大型语言模型(LLMs)在应用于决策领域时存在次优探索和认知行为缺陷问题,需深入研究其表现以提升其决策能力。
❓ 解决问题识别并解决LLMs在决策场景中的三个主要失效模式:贪婪倾向、频率偏差以及知行脱节。
🔍 现象分析LLMs在决策中的次优表现主要体现在以下方面:偏向贪婪策略导致局限性、频率偏差影响探索能力、以及难以将知识转化为有效行动。
🛠️ 主要方法通过强化学习(RL)对LLMs进行微调,以自生成的链式推理(CoT)作为训练数据,并采用经典探索机制与模型特定方法优化决策能力。
📊 数据与实验实验使用了多臂赌博机、上下文赌博机和井字棋任务,结果证明RL微调显著提高了LLMs的探索能力并缩小了知行脱节问题。
⭐ 主要贡献系统研究并缓解了LLMs决策中的关键缺陷,提出了结合强化学习和LLMs特定技术的优化方法,为提升其在复杂任务中的决策能力提供了新思路。
查看完整摘要 (Abstract)
The success of LLMs has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as $\epsilon$-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.
基础/前沿模型 (含LLM)
指令微调与对齐
#Reinforcement Learning #LLM Reasoning #Self-Rewarding
TL;DR:We propose LaSeR, an highly efficient and effective algorithm for jointly optimizing the reasoning and self-rewarding capabilities of LLMs.
🎯 研究动机近年来,大语言模型(LLMs)推理能力的增强依赖于可验证奖励强化学习方法(RLVR),但测试时缺乏验证信号,影响性能。
❓ 解决问题现有方法需要通过两步生成解答和验证,导致推理成本翻倍,效率低下。本文提出LaSeR算法以解决此问题。
🔍 现象分析理论上证明RL自验证训练的闭式解可以简化为解答末尾标记(last-token)的自奖励分数,此分数通过模型的下一标记概率分布计算,可精确衡量推理奖励。
🛠️ 主要方法提出LaSeR算法,在RLVR损失中添加均方误差(MSE)损失,确保末尾标记自奖励分数与基于验证器的推理奖励对齐,从而联合优化推理与自奖励能力,仅需额外计算一个标记的推理成本。
📊 数据与实验实验表明,该方法在多项数据集上显著提升模型推理性能,并赋予其优异的自奖励能力,进一步提高推理阶段的扩展性。
⭐ 主要贡献提出了一种高效低成本的联合优化算法,将推理和自奖励紧密结合,极大提升了LLMs的推理效率与性能。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time after RLVR, prior studies incorporate the training of model's self-verification capabilities into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which doubles the inference cost per sample and significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification training can be approximately reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a Mean Squared Error (MSE) loss that aligns the last-token self-rewarding scores with the verifier-based reasoning rewards, and jointly optimizes the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores serve as auxiliary reward signals in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last solution token immediately after solution generation, thereby incurring only the minimal extra cost of at most one additional token inference. Experimental results show that our method not only improves the reasoning performance of the model also equips it with remarkable self-rewarding capability, thereby further boosting its inference-time scaling performance.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Model #Adversarial Training #Machine Generated Text Detection
TL;DR:This paper proposes an adversarial training approach to enhance the robustness of machine-generated text detection in zero-shot languages against adversarial attacks.
🎯 研究动机机器生成文本检测对维护线上内容的真实性与防止错误信息传播至关重要,但现有检测器在零样本语言环境中表现较差,易受对抗攻击影响。
❓ 解决问题提出一种对抗训练框架,提高机器生成文本检测器在零样本语言环境中的鲁棒性及对抗能力。
🔍 现象分析现有检测器在单语环境中精度高,然而在跨语言情境下准确率显著下降,并且对抗攻击成功率较高。
🛠️ 主要方法设计名为 TASTE 的框架,包括翻译字典生成对抗样本的攻击模型和通过语言无关特征学习增强鲁棒性的检测器,同时引入语言无关对抗损失以提高检测器表现。
📊 数据与实验在9种语言和8种攻击类型上进行实验,与8个当前最先进检测器对比,F1得分平均提升0.064,攻击成功率平均降低3.8%。
⭐ 主要贡献提出了一个结合翻译字典攻击与语言无关对抗损失的多语言检测框架,显著提升了零样本语言环境下的检测鲁棒性与泛化能力。
查看完整摘要 (Abstract)
Machine-generated text (MGT) detection is critical for safeguarding online content integrity and preventing the spread of misleading information.
Although existing detectors achieve high accuracy in monolingual settings, they exhibit severe performance degradation on zero-shot languages and are vulnerable to adversarial attacks.
To tackle these challenges, we propose a robust adversarial training framework named
**T**ranslation-based
**A**ttacker
**S**trengthens
Mul**T**ilingual
Def**E**nder (TASTE).
TASTE comprises two core components: an attacker that performs code-switching by querying translation dictionaries to generate adversarial examples, and a detector trained to resist these attacks while generalizing to unseen languages.
We further introduce a novel Language-Agnostic Adversarial Loss (LAAL), which encourages the detector to learn language-invariant feature representations and thus enhances zero-shot detection performance and robustness against unseen attacks.
Additionally, the attacker and detector are synchronously updated, enabling continuous improvement of defensive capabilities.
Experimental results on 9 languages and 8 attack types show that our TASTE surpasses 8 SOTA detectors, improving the average F1 score by **0.064** and reducing the average Attack Success Rate (ASR) by **3.8\%**.
Our framework offers a promising approach for building robust, multilingual MGT detectors with strong generalization to real-world adversarial scenarios.
Our codes are available in https://github.com/Liyuuuu111/MGT-Eval, and our datasets and pretrained checkpoint are available in https://drive.google.com/file/d/1w1hbdiZMS_JzPntVMWM3qrTQ4KxJf-t6.
基础/前沿模型 (含LLM)
指令微调与对齐
#Reward Modeling #Large Language Models #RLHF
🎯 研究动机奖励模型对于将大语言模型与人类价值和意图对齐至关重要,但现有生成式和判别式方法各有缺陷,难以全面满足需求。
❓ 解决问题克服生成式模型依赖高成本监督和判别式模型缺乏概率解释的局限性,为奖励建模提供一种更有效的数据驱动解决方案。
🔍 现象分析实验表明,现有方法在准确性和数据利用效率方面存在不足,且处理质量评分时缺乏概率分布与绝对质量的直观联系。
🛠️ 主要方法提出了一种新的奖励建模范式——概率奖励模型,并具体实现为离散化分值范围的序概率模型,同时引入数据高效的区域泛化调优策略。
📊 数据与实验在多种基准奖励模型数据集上验证方法效果,结果显示准确率提升 2.9% ~ 7.4%,表明所提方法在性能与数据利用上具有优势。
⭐ 主要贡献设计了全概率奖励方案,提出高效调优策略,实验证明方法能同时捕捉相对排名和绝对质量水平,提升奖励建模性能和实用性。
查看完整摘要 (Abstract)
Reward models are crucial for aligning large language models (LLMs) with human values and intentions.
Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations:
GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic interpretation.
To address these challenges, we introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM).
Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response.
To make this paradigm practical, we present its closed-form, discrete realization: the **Ordinal Probabilistic Reward Model** (OPRM), which discretizes the quality score into a finite set of ordinal ratings.
Building on OPRM, we propose a data-efficient training strategy called **Region Flooding Tuning** (RgFT).
It enables rewards to better reflect absolute text quality by incorporating quality-level annotations, which guide the model to concentrate the probability mass within corresponding rating sub-regions.
Experiments on various reward model benchmarks show that our method improves accuracy by **2.9% ~ 7.4%** compared to prior reward models, demonstrating strong performance and data efficiency.
Analysis of the score distribution provides evidence that our method captures not only relative rankings but also absolute quality.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models; Reasoning; Reinforcement Learning; Supervised Fine-Tuning
TL;DR:ReLIFT, a method combining reinforcement learning and supervised fine-tuning, enhances large language model reasoning by addressing limitations of RL through interleaved training, improving performance across benchmarks with minimal data.
🎯 研究动机现有的强化学习方法无法突破大语言模型的固有能力边界,限制了复杂推理能力的提升,需引入新的方法以解决这些不足。
❓ 解决问题通过结合强化学习与监督微调,相互弥补单一训练方法的不足,提升模型在超越自身现有能力范围内的推理表现。
🔍 现象分析强化学习能优化模型已有能力范围内的问题,而监督微调则更擅长处理模型当前无法解决的复杂问题,二者具备互补性。
🛠️ 主要方法提出 ReLIFT 方法,利用强化学习进行常规训练,并交替引入针对困难问题的在线监督微调,以动态解决模型弱点。
📊 数据与实验在六个基准测试(包括五个数学推理任务和一个分布外任务)中进行验证,ReLIFT平均表现超越现有方法6.7个百分点,同时减少了训练时间。
⭐ 主要贡献提出了一种高效的新型训练范式,通过将强化学习与监督微调交替结合,显著提升了大语言模型的推理能力,且资源成本低。
查看完整摘要 (Abstract)
Recent advances in large language model (LLM) reasoning have shown that reasoning ability can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning), a novel training strategy. ReLIFT employs RL for general training, but interleaves it with targeted SFT on challenging questions for which high-quality solutions are collected online. By alternating between RL and SFT, ReLIFT addresses model weaknesses as they emerge. Empirically, ReLIFT outperforms previous RLVR methods by an average of +6.7 points across a suite of six benchmarks (five math reasoning and one out-of-distribution). More importantly, ReLIFT surpasses baselines such as individual RL, individual SFT, and various hybrid approaches while reducing the required training time. These results provide compelling evidence that ReLIFT is a powerful and resource-efficient paradigm for developing capable reasoning models. The code is available at \href{https://github.com/TheRoadQaQ/ReLIFT}{here}.
基础/前沿模型 (含LLM)
指令微调与对齐
#Code generation #Reinforcement Learning
🎯 研究动机单元测试是程序开发的重要环节,用于系统性评估代码质量。但编写全面的单元测试具有挑战性,因此需要自动化生成高质量单元测试的技术手段。
❓ 解决问题现有研究在训练大模型(LLMs)生成高质量单元测试的有效方法上探索不足,难以满足实际需求。
🔍 现象分析通过对比实验发现,基于UTRL训练的单元测试生成模型能更好地揭示代码缺陷,其质量优于传统的监督微调方法和前沿模型。
🛠️ 主要方法提出UTRL,一个强化学习框架,通过对抗训练迭代优化单元测试生成器和代码生成器,使测试生成器产生高质量单元测试,代码生成器生成能够通过测试的可靠代码。
📊 数据与实验实验表明,UTRL在Qwen3-4B模型上的表现优于基于真实单元测试训练的模型,并在单元测试生成质量上超越GPT-4.1和GPT-4o。
⭐ 主要贡献提出UTRL框架,突破性地通过对抗强化学习提升单元测试生成质量,显著增强LLMs在代码评估任务中的实用性。
查看完整摘要 (Abstract)
Unit testing is a core practice in programming, enabling systematic evaluation of programs produced by human developers or large language models (LLMs). Given the challenges in writing comprehensive unit tests, LLMs have been employed to automate unit test generation, yet methods for training LLMs to produce high-quality unit tests remain underexplored. In this work, we propose UTRL, a novel reinforcement learning (RL) framework that trains an LLM to generate high-quality unit test given a programming instruction. Our key idea is to iteratively train two LLMs, the unit test generator and the code generator, in an adversarial manner via RL: (1) the unit test generator is trained to maximize a discrimination reward, encouraging it to produce tests that reveal faults in the code generator’s solutions; and (2) the code generator is trained to maximize a code reward, encouraging it to produce solutions that pass the unit tests generated by the unit test generator. In our experiment, we demonstrate that unit tests generated by Qwen3-4B trained via UTRL show higher quality compared to unit tests generated by the same model trained via supervised fine-tuning on ground-truth unit tests, yielding code evaluations that more closely align with those induced by the ground-truth tests. Moreover, Qwen3-4B trained with UTRL outperforms frontier models like GPT-4.1 and GPT-4o in generating high-quality unit tests, highlighting the effectiveness of UTRL in training LLMs for the unit test generation.
基础/前沿模型 (含LLM)
指令微调与对齐
#reinforcement learning #large language model
🎯 研究动机大型语言模型在强化学习中表现出色,但其潜力的充分释放需要一个有效的中期训练阶段,用于学习强策略先验并加速在线交互的学习过程。
❓ 解决问题通过理论分析揭示中期训练如何通过压缩动作空间和缩短规划视野来提升强化学习的收敛速度,从而改善后期训练过程中策略表现。
🔍 现象分析时间抽象可以同时压缩动作集合的大小和缩短决策视野,进而在后期训练中有效降低懊悔值,提高策略优化效率。
🛠️ 主要方法提出一种可扩展的中期训练算法 RA3,通过时间变分边界的优化,迭代发现具有时间一致性的潜在结构,并基于强化学习进行微调。
📊 数据与实验在代码生成任务上验证算法有效性,包括 HumanEval 和 MBPP 数据集,RA3显著提升基模型的性能,并在多个扩展基准如 HumanEval+ 和 LiveCodeBench 上实现更快收敛和更高的性能。
⭐ 主要贡献提出并证明中期训练对强化学习策略优化的理论机制;设计了 RA3 算法,明确了时间抽象对动作集合和决策视野的优化作用;显著提升了大模型在代码生成等任务上的性能。
查看完整摘要 (Abstract)
Large language models excel with reinforcement learning (RL), but fully unlocking this potential requires a mid-training stage. Intuitively, an effective mid-training stage should both learn a strong policy prior and enable fast learning through online interactions. We formalize this intuition by presenting the first theoretical result on how mid-training shapes post-training: it acquires strong policy priors by efficiently pruning the action space and accelerates RL convergence by shortening the effective planning horizon. Moreover, we prove that temporal abstractions simultaneously compress the size of the action set and reduce the decision horizon, thereby improving regret minimization after training. Building on these insights, we introduce Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm. Specifically, we derive a temporal variational bound and optimize it by iteratively discovering temporally-consistent latent structures via RL, then fine-tuning on the bootstrapped data. Experiments on code generation tasks demonstrate the effectiveness of our approach. Across multiple base models, RA3 improves the average performance on HumanEval and MBPP by 8 and 4 points over the base model and the next-token prediction baseline. Furthermore, RA3 achieves faster convergence and higher asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.
基础/前沿模型 (含LLM)
指令微调与对齐
#pluralistic preference alignment #RL finetuning of LLMs #pluralistic reward modeling
TL;DR:PLUS enables personalized reward prediction and response generation for pluralistic LLM alignment. It improves reward modeling accuracy by 11-77\%, and responses conditioned on PLUS summaries achieve a 72% win rate compared to default responses.
🎯 研究动机随着大语言模型(LLM)在日常场景中的广泛应用,个性化响应以满足不同用户的偏好和目标变得至关重要。然而,现有的RLHF方法假设所有用户的偏好一致,无法应对多样化需求。
❓ 解决问题通过提出一个新框架PLUS,解决当前RLHF模型中无法个性化对用户偏好建模的问题,实现多样化用户偏好的精确捕获和更个性化的响应生成。
🔍 现象分析现有方法使用单一奖励模型处理所有用户,导致个性化预测能力不足。实验发现,PLUS能够显著提升奖励建模精准度,同时增强与用户相关的响应质量。
🛠️ 主要方法设计一个用户摘要生成模型,通过RL在线学习用户偏好、特性和历史对话的文本摘要,用这些摘要指导奖励模型进行个性化的奖励预测和响应生成。
📊 数据与实验通过与现有奖励模型方法对比,PLUS实现了11%-77%的奖励模型准确性提升;在GPT-4零样本条件下,基于PLUS的响应生成以72%的胜率比默认模型优胜。
⭐ 主要贡献提出了一个结合用户偏好摘要的个性化增强奖励建模框架,大幅提高奖励建模和响应生成的准确性;实现新用户和新问题的鲁棒性个性化;支持从多样用户上下文中学习,并提升用户透明度与控制力。
查看完整摘要 (Abstract)
As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users' preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model, meaning it assumes that everyone's preferences are the same.
We present a novel framework, **P**reference **L**earning **U**sing **S**ummarization (**PLUS**), that uses reinforcement learning (RL) to learn to produce text-based summaries of each user's preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop. We show that in contrast to the standard Bradley–Terry model, summaries produced by PLUS capture diverse aspects of user preferences, achieving a 11–77\% improvement in reward model accuracy. Key strengths of PLUS are: (1) robust performance with new users and conversation topics, achieving a 25\% improvement over the best personalized reward model technique used for RLHF; (2) zero-shot personalization with state-of-the-art proprietary models like GPT-4 (e.g., PLUS-summary-conditioned responses achieved a 72\% win rate compared to 28\% for default GPT-4o); (3) learning from flexible user contexts beyond preference labels, and (4) interpretable representation of users, enabling greater transparency and user control in pluralistic LLM alignment.
基础/前沿模型 (含LLM)
指令微调与对齐
#multilingual #reward modeling #rl #llm-as-judge #human evaluation
TL;DR:Massively multilingual preference evaluation, reward modeling, and post-training to improve LLMs' language proficiency
🎯 研究动机大语言模型在多语言场景中保证母语级质量仍是一个困难的挑战,需要新的机制来评估和提升其表现。
❓ 解决问题提出一种框架解决对多语言下模型响应质量进行原生化评估及优化的问题,以提升模型的多语言能力。
🔍 现象分析零样本大语言模型评估能力在配对的结构化注释规则下有所提升,但仍不及人类注释员,并且模型与人类评分存在差异。
🛠️ 主要方法结合强化学习、奖励曲线调节及多任务学习进行模型微调,同时基于可生成奖励的机制优化多语言能力。
📊 数据与实验构建了包含47种语言的6,423个人类标注的数据集,并通过实验验证了框架在多语言偏好对齐和规模化评估中的效果。
⭐ 主要贡献提出MENLO框架和数据集,显著提升多语言模型的原生化质量,并为多语言LLM评估与优化提供了开源资源。
查看完整摘要 (Abstract)
Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt–response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation (https://huggingface.co/datasets/facebook/menlo).
基础/前沿模型 (含LLM)
指令微调与对齐
#Best-of-N #test-time scaling #synthetic data generation #inference #multilingual #diversity #ensembling
TL;DR:We introduce Fusion-of-$N$ that synthesizes outputs from $N$ samples rather than selecting one amongst them (Best-of-$N$).
🎯 研究动机现代大模型生成的质量优化通常被视为选择问题,通过从多个样本中选取最佳结果,但这种方法忽略了多样性和潜在信息的价值,因此需要新的生成利用方法。
❓ 解决问题提出一种能够融合多个样本信息的策略,以克服传统单一选择机制的局限性,实现更高效的信息整合与生成优化。
🔍 现象分析传统的 Best-of-N 方法只选取一个样本作为最终生成,导致多样性信息的丢失,而新的方法能够综合利用每个样本的有用信息,从而提高生成质量。
🛠️ 主要方法提出 Fusion-of-N 方法,使用通用的 LLM 作为评判者,将多个样本的关键信息融合为一个最终答案,同时适用于测试时扩展和合成数据生成。
📊 数据与实验基于包括 11 种语言和 3 类基准的广泛实验,在多模型规模下验证了 Fusion-of-N 的优势,和 Best-of-N 方法相比,表现出一致的性能提升和鲁棒性。
⭐ 主要贡献从单一质量评估转向多样性整合,开创了一种全新的生成优化方法,在测试扩展和合成数据生成两方面均实现前所未有的性能改善。
查看完整摘要 (Abstract)
Obtaining high-quality generations in modern LLMs has largely been framed as a selection problem: identifying a single winning generation from a diverse pool of $N$ samples, the Best-of-$N$ (BoN).
Yet, this approach is inherently zero-sum, discarding diverse and potentially useful information from the pool. Instead, we explore a collaborative setup, where all candidates can potentially contribute to the final winning generation. To this end, we propose Fusion-of-$N$ (FusioN): a method that uses a general LLM judge to synthesize the most informative elements of each sample into a single final answer.
We compare FusioN to BoN in two settings, (i) test-time scaling, where we sample and aggregate from a single model at test-time (ii) synthetic data generation, where we fuse samples from a pool of diverse teachers to improve a student model. We extensively benchmark both setups across 11 languages, 3 diverse benchmarks and varying model scales. Across the bench, FusionN consistently outperforms BoN showing versatility and robustness both in test-time scaling and in downstream gains from synthetic data generation. We also perform extensive analysis on FusioN, where it shows surprising strengths and robustness under challenging settings.
These results show that we should shift how we think about evaluating and utilizing LLM generations from a monolithic measure of quality, to embracing their polylithic nature. This shift allows us to integrate diverse strengths, unlock latent potential, and achieve improvements that were previously inaccessible through selection alone.
基础/前沿模型 (含LLM)
指令微调与对齐
#Mixup #Augmentation #MLLM #Image Classification #Visual Alignment
🎯 研究动机当前多模态大语言模型的对齐方法存在效率与泛化性的权衡。监督微调依赖人工标注且任务泛化能力不足,强化学习则存在计算成本高和不稳定的问题。
❓ 解决问题提出了MergeMix,一个统一的增强范式,旨在平衡可扩展性、效率和对齐泛化能力。该方法通过Token Merge和Mixup技术弥合监督微调与强化学习之间的鸿沟。
🔍 现象分析视觉语言对齐主要依赖于监督微调或强化学习,二者分别在数据需求和计算稳定性上存在局限,导致模型在效率与泛化性能之间难以取得平衡。
🛠️ 主要方法利用注意力图谱聚类生成上下文对齐的混合图像及标签。通过构建原始图像与MergeMix生成图像的偏好对,并使用混合SimPO损失优化软偏好边界,以增强偏好驱动范式。
📊 数据与实验大量实验验证了MergeMix在分类任务中的主导性能,同时显著提升了多模态大模型的泛化能力和对齐效果,体现了其训练效率与稳定性优势。
⭐ 主要贡献提出了一种新的基于Token Merge的Mixup增强范式,为视觉与多模态理解提供了高效统一的学习框架。在提升模型对齐泛化性的同时,保证了训练过程的效率与稳定性。
查看完整摘要 (Abstract)
Vision-language alignment in multi-modal large language models (MLLMs) relies on supervised fine-tuning (SFT) or reinforcement learning (RL).
To align multi-modal large language models (MLLMs) in the post-training stage, supervised fine-tuning (SFT) is a stable choice but requires human annotations and lacks task generalizations, while Reinforcement Learning (RL) searches for better answers from reward signals but suffers from computational overhead and instability.
To achieve balance among scalability, efficiency, and alignment generalizations, we propose MergeMix, a unified paradigm that bridges SFT and RL with an efficient Token Merge based Mixup augmentation. As for the Mixup policy, we generate contextual aligned mixed images with the corresponding labels according to the merged attention maps with cluster regions. Then, we enhance the preference-driven paradigm for MLLMs by building preference pairs with raw images and MergeMix-generated ones and optimizing the soft preference margin with the mixed SimPO loss.
Extensive experiments demonstrate that MergeMix not only achieves dominant classification accuracy as an augmentation method but also improves generalization abilities and alignment of MLLMs, providing a new learning paradigm for preference alignment with training efficiency and stability.
基础/前沿模型 (含LLM)
指令微调与对齐
#Causal learning #Meta-learner #Large Language Model #query routing
🎯 研究动机在需要大量人机交互的语言任务中,大语言模型(LLM)的推理成本高昂,需通过模型路由器在响应质量和推理成本之间进行平衡。
❓ 解决问题设计一种路由器训练框架,结合金标准数据与偏好数据,解决偏好数据的偏差及两类数据源的失衡问题,以提升路由的精度和效率。
🔍 现象分析金标准数据质量高但成本高、不易扩展;偏好数据成本低、可扩展但存在质量偏差,这种偏差可通过因果估计中的条件平均处理效应(CATE)加以解释。
🛠️ 主要方法提出一套基于因果推断的整合性路由器训练框架,纠正偏好数据的偏差并平衡两类数据源,从而提高路由器的鲁棒性和效率。
📊 数据与实验通过数值实验验证了方法在提升路由器精度和改善成本与质量权衡方面的效果,并公开了可复现的代码。
⭐ 主要贡献创建了结合因果推断的路由器训练框架,改进了LLM路由的准确性和效率,为扩展经济实用的高质量自然语言处理任务提供了新工具。
查看完整摘要 (Abstract)
In language tasks requiring extensive human-model interaction, the inference cost of large language models (LLMs) can be substantial. To reduce expenses while preserving the quality of the responses, an LLM router selects among candidate models to balance between the expected response quality and the inference cost. A central challenge in router training is the accuracy and accessibility of reliable supervision. Gold-standard data, obtained from domain experts or benchmark labels, provide accurate quality evaluations of LLM responses but are costly and difficult to scale. In contrast, preference-based data, collected via crowdsourcing or LLM-as-a-judge systems, are cheaper and more scalable, yet often biased in reflecting the true quality of responses. We cast the problem of LLM router training with combined Gold-standard and preference-based data into a causal inference framework by viewing the response evaluation mechanism as the treatment assignment. This perspective further reveals that the bias in preference-based data corresponds to the well-known causal estimand: the conditional average treatment effect (CATE). Based on this new perspective, we develop an integrative causal router training framework that corrects preference-data bias, addresses imbalances between two data sources, and improves routing robustness and efficiency. Numerical experiments demonstrate that our approach delivers more accurate routing and improves the trade-off between cost and quality. Illustrative code to reproduce our main experiment is available at https://github.com/yichistat/Meta-router.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM #Fine-Tuning #Continual fine-tuning
🎯 研究动机大语言模型在连续任务环境中的使用逐渐增加,但现有的参数高效微调方法存在内存线性增长或适应能力不足的问题,难以解决灾难性遗忘现象。
❓ 解决问题针对冻结语言模型在无限任务流中的记忆限制和灾难性遗忘问题,提出一种高效适配的新方法,能够实现恒定内存的动态任务适应。
🔍 现象分析现有方法要么随任务数量增加参数暴涨,要么因缺乏深度更新导致遗忘,无法有效维持任务间知识迁移和保持。
🛠️ 主要方法Meta-UCF通过轻量化层归一化任务嵌入与单一超网络生成 LoRA 更新,同时结合元对比与正交性目标,引导嵌入方向正交,强化记忆保留与快速更新能力。
📊 数据与实验在四个任务基准上进行实验,包括 Std-CL 5、Seq-GLUE 7、Long-CL 15 和 TRACE-8,结果显示在单一适配器内存下,平均准确率提高最高达2.2个百分点,遗忘率减少13%。
⭐ 主要贡献实现了任务适配与内存增长分离,提供了具可扩展性、低资源需求的连续学习解决方案,为长期语言建模应用开辟了新方向。
查看完整摘要 (Abstract)
Large language models are increasingly deployed in settings where newtasks arrive continuously, yet existing parameter-efficient finetuning (PEFT) methods either bloat linearly with the task horizon or sacrifice deep adaptation, leaving catastrophic forgetting unresolved. We aim to achieve memory-constant, on-the-fly adaptation for a frozen LLM facing an unbounded stream of tasks. To this end we propose Meta-Unified Contrastive Finetuning(Meta-UCF), which encodes each task into a lightweight layer-normalised mean embedding and feeds it to a single hypernetwork that instantly generates rank-r LoRA updates for every transformer layer; a meta-contrastive coupled with orthogonality objective further steers task embeddings into near-orthogonal directions, preserving past knowledge without inner-loop gradients. On four benchmark streams—Std-CL 5, Seq-GLUE 7, Long-CL 15 and TRACE-8—Meta-UCF raises average accuracy by up to 2.2 pp and cuts forgetting by 13% relative to the strongest LoRA baseline, while using the parameters of a single adapter. By decoupling continual learning from parameter growth, Meta-UCF provides a practical path toward scalable, low-resource lifelong language modelling.
基础/前沿模型 (含LLM)
指令微调与对齐
#Vision–Language–Action models #Efficient Robot Reasoning #Generalization
🎯 研究动机现有视觉-语言-行动模型在任务适应性和计算效率方面存在缺陷,难以实现通用化且对未见任务泛化表现较差。
❓ 解决问题提出MetaVLA框架,通过统一的后训练方式实现更高效、更广泛的模型任务对齐,减少任务特定的微调和计算成本。
🔍 现象分析多任务微调效率低且适应性差,复杂任务环境中模型难以快速调整,传统方法在长时间任务中表现不稳定。
🛠️ 主要方法引入上下文感知的元协同训练机制,结合结构多样的辅助任务与轻量化元学习框架,提升泛化与快速适应能力。
📊 数据与实验在LIBERO基准上进行评估,MetaVLA使用六个辅助任务显著提升长时间任务表现,减少训练步骤和GPU时间分别达到69%和76%。
⭐ 主要贡献实现了统一、低资源的后训练框架,大幅降低计算成本,推动通用化机器人模型的发展。
查看完整摘要 (Abstract)
Vision–Language–Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists—they often require task-specific fine-tuning, incur high compute costs, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism—derived from Attentive Neural Processes—to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0\% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by 76%. These results show that scalable, low-resource post-training is achievable—paving the way toward general-purpose embodied agents. Code will be available.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM #Reinforcement Learning
TL;DR:Native Reasoning Training (NRT) cultivates reasoning in base LLMs using only Q&A pairs. It rewards self-generated logic for correct answers, achieving SOTA results for verifier-free methods without needing costly, human-written reasoning examples.
🎯 研究动机当前主流的大型语言模型推理训练依赖于高质量的人类注释数据和外部验证器,这存在成本高、潜在偏差以及无法处理不可验证任务的局限性。
❓ 解决问题提出一种不需要专家书写推理示例的新框架,通过仅使用问答对培养语言模型的复杂推理能力,解决现有方法在不可验证任务中的局限。
🔍 现象分析传统方法存在训练中策略坍缩等失败模式;无法系统地处理推理过程的不确定性,限制了模型在复杂任务中的表现。
🛠️ 主要方法采用原生推理训练框架,将推理过程视为潜变量优化问题,通过激励产生正确答案的推理路径,统一训练目标并设计新奖励聚合函数改进模型性能。
📊 数据与实验使用 Llama 和 Mistral 模型族进行评估,展现出在复杂推理领域显著的性能提升,并相较传统 SFT 和无验证器方法有较强鲁棒性。
⭐ 主要贡献提出了一种无需验证器的推理训练框架,解决了策略坍缩等固有问题,并在广泛任务中实现可扩展性和卓越表现,为构建通用推理系统提供了新路径。
查看完整摘要 (Abstract)
The dominant paradigm for training large reasoning models—combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)—is fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers.
This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a vast landscape of unverifiable tasks unaddressed.
To overcome these limitations, we introduce Native Reasoning Training (NRT), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations.
NRT reframes the training problem by treating the reasoning process as a latent variable.
It employs a unified training objective that models reasoning as an optimization problem, intrinsically rewarding paths that increase the model's likelihood of producing the ground-truth answer.
This unified perspective allows us to analyze intrinsic failure modes of prior methods, such as policy collapse, and systematically design more robust reward aggregation functions, creating a self-correcting feedback loop where the model learns to \textit{think} in ways that resolve its own uncertainty.
Empirical evaluation on Llama and Mistral model families demonstrates that NRT achieves state-of-the-art performance among verifier-free methods, significantly outperforming standard SFT baselines and prior verifier-free RL methods.
Our approach yields particularly strong performance gains in complex reasoning domains and exhibits high robustness to policy collapse, offering a general, scalable path toward building more powerful and broadly applicable reasoning systems.
基础/前沿模型 (含LLM)
指令微调与对齐
#peft #large language models #fine tuning
TL;DR:We identify a smoothness mismatch in LLM fine-tuning and propose a solution to mitigate it.
🎯 研究动机现有的LLM微调方式存在效率与性能的权衡问题:全参微调计算开销大,PEFT则在学习新知识方面表现不足且性能受限。
❓ 解决问题提出一种新的混合微调方法,融合全参微调与PEFT,兼顾效率与性能,解决现有方法的平滑性不匹配问题。
🔍 现象分析微调过程中优化风景不均衡,不同微调方法对模型参数的调整存在异质性影响,高效且平稳的优化机制亟需理论支持。
🛠️ 主要方法设计结合零阶和一阶优化的混合算法,并基于混合平滑条件开发收敛性分析框架,优化LLM与PEFT模块的联合训练。
📊 数据与实验在多项下游任务与模型架构上进行全面实验,展示算法在实际应用场景中的性能一致改进与鲁棒性提升。
⭐ 主要贡献提出新型混合微调范式,从理论与实验层面推进LLM微调领域发展,并显著提升大规模语言模型的微调效果。
查看完整摘要 (Abstract)
Fine-tuning Large Language Models (LLMs) typically involves either full fine-tuning, which updates all model parameters, or Parameter-Efficient Fine-Tuning (PEFT), which adjusts a small subset of parameters. However, both approaches have inherent limitations: full fine-tuning is computationally expensive, while PEFT often struggles to learn new knowledge and exhibits suboptimal performance. To overcome these issues, we propose a novel *hybrid fine-tuning* approach that jointly updates both LLMs and PEFT modules using a combination of zeroth-order and first-order optimization methods. To analyze our new algorithm, we develop a theoretical framework centered on the concept of *hybrid smoothness condition*, which accounts for the heterogeneous nature of the optimization landscape in joint LLM and PEFT training. We derive a rigorous convergence analysis for the convergence of reshuffling-type SGD algorithm under multiple learning rates and demonstrate its effectiveness through extensive empirical studies across various downstream tasks and model architectures. On the practical side, our results demonstrate consistent performance improvement, making the approach a viable solution for large-scale language model fine-tuning.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM-as-a-Judge #Hypothesis testing #Finite-sample guarantees #Type I/II errors
TL;DR:A statistically grounded framework is proposed for evaluating LLMs as imperfect judges, offering finite-sample guarantees by modeling judge reliability through hypothesis testing.
🎯 研究动机大规模语言模型(LLMs)需要可靠的性能认证,但当前评估方法难以应对评估者存在的噪声和偏差问题。
❓ 解决问题提出了一个统计严谨的假设检验框架,专门针对评估者的不完美情境,解决统计保证失效的问题,并控制有限样本的错误率。
🔍 现象分析通过评估者的真阳性率(TPR)和假阳性率(FPR)建模,揭示评估能力受限于评估者质量、数据量和安全认证水平的不确定性。
🛠️ 主要方法采用人类标注的小规模校准集估计评估者的可靠性参数,推导校正后的临界值并应用于大规模标注数据,确保有限样本下的统计健全性。
📊 数据与实验在 Jigsaw 评论、仇恨言论和 SafeRLHF 数据集上进行实验,验证了方法的理论预测,量化了实际方法与理想评估者(Oracle)之间的性能差距。
⭐ 主要贡献首次系统处理了不完美评估者情境,提出具解释性的诊断工具,明确统计评估的关键因素及其权衡,提升评估模型的理论和实践认知。
查看完整摘要 (Abstract)
Reliable certification of Large Language Models (LLMs)—verifying that failure rates are below a safety threshold—is critical yet challenging. While "LLM-as-a-Judge" offers scalability, judge imperfections, noise, and bias can invalidate statistical guarantees.
We introduce a "Noisy but Valid" hypothesis testing framework to address this. By leveraging a small human-labelled calibration set to estimate the judge's True Positive and False Positive Rates (TPR/FPR), we derive a variance-corrected critical threshold applied to a large judge-labelled dataset. Crucially, our framework theoretically guarantees finite-sample Type-I error control (validity) despite calibration uncertainty. This distinguishes our work from Prediction-Powered Inference (PPI), positioning our method as a diagnostic tool that explicitly models judge behavior rather than a black-box estimator.
Our contributions include: (1) Theoretical Guarantees: We derive the exact conditions under which noisy testing yields higher statistical power than direct evaluation; (2) Empirical Validation: Experiments on Jigsaw Comment, Hate Speech and SafeRLHF confirm our theory; (3) The Oracle Gap: We reveal a significant performance gap between practical methods and the theoretical "Oracle" (perfectly known judge parameters), quantifying the cost of estimation.
Specifically, we provide the first systematic treatment of the imperfect-judge setting, yielding interpretable diagnostics of judge reliability and clarifying how evaluation power depends on judge quality, dataset size, and certification levels. Together, these results sharpen understanding of statistical evaluation with LLM judges, and highlight trade-offs among competing inferential tools.
基础/前沿模型 (含LLM)
指令微调与对齐
#data extraction #data efficient #instruction tuning
🎯 研究动机指令调优依赖高质量数据,但当前基于大语言模型合成的数据存在多样性不足和偏差问题。为缓解该问题,研究者转向从知识丰富的网络语料库中提取指令数据。
❓ 解决问题解决从网络语料库提取指令数据时存在的两大挑战:一是完全提取所有问答对的计算成本过高;二是并非所有提取的问答都对下游任务有益,某些数据反而可能损害模型性能。
🔍 现象分析现有方法直接从语料库检索领域相关文档并提取全部问答对,这导致计算开销巨大,且引入的低质量或不相关数据会降低指令调优的效果。
🛠️ 主要方法提出了一种名为 EQUAL 的迭代式数据提取框架,通过对比学习嵌入对文档聚类,并采用多臂老虎机策略高效识别能产生高质量问答对的文档簇,从而动态交替进行文档选择和优质数据提取。
📊 数据与实验在 AutoMathText、KnowledgePile 和 StackOverflow 三个数据集上进行了实验,涵盖 13 个下游任务,在多个主流模型中验证了方法的有效性和高效性。
⭐ 主要贡献提出的 EQUAL 框架在显著降低 5-10 倍计算成本的同时,将 LLaMA-3.1-8B 等模型的准确率提升了 2.5%,为高效、可扩展的高质量指令数据提取提供了新方案。
查看完整摘要 (Abstract)
Instruction tuning improves the LLMs performance but depends on high-quality training data. Recently, LLMs have been used to synthesize data, enhancing training with seeds like question-answer (QA) pairs. However, this synthesis often results in instruction examples similar to the seeds, lacking diversity and biasing real applications. Thus, we propose to extract instruction tuning data from web corpus with much rich knowledge. The most straightforward strategy is to quickly retrieve domain specific documents from the corpus and then extract all QA pairs of these documents for tuning LLMs, which has two main limitations. (1) Extracting all QA pairs using LLMs is prohibitively expensive; and (2) These extracted pairs are not all beneficial for the downstream applications, and incorporating all of them for tuning may even hurt the model performance. To overcome the limitations, we introduce $\texttt{EQUAL}$, an $\textbf{E}$ffective and scalable data extraction framework that iteratively interleaves document selection and extract high-$\textbf{QUAL}$ity QA pairs to optimize instruction tuning. $\texttt{EQUAL}$ first clusters the document set based on the embeddings generated by contrastive learning. Then it leverages the multi-armed bandit based strategy to quickly identify document clusters where can extract high-quality QA pairs for training. This iterative framework significantly reduces computational costs while improving model performance much. Experiments on AutoMathText, KnowledgePile and StackOverflow across 13 downstream tasks demonstrate that $\texttt{EQUAL}$ reduces computational costs by 5–10$\times$ while improving accuracy by 2.5\% on LLaMA-3.1-8B, Qwen2.5-7B and Mistral-7B. Code and data is available at https://anonymous.4open.science/r/EQUAL-DD20.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM alignment #Representation Engineering #Activation Steering #ODE-based Framework #Barrier Functions
TL;DR:We propose a unified ODE-based framework for activation steering and introduce ODESteer, a method derived from our ODE-based framework that significantly outperform current SOTA activation steering methods
🎯 研究动机当前激活引导方法缺乏统一的理论框架,且依赖简单的单步调控,无法捕捉复杂的激活分布模式,限制了大模型的对齐效果。
❓ 解决问题提出基于常微分方程(ODE)的理论框架,将激活引导问题转化为构建来自控制理论中的屏障函数,用于设计多步和自适应的引导方案。
🔍 现象分析传统激活相加方法可视为ODE解的一级近似,复杂场景下单步方法性能不足,需优化引导方向以提升模型对齐能力。
🛠️ 主要方法提出ODESteer方法,通过定义正负激活的对数密度比为屏障函数,构建ODE并实现多步自适应引导,大幅优化激活分布以提升模型对齐效果。
📊 数据与实验在TruthfulQA、UltraFeedback和RealToxicityPrompts等基准测试中,ODESteer实现了显著性能提升,分别较SOTA方法提高了5.7%、2.5%和2.4%。
⭐ 主要贡献统一了激活引导的理论基础,首次引入基于ODE的框架,提出高效实用的ODESteer方法并在多个任务上实现领先性能,为LLM对齐研究开辟新路径。
查看完整摘要 (Abstract)
Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit{(i)} the lack of a unified theoretical framework for guiding the design of steering directions, and \textit{(ii)} an over-reliance on \textit{one-step steering} that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textit{theoretical} framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textit{barrier function} from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows \textit{empirical} advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textit{multi-step and adaptive} steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable $5.7\%$ improvement over TruthfulQA, $2.5\%$ over UltraFeedback, and $2.4\%$ over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Reinforcement Learning #Interpretability #Low Rank
TL;DR:We have discovered the naturally emerging low-rank phenomenon in reinforcement learning and leveraged this phenomenon to design a method that accelerates model training.
🎯 研究动机近年来,大语言模型的推理能力显著提升,但强化学习训练中的参数动态尚未被充分理解。
❓ 解决问题识别并解释强化学习训练中参数更新的基本性质,并通过该理解提升训练速度。
🔍 现象分析提出并验证了强化学习训练中的两种低秩现象:Rank-1 主导性和 Rank-1 线性动态,前者解释了模型性能改进的主要来源,后者揭示了主导子空间的线性演变规律。
🛠️ 主要方法基于低秩现象开发了一种名为 AlphaRL 的加速框架,通过早期训练参数线性外推,高效预测最终参数更新。
📊 数据与实验在 13 种大语言模型和 10 种强化学习算法上进行了广泛实验,验证了两个性质的普适性及 AlphaRL 在加速训练中的有效性。
⭐ 主要贡献首次揭示强化学习训练中的核心低秩现象,并提出无需额外模块或调参的高效加速框架,显著提升训练效率并保持性能一致性。
查看完整摘要 (Abstract)
Recent advances in reasoning capabilities of large language models (LLMs) are largely driven by reinforcement learning (RL), yet the underlying parameter dynamics during RL training remain poorly understood. This work identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 Dominance, where the top singular subspace of the parameter update matrix nearly fully determines reasoning improvements, recovering over 99\% of performance gains; and (2) Rank-1 Linear Dynamics, where this dominant subspace evolves linearly throughout training, enabling accurate prediction from early checkpoints. Extensive experiments across 13 LLMs and 10 algorithms validate the generalizability of these properties. More importantly, based on these findings, we propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window, achieving up to 2.5× speedup while retaining > 96\% of reasoning performance without extra modules or hyperparameter tuning. This positions our finding as a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #LLM-as-a-Judge #Distribution Shift #Generalization #Evaluation Robustness
TL;DR:We study the generalization and robustness of LLM-as-judge as new LLMs emerge, addressing practical questions about judge shelf life, including future proofing, backward compatibility, and question generalization.
🎯 研究动机随着新的大型语言模型(Large Language Models, LLMs)不断涌现,对于作为评判者的LLM的通用性和稳健性需求也在增加。为了延长这些评判模型的使用寿命,需要解决未来适应性和向后兼容性的问题。
❓ 解决问题本文探讨了LLM评判模型在面对新旧生成模型的响应时的表现,尤其关注如何在问答泛化中处理未见过的问题。
🔍 现象分析大多数模型在处理未来生成模型的响应时面临挑战,而在应对旧模型的响应时相对简单。所有模型在处理未见问题时表现有所下降,表明尚未完全泛化。
🛠️ 主要方法在统一框架下,该研究使用两套推理数据集、三种SFT和DPO微调算法及三种不同的基础模型,分析训练和测试分布的变化对模型能力的影响。
📊 数据与实验实验基于两套推理数据集进行,使用SFT和DPO微调方法对不同的基础模型进行训练,考察它们在不同训练和测试条件下的适应性。
⭐ 主要贡献本研究揭示了持续学习在适应响应分布变化中的优势,并指出当前评判模型在未见问题上的泛化能力不足,提供了开发和部署评判模型的实用见解。
查看完整摘要 (Abstract)
The LLM-as-a-judge paradigm is widely used in both evaluating free-text model responses and reward modeling for model alignment and fine-tuning. Recently, fine-tuning judges with judge-specific data has emerged as an often preferred choice over directly prompting frontier models as judges, as the former achieves better performance with smaller model sizes while being more robust to common biases. However, the standard evaluation ignores several practical concerns of fine-tuned judges regarding their real-world deployment. In this paper, we identify and formalize three aspects that affect the *shelf life* of these judges: *future-proofing* and *backward-compatibility* $-$ how well judges fine-tuned on responses by today's generator models perform on responses by future models or past models, as well as *question generalization* $-$ how well judges generalize to unseen questions at test time. We study these three aspects under a unified framework with varying train and test distributions in two reasoning datasets, three SFT- and DPO-based fine-tuning algorithms, and three different backbone models. Experiments suggest that future-proofing is challenging for most models, while backward-compatibility is relatively easy, with DPO-trained models consistently *improving* performance. We further find that continual learning provides a more balanced adaptation to shifts between older and newer response distributions than training solely on stronger or weaker responses. Moreover, all models exhibit some degree of performance degradation when moving from questions seen during training to unseen ones, showing that current judges do not fully generalize to unseen questions. These findings provide insights into practical considerations for developing and deploying judge models in the face of ever-changing generators.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Model #Post-Training #Reinforcement Learning #Supervise Fine-Tuning
🎯 研究动机大语言模型的后训练阶段需要结合监督微调与强化学习,但现有方法容易破坏模型的响应模式或在专家数据上过拟合。
❓ 解决问题探索监督微调与强化学习统一框架,通过动态权重平衡解决结合过程中模式破坏和专家数据过拟合的风险。
🔍 现象分析对离策略专家数据的全局和细粒度影响进行了分析,发现需要在离策略模仿和贴近策略探索之间找到平衡。
🛠️ 主要方法提出CHORD框架,将监督微调动态整合为强化学习的辅助目标,通过全局系数和词元级权重函数实现离策略模仿与贴近策略探索的可控和谐。
📊 数据与实验在多个实际任务中进行广泛实验,结果表明CHORD能稳定、高效地学习并优于现有基线。
⭐ 主要贡献实现离策略专家数据与贴近策略探索的调和,提出一种动态权重机制并验证其性能提升,同时将代码公开以推动后续研究。
查看完整摘要 (Abstract)
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from the expert, which promotes on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments across various practical tasks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We will release the source code to inspire further research.
基础/前沿模型 (含LLM)
指令微调与对齐
#Black-Box Prompt Optimization #Online Learning #Generative Al
🎯 研究动机生成式人工智能的性能高度依赖输入提示优化,但现有研究大多聚焦于离线环境,忽视了输出随机性及在线学习场景的挑战。
❓ 解决问题针对生成式AI在在线黑盒提示优化中的随机性和高方差问题,探索一个具备噪声抑制能力的在线学习方法。
🔍 现象分析生成式AI在实时应用中面临提示优化的噪声和非凸性挑战,现有方法缺乏对在线环境中此类问题的深度研究。
🛠️ 主要方法提出了一种自适应在线零阶提示调优(AOZPT)方法,结合零阶优化和在线学习,引入不确定性尺度调整机制以缓解噪声和高方差影响。
📊 数据与实验通过广泛的生成式实验验证了方法性能,结果显示在在线场景中,AOZPT的稳定性显著优于现有黑盒提示调优方法。
⭐ 主要贡献提供理论上可达的次线性遗憾分析,并提出了一种在噪声环境下有效的在线提示优化框架,大幅提升在线提示优化的稳定性。
查看完整摘要 (Abstract)
Generative AI excels in various tasks through advanced language modeling techniques, with its performance heavily influenced by input prompts. This has driven significant research into prompt optimization, particularly in commercial generative AI platforms, where prompt optimization is treated as a black-box optimization problem. Most existing research on black-box prompt optimization primarily focuses on offline learning and overlooks the randomness in outputs. However, in real-world applications, black-box prompt optimization typically operates in an online learning setting, which remains largely unexplored, especially given the noisy outputs. To address these challenges, we propose an \textbf{A}daptive \textbf{O}nline \textbf{Z}eroth-order \textbf{P}rompt \textbf{T}uning (AOZPT) approach which integrates zeroth-order optimization with online learning in the non-convex setting. Specifically, we developed an uncertainty-scale-adjustment mechanism to mitigate the noise inherent in generative AI and the high variance associated with zeroth-order estimates. We conducted a comprehensive regret analysis of the AOZPT approach, and the results indicate that sublinear regret convergence is achievable. Extensive generative experiments demonstrate that AOZPT outperforms existing black-box prompt tuning methods, particularly in terms of stability in online scenarios.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLMs #data synthetic #instruction tuning
🎯 研究动机特定垂直领域的高质量SFT数据极度稀缺,专家标注成本高昂且隐私受限,手工评估准则的转移性和优化循环均不可靠。
❓ 解决问题提出了一种新的合成数据框架,通过目标模型的因果影响来评估合成数据质量,并据此优化评估准则生成过程。
🔍 现象分析合成样本与真实样本在嵌入空间可能接近,但对模型学习的因果影响可能差异巨大,揭示了当前代理指标的局限性。
🛠️ 主要方法基于经典影响函数,使用梯度信息估计每个合成样本对目标任务目标的贡献,并基于该反馈通过专用模型自动调整评估准则。
📊 数据与实验实验涵盖人文社科与健康等多个领域,使用Qwen和Llama等多种目标模型和生成器,无需任务特定调优即可获得一致性能提升。
⭐ 主要贡献建立了基于目标模型因果反馈的评估准则自动优化框架,实现了跨域、模型和生成器的稳健性能提升,增强了工程可移植性。
查看完整摘要 (Abstract)
Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data that imparts problem-solving capabilities. However, as applications expand, high-quality SFT data in knowledge-intensive verticals (e.g., humanities and social sciences, medicine, law, finance) is exceedingly scarce: expert curation is costly, privacy constraints are strict, and label consistency is hard to guarantee. Recent work turns to synthetic data, typically prompting a teacher model over domain documents and filtering with handcrafted rubrics. Yet, rubric design is expert-dependent and rarely transfers across domains; moreover, prevalent heuristic optimization follows a brittle loop (write rubric $\rightarrow$ synthesize $\rightarrow$ train $\rightarrow$ inspect $\rightarrow$ guess tweaks) that lacks reliable, quantitative feedback about a rubric's true contribution to downstream performance.
We argue for assessing synthetic data quality through its causal impact on the target model, using this feedback to guide data generation. Inspired by classic influence functions, we repurpose an optimizer-aware estimator that uses gradient information to quantify each synthetic sample's contribution to the objective of a given target model on specific tasks. Our analysis reveals a gap: although synthetic and real samples may be close in embedding space, their influence on learning can differ substantially. Building on this insight, we propose an optimization-based synthetic data framework that adapts rubrics with target-model feedback. Instead of manually engineering domain rubrics, we supply lightweight guiding text and delegate rubric generation to a rubric-specialized model conditioned on the task; crucially, rubric (and data) selection is supervised by estimated downstream impact rather than proxy formality. Empirically, the framework yields consistent gains across domains (HSS and health), target models (e.g., Qwen and Llama families), and data generators, demonstrating broad generalization and engineering portability without task-specific tuning.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLMs ; MLLMs ; Hallucination ; DPO
🎯 研究动机当前基于DPO的方法在解决多模态大模型幻觉问题时存在两大局限:一是未能针对性处理关注区域(attended regions)的感知瓶颈(perceptual bottleneck),二是缺乏对图像退化场景的视觉鲁棒性。此外,现有偏好数据往往是视觉无关(vision-agnostic)且遵循离线策略(off-policy),限制了指导模型学习的有效性。
❓ 解决问题论文提出P^2-DPO训练范式,旨在通过模型自生成并学习偏好对,直接解决视觉感知瓶颈与鲁棒性问题。该方法避免了视觉无关和离线策略数据的固有缺陷,实现了对关注区域感知能力与退化图像鲁棒性的针对性优化。
🔍 现象分析在多模态大模型中,幻觉常源于模型对图像关键区域的感知不足或对图像质量退化的适应能力弱。现有DPO方法依赖人类修正的偏好数据,这些数据往往缺乏对视觉感知细粒度差异的刻画,且因离线收集而难以精确对齐当前模型的视觉-文本因果生成过程。
🛠️ 主要方法方法包括两个方面:一是提出一种在线策略(on-policy)的偏好对构建机制,针对“聚焦增强感知”(Focus-and-Enhance perception)与视觉鲁棒性(Visual Robustness)生成训练数据;二是设计校准损失(Calibration Loss),用于精确对齐视觉信号与文本生成的因果过程,确保感知信息被准确整合到语言生成中。
📊 数据与实验实验在可比的训练数据量和成本下进行,结果表明P^2-DPO在多项基准测试上优于依赖昂贵人类反馈的基线方法。通过在关注区域保真度(ARF)和图像退化场景下的评估,验证了该方法在改善感知瓶颈和提升视觉鲁棒性方面的有效性。
⭐ 主要贡献贡献在于提出一种新型的感知处理直接偏好优化范式(P^2-DPO),实现了模型自生成在线偏好对以直接优化视觉瓶颈;同时设计了校准损失来加强视觉-文本对齐。该方法以较低成本有效提升了模型在关注区域的感知准确性和对退化输入的鲁棒性,为减少多模态幻觉提供了新思路。
查看完整摘要 (Abstract)
Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target the perceptual bottleneck in attended regions or address insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are often vision-agnostic and their inherently off-policy nature limits their effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P$^2$-DPO), a novel training paradigm in which the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pairs construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P$^2$-DPO outperforms strong baselines that rely on costly human feedback on benchmarks. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P$^2$-DPO in addressing perceptual bottleneck in attended regions and improving Visual Robustness against degraded inputs.
基础/前沿模型 (含LLM)
指令微调与对齐
#Personalization #Large Language Models
🎯 研究动机个性化已成为智能系统研究的重要领域,现有的大型语言模型虽然擅长处理通用知识任务,但在适应用户个性化需求方面存在不足。探讨如何平衡个性化效果与效率具有重要意义。
❓ 解决问题现有调整模型行为的方法(如 tune-free 和参数微调方法)难以有效兼顾效果与效率,同时缺乏对个性化偏好机制的深入研究。
🔍 现象分析研究发现用户个性化信息嵌入在表示空间中的低秩子空间;信息变化包括所有用户共享的整体偏移和每个用户特有的个性化偏移。
🛠️ 主要方法提出 PerFit,基于表示空间中的偏移模式进行两阶段微调。方法通过直接干预隐藏表示空间,精准调整模型输出,同时大幅减少参数开销。
📊 数据与实验实验使用六个数据集测试,表明 PerFit 在性能上表现突出,并在平均参数使用量上减少了92.3%,优于现有方法。
⭐ 主要贡献揭示了个性化信息在表示空间中的规律性;提出高效精准的 PerFit 方法,显著降低参数开销;为个性化偏好机制研究提供理论支持。
查看完整摘要 (Abstract)
Personalization has become a pivotal field of study in contemporary intelligent systems. While large language models (LLMs) excel at general knowledge tasks, they often struggle with personalization, i.e., adapting their outputs to individual user expectations. Existing approaches that steer LLM behavior to meet users’ implicit preferences and behavior patterns, primarily relying on tune-free methods (e.g., RAG, PAG) or parameter fine-tuning methods (e.g., LoRA), face challenges in effectively balancing effectiveness and efficiency. Moreover, the mechanisms underlying personalized preferences remain underexplored. To address these challenges, we first uncover key patterns of user-specific information embedded in the representation space. Specifically, we find that (1) personalized information lies within a low-rank subspace represented by vectors, and (2) these vectors demonstrate both a collective shift shared across users and a personalized shift unique to each individual user. Building on these insights, we introduce PerFit, a novel two-stage solution that directly fine-tunes interventions in the hidden representation space by addressing both collective and user-specific shifts, thereby achieving precise steering of LLM with minimal parameter overhead. Experimental results demonstrate that \perfit delivers strong performance across six datasets while \cutting the number of parameters by an average of 92.3% compared to the state-of-the-art method.
基础/前沿模型 (含LLM)
指令微调与对齐
#Multimodal Large Language Models #Multimodal Reasoning #Reinforcement Learning
TL;DR:We observe that standard RLVR fails to enhance the MLLMs perception. Therefore, we propose a novel visual perception reward to improve the MLLMs perception in RLVR, effectively boosting performance on several multimodal benchmarks with limited data.
🎯 研究动机现有基于可验证奖励的强化学习(RLVR)方法应用于多模态大语言模型(MLLMs)时,主要聚焦于推理能力提升,但忽略了多模态感知这一核心前提的增强。
❓ 解决问题旨在解决 RLVR 方法在增强 MLLMs 多模态感知能力上的失败,从而突破其多模态推理能力的进一步提升瓶颈。
🔍 现象分析通过麦克尼马尔检验发现,现有 RLVR 方法未能有效提升 MLLMs 的多模态感知能力,这限制了其复杂推理性能的进一步改善。
🛠️ 主要方法提出 Perception-R1,引入一种新颖的视觉感知奖励。该方法利用思维链轨迹中的文本视觉标注作为参考,通过一个评判 LLM 评估 MLLM 输出与视觉标注的一致性,并据此分配奖励。
📊 数据与实验在多个多模态数学和通用基准测试上进行了广泛实验。仅使用 1,442 条训练数据,Perception-R1 在所有基准上均取得了优越性能,证明了其有效性和鲁棒性。
⭐ 主要贡献提出了首个针对 MLLMs 感知能力增强的视觉感知奖励机制,通过联合激励感知与推理,在数据受限情况下显著提升了多个基准测试的性能。
查看完整摘要 (Abstract)
Enhancing the multimodal reasoning capabilities of Multimodal Large Language Models (MLLMs) is a challenging task that has attracted increasing attention in the community. Recently, several studies have applied Reinforcement Learning with Verifiable Rewards (RLVR) to the multimodal domain in order to enhance the reasoning abilities of MLLMs. However, these works largely overlook the enhancement of multimodal perception capabilities in MLLMs, which serve as a core prerequisite and foundational component of complex multimodal reasoning. Through McNemar's test, we find that existing RLVR method fails to effectively enhance the multimodal perception capabilities of MLLMs, thereby limiting their further improvement in multimodal reasoning. To address this limitation, we propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately, thereby can effectively incentivizing both their multimodal perception and reasoning capabilities. Specifically, we first collect textual visual annotations from the CoT trajectories of multimodal problems, which will serve as visual references for reward assignment. During RLVR training, we employ a judging LLM to assess the consistency between the visual annotations and the responses generated by MLLM, and assign the visual perception reward based on these consistency judgments. Extensive experiments on several multimodal math and general benchmarks demonstrate the effectiveness and robustness of our Perception-R1, which achieves superior performance on all benchmarks using only 1,442 training data. Our code and dataset will be available at https://github.com/tongxiao2002/Perception-R1.
基础/前沿模型 (含LLM)
指令微调与对齐
#forecasting #evaluation #criticism #leakage #standards #LLMs #prediction #future #benchmarks
TL;DR:We find conceptual issues and real temporal leakage errors in existing LLM forecasting evaluations, and argue this is a problem.
🎯 研究动机大型语言模型(LLMs)在预测任务中的应用逐渐增加,但现有评估方法存在概念性问题和时间泄漏等缺陷,这影响了其结果的可信度。
❓ 解决问题探索并揭示当前LLM预测评估中的潜在问题,明确评估缺陷对性能声明的影响,并提出改进评估方法的必要性。
🔍 现象分析识别出两大类问题:评估结果因时间泄漏而难以信任,以及评估表现难以外推至真实世界预测场景。
🛠️ 主要方法通过系统化分析和从先前研究中提取的具体示例,揭示评估方法中存在的缺陷及其对预测能力判定的影响。
📊 数据与实验结合现有研究中的案例进行分析,重点讲解时间泄漏问题及实验结果在预测性能中的误导性。
⭐ 主要贡献明确评估中面临的独特挑战,强调问题的重要性,并呼吁采用更严格的评估方法来可靠地测试LLM预测能力。
查看完整摘要 (Abstract)
Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a community, we should be careful about such conclusions as evaluating LLM forecasters presents unique challenges. We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims. We argue that more rigorous evaluation methodologies are needed to confidently assess the forecasting abilities of LLMs.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large language models #Diversity #Reinforcement learning #Post-training
🎯 研究动机强化学习在大语言模型的后训练中表现出色,但通常导致输出多样性下降,这限制了模型生成丰富响应的能力。
❓ 解决问题提出一种优化大语言模型同时兼顾输出质量和语义多样性的新方法,以克服现有方法仅注重时间推理或词汇差异的局限性。
🔍 现象分析现有方法提升多样性时,往往牺牲响应质量,或者仅实现表层词汇的多样性而忽略语义丰富性。
🛠️ 主要方法基于行列式点过程(DPP),提出名为DQO的训练方法,通过嵌入和采样响应组,并利用核相似矩阵的行列式度量响应间的语义多样性。
📊 数据与实验在指令跟随、文本摘要、故事生成和推理任务上进行实验,验证了所提方法在提升语义多样性的同时维持输出质量。
⭐ 主要贡献提出了结合质量与语义多样性的优化方法,显著增强了后训练大语言模型的输出多样性,为多任务人工智能研究提供了新思路。
查看完整摘要 (Abstract)
Reinforcement learning has emerged as a popular method for post-training large language models (LLMs). While improving the model's performance on downstream tasks, it often reduces the model's output diversity, leading to narrow, canonical responses. Existing methods to enhance diversity are limited, either by operating at inference time or by focusing on lexical differences. We propose a novel training method named DQO (Diversity Quality Optimization) based on determinantal point processes (DPPs) to jointly optimize LLMs for quality and semantic diversity. Our approach samples and embeds a group of responses for each prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by the embeddings of these responses. Experiments across instruction-following, summarization, story generation, and reasoning tasks demonstrate that our method substantially improves semantic diversity without sacrificing model quality.
基础/前沿模型 (含LLM)
指令微调与对齐
#Interactive Personalization #Test-time Reasoning #Information Seeking #Preference Alignment #Proactive Question Asking
TL;DR:We introduce the task of Personalized Reasoning, in which LLMs need to reason about missing user preferences, strategically elicit preference values, then adapt their reasoning processes and responses accordingly.
🎯 研究动机当前大模型开发将任务解决与偏好对齐视为独立挑战,但在用户交互场景中,仅正确解决任务不足以满足个性化需求。冷启动和隐私限制等场景进一步加剧这一问题。
❓ 解决问题提出了个性化推理任务,旨在让模型主动发现用户偏好空白,战略性地通过提问获得偏好信息,并实时调整推理和生成。
🔍 现象分析评估发现,29.0%的简单个性化尝试比通用响应更差,而通用响应又难以满足个体需求,这表明个性化推理需要专门开发,而非自然而生。
🛠️ 主要方法设计了PREFDISCO评估方法,通过心理学驱动的角色生成稀疏、上下文相关的偏好,构建交互式个性化任务,并提出用于偏好对齐的细粒度评估指标PREFALIGN。
📊 数据与实验基于10个任务评估了21种前沿模型,在模拟场景下测试模型在多样化偏好环境中的适应能力。
⭐ 主要贡献提供了个性化推理的任务定义、评估基准和初步发现,为开发能精准适应用户需求的系统奠定了基础,尤其适用于教育、医疗等需要高度个性化的领域。
查看完整摘要 (Abstract)
Current large language model (LLM) development treats task-solving and preference-alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user’s needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to proactively identify what they don’t know about the user, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly—a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse, context-dependent preferences, and define PREFALIGN as a fine-grained rubric-based metric for measuring preference alignment. PREFDISCO builds scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO provides a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM;RLHF;Value Model
TL;DR:DVPO pretrains and freezes a global value model from preference data, turning RLHF into stable policy-only optimization and matching or beating SOTA on major benchmarks.
🎯 研究动机强化学习中,价值估算是策略优化的核心,但现有RLHF流程中价值函数在线学习与奖励模型训练信息等价,导致不必要的冗余与不稳定性。
❓ 解决问题通过直接预训练价值模型,消除在RLHF中重复学习带来的不稳定性,实现更加简化和稳定的策略优化过程。
🔍 现象分析预训练的价值模型可以提供全局性的精确信号,避免在线学习中的评价漂移与轨迹采样问题,同时优化效率更高。
🛠️ 主要方法提出Decoupled Value Policy Optimization (DVPO)框架,离线预训练并固化全局价值模型(Global Value Model, GVM)作为通用评价器,仅通过策略优化实现目标。
📊 数据与实验在MT-Bench、Alpaca-Eval、Arena-Hard等基准测试上验证,DVPO的性能与现有最佳方法持平或更优。
⭐ 主要贡献重新定义RLHF流程为单一预训练价值模型引导的仅策略优化模式;提出DVPO框架;实现了更高稳定性与性能的强化学习方法。
查看完整摘要 (Abstract)
In this paper, we explore how directly pretraining a value model simplifies and stabilizes reinforcement learning from human feedback (RLHF).
In reinforcement learning, value estimation is the key to policy optimization, distinct from reward supervision.
The value function predicts the \emph{return-to-go} of a partial answer, that is, how promising the partial answer is if it were continued to completion.
In RLHF, however, the standard pipeline first pretrains a reward model and then learns a value function online, even though no new reward signals are available once preference data is collected.
This makes critic learning redundant, as the process of training a reward model and then deriving a value model is informationally equivalent to directly pretraining a value model.
Importantly, this requires no additional supervision, and our value model is trained on exactly the same data used for reward modeling.
Building on this insight, we introduce \emph{Decoupled Value Policy Optimization} (DVPO), a framework that pretrains a \emph{Global Value Model} (GVM) offline and freezes it as a universal critic for policy learning.
The GVM provides stable, fine-grained credit assignment without critic drift or trajectory sampling.
Experiments across MT-Bench, Alpaca-Eval, and Arena-Hard demonstrate that DVPO matches or surpasses state-of-the-art RLHF methods.
These results highlight RLHF can be reframed as policy-only optimization guided by a single pretrained value model. The implementation code for our method is available in \url{https://github.com/microsoft/DKI_LLM/tree/main/dvpo}
基础/前沿模型 (含LLM)
指令微调与对齐
#masked diffusion models #diffusion language models #reinforcement learning #GRPO #dLLMs
🎯 研究动机强化学习在自回归模型中表现优异,但对扩散大语言模型(dLLMs)的适配遇到基础性挑战,核心在于如何处理非自回归生成过程中的似然估计问题。
❓ 解决问题提出一种基于变分下界的序列级策略优化方法(ESPO),以解决dLLMs在序列生成过程中缺乏条件概率分解的难题。
🔍 现象分析传统的基于token级的强化学习目标(如GRPO)无法直接应用于迭代去噪生成的dLLMs,需要新的序列级优化视角。
🛠️ 主要方法将整个序列生成视为一个整体动作,使用变分下界作为序列级似然估计,并引入重要性比例的逐token归一化和稳健的KL散度估计以确保大规模训练的稳定性。
📊 数据与实验在数学推理、代码生成和规划任务中进行了大量实验,在Countdown任务上提升20-40分,并在数学和代码基准上保持一致性改进。
⭐ 主要贡献确立了一种适用于扩散大语言模型的序列级优化框架,为强化学习在非自回归模型中的应用开辟了新的方向,并在多个任务中验证了其实验有效性。
查看完整摘要 (Abstract)
Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Abstraction #Procedural Knowledge
TL;DR:LLMs can learn to execute procedures that are described symbolically in their training data, but only with specific finetuning curricula.
🎯 研究动机大语言模型(LLMs)在训练中多依赖于示例或经验获取行为,而大量训练数据为陈述式指令,缺乏直接的操作演示。探讨如何从这些陈述式数据中提取可复用的程序性知识是一个重要问题。
❓ 解决问题提出一种新训练机制,使LLMs能够通过训练数据中的陈述式指令学习程序性知识,从而弥补其在操作执行方面的不足。
🔍 现象分析LLMs在训练过程中学习从指令到行为的映射能力较弱。如果没有适当的微调机制,仅靠陈述式指令无法很好地嵌入可执行行为。
🛠️ 主要方法提出Programming by Backprop (PBB)训练机制,明确区分学习指令与行为关系的过程与内化新指令能力,通过两种设计的PBB课程实现程序性知识的高效训练。
📊 数据与实验在两个领域(Python代码的算法执行与基于上下文无关语法的文本生成)中进行了受控实验,结果表明PBB课程比同质数据混合的训练有较大优势。
⭐ 主要贡献证明陈述式指令通过PBB机制可嵌入模型权重实现程序性知识学习,并展现其在数据利用效率与模型安全性上的潜在意义。
查看完整摘要 (Abstract)
Large language models (LLMs) are typically trained to acquire behaviours from demonstrations or experience, yet much of their training data is declarative: instructions, rules, and descriptions that specify behaviours without showing how to execute them. We introduce **Programming by Backprop (PBB)**: a training regime that enables LLMs to acquire *procedural* knowledge (i.e., reusable behaviours) from *declarative* instructions encountered during training. With PBB, instructions in training data provide an opportunity to "program" specific behaviours into model weights. The core principle underpinning PBB is the separation of learning how instructions map to behaviour from internalising new instructions. We devise two distinct PBB curricula that leverage this principle. Through controlled experiments across two domains (algorithmic execution from Python source code and text generation from context-free grammars), we demonstrate the benefit of these curricula over training on a homogeneous data mixture. Crucially, PBB is highly sample efficient, with *a single instruction substituting for up to 100 execution examples*. Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important implications for data curation and safety.
基础/前沿模型 (含LLM)
指令微调与对齐
#Instruction Induction #Prompt Generation #Prompt Optimization #Reinforcement Learning #Task Adaptation #Large Language Models
🎯 研究动机为了解决大语言模型在适应新任务时依赖上下文学习带来的高推理成本问题,开发更高效的任务适应方法。
❓ 解决问题通过指令诱导减少训练示例,生成紧凑但描述性的提示,同时保持与完整训练集相当的性能。
🔍 现象分析上下文学习虽然有效,但随着上下文长度增加,推理开销显著增长,对新任务适应较低效。
🛠️ 主要方法提出基于强化学习的Prompt-MII框架,元学习指令生成模型,可针对任意新任务生成紧凑的指令提示。
📊 数据与实验在HuggingFace平台超过3000个分类数据集上进行训练,并在90个未见任务上测试模型性能,提示生成减少3-13倍的tokens,性能提升4-9 F1点。
⭐ 主要贡献提出了高效的指令诱导方法Prompt-MII,实现与上下文学习相当的任务适应性能,同时显著降低推理成本。
查看完整摘要 (Abstract)
A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows. In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. Specifically, we propose Prompt-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks. Prompt-MII improves downstream model quality by 4-9 F1 points (10-20\% relative), matching ICL performance while requiring 3-13x fewer tokens.
基础/前沿模型 (含LLM)
指令微调与对齐
#Off-policy RL; LLM; Reasoning
🎯 研究动机传统大语言模型推理训练依赖昂贵的在线策略强化学习,这要求每次更新都需重新采样数据,严重制约了训练效率和可扩展性。为提高效率,研究转向能容忍数据延迟的异步强化学习系统。
❓ 解决问题针对异步强化学习中陈旧数据导致性能下降或崩溃的难题,本文旨在研究陈旧数据能否被有效利用,以稳定达成与在线策略训练相媲美的性能。
🔍 现象分析研究揭示了“繁荣-崩溃”现象:陈旧数据若被恰当利用,其信息量可与在线策略数据媲美。关键在于抑制重要性权重中的极端异常值,以保持稳定且信息丰富的更新。
🛠️ 主要方法提出M2PO(二阶矩信任策略优化)算法。该方法通过约束重要性权重的二阶矩来抑制极端异常值,在高延迟场景下大幅减少被裁剪的token比例,从而实现稳定优化。
📊 数据与实验在六个模型规模(1.7B至32B)和八个数学推理基准及一个代码基准上进行了广泛评估。实验验证了M2PO能在数据延迟高达至少256次模型更新的情况下稳定训练,并匹配在线策略性能。
⭐ 主要贡献首次系统分析并验证了陈旧数据在离线策略强化学习中的有效利用潜力。提出了M2PO方法,显著提升了在陈旧数据上的训练稳定性和效率,为高效的大规模语言模型训练提供了新途径。
查看完整摘要 (Abstract)
Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a \emph{prosperity-before-collapse} phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce \textbf{M2PO} (Second-Moment Trust Policy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22\% to 0.06\% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six model scales (from 1.7B to 32B) and eight math reasoning benchmarks and one coding benchmarks shows that M2PO delivers stable off-policy training even with data stale by \underline{\emph{at least 256 model updates}} and matches on-policy performance. Our code is available at https://github.com/Infini-AI-Lab/M2PO/.
基础/前沿模型 (含LLM)
指令微调与对齐
#SFT #generalization #language models #vision language models
TL;DR:PSFT is a trust-region–inspired fine-tuning objective that views SFT as a policy gradient method with constant advantages, constraining policy drift to stabilize training and improve generalization.
🎯 研究动机基础模型监督微调常导致泛化能力变差,原有能力在特定任务调优后下降。受强化学习中的信赖域策略优化与近端策略优化启发,研究者希望将信赖域思想引入监督微调以改善泛化性能。
❓ 解决问题针对监督微调中策略漂移导致的训练不稳定和泛化能力下降问题。提出了引入信赖域约束的方法以稳定优化过程,防止微调后模型能力退化。
🔍 现象分析传统监督微调可视为优势函数恒为正的策略梯度方法,导致过度优化特定任务而损害原始能力。持续训练时易出现熵崩溃现象,影响后续优化阶段效果。
🛠️ 主要方法提出近端监督微调方法,在监督微调目标中加入信赖域约束来控制策略漂移。该方法将监督微调视为恒定正优势的策略梯度特例,通过约束策略变化来稳定训练。
📊 数据与实验在数学推理、人类价值观和多模态三个领域进行实验验证。结果表明该方法域内性能与标准监督微调相当,域外泛化更优,且长期训练稳定无熵崩溃。
⭐ 主要贡献首次将信赖域思想系统引入监督微调框架,建立了监督微调与策略梯度方法的理论联系。提出的PSFT方法在保持竞争性调优的同时显著提升泛化能力,为后续训练阶段提供了更好基础。
查看完整摘要 (Abstract)
Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on specific tasks. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT), a fine-tuning objective that incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages. Experiments across mathematical, human-value, and multimodal domains show that PSFT matches standard SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimization.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Large Reasoning Models
TL;DR:This paper analyzes existing preference optimization method and proposes a computationally efficient method to mitigate the problem of overly lengthy outputs for Large Reasoning Models.
🎯 研究动机大型推理模型通过长链式思维展示了强大的复杂任务处理能力,但过于冗长的输出增加了计算成本,并可能导致过度思考,亟需平衡推理效果与效率的方法。
❓ 解决问题现有方法往往在推理质量和资源需求之间做出妥协,本文旨在通过有限调优减少生成长度,同时保持推理性能。
🔍 现象分析通过分析生成路径分布并结合困难估计过滤生成轨迹,研究了不同偏好优化目标在统一 Bradley-Terry 损失框架下的收敛特性。
🛠️ 主要方法提出了长度控制偏好优化(LCPO),直接平衡与负对数似然损失相关的隐式奖励,在有限数据和训练下实现模型的长度偏好学习。
📊 数据与实验在多个基准测试上进行了广泛实验,表明该方法在保持推理性能的同时将模型平均输出长度减少超过50%。
⭐ 主要贡献提出了一种计算高效的解决方案,展示了指导大型推理模型向高效推理方向发展的潜力。
查看完整摘要 (Abstract)
Recent advances in Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning. However, their lengthy outputs increase computational costs and may lead to overthinking, raising challenges in balancing reasoning effectiveness and efficiency. Current solutions often compromise reasoning quality or require extensive resources. In this paper, we investigate how to reduce the generation length of LRMs with limited tuning. We analyze generation path distributions and filter generated trajectories through difficulty estimation. Subsequently, we analyze the convergence characteristics of various preference optimization objectives under a unified Bradley-Terry loss based framework. Based on the analysis, we propose Length Controlled Preference Optimization (LCPO) that directly balances the implicit reward related to NLL loss. LCPO can effectively learn length preference with limited data and training. Extensive experiments demonstrate that our method significantly reduces the average output length of LRMs by over 50\% across multiple benchmarks while maintaining the reasoning performance. Our work highlights the potential for computationally efficient approaches in guiding LRMs toward efficient reasoning.
基础/前沿模型 (含LLM)
指令微调与对齐
#rubrics #reinforcement-learning #open-ended qa #large language model #generation
TL;DR:We turn web-mined, question-specific rubrics into verifiable rewards to RL-train LLMs for open-ended QA, boosting alignment and performance.
🎯 研究动机开放性问答任务缺乏可靠的评估和验证信号,现有方法依赖人工反馈或LLM自评,成本高且易受奖励作弊,并且评价信号区分度和解释性不足。
❓ 解决问题提出一种问题特定的评分规则生成框架,通过内容与风格敏感的评分体系评估回答的事实性与写作质量,并用此指导LLM进行强化学习优化。
🔍 现象分析当前用于开放性问答的评估方法难以可靠体现模型输出质量,传统的基于人类反馈或自生成评价方式存在明显局限。
🛠️ 主要方法自动从在线资源挖掘题目特定评分规则,将其作为奖赏信号,并使用GRPO算法引导模型按正确路径生成回答,通过强化学习提升任务表现。
📊 数据与实验在开放性问答基准上进行广泛实验,框架测试显示总分提升17.0点,验证了基于评分规则的强化学习在开放性问答中的有效性。
⭐ 主要贡献创建了一个新框架QuRL,结合自动生成评分规则与强化学习优化,显著提高开放性问答任务的对齐性与性能,并提供优化路径的新视角。
查看完整摘要 (Abstract)
Reinforcement Learning from Verifiable Rewards (RLVR) has significantly improved the performance of large language models (LLMs) on tasks with gold ground truth, such as code generation and mathematical reasoning. However, its application to open-ended question answering (QA) remains challenging, primarily due to the absence of reliable evaluation and verifiable reward signals. This difficulty is further compounded by the limitations of existing evaluation paradigms. Previous approaches typically rely on human feedback or LLM-as-judge strategies, which are costly, prone to reward hacking, and often fail to provide sufficiently discriminative or interpretable evaluation signals. To address these limitations, we introduce a schema for generating case-wise rubrics that are question-specific, content-based and stylistically sensitive, thereby evaluating both factual soundness and writing quality. Building on this schema, we propose QuRL (Open-Ended QA with Rubric-guided Reinforcement Learning), a framework that automatically mines rubrics for each question from easily accessible online sources and leverages them as reward signals. With these rubrics, QuRL employs the GRPO (Group Relative Policy Optimization) algorithm to guide the model in exploring the correct generation path. Extensive experiments show that our framework achieves significant improvements of total +17.0 points on evaluation benchmark, demonstrating the effectiveness of rubric-guided reinforcement learning for open-ended QA.
基础/前沿模型 (含LLM)
指令微调与对齐
#Post-Training #Large Reasoning Models #Large Language Models #Performance Prediction #Reinforcement Learning with Verifiable Rewards
TL;DR:We show extensive examples where high SFT scores do not transfer to improved RL performance in reasoning post-training; we propose generalization loss on held-out SFT examples and pass@large k as viable proxies for predicting post-RL performance.
🎯 研究动机当前推理型大型语言模型的后训练流程通常分为监督微调(SFT)和基于可验证奖励的强化学习(RLVR)。该论文质疑高 SFT 分数是否能有效预测 RL 后的性能提升。
❓ 解决问题发现高 SFT 分数可能偏向简单或同质化数据,未必能可靠预测后续 RL 的效果;提出替代指标以提高RL性能预测能力。
🔍 现象分析实验显示 SFT 高分模型在 RL 后可能表现更差;如使用统一短文本训练可提升 SFT,但 RL 后却可能下降,反映数据分布对性能的复杂影响。
🛠️ 主要方法提出基于独立验证集的泛化损失与 Pass@large k 作为更可靠的 RL 结果预测指标,并对其统计性能进行评估验证。
📊 数据与实验基于数百个规模达12B参数模型,使用 Llama3、Mistral-Nemo 等模型和多种 SFT/RL 数据集,在7个数学基准上进行大量重复实验,耗费超 1M GPU 小时。
⭐ 主要贡献提出新的评估指标显著提升 RL 性能预测精度,改善相关统计相关性;揭示 SFT分数与RL性能间的误导性关联,开放了一套数学推理评价工具。
查看完整摘要 (Abstract)
In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as "RL" below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending $>$1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving $R^2$ coefficient and Spearman's rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. This work develops an enhanced evaluation tool for math reasoning tasks and is open-sourced.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Code Interpreter #Reinforcement Learning #Curriculum Learning #Symbolic Reasoning #Tool Use
TL;DR:We present R1-Code-Interpreter, an LLM trained with supervised and multi-stage RL to integrate code execution with reasoning, achieving strong performance across 144 diverse tasks and outperforming GPT-4o with Code Interpreter.
🎯 研究动机现有大语言模型在使用代码解释器解决多样化任务时缺乏有效训练方法,限制了其推理能力拓展至广泛领域的潜力。
❓ 解决问题开发一种通用代码解释器训练框架,通过监督学习和多阶段强化学习有效应对任务异质性和样本稀缺问题。
🔍 现象分析基于代码生成的逐步推理能够显著提升模型的解题能力,训练样本的质量和优化策略对结果有决定性影响。
🛠️ 主要方法提出多阶段课程学习策略,根据样本的改进潜力平衡训练过程,逐步从高潜力样本到低潜力样本优化模型表现。
📊 数据与实验设计涵盖144项多样化任务的数据集,在Qwen-2.5模型中进行实验,通过准确率测试验证方法有效性,最终在37项任务上显著提升了模型表现。
⭐ 主要贡献研发R1-Code-Interpreter框架,将代码执行与推理深度结合,利用强化学习提升性能,超越GPT-4o及其代码解释器扩展并呈现自检行为,提供公开可用的数据集与资源。
查看完整摘要 (Abstract)
Practical guidance on training Large Language Models (LLMs) to leverage Code Interpreter across diverse tasks remains lacking. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. Unlike prior RL + tool-use efforts focused on narrow domains such as math or retrieval, we curate 144 diverse reasoning and planning tasks and show that training a general-purpose Code Interpreter across them presents significant challenges due to task heterogeneity and scarcity of effective samples. To address this, we introduce a multi-stage curriculum learning approach that partitions training samples by measured improvement potential. The RL training prioritizes samples with higher potential and gradually shifts to lower-potential ones, increasing the average RL gains from merely +3.4\% to +9.3\% across Qwen-2.5 models (3/7/14B). Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.1\% to 72.4\%, outperforming text-only GPT-4o (58.6\%) and GPT-4o with Code Interpreter (70.9\%). Notably, R1-CI-14B also exhibits emergent self-checking behavior through code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Reasoning Model #Instruction Following #Model Merging #Null-Space
🎯 研究动机大型推理模型尽管在长链推理任务中表现出色,但在输出格式、约束和特定需求的指令遵循方面存在明显不足,亟需方法弥合这一差距。
❓ 解决问题通过将指令调优模型融入大型推理模型,解决其在推理能力与指令遵循间的矛盾,同时保留推理模型的结构化推理机制。
🔍 现象分析研究发现,在参数空间中,这两类模型的任务向量在关键模块的主子空间上几乎正交,表明可以进行轻量级融合;但其输出格式的不匹配使得简单融合方法表现脆弱。
🛠️ 主要方法提出 RAIN-Merging 方法,在保持推理结构的基础上,通过特殊 token 的前向特征零空间投影降低干扰,并基于小型指令校准集调整模块权重以增强指令相关组件。
📊 数据与实验在四个指令遵循基准和九个推理及通用能力基准上验证,发现该方法显著提高了对指令的遵循性,同时保持了推理质量,且对不同规模和架构的模型均有一致性提升。
⭐ 主要贡献提出一种无梯度的融合方法 RAIN-Merging,成功平衡了大型推理模型的指令遵循性和推理性能,为多模型融合领域提供了新工具。
查看完整摘要 (Abstract)
Large reasoning models (LRMs) excel at a long chain of reasoning but often fail to faithfully follow instructions regarding output format, constraints, or specific requirements. We investigate whether this gap can be closed by integrating an instruction-tuned model (ITM) into an LRM. Analyzing their differences in parameter space, namely task vectors, we find that their principal subspaces are nearly orthogonal across key modules, suggesting a lightweight merging with minimal interference. However, we also demonstrate that naïve merges are fragile because they overlook the output format mismatch between LRMs (with explicit *thinking* and *response* segments) and ITMs (answers-only). We introduce **RAIN-Merging** (Reasoning-Aware Instruction-attention guided Null-space projection Merging), a gradient-free method that integrates instruction following while preserving thinking format and reasoning performance. First, with a small reasoning calibration set, we project the ITM task vector onto the null space of forward features at thinking special tokens, which preserves the LRM's structured reasoning mechanisms. Second, using a small instruction calibration set, we estimate instruction attention to derive module-specific scaling that amplifies instruction-relevant components and suppresses leakage. Across four instruction-following benchmarks and nine reasoning & general capability benchmarks, RAIN-Merging substantially improves instruction adherence while maintaining reasoning quality. The gains are consistent across model scales and architectures, translating to improved performance in agent settings.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Reinforcement Learning #Generalization #Learnability
🎯 研究动机探索大语言模型是否能够通过强化学习掌握并迁移新的算法性推理能力,而非仅依赖预训练和微调期间已编码的技能。
❓ 解决问题提出 DELTA 基准来评估 LLM 是否能学习和迁移解决无法预训练解决的问题,并进一步研究学习能力和迁移能力的界限。
🔍 现象分析实验发现 RL 训练的模型在长期低奖励后会经历突如其来的 '顿悟式'跃迁,并且表现出出色的家族内问题迁移能力但在变革性问题上仍有不足。
🛠️ 主要方法采用分阶段暖启动、经验回放、课程训练及循环验证等增强技术来促进模型解决未被预训练解决的问题。
📊 数据与实验基于 DELTA 基准,通过合成问题生成器构建完全分布外问题,重点评估探索性、组合性、变革性等轴向上的迁移表现。
⭐ 主要贡献提出 DELTA,提供一个独立测试环境以评估 RL 在算法性推理方面的极限,并揭示新技能获取与迁移的关键因素。
查看完整摘要 (Abstract)
It remains an open question whether LLMs can acquire or generalize genuinely new reasoning strategies, beyond the sharpened skills encoded in their parameters during pre-training or post-training.
To attempt to answer this debate, we introduce DELTA — Distributional Evaluation of Learnability and Transferrability in Algorithmic Coding, a controlled benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability—can LLMs, through reinforcement learning (RL), solve problem families where pretrained models exhibit failure with large enough attempts (pass@K=0)?—and transferability—if learnability happens, can such skills transfer systematically to out-of-distribution (OOD) test sets? Unlike prior public coding datasets, DELTA isolates reasoning skills through templated problem generators and introduces fully OOD problem families that demand novel strategies rather than tool invocation or memorized patterns.
Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy.
To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop.
Beyond learnability, we use DELTA to evaluate transferability or generalization along exploratory, compositional, and transformative axes, as well as cross-family transfer. Results show solid gains within families and for recomposed skills, but persistent weaknesses in transformative cases. DELTA thus offers a clean testbed for probing the limits of RL-driven reasoning and for understanding how models can move beyond existing priors to acquire new algorithmic skills.
基础/前沿模型 (含LLM)
指令微调与对齐
#Reinforcement Learning #Large Language Models #Catastrophic Forgetting
🎯 研究动机强化学习和监督微调在新任务上的表现相似,但强化学习显著减少了遗忘原始知识的程度,需要进一步探索原因和机制。
❓ 解决问题分析为何在线强化学习比监督微调更能保留原始模型的知识和能力,并揭示遗忘现象与分布变化之间的关系。
🔍 现象分析实验表明分布变化的程度(通过KL散度测量)是遗忘的关键因素,强化学习偏向于选择分布变化较小的解决方案,而监督微调可能产生较大的分布偏移。
🛠️ 主要方法通过理论分析和实验验证强化学习的在线更新策略,揭示其倾向于最小化KL散度,进而减少遗忘。
📊 数据与实验用大型语言模型和机器人基础模型进行实验,测试强化学习和监督微调在不同任务中的遗忘情况和分布变化。
⭐ 主要贡献提出强化学习的隐式偏好理论‘RL's Razor’,解释其在解决新任务时如何保留原始模型的能力,并提供理论与实验验证支持。
查看完整摘要 (Abstract)
Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL preserves prior knowledge and capabilities significantly better. We find that the degree of forgetting is determined by the distributional shift, measured as the KL-divergence between the fine-tuned and base policy evaluated on the new task. Our analysis reveals that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. We validate these findings through experiments with large language models and robotic foundation models and further provide theoretical justification for why on-policy RL updates lead to a smaller KL change. We term this principle $\textit{RL’s Razor}$: among all ways to solve a new task, RL prefers those closest in KL to the original model.
基础/前沿模型 (含LLM)
指令微调与对齐
#large language model #reinforcement learning #dynamic critics #language model post-training #open-ended generation
TL;DR:RLAC trains a dynamic critic jointly with the generator (RL policy) in an adversarial two-player game, enabling rubric verification for free-form generation tasks without exhaustively enumerating rubrics or manually engineering robust reward models.
🎯 研究动机开放式生成任务需要满足多样且隐式的评价标准,而基于规则的奖励验证成本过高,且难以全面评估输出质量。
❓ 解决问题提出一种动态验证方式,解决评价规则难以全面枚举以及奖励模型鲁棒性不足的问题,实现高效的后训练优化。
🔍 现象分析静态评价机制无法适应生成任务的多样性和可变性,最佳奖励组合通常依赖具体上下文,导致验证效率低下。
🛠️ 主要方法RLAC联合训练生成器与动态评论者,以博弈方式优化生成质量;评论者通过大型语言模型动态识别潜在错误,并由外部验证器确认。
📊 数据与实验实验验证了RLAC在文本生成的事实准确性和代码生成的正确性上优于传统穷尽验证及奖励模型方法。
⭐ 主要贡献提出动态评论者机制,提高生成质量与验证效率;展示RLAC在自由生成任务中扩展后训练优化的潜力。
查看完整摘要 (Abstract)
Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics. The sheer number of relevant rubrics leads to prohibitively high verification costs and incomplete assessments of a response, making reinforcement learning (RL) post-training with rubric-based rewards difficult to scale. This problem is exacerbated by the fact that often the best way to combine these rubrics into one single reward is also highly prompt-specific. We propose Reinforcement Learning with Adversarial Critic (RLAC), a post-training approach that addresses these challenges via dynamic rubric verification. Our approach employs a large language model (LLM) as a critic that dynamically identifies only the most likely failure modes (e.g., a factual error or unhandled edge case), which are then verified by an external validator to optimize both generator and critic jointly. By training both the generator and the critic, this game enhances the critic's error detection and the generator's output quality while reducing required verifications. Our experiments demonstrate that RLAC improves factual accuracy in text generation and correctness in code generation, while also outperforming exhaustive verification and reward model methods. We show that dynamic critics are more effective than fixed critics, showcasing the potential of RLAC for scaling RL post-training to free-form generation tasks.
基础/前沿模型 (含LLM)
指令微调与对齐
#reward modeling #model alignment #inference-time control #customization #LLM post-training
TL;DR:We present an approach to bridge RL with Human Feedback and Verifiable Rewards. Our method achieves #1 on JudgeBench leaderboard and exceeds or matches DeepSeek R1 and o3-mini on Arena Hard V2,WildBench and MT Bench at <5% of their inference cost.
🎯 研究动机现有的RLHF缺乏可解释性且易受奖励黑客攻击,RLVR受限于基于正确性的验证范围,难以全面衡量模型输出质量。
❓ 解决问题提出一种结合人类偏好灵活性和规则验证精确性的RLBFF方法,以改善奖励建模的表现,解决RLHF和RLVR各自的局限性。
🔍 现象分析RLHF依赖人类判断而缺少明确标准,容易导致不一致;RLVR尽管可验证,但过于局限于正确性目标,忽视了更广泛的质量维度。
🛠️ 主要方法通过从自然语言反馈中提取可二值化的原则,将其作为奖励模型训练的基础,并将此过程建模为蕴含任务,同时支持推理时动态指定关注原则。
📊 数据与实验训练的奖励模型在RM-Bench和JudgeBench上分别达到了86.2%和81.4%的性能,并在MT-Bench、WildBench和Arena Hard V2等基准测试中以不到5%的推理成本匹配或超越o3-mini和DeepSeek R1。
⭐ 主要贡献提出结合人类反馈和验证奖励的RLBFF方法,证明其在准确性、灵活性及推理成本上都优于现有方法,并发布了公开的对齐配方和数据集。
查看完整摘要 (Abstract)
Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2\%) and JudgeBench (81.4\%, \#1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at $<5$\% of the inference cost).
基础/前沿模型 (含LLM)
指令微调与对齐
#Reward Model #Reasoning #Reinforcement Learning
TL;DR:Reward model with thinking improves the reward accuracy.
🎯 研究动机奖励模型是通过强化学习使大语言模型与人类偏好对齐的关键。当前模型在给出奖励信号前缺乏深度思考与可解释推理,这限制了其性能与准确性。
❓ 解决问题探索将推理能力融入奖励模型中,以提升奖励信号的解释性与准确性,并验证这一方法在多种基准上的效果。
🔍 现象分析通过长推理链方法可以显著提升复杂任务的表现,而奖励模型的性能与可解释性依赖于其推理能力的质量。
🛠️ 主要方法提出推理奖励模型(ReasRMs),将奖励建模作为推理任务。采用“链式评分”机制生成样本级评分标准,并分两阶段完成训练:高质量推理链蒸馏与基于可验证奖励的强化学习。
📊 数据与实验通过三个基准测试验证模型性能,与主流开源和商业奖励模型相比,RM-R1平均性能超出最高达4.9%。实验包括详细成分分析以理解训练成功的关键因素。
⭐ 主要贡献首次将推理融入奖励建模,提出高效的推理导向训练方法,提升模型的解释性和性能;验证了推理奖励模型在强化学习领域的优势与潜力。
查看完整摘要 (Abstract)
Reward modeling is essential for aligning large language models with human preferences through reinforcement learning. To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. Inspired by recent advances of long chain-of-thought on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning into reward modeling significantly enhances RM's interpretability and performance. We introduce a new class of generative reward models, Reasoning Reward Models (ReasRMs), which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. RM-R1 features a chain-of-rubrics (CoR) mechanism -- self-generating sample-level chat rubrics or math/code solutions, and evaluating candidate responses against them. The training of RM-R1 consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. Empirically, our models achieve superior performance across three reward model benchmarks on average, outperforming much larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones (e.g., GPT-4o) by up to 4.9%. Beyond final performance, we perform thorough analyses to understand the key ingredients of successful ReasRM training.
基础/前沿模型 (含LLM)
指令微调与对齐
#Conversational recommender system #Reinforcement learning with Verifiable Reward
🎯 研究动机随着大语言模型(LLMs)的发展,通过对话表达偏好并获得推荐成为可能,但预训练的LLMs在推荐任务中表现出生成无效项、输出格式违规及排序质量下降等问题。
❓ 解决问题为解决LLMs在推荐中不符合目录、输出格式及排序退化的问题,提出一种针对此类任务的高效训练框架和强化学习方法。
🔍 现象分析预训练LLMs在生成推荐列表时存在非目录项输出和尾部排序质量显著下降,传统的行为克隆和统一序列优化策略对这些问题处理不足。
🛠️ 主要方法本文提出了ConvRec-R1框架,包括两阶段流程:行为克隆数据集的构建用以暖启动RL训练;基于Rank-GRPO方法重新定义奖励分配,采用几何均值实现排序级别的重要性比优化。
📊 数据与实验在Reddit-v2和Redial数据集上的实验表明,相较于基线方法,ConvRec-R1具有更快的收敛速度,并且在Recall与NDCG指标上实现更优表现。
⭐ 主要贡献首次提出基于排序单元的Rank-GRPO方法;设计了ConvRec-R1,可高效训练基于LLMs的对话推荐系统;公开代码与数据集,为后续研究提供支持。
查看完整摘要 (Abstract)
Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the on the Reddit-v2 and Redial datasets show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at https://github.com/yaochenzhu/Rank-GRPO.
基础/前沿模型 (含LLM)
指令微调与对齐
#General Reasoning #Reinforcement Learning #Large Language Models
TL;DR:A verifier-free RL method to improve general reasoning for LLMs.
🎯 研究动机现有的强化学习方法依赖答案可验证性,限制了大语言模型在现实世界复杂领域的泛化推理能力。
❓ 解决问题提出一种无需验证器的强化学习方法,以解决现有方法对强验证器的依赖和计算资源占用问题。
🔍 现象分析基于验证器的方法容易受到奖励操纵的影响,同时在训练中引入了额外的存储与计算负担。
🛠️ 主要方法设计了 VeriFree 方法,直接以优化生成参考答案概率为目标,规避了对验证器的依赖。
📊 数据与实验使用 MMLU-Pro、GPQA、SuperGPQA 以及数学相关基准数据集进行评估,实验结果表明 VeriFree 在效果和计算效率上优于基于验证器的方法。
⭐ 主要贡献提出了首个无需验证器的泛化推理强化学习架构,实现了更高效的模型训练,推动了语言模型在广泛领域的应用。
查看完整摘要 (Abstract)
The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (**VeriFree**) that bypasses answer verification and instead directly maximizes the probability of generating the reference answer, derived in a principled way from the RL objective. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models ; Retrieval-Augmented Generation ; Reinforcement Learning
TL;DR:We propose Knowledgeable-R1, a reinforcement-learning framework that explicitly trains large language models to use parametric knowledge (PK) to resist contextual interference while still exploiting external context when it is reliably helpful.
🎯 研究动机检索增强生成(RAG)在知识密集型任务中表现出优越性能,但在面对错误、不相关或冲突的检索文本时容易受到上下文干扰并产生失误,亟需解决如何有效利用模型内部参数知识以抗干扰的问题。
❓ 解决问题提出明确训练大语言模型使用参数知识(PK)的强化学习框架,以在上下文干扰存在时增强模型的准确性,同时仍能可靠地利用有帮助的外部检索信息。
🔍 现象分析错误或冲突的检索文本会导致模型依赖不准确证据并触发错误级联,影响任务准确性,尤其在知识冲突场景中表现尤为明显。
🛠️ 主要方法构建一种强化学习框架—Knowledgeable-R1,通过联合采样机制生成有检索和无检索的配对响应,评估本地与全局优势;引入非对称优势变换以促进探索行为向参数知识倾斜。
📊 数据与实验在知识冲突和常规RAG场景中进行实验,模型在反事实场景下性能提升22.89%,且在检索上下文完全准确时未出现性能退化,显著超越SOTA基线。
⭐ 主要贡献提出可抗上下文干扰的强化学习框架,将参数知识与检索信息高效融合,显著提升知识冲突场景下的模型鲁棒性和推理能力。
查看完整摘要 (Abstract)
Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors. We propose Knowledgeable-R1, a reinforcement-learning framework that explicitly trains large language models to use parametric knowledge (PK) to resist contextual interference while still exploiting external context when it is reliably helpful. Knowledgeable-R1 introduces a joint sampling scheme that generates paired responses with and without retrieval, and learns both local advantages (within each decoding regime) and global advantages under the same input to quantify when to ignore misleading context versus adopt it. We employ an asymmetric advantage transformation that amplifies exploratory behaviors toward parametric knowledge. Experiments show that Knowledgeable-R1 significantly improves robustness and reasoning accuracy in knowledge conflict scenarios and general RAG scenarios, outperforming SOTA baselines by +22.89% in counterfactual scenarios, and without degradation when the retrieved context is fully accurate.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Vision Language Models #Multi-Image Safety #Dataset #Safety Fine-Tuning
TL;DR:We propose a novel safety fine-tuning pipeline, Multi-Image Reasoning Safety (MIRage), which significantly enhances the model’s ability to handle challenging safety-related tasks without compromising its general performance.
🎯 研究动机大型视觉语言模型(VLMs)在安全关键领域的部署面临挑战,现有安全微调方法在处理复杂任务时存在瓶颈,无法同时保障辅助性与无害性。本文旨在弥补模型在安全视觉推理能力上的不足,解决这一根本问题。
❓ 解决问题针对现有方法在视觉感知和推理上的局限性,提出通过多图像输入和细粒度安全思维链(CoT)标签,增强模型在安全相关任务中的推理能力,从而提升模型整体性能。
🔍 现象分析评估揭示了安全推理鸿沟:当前方法缺乏安全视觉推理能力,导致难以应对挑战性多图像安全场景,并可能影响通用能力的平衡。
🛠️ 主要方法提出多图像安全微调流程MIRage,并构建Multi-Image Safety (MIS)数据集,结合多图像输入与安全思维链标签,以细粒度逻辑指导模型的安全推理。
📊 数据与实验MIS数据集专为多图像安全场景设计,包含训练和测试集。实验表明,基于InternVL2.5-8B的MIS微调在安全基准上显著降低攻击成功率,并在五个通用基准上平均准确率提升0.83%。
⭐ 主要贡献提出并验证了MIRage安全微调方法,首次通过多图像推理增强模型安全能力;构建MIS数据集以支持细粒度安全推理;实现了安全性能与通用性能的无损提升。
查看完整摘要 (Abstract)
Large Vision-Language Models (VLMs) have achieved remarkable performance across a wide range of tasks. However, their deployment in safety-critical domains poses significant challenges. Existing safety fine-tuning methods, which focus on textual or multimodal content, fall short in addressing challenging cases or disrupt the balance between helpfulness and harmlessness. Our evaluation highlights a safety reasoning gap: these methods lack safety visual reasoning ability, leading to such bottlenecks. To address this limitation and enhance both visual perception and reasoning in safety-critical contexts, we propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance. Specifically, we introduce the Multi-Image Safety (MIS) dataset, an instruction-following dataset tailored for multi-image safety scenarios, consisting of training and test splits. Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks requiring safety-related visual reasoning. This approach not only delivers exceptional safety performance but also preserves general capabilities without any trade-offs. Specifically, fine-tuning with MIS increases average accuracy by 0.83% across five general benchmarks and reduces the Attack Success Rate (ASR) on multiple safety benchmarks by a large margin.
基础/前沿模型 (含LLM)
指令微调与对齐
#continual learning #vision-language models #catastrophic forgetting
🎯 研究动机视觉-语言模型在非稳态环境下部署时,面临持续学习的挑战。现有的方法在顺序适应过程中,往往能保留原始特征识别能力,却会损失组合结构信息。
❓ 解决问题针对持续视觉-语言学习中组合结构遗忘问题,提出在有限回放预算和无任务标识情况下,如何保持模型的结构可靠性和零样本性能。
🔍 现象分析现有模型在顺序适应时,组合结构信息容易丢失,而原始特征识别能力得以保留。这导致模型行为结构不可靠,尤其是在对齐敏感度升高时。
🛠️ 主要方法提出了Compo-ReAlign方法,包含三个核心组件:可逆组合器将原始特征映射为组合表示;多正样本InfoNCE联合对齐文本和组合视图;谱信任区域在敏感度高时限制更新。
📊 数据与实验在组合DIL和多领域MTIL检索任务上进行验证,实现了新的SOTA性能。相比于最强基线在R@1指标上提升2.4%,并将遗忘率降低40%。
⭐ 主要贡献设计了紧凑可逆的对齐头部和几何感知训练方法,为组合鲁棒的持续学习提供了结构优先的解决方案。在保持零样本性能的同时,显著减少了组合结构遗忘。
查看完整摘要 (Abstract)
Vision-language (VL) models are increasingly deployed in non-stationary settings, yet under sequential adaptation they often preserve primitive recognition while losing compositional structure, especially with tight rehearsal budgets and no task IDs. We address this gap by asking how a continual VL system can maintain structurally dependable behaviour while safeguarding zero-shot performance. We introduce Compo-ReAlign, a structure-first recipe built around three components: a reversible composer that maps primitive embeddings to compositions by design, a multi-positive InfoNCE that jointly aligns textual and composed views of the same target, and a spectral trust region that clips updates when alignment sensitivity inflates. Across compositional DIL and multi-domain MTIL retrieval, Compo-ReAlign sets a new state of the art, improves over the strongest prior by +2.4 R@1, and reduces forgetting by 40%. We provide a compact, reversible alignment head with geometry-aware training for compositionally robust VL continual learning.
基础/前沿模型 (含LLM)
指令微调与对齐
#Diffusion Language Models #Reinforcement Learning
🎯 研究动机扩散模型在语言任务中的效果初见成效,但后训练方法仍缺乏系统性研究,需优化扩散语言模型的推理轨迹与训练目标对齐。
❓ 解决问题开发一种新型强化学习框架,有效利用推理轨迹信息,改进扩散语言模型在复杂推理任务上的表现及模型扩展能力。
🔍 现象分析扩散语言模型的推理轨迹与其后训练目标不完全一致,制约了模型在高复杂度任务中的推理能力和稳定性。
🛠️ 主要方法提出TraceRL框架,将推理轨迹纳入训练过程,引入基于扩散的价值模型,支持全注意力和分块注意力扩散语言模型的优化。
📊 数据与实验通过复杂数学推理任务(MATH500)和编程任务(LiveCodeBench-V2)进行验证,证明新模型在推理准确性上明显优于现有方法,并扩展至大规模分块模型。
⭐ 主要贡献设计了TraceRL强化学习框架和TraDo系列最优扩散语言模型;达成复杂任务准确率提升,开源代码推动领域发展,同时实现首个支持8B规模长推理链的模型。
查看完整摘要 (Abstract)
The extension of diffusion models to language tasks has shown promising results, but their post-training methods remain largely unexplored. We highlight the importance of aligning a diffusion language model’s preference-inference trajectory with its post-training objective. To this end, we propose TraceRL, a trajectory-aware reinforcement learning framework for DLMs that incorporates information from inference trajectories into post-training and is applicable to both full-attention and block-attention diffusion models. We also introduce a diffusion-based value model that enhances training stability and naturally accommodates process rewards. We demonstrate TraceRL’s superiority in enhancing a model’s reasoning ability on complex math and coding tasks, as well as its applicability in scaling block diffusion models to larger block sizes. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than Qwen2.5-7B-Instruct, TraDo-4B-Instruct consistently outperforms it on complex math reasoning tasks. TraDo-8B-Instruct achieves 4.5% higher accuracy on MATH500 than Qwen2.5-7B-Instruct and 6.6% higher accuracy on LiveCodeBench-V2 than Llama3.1-8B-Instruct. Through curriculum learning, we also develop the first 8B-scale long-CoT diffusion language model. We open-source our code at [https://github.com/Gen-Verse/dLLM-RL](https://github.com/Gen-Verse/dLLM-RL).
基础/前沿模型 (含LLM)
指令微调与对齐
#Multi-armed bandits #Preference optimization #Reward model #Online DPO
TL;DR:A hybrid reward model routing framework that improves alignment in LLMs.
🎯 研究动机RLHF/RLAIF已成为对齐LLM的方法,但现有管线多依赖单一奖励模型,导致对齐质量受限并存在过拟合风险。探索动态奖励模型选择能够发挥优势互补,提升对齐效果。
❓ 解决问题现有奖励模型路由方法存在冷启动和探索不足问题,无法充分利用候选奖励模型的潜力。
🔍 现象分析通过动态选择奖励模型,可在保持低调用成本的同时优化对齐质量,但现有方法未有效解决冷启动阶段的不足及在线探索方法的局限性。
🛠️ 主要方法提出一种混合路由框架,结合离线奖励模型可靠性学习与在线贝叶斯选择。离线阶段通过多任务路由器估算模型可靠性,在线阶段采用贝叶斯Thompson采样实现动态选择,并通过在线奖励更新策略适应分布变化。
📊 数据与实验实验基于指令跟随和推理任务的多个数据集,包括AlpacaEval-2、Arena-Hard等,结果显示框架优于单一奖励模型、模型集合及其他路由方法。
⭐ 主要贡献设计了一种能动态优化奖励模型选择的混合路由框架,有效解决冷启动及探索不足问题,并在多个基准任务上验证其对齐性能的提升。
查看完整摘要 (Abstract)
Reinforcement learning from human or AI feedback (RLHF/RLAIF) has become the standard paradigm for aligning large language models (LLMs). However, most pipelines rely on a single reward model (RM), limiting alignment quality and risking overfitting. Recent work explores RM routing—dynamically selecting an RM from a candidate pool to exploit complementary strengths while maintaining \(O(1)\) RM calls—but existing methods suffer from cold-start and insufficient exploration. We propose {\name}, a hybrid routing framework that combines offline RM strengths learning with online Bayesian selection. In the offline stage, a multi-task router is trained on preference data to estimate per-RM reliability. In the online stage, a Bayesian Thompson sampling router performs per-query RM selection, initializing RM-specific weight vectors with offline embeddings as Gaussian priors and adaptively updating their posteriors with online rewards to adapt to the evolving policy distribution. Extensive experiments on instruction-following (AlpacaEval-2, Arena-Hard, MT-Bench) and reasoning (GSM8K, MMLU) benchmarks show that {\name} consistently outperforms individual RMs, RM ensembling, and existing routing methods.
基础/前沿模型 (含LLM)
指令微调与对齐
#reward models #value alignment #finetuning #preference learning #large language models #RLHF #AI safety #bias #pretraining
TL;DR:Reward models are not a blank slate - they inherit significant value biases from their base models that persist even through extensive preference training.
🎯 研究动机奖励模型被用于将大语言模型对齐至人类价值,但其继承自基础模型的价值偏差尚未被充分研究,这对AI安全和价值对齐至关重要。
❓ 解决问题探讨奖励模型是否以及如何继承其预训练语言模型的价值偏差,以及这些偏差如何影响偏好学习和微调过程的结果。
🔍 现象分析奖励模型的行为显著受其基础模型影响,不同基础模型在心理学上的价值维度(如“自主性”和“共同性”)体现出明显偏向,即使微调和偏好数据完全一致,此偏向仍存在。
🛠️ 主要方法通过分析十款开源奖励模型,结合验证性心理语言数据,研究基础模型对奖励模型行为的影响,并提出可使用的隐式奖励评分公式以量化价值偏差。
📊 数据与实验使用经过验证的心理语言语料库,进行了多次偏好学习实验,并对偏好数据源和数量进行消融研究以确保结果的可重复性和稳定性。
⭐ 主要贡献发现奖励模型继承了基础模型的价值偏差,强调预训练阶段的价值对齐重要性,提醒开发者在选择基础模型时需同时考虑性能与价值兼容性。
查看完整摘要 (Abstract)
Reward models (RMs) are central to aligning large language models (LLMs) with human values but have received less attention than pretrained and post-trained LLMs themselves. Because RMs are initialized from LLMs, they inherit representations that shape their behavior, but the nature and extent of this influence remain understudied. In a comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora, we show that RMs exhibit significant differences along multiple dimensions of human value as a function of their base model. Using the "Big Two" psychological axes, we show a robust preference of Llama RMs for "agency" and a corresponding robust preference of Gemma RMs for "communion." This phenomenon holds even when the preference data and finetuning process are identical, and we trace it back to the logits of the respective instruction-tuned and pretrained models. These log-probability differences themselves can be formulated as an implicit RM; we derive usable implicit reward scores and show that they exhibit the very same agency/communion difference. We run experiments training RMs with ablations for preference data source and quantity, which demonstrate that this effect is not only repeatable but surprisingly durable. Despite RMs being designed to represent human preferences, our evidence shows that their outputs are influenced by the pretrained LLMs on which they are based. This work underscores the importance of safety and alignment efforts at the pretraining stage, and makes clear that open-source developers' choice of base model is as much a consideration of values as of performance.
基础/前沿模型 (含LLM)
指令微调与对齐
#Confidence Calibration #Uncertainty Estimation #Large Language Models
TL;DR:We propose Rewarding Doubt, an RL-based approach that models confidence estimation as a betting game, optimizing LLMs for calibrated confidence in factual answers.
🎯 研究动机准确表达大模型回答中的信心对于其安全可靠的应用至关重要。
❓ 解决问题提出了一种新方法,使得大模型能够直接进行信心校准,与生成回答的过程一致。
🔍 现象分析现有方法通常将信心估计与回答生成分开处理,导致信心与实际准确性不一致。
🛠️ 主要方法基于强化学习的方法,将信心水平的估计建模为博彩游戏,优化对数评分规则,惩罚过度和不足的信心。
📊 数据与实验通过实验证明,该方法在未调优的新任务上展示了显著的校准改善和泛化能力。
⭐ 主要贡献首次将信心校准无缝地整合到大模型生成过程中,实现了高度校准的信心表达。
查看完整摘要 (Abstract)
A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We propose a novel Reinforcement Learning approach that allows to directly fine-tune LLMs to express calibrated confidence estimates alongside their answers to factual questions. Our method optimizes a reward based on the logarithmic scoring rule, explicitly penalizing both over- and under-confidence. This encourages the model to align its confidence estimates with the actual predictive accuracy. The optimal policy under our reward design would result in perfectly calibrated confidence expressions. Unlike prior approaches that decouple confidence estimation from response generation, our method integrates confidence calibration seamlessly into the generative process of the LLM. Empirically, we demonstrate that models trained with our approach exhibit substantially improved calibration and generalize to unseen tasks without further fine-tuning, suggesting the emergence of general confidence awareness. Our code is available at https://github.com/pasta99/RewardingDoubt.
基础/前沿模型 (含LLM)
指令微调与对齐
#RLVR #Large Language Model #Reinforcement Learning #Pass@k Optimization
🎯 研究动机现有强化学习方法在优化大型语言模型时,存在探索困境;模型初始策略过于尖锐化,特定解准确性提升但多解性能受限,无法有效开发新的推理策略。
❓ 解决问题通过设计一种风险敏感强化学习框架,改善模型在复杂推理任务中的探索能力,提升多解表现(pass@k)。
🔍 现象分析标准强化学习方法主要强化预训练模型的单一能力,导致模型陷入局部最优,抑制了解的多样性和新策略的发现。
🛠️ 主要方法提出风险敏感GRPO算法,引入一种风险寻求目标函数,结合均值与最大奖励,重点针对复杂提示进行探索以突破局部最优。
📊 数据与实验在六个数学推理基准和五个大型语言模型上进行实验,显示RS-GRPO算法稳定提升pass@k,同时保持或改善pass@1表现。
⭐ 主要贡献开发了风险敏感强化学习框架,突破探索困境;提供新算法RS-GRPO,可提升推理的解答多样性和整体性能,为强化学习优化LLMs提供新方向。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs) on complex reasoning tasks. Yet current methods face an exploration dilemma: standard RL struggles to escape the local optima of pre-trained LLMs’ sharply peaked initial policies, boosting single-solution accuracy (pass@1) but suppressing solution diversity and multi-solution performance (pass@k). As a result, RLVR often distills existing capabilities rather than discovering new reasoning strategies. We address this with a Risk-Sensitive Reinforcement Learning framework. By adopting a risk-seeking objective that interpolates between mean and maximum rewards, we derive a novel Risk-Sensitive GRPO (RS-GRPO) algorithm that emphasizes hard prompts to drive exploration. Across six mathematical reasoning benchmarks and five LLMs, RS-GRPO consistently improves pass@k performance while enhancing or maintaing pass@1.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM Unlearning; Adversarial Robustness; AI Safety
TL;DR:We introduce PoRT, a robust unlearning framework that cleans prompts, jointly judges the question-answer pair, and triggers self-correction for safer outputs.
🎯 研究动机LLM 的知识遗忘能力,对移除敏感知识及确保合规性和安全性至关重要,尤其是在无需更改参数情况下快速部署的场景中表现尤为重要。
❓ 解决问题现有的预过滤方法在应对对抗性攻击时表现出显著的鲁棒性缺陷,容易导致敏感信息泄露和危险知识的准确性提升,亟需更稳健的遗忘框架。
🔍 现象分析简单的前缀攻击可导致虚构实体知识泄露增加 1,150 倍,复合问题攻击可使随机猜测基线的准确性从 24.9% 攀升至 67.0%。
🛠️ 主要方法提出 PoRT 框架,包括数据清理模块(动态 few-shot 生成清理后的问题及初步回答)、后评估模块(联合评估清理后的问题和回答以检测违规内容)、多轮思考模块(触发低置信度结果的自我纠正)。
📊 数据与实验在对抗攻击基准上进行广泛实验,验证 PoRT 在增强鲁棒性和遗忘能力方面优于现有方法,且不损害模型的通用性能。
⭐ 主要贡献提供了一个新颖且稳健的 LLM 遗忘框架 PoRT,显著提升模型在敏感知识移除及对抗环境下的安全性,同时公开代码以促进研究社区发展。
查看完整摘要 (Abstract)
The unlearning capability of LLMs is vital for ensuring compliance and safety, especially when removing sensitive knowledge from deployed models. Pre-filtering methods, enabling rapid deployment without parameter changes, are a prominent unlearning approach. However, they exhibit significant robustness deficiencies against adversarial attacks: in the worst case, simple prefix attacks can induce up to a 1,150-fold surge in information leakage for fictitious entity knowledge, while composite question attacks can cause accuracy on hazardous knowledge to rebound from the 24.9% random-guess baseline to as high as 67.0%. To address this, we propose a new unlearning framework via post judgment and multi-round thinking (PoRT), which consists of three key modules. First, a data cleaning module compiles a dynamic few-shot prompt that instructs the LLM to simultaneously generate both a cleaned version of the user's query and a corresponding initial response, supported by an extensible demonstration library for adaptive defense. Second, unlike existing pre-filtering methods that typically judge based solely on prompts, our post-judgment module jointly evaluates cleaned prompts and their corresponding responses to better detect non-compliant outputs. Finally, a selective multi-round thinking process is employed to trigger LLM's self-correction for low-confidence outputs, enhancing reliability and result quality. Extensive experiments on benchmarks demonstrate PoRT's superior robustness against adversarial attacks and strong unlearning effectiveness without compromising general model utility. Code is available at https://github.com/ChnIRuI/PoRT_LLM_Unlearning
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Inference-time Alignment #Robustness
TL;DR:We propose a multi-objective decoding algorithm that generates robust responses without knowing the exact weights over the objectives.
🎯 研究动机当前大语言模型在推理时难以同时满足多个目标需求(如指令性、安全性、协作性),存在对齐鲁棒性不足的问题。
❓ 解决问题提出一种多目标鲁棒解码算法,即使不明确设定目标权重,也能生成最优的模型输出。
🔍 现象分析传统控制解码方法在目标权重不确定或变化时表现较差,难以获得一致的最坏情况表现。
🛠️ 主要方法将鲁棒解码问题形式化为一个对抗性的极大极小博弈,通过凸优化解决最坏情况下的奖励权重,并推导最佳采样策略。
📊 数据与实验在多项主流对齐数据集上进行测试,实验设计包括多目标场景下的最坏情况奖励和对比方法取胜率分析。
⭐ 主要贡献提出RMOD算法及其高效实现版本,验证其在提升多目标对齐鲁棒性方面的显著优势,同时保持低计算开销。
查看完整摘要 (Abstract)
We introduce Robust Multi-Objective Decoding (RMOD), a novel inference-time algorithm that robustly aligns Large Language Models (LLMs) to multiple human objectives (e.g., instruction-following, helpfulness, safety) by maximizing the worst-case rewards. RMOD formulates the robust decoding problem as a maximin two-player game between adversarially computed reward weights and the sampling policy, solvable through a Nash equilibrium. We demonstrate that this game reduces to a convex optimization problem to identify the worst-case reward weights, with the optimal sampling policy analytically derived. For practical applications, we propose an efficient algorithm of RMOD tailored for contemporary LLMs, introducing minimal computational overhead compared to standard non-robust Controlled Decoding methods. Experimental results across a range of popular alignment datasets with up to 10 objectives show the effectiveness of RMOD and its distilled version, consistently outperforming baselines in worst-case rewards and win rates.
基础/前沿模型 (含LLM)
指令微调与对齐
#Reward Modeling #Reward Hacking #Alignment #Post training LLMs #RLHF
TL;DR:We introduce CROME, a novel causality-inspired technique for training reward models for LLM post-training, which achieves significantly improved reward model robustness and reduced reward hacking.
🎯 研究动机强化学习中的奖励模型是对大语言模型进行人类反馈对齐的重要工具,但常因奖励漏洞导致对表面特征的过拟合,难以捕捉核心驱动的因果因素。
❓ 解决问题提出一种基于因果视角的新方法,解决奖励模型中因果与非因果特征混淆导致的健壮性差与奖励漏洞问题。
🔍 现象分析传统训练目标难以分离因果驱动因素与数据中的伪相关特征,导致奖励模型对响应质量的评价不可靠。
🛠️ 主要方法设计了CROME框架,结合因果属性增强策略,通过‘因果增强’提高模型对因果特征的敏感性,通过‘中性增强’减少模型对伪相关特征的依赖,增强训练结果的稳健性。
📊 数据与实验在RewardBench、AlpacaEval 2.0、WildGuardTest和GSM8k等基准数据集上进行实验,结果显示CROME在奖励模型准确性、推理能力、安全性等方面均显著优于标准基线,分别提升最高达12.4%、7.1%和5.3%。
⭐ 主要贡献提出并验证了一种因果驱动的奖励建模框架CROME,显著改善奖励模型鲁棒性,减轻奖励漏洞问题,推动了更安全、更可靠的语言模型对齐技术。
查看完整摘要 (Abstract)
Reward models (RMs) are fundamental to aligning Large Language Models (LLMs) via human feedback, yet they often suffer from reward hacking. They tend to latch on to superficial or spurious attributes, such as response length or formatting, mistaking these cues learned from correlations in training data for the true causal drivers of quality (e.g., factuality, relevance). This occurs because standard training objectives struggle to disentangle these factors, leading to brittle RMs and misaligned policies. We introduce CROME (Causally Robust Reward Modeling), a novel framework inspired by an explicit causal model designed to mitigate reward hacking. CROME queries an oracle LLM for rubrics that are (or the oracle deems to be) causally relevant to answering a specific prompt. Then, it employs the following synthetic targeted augmentations during training: (1) Causal Augmentations, which are pairs that differ along specific causal attributes (subset of the Oracle identified rubrics), to enforce sensitivity along each causal attribute individually, and (2) Neutral Augmentations, which are tie-label pairs varying primarily in spurious attributes, to enforce invariance along spurious attributes. Notably, our neutral augmentations are produced without any knowledge of unknown spurious factors, via question swapping and response interventions only along causal rubrics. We show that the CROME augmentation strategy using rubrics from popular LLM APIs significantly outperforms standard baselines on RewardBench, improving average accuracy by up to 5.3% and achieving gains of up to 7.1% and 12.4% in reasoning and safety. The robustness of CROME is further testified by significant gains in DPO-aligned policies and Best-of-N alignment across various benchmarks, including AlpacaEval 2.0, RewardBench, safety-focused WildGuardTest, and the reasoning-specific GSM8k.
基础/前沿模型 (含LLM)
指令微调与对齐
#generalization #continual learning #fine-tuning #memorization
TL;DR:We propose a "memorize-then-generalize" framework where LLMs first memorize facts with meaningless tokens and later generalize through meaningful prompts.
🎯 研究动机探讨大语言模型通过机械记忆是否能够实现泛化,挑战传统认为记忆阻碍泛化的观点。
❓ 解决问题提出如何让模型通过记忆纯粹事实并利用提示实现从记忆到泛化的过渡。
🔍 现象分析实验发现模型可基于无意义记忆重新解释数据,并通过语义化提示形成结构化和语义对齐的潜在表示。
🛠️ 主要方法提出‘记忆-泛化’框架,先使用无语义令牌进行机械记忆,再通过语义化提示进行微调以实现泛化能力。
📊 数据与实验通过 8 个不同的大语言模型进行实验,验证框架在知识注入的有效性及潜在风险。
⭐ 主要贡献首次验证模型能基于机械记忆实现泛化,为高效知识注入和潜在安全风险管理提供新视角。
查看完整摘要 (Abstract)
Rote learning is a memorization technique based on repetition. Many researchers argue that rote learning hinders generalization because it encourages verbatim memorization rather than deeper understanding. This concern extends even to factual knowledge, which inevitably requires a certain degree of memorization.
In this work, we challenge this view and demonstrate that large language models (LLMs) can, in fact, generalize over rote memorized data. We introduce a two-phase “memorize-then-generalize” framework, where the model first rote memorizes factual subject-object associations using a synthetic semantically meaningless key token and then learns to generalize by fine-tuning on a small set of semantically meaningful prompts. Extensive experiments over 8 LLMs show that the models can reinterpret rote memorized data through the semantically meaningful prompts, as evidenced by the emergence of structured, semantically aligned latent representations between the key token and the semantically meaningful prompts.
This surprising finding opens the door to both effective and efficient knowledge injection as well as possible risks of repurposing the memorized data for malicious usage.
基础/前沿模型 (含LLM)
指令微调与对齐
#representation learning for language #datasets and benchmarks #reward modeling #reinforcement learning #natural langauge processing #large language models #reasoning #alignment
TL;DR:An on-policy RL framework that uses rubric-guided rewards for training LLMs on real-world reasoning tasks.
🎯 研究动机当前强化学习方法难以在需多维度判断的真实世界推理任务中有效处理,传统奖励信号难以捕获任务复杂性。
❓ 解决问题利用基于实例的评分标准(rubric)作为反馈信号,扩展强化学习的应用范围至不可验证领域。
🔍 现象分析评分标准作为结构化奖励信号较直接评分方式更能提升模型的适配性与一致性,并减少因裁判模型规模变化带来的性能波动。
🛠️ 主要方法提出基于评分标准奖励(RaR)的一种策略训练方法,通过多种奖励聚合策略提升不同任务领域的推理性能。
📊 数据与实验在健康领域数据集HealthBench与科学领域数据集GPQA-Diamond上测试,RaR方法在两个数据集分别对比基线提升了31%和7%的相对性能。
⭐ 主要贡献证明评分标准可以作为高效奖励信号用于强化学习,优化模型在复杂评价任务中的表现,并显著改善裁判模型对齐程度与稳定性。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for complex reasoning tasks with clear correctness signals such as math and coding. However, extending it to real-world reasoning tasks is challenging, as evaluation depends on nuanced, multi-criteria judgments rather than binary correctness. Instance-specific rubrics have recently been used in evaluation benchmarks to capture such judgments, but their potential as reward signals for on-policy post-training remains underexplored. We introduce $\textbf{Rubrics as Rewards (\textit{RaR})}$, an on-policy reinforcement learning method that extends RLVR beyond verifiable domains by using rubric-based feedback. Across both medical and science domains, we evaluate multiple strategies for aggregating rubric feedback into rewards. The best RaR variant achieves relative improvements of up to 31\% on HealthBench and 7\% on GPQA-Diamond over popular LLM-as-judge baselines that rely on direct Likert-based rewards. These results demonstrate that RaR-trained policies adapt well to diverse evaluation formats, performing strongly on both rubric-based and multiple-choice tasks. Moreover, we find that using rubrics as structured reward signals yields better alignment for smaller judges and reduces performance variance across judge scales.
基础/前沿模型 (含LLM)
指令微调与对齐
#Post-training #Transferability #Sparse Autoencoder #Large Language Models
🎯 研究动机近年来,预训练大型语言模型在各类任务中表现优异,但后训练过程引入的模型位移对跨领域性能的影响尚未完全理解。
❓ 解决问题该研究试图通过提出一种新指标,预测模型后训练阶段的跨领域迁移能力,从而减少训练步骤中的不确定性。
🔍 现象分析模型后训练阶段引入的位移可以显著影响其在不同领域的表现,而这种位移如何与领域相关性联系仍是未知的研究领域。
🛠️ 主要方法提出了一种基于稀疏自编码器(SAE)的迁移性评分方法(STS),通过识别位移维度并计算其与领域的相关性,提前预测迁移能力。
📊 数据与实验在多个模型和领域上进行了广泛实验,STS预测模型的迁移能力与实际性能变化的皮尔逊相关系数超过0.7。
⭐ 主要贡献STS提供了一种可解释的工具,用于预测后训练阶段模型的迁移性,并推动了相关领域的研究与应用。
查看完整摘要 (Abstract)
In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts models to task-specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE-based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post-training transferability. Taking supervised fine-tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability \textit{before} fine-tuning. Extensive experiments across multiple models and domains show that STS accurately predicts the transferability of supervised fine-tuning, achieving Pearson correlation coefficients above 0.7 with actual performance changes. Beyond this, we take an initial step toward extending STS to reinforcement learning. We believe that STS can serve as an {\color{black} interpretable} tool for guiding post-training strategies in LLMs. Code is available at \url{https://github.com/PKU-ML/STS}.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models; Supervised Finetuning; Domain-specific SFT; Continual Learning
🎯 研究动机针对领域特定数据集的监督微调(SFT)被认为会削弱大型语言模型(LLM)的通用能力,但这一权衡现象尚需进一步探索与量化分析。
❓ 解决问题研究如何在保持目标领域性能的同时,尽可能减轻SFT对模型通用能力的负面影响,并提出切实有效的解决方案。
🔍 现象分析实验表明,通过较小学习率的调整,可以显著降低通用能力下降的问题。同时,理论分析解释了这一现象,并推动新方法的提出。
🛠️ 主要方法提出了一种名为Token-Adaptive Loss Reweighting(TALR)的新方法,并结合其他策略(如L2正则化、LoRA、模型平均等),优化领域适应与通用能力之间的平衡。
📊 数据与实验利用多个实验评估TALR及其他方法的效果,结果表明TALR在平衡领域增益与通用能力方面性能优于其他基线方法。
⭐ 主要贡献通过理论与实证分析重访SFT的权衡问题,提出TALR作为有效的新方法,并总结出低学习率与TALR结合的实用指导方针,为领域特定模型适应提供了新思路。
查看完整摘要 (Abstract)
Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.
基础/前沿模型 (含LLM)
指令微调与对齐
#Synthetic Data #Prompt Optimization
🎯 研究动机高质量的提示设计对大语言模型性能至关重要,但现有方法通常假设固定数据分布,无法支持提示的迭代优化。
❓ 解决问题突破固定静态数据集的限制,提出一种能够通过合成数据反馈实现提示自我优化的闭环框架。
🔍 现象分析现有提示优化方法对输入分布变化和提示缺陷的动态调整支持不足,限制了模型性能提升的潜力。
🛠️ 主要方法提出SIPDO框架,结合合成数据生成器与提示优化器,让生成器揭示提示缺陷,并通过反馈迭代优化提示,无需外部监督或新任务支持。
📊 数据与实验在问答和推理基准测试中通过实验证明,SIPDO优于标准提示调优方法,表明数据合成过程序列对提示学习的重要作用。
⭐ 主要贡献验证了闭环合成数据反馈在提示优化中的有效性,提出了无需外部监督的自动迭代提示优化框架并获取显著性能提升。
查看完整摘要 (Abstract)
Prompt quality plays a critical role in the performance of large language models (LLMs), motivating a growing body of work on prompt optimization. Most existing methods optimize prompts over a fixed dataset, assuming static input distributions and offering limited support for iterative improvement. We introduce SIPDO (Self-Improving Prompts through Data-Augmented Optimization), a closed-loop framework for prompt learning that integrates synthetic data generation into the optimization process. SIPDO couples a synthetic data generator with a prompt optimizer, where the generator produces new examples that reveal current prompt weaknesses and the optimizer incrementally refines the prompt in response. This feedback-driven loop enables systematic improvement of prompt performance without assuming access to external supervision or new tasks. Experiments across question answering and reasoning benchmarks show that SIPDO outperforms standard prompt tuning methods, highlighting the value of integrating data synthesis into prompt learning workflows.
基础/前沿模型 (含LLM)
指令微调与对齐
#Vision Language Models #Reinforcement Learning #Reasoning #Cold-Start #Preference Optimization #Direct Preference Optimization (DPO) #Self-Distillation
TL;DR:Our Self-distilled Preference-based Cold Start (SPECS) method better prepares Vision Language Models for reinforcement learning, significantly boosting their reasoning performance.
🎯 研究动机当前基于SFT的冷启动方法将推理范式与任务解决方案、输出格式耦合,易导致指令风格过拟合并削弱分布外泛化能力,进而影响下游强化学习性能。
❓ 解决问题提出自蒸馏偏好冷启动框架SPECS,通过解耦多模态学习的表面形式与深层内容,以偏好学习代替SFT进行冷启动,提升模型的泛化与推理能力。
🔍 现象分析研究发现基于偏好的训练方法在冷启动阶段比SFT方法具有更好的泛化性,引入泛化因子系数进行量化验证,为方法设计提供了依据。
🛠️ 主要方法采用自蒸馏生成内省偏好数据对,无需大型教师模型或人工标注;通过偏好学习专注可迁移的表面形式准则;最终交接给RLVR进行深度推理。
📊 数据与实验在多个多模态基准测试中验证,在MEGA-Bench和MathVista上分别提升4.1%和12.2%,同时减少分布内“停滞”、提升探索稳定性并推高性能上限。
⭐ 主要贡献提出解耦的冷启动框架SPECS,将偏好优化与自蒸馏结合,显著增强了视觉语言模型在强化学习前的准备,为多模态推理学习提供了新范式。
查看完整摘要 (Abstract)
Reinforcement learning with verifiable rewards (RLVR) has recently catalyzed a wave of "MLLM-r1" approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference–based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose $\textbf{SPECS}$—a $\textbf{S}$elf-distilled, $\textbf{P}$r$\textbf{e}$ference-based $\textbf{C}$old $\textbf{S}$tart framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference–based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RLVR for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1\% and MathVista by 12.2\%. Additional experiments indicate that SPECS contributes to reducing in-distribution "stuckness," improving exploration, stabilizing training, and raising the performance ceiling.
Project Page: https://kwen-chen.github.io/SPECS-VL/
基础/前沿模型 (含LLM)
指令微调与对齐
#Reinforcement Learning #Diffusion Language Models #Policy Gradient
TL;DR:We propose SPG, an RL algorithm for dLLMs that addresses the challenge of log-likelihood estimation by leveraging both an upper and a lower bound on the true log-likelihood. Extensive experiments showcase the effectiveness of SPG.
🎯 研究动机扩散大型语言模型(dLLMs)因其并行解码能力成为自回归模型的高效替代,但通过强化学习进行偏好或任务奖励对齐时面临不可直观估计对数似然的问题。
❓ 解决问题现有方法使用单侧估计(如证据下界)易产生策略梯度偏差,为解决这一问题,研究提出一种基于上下界估计对数似然的策略梯度算法SPG。
🔍 现象分析单侧似然估计方法由于偏差限制了模型对人类偏好和任务奖励的优化能力。
🛠️ 主要方法SPG方法通过引入对数似然的上下界进行夹层估计,从而改进策略梯度偏差问题,实现更优解。
📊 数据与实验实验在GSM8K、MATH500、Countdown和Sudoku四个数据集上进行,对比基线方法,SPG分别提高准确率3.6%、2.6%、18.4%和27.0%。
⭐ 主要贡献提出SPG算法并验证其有效性,为扩散语言模型中的策略梯度优化提供了新的方向和实践效果提升。
查看完整摘要 (Abstract)
Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% on GSM8K, 2.6% on MATH500, 18.4% on Countdown, and 27.0% on Sudoku.
基础/前沿模型 (含LLM)
指令微调与对齐
#Data selection; Submodular; Log-determinant Fisher information; Instruction tuning
TL;DR:We prove gradient conflicts accelerate marginal log-det FIM decay through ε-analysis , and introduce SPICE—an adaptive conflict-penalized greedy selector that matches full-data results with 10% data.
🎯 研究动机基于信息量的数据筛选方法在指令微调中具有吸引力,但其实际效果可能受梯度冲突影响。本文旨在理解并解决梯度冲突导致边际信息增益衰减缓慢的问题,以提升数据选择效率。
❓ 解决问题梯度冲突(即样本间梯度不一致)会减缓对数行列式 Fisher 信息矩阵(log-det FIM)的边际增益衰减,从而阻碍信息的高效提取。本文提出一种冲突感知的筛选方法,在预算限制下最大化信息量并减少冲突。
🔍 现象分析通过 ε-分解分析,量化了梯度冲突导致目标函数偏离理想子模性的程度。冲突程度越高,近似保证越弱;冲突减少时,数据依赖的近似因子会收紧,这解释了原有准则效果受限的原因。
🛠️ 主要方法提出 SPICE(冲突惩罚子模信息选择器),在贪婪选择过程中自适应地惩罚梯度对齐不良的样本。该方法支持早停和代理模型以提高效率,确保在有限数据下最大化信息并抑制冲突。
📊 数据与实验在 8 个基准上使用 LLaMA2-7B 和 Qwen2-7B 进行实验。SPICE 仅使用 10% 数据,其选择子集的 log-det 信息高于原始准则,性能匹配或超越包括全数据微调在内的 6 种方法,显著降低训练成本。
⭐ 主要贡献理论证明了梯度冲突与边际信息衰减的关系,并提出冲突感知的 ε-分解框架。设计了高效的 SPICE 选择算法,实现仅用 10% 数据达到全数据性能,为高效大语言模型训练提供了理论和实践基础。
查看完整摘要 (Abstract)
Information-based data selection for instruction tuning is compelling: maximizing the log-determinant of the Fisher information yields a monotone submodular objective, enabling greedy algorithms to achieve a $(1-1/e)$ approximation under a cardinality budget. In practice, however, we identify alleviating gradient conflicts, misalignment between per-sample gradients, is a key factor that slows down the decay of marginal log-determinant information gains, thereby preventing significant loss of information. We formalize this via an $\varepsilon$-decomposition that quantifies the deviation from ideal submodularity as a function of conflict statistics, yielding data-dependent approximation factors that tighten as conflicts diminish. Guided by this analysis, we propose SPICE, a conflict-aware selector that maximizes information while penalizing misalignment, and that supports early stopping and proxy models for efficiency. Empirically, SPICE selects subsets with higher log-determinant information than original criteria, and these informational gains translate into performance improvements: across 8 benchmarks with LLaMA2-7B and Qwen2-7B, SPICE uses only 10% of the data, yet matches or exceeds 6 methods including full-data tuning. This achieves performance improvements with substantially lower training cost.
Code is available at https://github.com/Chang-pw/SPICE#.
基础/前沿模型 (含LLM)
指令微调与对齐
#prompting #system prompt #prompt optimization
TL;DR:System prompt optimization is powerful, highly generalizable, and complementary to existing methods
🎯 研究动机大型语言模型(LLMs)的性能依赖于提示的设计,当前研究主要针对特定任务优化提示,但对系统提示优化的关注较少。
❓ 解决问题如何通过优化通用系统提示(system prompt),提升模型在广泛任务中的性能,弥补现有方法的局限性。
🔍 现象分析优化后的单一系统提示可与针对单任务优化的任务提示性能相媲美,且任务提示与系统提示优化的结合具有互补性。
🛠️ 主要方法提出名为SPRIG的基于编辑的遗传算法,从预先定义的组件中迭代构建优化的系统提示,以最大化模型性能。
📊 数据与实验在包含47种任务的数据集上评估了系统提示的泛化性能,同时研究优化提示在不同模型家族、参数规模和语言间的适用性。
⭐ 主要贡献揭示了系统级指令对优化LLM潜力的重要性,证明其通用性和与任务级优化的互补作用,为提示工程提供新思路。
查看完整摘要 (Abstract)
Large Language Models (LLMs) have shown impressive capabilities in many scenarios, but their performance depends, in part, on the choice of prompt. Past research has focused on optimizing prompts specific to a task. However, much less attention has been given to optimizing the general instructions included in a prompt, known as a system prompt. To address this gap, we propose SPRIG, an edit-based genetic algorithm that iteratively constructs prompts from prespecified components to maximize the model's performance in general scenarios. We evaluate the performance of system prompts on a collection of 47 different types of tasks to ensure generalizability. Our study finds that a single optimized system prompt performs on par with task prompts optimized for each individual task. Moreover, combining system and task-level optimizations leads to further improvement, which showcases their complementary nature. Experiments also reveal that the optimized system prompts generalize effectively across model families, parameter sizes, and languages. This study provides insights into the role of system-level instructions in maximizing LLM potential.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM Reasoning #Reinforcement Learning #Supervised Reinforcement Fine-Tuning
TL;DR:We propose SRFT, an entropy-aware single-stage framework that unifies SFT and RL for LLM reasoning.
🎯 研究动机大型语言模型在推理任务中表现优异,但如何最佳整合监督微调(SFT)与强化学习(RL)仍是关键挑战。
❓ 解决问题通过从熵的角度分析学习动态和分布特性,统一 SFT 的全局调整与 RL 的局部优化,实现两者的高效融合。
🔍 现象分析SFT 更倾向于诱导全局的策略分布变化,而 RL 则专注于细粒度的选择性优化;熵是训练效果的重要指标。
🛠️ 主要方法提出 SRFT 框架,基于熵感知的权重机制,将 SFT 和 RL 以单阶段方式融合,直接利用示例数据和自探索结果进行优化。
📊 数据与实验在五个数学推理基准上较零-RL基线平均提升 9.0%,在三个分布外基准上提升 10.9%,并通过示例数据保持更稳定的策略熵。
⭐ 主要贡献首次提出熵感知的单阶段 SFT 和 RL 融合方法 SRFT,为推理任务的模型优化提供了有效的新路径。
查看完整摘要 (Abstract)
Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet optimally integrating Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through a comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from an entropy-based perspective, we reveal key differences between these paradigms: SFT induces coarse-grained, global shifts to policy distributions, while RL performs fine-grained, selective optimizations.
Our analysis further establishes entropy as a critical indicator of training efficacy.
Building on these observations, we introduce **S**upervised **R**einforcement **F**ine-**T**uning (**SRFT**), a single-stage framework that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms.
SRFT simultaneously applies SFT and RL to directly optimize LLMs using demonstrations and self-exploration rollouts rather than through two-stage sequential methods.
Extensive experiments show that SRFT outperforms zero-RL baselines by **9.0%** on five mathematical reasoning benchmarks and by **10.9%** on three out-of-distribution benchmarks.
Moreover, by leveraging demonstration data, SRFT maintains a more stable policy entropy, facilitating sustained policy improvement.
基础/前沿模型 (含LLM)
指令微调与对齐
#metacognition #out-of-distribution generalization #dataset reusability #skill-level training
TL;DR:We introduce a new fine-tuning strategy, STAT, that select or synthesize training data targeted on models' missing skills.
🎯 研究动机语言模型在常规微调过程中容易出现性能饱和现象,尤其是在使用与训练数据相近的数据集时难以提升效果。
❓ 解决问题提出一种新型微调策略 STAT,通过教师模型分析学生模型的技能缺失,生成或重权重训练数据,以缩小模型技能差距。
🔍 现象分析发现传统微调在MATH等数据集上提升有限,而技能针对性训练可以显著改善模型在技能应用以及分布外任务上的表现。
🛠️ 主要方法通过强大的教师语言模型列出任务所需技能,根据学生模型的技能缺失情况调整训练数据权重(STAT-Sel)或生成缺失技能示例(STAT-Syn)。
📊 数据与实验在Llama和Qwen模型上实验,MATH数据集准确率提升最高达7.5%,在AIME24/25、AMC23等分布外基准上平均提升4.6%。
⭐ 主要贡献提出STAT训练框架,显著提升技能缺失场景下的模型微调效果,并与现有的强化学习方法如GRPO具有互补性,为现代训练流程提供新方向。
查看完整摘要 (Abstract)
Language models often show little to no improvement (i.e., “saturation”) when trained via vanilla supervised fine-tuning (SFT) on data similar to what they saw in their training set (e.g., MATH). We introduce a new fine-tuning strategy, STAT, to train such a student model by using the metacognition ability of a stronger large language model (LLM) as the teacher. The teacher uses the task dataset to create a list of skills needed for the task, and then labels each data point with its required skills (Didolkar et al., 2024). By monitoring the student’s answers, the teacher creates a Missing-Skill-Profile for the student, tracking how often they failed to apply each skill in their responses. We use this idea to build a modified training set in one of two ways. In STAT-Sel, the teacher uses an existing set of training examples but adaptively reweights them according to the Missing-Skill-Profile. In STAT-Syn, the teacher synthesizes additional examples involving missing skills. Across extensive experiments on Llama and Qwen models, our methods yield improvements of up to 7.5% on MATH, whereas SFT provides only limited gains. Furthermore, STAT enhances performance on out-of-distribution benchmarks (e.g., AIME24/25, AMC23, etc.) by an average of 4.6%. Crucially, we find that STAT is complementary to RL via GRPO (Shao et al., 2024): after the model is improved using STAT to address skill gaps, GRPO continues to add further gains. We conclude that skill-targeted adaptive training should broadly improve current training pipelines.
基础/前沿模型 (含LLM)
指令微调与对齐
#Continual Learning #Parameter-Efficient Fine-Tuning #Full Fine-Tuning #Catastrophic Forgetting #Singular Value Decomposition #Geometric Constraints #Orthogonal Subspaces #Low-Rank Subspaces #Constrained Optimization
TL;DR:We propose a constrained fine-tuning method for continual learning in LLMs using SVD and effective rank to guide updates in subspaces spanned by low singular vectors, significantly reducing catastrophic forgetting and outperforming SOTA methods.
🎯 研究动机大规模语言模型(LLMs)在增量学习中易发生灾难性遗忘,而现有方法要么牺牲模型表达能力,要么引入任务特定参数,导致扩展性受限。
❓ 解决问题设计一种高效参数的增量学习方法,避免灾难性遗忘,同时保持模型固定参数规模及任务迁移能力。
🔍 现象分析当前方法在增量学习中表现为较低的知识保留率、对任务特定参数的依赖,以及模型通用能力的损失。
🛠️ 主要方法提出正交子空间微调(OSFT),利用自适应奇异值分解(SVD)识别关键高秩参数子空间并保护已有知识,同时约束新任务的更新方向正交于该子空间。
📊 数据与实验在多个标准持续学习基准上测试,包括T5-Large、LLaMA-2 7B和Mistral-7B,实验表明OSFT在学习能力与知识保留的权衡上优于SOTA方法。
⭐ 主要贡献提供理论支持和实践可行的增量学习方案,显著减少遗忘问题,与现有技术相比提高平均准确率达7%,并保持模型通用语言和安全能力。
查看完整摘要 (Abstract)
Continual learning in large language models (LLMs) is prone to catastrophic forgetting, where adapting to new tasks significantly degrades performance on previously learned ones. Existing parameter-efficient methods often limit model expressivity or introduce new parameters per task, creating scalability issues. To address these limitations, we introduce **Orthogonal Subspace Fine-Tuning (OSFT)**, a novel parameter-efficient approach for continual learning. OSFT leverages adaptive singular value decomposition (SVD) to dynamically identify and preserve critical, high-rank parameter subspaces that encode prior knowledge. All updates for new tasks are constrained to be strictly orthogonal to these preserved subspaces, which minimizes interference while maintaining a fixed parameter count and avoiding the need to store task-specific gradients. We extensively evaluate OSFT on standard continual learning benchmarks using both encoder-decoder (T5-Large) and decoder-only (LLaMA-2 7B, Mistral-7B) models across diverse tasks. Empirically, our method achieves a state-of-the-art trade-off between learnability and knowledge retention, dominating the Pareto frontier, with **up to 7\% higher** average accuracy than recent baselines like O-LoRA, and **reduces forgetting to near-negligible levels**. It notably maintains the model's general linguistic capabilities, instruction-following, and safety throughout the learning process. OSFT provides a practical, theoretically grounded, and scalable solution that effectively balances model plasticity and knowledge retention for continual learning in LLMs. Code is available at https://github.com/Red-Hat-AI-Innovation-Team/mini_trainer.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM #uncertainty quantification #subjective uncertainty #benchmark #language model
TL;DR:We measure whether LLMs can output a string that summarizes their distribution of strings P_\theta(A|q) that they would output in response to a question.
🎯 研究动机现有大语言模型(LLM)通常通过添加百分比或模糊语言来表达不确定性,但缺乏对内部答案分布的透明描述能力。研究如何让LLM更准确地反映内部信念分布成为关键问题。
❓ 解决问题提出一种信息论方法(SelfReflect)来衡量LLM生成的摘要与其内部答案分布的契合度,以检测模型真实反映其不确定性的能力。
🔍 现象分析研究发现现代LLM普遍无法通过推理、思维链或显式微调准确揭示其不确定性。然而,采样多次输出并反馈到上下文中后,模型能够生成更可靠的不确定性总结。
🛠️ 主要方法开发SelfReflect度量,将答案分布与总结字符串之间的偏差量化;实验干预与人类评估均表明其敏感性与可靠性。
📊 数据与实验通过干预实验和人类评估验证SelfReflect的有效性,并公开工具代码以供研究者测试任意LLM。
⭐ 主要贡献提出衡量LLM反映内部分布的新度量,揭示现有模型的局限性,并提供简单有效的改进方法和相关工具,推动LLM不确定性量化研究发展。
查看完整摘要 (Abstract)
The common approach to communicate a large language model's (LLM) uncertainty is to add a percentage number or a hedging word to its response. But is this all we can do? Instead of generating a single answer and then hedging it, an LLM that is fully transparent to the user needs to be able to reflect on its internal belief distribution and output a summary of all options it deems possible, and how likely they are. To test whether LLMs possess this capability, we develop the SelfReflect metric, an information-theoretic distance between a given summary and a distribution over answers. In interventional and human studies, we find that SelfReflect indicates even slight deviations, yielding a fine measure of faithfulness between a summary string and an LLM's actual internal distribution over answers. With SelfReflect, we make a resounding negative observation: modern LLMs are, across the board, incapable of revealing what they are uncertain about, neither through reasoning, nor chains-of-thoughts, nor explicit finetuning. However, we do find that LLMs are able to generate faithful summaries of their uncertainties if we help them by sampling multiple outputs and feeding them back into the context. This simple approach shines a light at the universal way of communicating LLM uncertainties whose future development the SelfReflect score enables. To support the development of this universal form of LLM uncertainties, we publish the code that implements our metric for arbitrary LLMs under https://github.com/apple/ml-selfreflect .
基础/前沿模型 (含LLM)
指令微调与对齐
#Single-stream Policy Optimization #Large Language Models #Reinforcement Learning
🎯 研究动机重新审视当前针对大语言模型的策略梯度优化,探讨单流方法能否解决现有方法中的关键问题。
❓ 解决问题解决基于群组的优化方法存在的学习信号丢失和同步性障碍,从而提升稳定性和扩展性。
🔍 现象分析现有方法频繁出现退化群组导致学习信号丢失,并因同步障碍而限制了在长时间生成或工具集成场景中的扩展性。
🛠️ 主要方法提出单流策略优化(SPO),通过持久的KL自适应值追踪器替代群组基准,并进行全局归一化优势计算,以降低信号方差并提高学习效率。
📊 数据与实验使用Qwen3-8B模型在五个数学基准上进行实验,SPO在hard math任务上提升平均maj@32得分+3.4个百分点,并显著改善复杂数据集上的绝对分值。
⭐ 主要贡献引入了基于原则的优化方法,避免算法复杂化并显著提升大语言模型推理能力和计算效率;挑战现有趋势,为强化学习的未来发展提供新方向。
查看完整摘要 (Abstract)
We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3-8B, SPO improves the average maj@32 by $+3.4\ \text{percentage points} (\mathrm{pp})$ over GRPO, driven by substantial absolute point gains on challenging datasets, including $+7.3\ \mathrm{pp}$ on BRUMO 25, $+4.4\ \mathrm{pp}$ on AIME 25, $+3.3\ \mathrm{pp}$ on HMMT 25, and achieves consistent relative gain in pass@$k$ across the evaluated $k$ values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Reasoning #Reinforcement Learning
TL;DR:We present SkillFactory, a pipeline for priming Language Models with cognitive reasoning skills that enhance reinforcement learning and improves downstream performance.
🎯 研究动机语言模型需要具备复杂认知技能(如验证、回溯等),以提高推理能力和强化学习性能。然而,基础模型中未展示的这些技能需要新的方法来学习和利用。
❓ 解决问题如何使语言模型在强化学习之前,通过监督微调阶段掌握基础认知技能,以解决难度更高的任务并提升模型的普适性和鲁棒性。
🔍 现象分析模型在强化学习之前通过 SkillFactory 初始化后表现出更强的泛化能力,即使在出发性能较低时也能在难度更高的任务中取得更好表现,并减少在域外任务上的性能退化。
🛠️ 主要方法提出 SkillFactory,通过模型自生成样本并重新安排成认知技能的训练数据以进行监督微调,无需依赖更强模型的蒸馏过程。
📊 数据与实验实验对比了基础模型与 SkillFactory 微调模型在强化学习后的任务表现,验证了方法在不同领域和任务上的鲁棒性及技能使用效率。
⭐ 主要贡献提出一种低成本、无需蒸馏的认知技能学习方法,显著提升模型在强化学习后的泛化性和任务性能,同时减少域外任务上的性能退化。
查看完整摘要 (Abstract)
Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
基础/前沿模型 (含LLM)
指令微调与对齐
#hindsight learning #agentic LLM #LLM #post training #RL
TL;DR:We propose a sample-efficient post-training method for LLM agents that turns their trajectories into successful demonstrations the agents use to learn and improve.
🎯 研究动机大语言模型(LLM)代理在部分可观测、长时间跨度的任务中获取监督信息存在瓶颈,尤其是现有方法忽略了代理轨迹中非预期但成功的目标。
❓ 解决问题如何利用代理的既有轨迹中隐含的成功目标来提供监督信号,从而改进LLM代理的学习效果。
🔍 现象分析在长时间跨度和目标多样的任务中,传统的监督信号难以覆盖所有场景,而代理的历史轨迹常常隐含用于学习的潜在价值。
🛠️ 主要方法提出Hindsight Supervised Learning (HSL),通过辅助LLM回溯轨迹并重新标注其实际实现的自然语言目标,将轨迹与目标配对进行再微调,并通过无关动作屏蔽和样本重加权两种技术提升标注数据的质量。
📊 数据与实验实验在ALFWorld等环境中进行,结果显示HSL在样本利用率上显著优于基线,使用四分之一的真实演示数据即可超越全数据集的基线表现,特别在长时间跨度和目标多样任务中改进明显。
⭐ 主要贡献提出了一种高效的后训练方法HSL,成功挖掘LLM代理轨迹中的隐性监督信息;验证了其在多种任务中的兼容性和样本高效性,为长任务目标空间问题提供了新的解决思路。
查看完整摘要 (Abstract)
Large language model agents operate in partially observable, long-horizon settings where obtaining supervision remains a major bottleneck. We address this by utilizing a source of supervision overlooked in existing post-training methods: unintended yet successful goals embedded within agent rollouts. Specifically, we introduce Hindsight Supervised Learning (HSL), where an auxiliary LLM reviews each completed trajectory and relabels it with all of the natural-language goals the agent actually achieved. HSL then pairs the trajectory with its relabeled goals and uses these pairs for additional fine-tuning. To mitigate suboptimality in the relabeled data, we propose two learning techniques for HSL, irrelevant-action masking and sample reweighting. Our experiments show that HSL is flexible and compatible with existing post-training pipelines. It improves both SFT and DPO, with larger gains on long-horizon tasks with more diverse goal spaces. Moreover, HSL is sample-efficient: on ALFWorld, it surpasses baselines trained on the full dataset while using only one quarter of the ground-truth demonstrations.
基础/前沿模型 (含LLM)
指令微调与对齐
#Reinforcement Finetuning #Large Language Model #Reasoning
🎯 研究动机现有强化微调方法主要基于*on-policy*,无法高效利用历史数据,限制了大规模语言模型推理能力的提升效率。
❓ 解决问题提出一种将*off-policy*数据引入*on-policy*强化微调的方法,以提高训练效率和推理能力,同时降低训练代价。
🔍 现象分析分析发现,非策略数据能够减少梯度更新所需的样本量,但过度偏离策略可能引发收敛不稳定及模型自反性的崩溃模式。
🛠️ 主要方法提出ReMix框架,包括混合策略梯度优化、KL-凸调整策略约束以及策略重启机制,从早期效率逐步过渡到稳定收敛。
📊 数据与实验基于多个数学推理基准测试,与最新模型相比,ReMix在1.5B和7B规模模型上以极低的训练代价达到了SOTA级别性能,显著提升推理准确率。
⭐ 主要贡献通过创新性*off-policy*引入设计,提高了强化微调效率并大幅降低训练成本,同时公开代码和模型供进一步研究与应用。
查看完整摘要 (Abstract)
Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs), yet most existing Reinforcement Finetuning (RFT) methods are inherently *on-policy* RL, failing to reuse historical data and thus preventing efficient scaling. In this work, we explore the potential of *off-policy* RL to leverage historical data for rollout-efficient RFT. Specifically, we propose **Re**incarnating **Mix**-policy Proximal Policy Gradient (**ReMix**), which enables on-policy RFT methods to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio that utilizes the data from both current and past policies for efficient training; (2) KL-Convex policy constraint that combines the KL constraints on the base and precedent model to balance stability and flexibility; (3) Policy reincarnation that replaces the base model with the mix-policy RFT model in the mid way of training and restarts on-policy training, to achieve a seamless transition from early efficiency to steady convergence. In our experiments, we train a series of ReMix models based on PPO, GRPO from 1.5B, 7B base models. On five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500), ReMix achieves an average Pass@1 accuracy of **52.10%** (with **0.079M rollouts**) and **64.39%** (with **0.011M rollouts**) on 1.5B and 7B models, respectively. Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over **30x to 450x reduction in training cost in terms of rollout data volume**, demonstrating superior training efficiency. Additionally, our multifaceted analysis reveals insightful findings, including the implicit preference for shorter responses of off-policy RFT, the collapse mode of self-reflection under severe off-policyness, etc. The code and the trained models are available at https://anitaleungxx.github.io/ReMix/ .
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Prompting #Diversity
TL;DR:We propose String Seed of Thought (SSoT), a simple prompting method that uses a random string as a seed to enable LLMs to accurately follow probabilistic instructions and enhance the response diversity.
🎯 研究动机当前大型语言模型在非确定性任务中表现不足,难以完成概率指令跟随并保持生成内容的多样性,影响实际应用场景需求。
❓ 解决问题提出一种简洁的提示方法,通过引入随机字符串种子解决概率指令跟随和响应多样性不足的问题。
🔍 现象分析LLMs倾向于生成单一确定性答案,导致概率任务失真和回答多样性坍缩,尤其在需要分布忠实度和多样化输出的场景中表现受限。
🛠️ 主要方法设计并采用String Seed of Thought (SSoT)方法,让LLMs通过随机字符串生成熵,并基于字符串操作提取随机性以生成最终答案,确保分布忠实并提升多样性。
📊 数据与实验在NoveltyBench基准测试中验证SSoT方法,不仅针对封闭任务提升了概率指令的表现,还在开放任务中显著提高了响应的多样性。
⭐ 主要贡献提出SSoT方法提升LLMs非确定性任务表现,确保分布忠实并增强多样性,为实现更复杂的非确定性应用奠定基础。
查看完整摘要 (Abstract)
We introduce _String Seed of Thought (SSoT)_, a novel prompting method for LLMs that improves _Probabilistic Instruction Following (PIF)_. We define PIF as a task requiring an LLM to select its answer from a predefined set of options, each associated with a specific probability, such that the empirical distribution of the generated answers aligns with the target distribution when prompted multiple times. While LLMs excel at tasks with single, deterministic answers, they often fail at PIF, exhibiting biases problematic for applications requiring non-deterministic behaviors, such as human-behavior simulation, content diversification, and multiplayer games.
It also harms the diversity of generated responses, a crucial factor in test-time scaling, by causing the outputs to collapse into a limited set of answers. To address this, we propose SSoT, a simple prompting method that instructs an LLM to first output a random string to generate sufficient entropy. SSoT also instructs the LLM to extract randomness by manipulating this string to derive a final answer, thereby preserving diversity while adhering to specific constraints. We demonstrate that SSoT significantly improves the PIF performance of LLMs, approaching the ideal performance of a pseudo-random number generator. Notably, our experiments on NoveltyBench show SSoT's benefits extend beyond closed-set tasks to open-ended tasks by enhancing response diversity.
基础/前沿模型 (含LLM)
指令微调与对齐
#Controllable Stylized and Truthful Generation #Representation Editing
🎯 研究动机通过表示编辑实现具有风格化的大规模语言模型生成是一种具有潜力的精细化控制方式。然而,这往往导致生成内容的真实性下降,需平衡风格化和真实性。
❓ 解决问题提出一种新机制,解决风格化生成过程中真实度下降的问题,重点在于同时保持风格一致性和内容真实性。
🔍 现象分析风格信号注入导致模型关键注意力层中风格方向与真实方向的潜在耦合,这是风格化引发真实性崩塌的根本原因。
🛠️ 主要方法通过正交降解分离风格相关与真实性相关的子空间,并在各子空间内设计自适应、逐词级的控制向量,实现生成过程的独立精细控制。
📊 数据与实验在多种风格和语言上进行验证,表明新方法有效减少因风格化导致的真实性崩塌,并优于现有的推断时干预方法。
⭐ 主要贡献提出StyliTruth机制,首次从表示分离角度解决风格与真实性的冲突,显著提升风格化生成的真实性平衡能力,对相关研究具借鉴意义。
查看完整摘要 (Abstract)
Generating stylized large language model (LLM) responses via representation editing is a promising way for fine-grained output control. However, there exists an inherent trade-off: imposing a distinctive style often degrades truthfulness. Existing representation editing methods, by naively injecting style signals, overlook this collateral impact and frequently contaminate the model’s core truthfulness representations, resulting in reduced answer correctness. We term this phenomenon stylization-induced truthfulness collapse. We attribute this issue to latent coupling between style and truth directions in certain key attention heads, and propose \textbf{StyliTruth}, a mechanism that preserves stylization while keeping truthfulness intact. StyliTruth separates the style-relevant and truth-relevant subspaces in the model’s representation space via an orthogonal deflation process. This decomposition enables independent control of style and truth in their own subspaces, minimizing interference. By designing adaptive, token-level steering vectors within each subspace, we dynamically and precisely control the generation process to maintain both stylistic fidelity and truthfulness. We validate our method on multiple styles and languages. Extensive experiments and analyses show that StyliTruth significantly reduces stylization-induced truthfulness collapse and outperforms existing inference-time intervention methods in balancing style adherence with truthfulness.
基础/前沿模型 (含LLM)
指令微调与对齐
#dynamic decoding #instruction-based Control #truly end-to-end
TL;DR:We introduces AutoDeco to dynamically generate sampling parameters which improves LLMs' performance with almost no added latency. Crucially, it enables the model can understand natural language commands and actively steer its own decoding parameters.
🎯 研究动机现有的所谓“端到端”语言模型依赖手动调试解码参数,限制了生成性能和用户体验。亟需一种能够真正实现动态、自动解码的解决方案。
❓ 解决问题解决解码过程中的超参数手动调试问题,将解码转化为模型自身可控的参数化过程,实现真正的端到端生成。
🔍 现象分析对比静态解码方法,可调节解码参数不仅能提升生成质量,还展示出理解自然语言指令并实时调整参数的能力。
🛠️ 主要方法提出AutoDeco架构,通过在Transformer中添加轻量级预测头,模型在每个生成步骤动态预测上下文相关的解码参数(如温度和top-p),实现单次前向传递的参数化解码。
📊 数据与实验在八个基准数据集上进行广泛实验,AutoDeco超越常用静态解码策略,并与“测试集优化”得到的理论最佳结果表现接近。
⭐ 主要贡献提出了一种无需手动操作的真正端到端解码方法,同时证明模型具有基于自然语言指令控制解码的能力,开启了可控性与交互性的新方向。
查看完整摘要 (Abstract)
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end'' generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass.
Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms common decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"—a practical upper bound for any static method. Besides, we demonstrate an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., ''generate with low randomness'') and adjusts its predicted temperature and top-p on a token-by-token basis, which may open a new paradigm for steerable and interactive LLM decoding.
基础/前沿模型 (含LLM)
指令微调与对齐
#RL from verifiable rewards #Finetuning LLMs #Trust Regions
TL;DR:Replacing PPO's clipping objective with more principled trust regions improves RL from verifiable rewards.
🎯 研究动机虽然基于PPO裁剪目标的强化学习已成为奖励微调大语言模型(LLM)的标准方法,但裁剪机制本质上是对KL约束信任区域的粗糙近似,常导致训练不稳定和性能欠佳。目前对优势估计和归一化的改进研究较多,而对核心裁剪机制的替代性研究不足。
❓ 解决问题本工作旨在解决PPO裁剪目标作为KL信任区域近似所带来的不稳定性与次优性能问题。通过引入更理论化的离散可微信任区域投影,直接替换原有的启发式裁剪机制。
🔍 现象分析裁剪目标源于对基于KL散度的信任区域的近似,但这种近似较为粗糙,常引发更新不稳定并限制模型最终性能。尽管在优势估计等方面已有改进,但这一根本性近似问题在现有工作中尚未被有效处理。
🛠️ 主要方法提出了TROLL方法,用离散可微的信任区域投影取代PPO的裁剪目标,以实施原则性的词元级KL约束。该投影在模型最重要词元的稀疏逻辑子集上进行操作,以平衡计算成本与投影效果。
📊 数据与实验在数学推理和代码生成任务上,结合多种模型家族与优势估计方法进行了系统实验。TROLL在训练速度、稳定性和最终成功率方面均一致性地优于基于PPO裁剪的方法。
⭐ 主要贡献提出了TROLL框架,作为PPO类裁剪目标在训练阶段的直接替代方案,且不改变模型的推理行为。通过引入更原则性的信任区域约束,在多个任务和设置中显著提升了强化学习微调大语言模型的效率与性能。
查看完整摘要 (Abstract)
Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs).
Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched.
Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance.
We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints.
The projection operates on a sparse subset of the model’s most important token logits to balance computational cost and projection effectiveness.
Our approach, Trust Region Optimization for Large Language Models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model’s inference behavior.
Across mathematical reasoning and code generation tasks, model families, as well as advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Supervised Fine-Tuning #Output Diversity
🎯 研究动机大型语言模型的监督微调通常使用交叉熵损失,但交叉熵将分布强制向单一目标集中,忽略替代答案,限制输出多样性,这对生成式任务的探索性采样造成阻碍。
❓ 解决问题提出了TS$^2$框架,通过训练中改进的Sparsemax+增强目标多样性,同时测试时使用Softmax确保概率校准和保留合理的近似答案。
🔍 现象分析传统Sparsemax在梯度处理上忽略了非支撑集外的概率分布,而使用Softmax解码则导致尾部类别概率过高,影响分布稳定性。
🛠️ 主要方法设计Sparsemax+算法,在训练阶段通过抑制非支撑集的概率质量改善Sparsemax性能,测试阶段结合Softmax解码获取非退化概率以增强模型的可生成性。
📊 数据与实验在Chat、代码生成、开放域任务等基准数据集上微调Llama-3.1-8B和Qwen-2.5-7B,实验表明TS$^2$能够稳定提升准确性和输出多样性。
⭐ 主要贡献提出了一种简单易用的框架,为大型语言模型的微调提供了更准确且具创造力的解决方案,同时公开相关代码以促进研究应用。
查看完整摘要 (Abstract)
Large Language Models typically rely on Supervised Fine-Tuning (SFT) with Cross-Entropy (CE) loss to specialize in downstream tasks. However, CE forces the distribution toward one-hot targets and ignores alternative continuations, thereby limiting output diversity, a key drawback for generative applications that rely on sampling-based exploration.
In this paper, we propose ``Training with Sparsemax$+$, Testing with Softmax (TS$^2$)''. Intuitively, sparsemax and its tailored loss mask the gradients of probabilities outside the support set, leaving excessive probability mass on irrelevant tail classes when evaluating with softmax. To address this issue, we propose an improved variant, Sparsemax$+$, for training, which augments the sparsemax loss with a suppression term that penalizes the out-of-support probabilities. At testing, we decode with softmax, yielding calibrated, non-degenerate probabilities where plausible near-ties survive.
We fine-tuned Llama-3.1-8B and Qwen-2.5-7B with TS$^2$, achieving consistent improvements in accuracy and output diversity across chat, code, and open-domain benchmarks. Together, these results demonstrate that TS$^2$ provides a practical, drop-in solution for fine-tuning LLMs that are both more accurate and more creative.
The code is available at https://github.com/xzy-bit/TS-2-ICLR-2026.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models; Reinforcement Learning;Adaptive Sampling Temperature;Meta-Optimization;GRPO;
🎯 研究动机在大语言模型中,温度超参数控制生成文本时的探索与利用权衡,但静态或启发式温度调度无法适应强化学习训练中的动态需求,限制了策略优化性能。
❓ 解决问题解决温度调度固定或启发式方法对动态需求的不适应问题,提出一个可学习的温度控制框架以增强探索能力并优化政策表现。
🔍 现象分析高温度引导多样但噪声较大的输出,低温度专注但可能过早收敛;传统方法无法动态平衡这两者,影响强化学习训练效果。
🛠️ 主要方法提出 TAMPO 框架,通过层级双循环过程,让温度控制成为可学习的元策略。内循环基于选择的温度更新策略,外循环根据高优势轨迹的奖励优化温度的分布,实现在线适应。
📊 数据与实验在五个数学推理基准上进行实验,与固定或启发式温度的基线方法对比,验证 TAMPO 的性能优势。
⭐ 主要贡献建立温度为一种可学习的元策略的概念,提出面向大语言模型的 TAMPO 框架,为强化学习中的适应性探索提供新方法。
查看完整摘要 (Abstract)
Temperature is a crucial hyperparameter in large language models (LLMs), controlling the trade-off between exploration and exploitation during text generation. High temperatures encourage diverse but noisy outputs, while low temperatures produce focused outputs but may cause premature convergence. Yet static or heuristic temperature schedules fail to adapt to the dynamic demands of reinforcement learning (RL) throughout training, often limiting policy improvement. We propose Temperature Adaptive Meta Policy Optimization (TAMPO), a new framework that recasts temperature control as a learnable meta-policy. TAMPO operates through a hierarchical two-loop process. In the inner loop, the LLM policy is updated (e.g., using GRPO) with trajectories sampled at the temperature selected by the meta-policy.
In the outer loop, meta-policy updates the distribution over candidate temperatures by rewarding those that maximize the likelihood of high-advantage trajectories. This trajectory-guided, reward-driven mechanism enables online adaptation without additional rollouts, directly aligning exploration with policy improvement. On five mathematical reasoning benchmarks, TAMPO outperforms baselines using fixed or heuristic temperatures, establishing temperature as an effective learnable meta-policy for adaptive exploration in LLM reinforcement learning.
基础/前沿模型 (含LLM)
指令微调与对齐
#Test-time preference alignment #Large Language Models #Machine translation
TL;DR:Test-time preference alignment, Large Language Models, Machine translation
🎯 研究动机大语言模型通过微调与人类偏好对齐代价昂贵,测试阶段轻量化替代方法迫在眉睫。
❓ 解决问题提出针对测试阶段偏好对齐的问题,解决基于序列决策时出现的视野诅咒与维度诅咒两大挑战。
🔍 现象分析在令牌级别的引导解码中,模型表现受限于视野诅咒;而在传统反复优化中则易受维度诅咒影响。
🛠️ 主要方法借鉴控制理论中的模型预测控制(MPC),提出文本模型预测控制(TMPC),通过回顾性目标识别和目标条件重生成,稳定提高推理性能。
📊 数据与实验在跨领域的三种任务(话语级翻译、长文本生成、程序生成)下测试,结果表明 TMPC 方法性能稳定提升,具备普适性。
⭐ 主要贡献创新性提出 TMPC 框架以解决测试阶段的偏好对齐,用层次强化学习策略克服文本生成中的固有挑战,并在多任务实验中验证其有效性和广泛适用性。
查看完整摘要 (Abstract)
Aligning Large Language Models (LLMs) with human preferences through finetuning is resource-intensive, motivating lightweight alternatives at test time. We address test-time alignment through the lens of sequential decision making, a perspective that reveals two fundamental challenges. When actions are defined at the token level, as in guided decoding, alignment suffers from the curse of horizon. Conversely, when actions are at the response level, as in traditional iterative refinement, the curse of dimensionality emerges. To resolve this trade-off, we draw inspiration from Model Predictive Control (MPC) in control theory to propose Textual Model Predictive Control (TMPC), a novel predictive planning framework adapted for aligning LLMs at inference time. A key limitation of standard MPC is its reliance on predefined, hard segment boundaries, which are often absent in text generation. TMPC overcomes this by introducing two principles inspired by hierarchical reinforcement learning: (1) Hindsight Subgoal Identification, where TMPC analyzes generation subgoals to retrospectively identify high-reward intermediate outputs as subgoals. This allows the framework to discover meaningful, task-specific planning steps (e.g., a sentence in machine translation or a bug fix in code generation.). (2) Subgoal-Conditioned Re-Generation, where these identified subgoals are used to guide subsequent planning iterations. By conditioning on these proven, high-quality subgoals, TMPC ensures stable improvement by building upon previously validated successes. TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis. The results demonstrate that TMPC consistently improves performance, highlighting the generality. Project page: https://rl-bandits-lab.github.io/TMPC/.
基础/前沿模型 (含LLM)
指令微调与对齐
#alignment #bayesian #inverse reinforcement learning #uncertainty #diagnostics
TL;DR:We develop an auditing framework that reframes reward inference in LLMs to a comprehensive process for verification via Bayesian IRL
🎯 研究动机LLM 的隐式优化目标具有高度不透明性,对齐与审计面临巨大挑战,亟需可靠的目标验证框架以增进可信度。
❓ 解决问题现有 IRL 方法无法有效应对任务中的不确定性和非可辨识性,只能生成单一或过度自信的奖励估计,缺乏系统性的验证手段。
🔍 现象分析LLM 的目标推断涉及奖励分布的模糊性及其在分布外情况下的变化,现有方法难以提供稳健诊断与可靠性验证。
🛠️ 主要方法提出基于贝叶斯 IRL 的审计框架,通过迭代证据更新奖励分布并提供不确定性诊断,同时验证优化目标对齐效用。
📊 数据与实验在去毒化和帮助性偏好设置场景中验证框架,展示其目标校准能力及对 RLHF 的训练动态和毒性降低的增强效果。
⭐ 主要贡献提供系统审计工具以验证 LLM 的真实目标,显著提升 AI 对齐的可信性与问责性,为安全团队和监管机构提供实用解决方案。
查看完整摘要 (Abstract)
The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task (non-identifiability). This paper introduces a principled auditing framework that re-frames reward inference from a simple estimation task to a comprehensive process for verification. Our framework leverages Bayesian IRL to not only recover a distribution over objectives but to enable three critical audit capabilities: (i) Quantifying and systematically reducing non-identifiability by demonstrating posterior contraction over sequential rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics that expose spurious shortcuts and identify out-of-distribution prompts where the inferred objective cannot be trusted; and (iii) Validating policy-level utility by showing that the refined, low-uncertainty reward can be used directly in RLHF to achieve training dynamics and toxicity reductions comparable to the ground-truth alignment process. Empirically, our framework successfully audits a detoxified LLM and generalizes beyond detoxification to a helpfulness preference setting, yielding a well-calibrated and interpretable objective that strengthens alignment guarantees. Overall, this work provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving us toward more trustworthy and accountable AI.
基础/前沿模型 (含LLM)
指令微调与对齐
#Scaling #LLMs #Reasoning
TL;DR:We study compute scaling properties of RL methods on LLMs
🎯 研究动机强化学习已成为训练大型语言模型的核心,但缺乏类似预训练阶段的计算扩展预测方法。
❓ 解决问题提出一个系统框架,用于分析和预测强化学习在大型语言模型中的计算扩展规律。
🔍 现象分析发现不同设计方案对极限性能的影响不同,具体如损失聚合、归一化和课程设计主要影响计算效率,而非最终性能;稳定的扩展方案遵循可预测的扩展轨迹。
🛠️ 主要方法通过拟合S型计算-性能曲线,分析常见设计选择对性能和效率的影响,并提出可扩展的最佳实践方案ScaleRL。
📊 数据与实验基于超过40万GPU小时的实验验证,规模涵盖单次强化学习训练扩展至10万GPU小时的情景。
⭐ 主要贡献提供了一个科学分析强化学习扩展的框架以及一个实用的最佳实践方案,使强化学习训练更接近预训练的可预测性。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training.
Despite rapidly rising compute budgets, there is no principled understanding of
how to evaluate algorithmic improvements for scaling RL compute.
We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs.
We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe:
(1)
Not all recipes yield similar asymptotic performance,
Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and
(3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs.
Combining these insights, we propose a _best-practice_ recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours.
Our work provides both a _scientific framework_ for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Model #Reinforcement Learning with Verifiable Reward #f divergence
TL;DR:We propose the DPH-RL framework, which uses f-divergence as a proactive 'rehearsal mechanism' to solve the solution diversity collapse problem that arises from fine-tuning LLMs with reinforcement learning.
🎯 研究动机强化学习优化大规模语言模型时,单次尝试表现提升的同时多次尝试表现(Pass@k)却下降,伴随知识遗忘问题,亟需解决多样性崩塌现象。
❓ 解决问题重新审视传统逆KL散度,探索替代性 f 散度作为知识保留机制,用于维护解决方案多样性并减缓遗忘效应。
🔍 现象分析逆KL散度倾向于收缩模型策略,使知识多样性丧失;无散度约束同样无法保护已有技能,导致最优解集中化。
🛠️ 主要方法提出 DPH-RL 框架,引入覆盖质量更高的 f 散度(如正向KL和JS散度)作为‘排练机制’,与初始策略对比以强制维持广泛解空间。
📊 数据与实验在数学与SQL生成任务中验证,DPH-RL 不仅提升了域内单次和多次尝试表现,还有效减缓非域任务中的遗忘现象,同时显著提高训练效率。
⭐ 主要贡献新提出的系统性 f 散度应用框架确立了改善 RLVR 的关键方向,展示选取适当散度对构建更普适的推理模型的潜力。
查看完整摘要 (Abstract)
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. Despite numerous proposed methods, the community's focus on the standard reverse KL-divergence has led to a surprising oversight: the potential of alternative f-divergences as a proactive solution has been largely unexamined. We argue that standard RLVR objectives—both those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirely—lack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We propose a fundamental shift in perspective: using the divergence term itself as the solution. Our framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences (like forward-KL and JS-divergence) to function as a 'rehearsal mechanism'. By continuously referencing the initial policy, this approach forces the model to maintain broad solution coverage. Math and SQL generation experiments show that DPH-RL both improves in-domain Pass@1 and Pass@k scores and effectively prevents catastrophic forgetting on out-of-domain tasks. Additionally, DPH-RL is more training-efficient because it computes f-divergence using generator functions, requiring only sampling from the initial policy and no online reference model. Our work highlights a crucial, overlooked axis for improving RLVR, demonstrating that the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Research Automation #Scientific Discovery
🎯 研究动机大型语言模型(LLMs)已显示加速科研流程的潜力,但其生成的研究理念需要不仅仅表现出新颖性,还需在执行后呈现优越的科研成果。
❓ 解决问题评估 LLM 生成的研究理念在执行阶段是否能达到或超越人类专家提出的研究理念的实际效果。
🔍 现象分析研究发现,尽管 LLM 生成的研究理念在初始阶段被认为比人类专家的更新颖,但在执行后,其各项评价指标显著下降,显示它们难以维系初始的优势。
🛠️ 主要方法通过招募 43 位专家研究人员分别执行 LLM 或人类专家生成的随机分配研究理念,对执行结果进行盲审比较其科研成果质量的变化。
📊 数据与实验每位专家投入超过 100 小时执行研究理念,并撰写一篇 4 页的短文记录实验结果;所有项目由 NLP 专家进行盲审评分,涵盖新颖性、激动性、有效性及整体质量等指标。
⭐ 主要贡献揭示了 LLM 在生成科研理念与实际执行间的显著差距,为优化 AI 生成科研理念的评估及发展方向提供了关键洞见。
查看完整摘要 (Abstract)
Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.
基础/前沿模型 (含LLM)
指令微调与对齐
#Training Dynamics #Self-Improvement
TL;DR:This paper presents a physics-inspired model where the solver-verifier gap drives self-improvement, yielding an exponential capability convergence that accords with empirical observations on various LLMs and datasets.
🎯 研究动机大语言模型(LLM)的自我改进方法旨在无需外部数据提升性能,但其性能提升动态尚未被深入探索。解决这一问题有助于更系统化地理解和优化自我改进过程。
❓ 解决问题论文试图通过引入“解算器-验证器差距”理论框架,解释LLM在自我改进过程中性能提升的动力来源及其限制条件。
🔍 现象分析提出性能提升源于解算器和验证器能力的差距,并验证此差距驱动了指数式能力收敛,这与多种LLM及数据集的实证观察一致。
🛠️ 主要方法通过物理启发的理论建模,模拟自我改进训练的全程轨迹,并利用实验数据拟合模型参数以量化最终性能限制。
📊 数据与实验在多种LLM和不同数据集上验证理论框架的有效性,进一步扩展分析外部数据对自我改进动态的影响。
⭐ 主要贡献首次通过理论框架量化LLM自我改进的训练动态,揭示解算器-验证器差距驱动性能提升的核心机制,并提供对外部数据利用的独到见解。
查看完整摘要 (Abstract)
Self-improvement is a significant techniques within the realm of large language model (LLM), aiming to enhance the LLM performance without relying on external data. Despite its significance, generally how LLM performances evolve during the self-improvement process remains underexplored. In this paper, we theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap. This is inspired by the conjecture that the performance enhancement of self-improvement stems from the gap between LLM's solver capability and verifier capability. Based on the theoretical framework, we further show how to model the entire training trajectory. This framework allows quantifying the capability limit of self-improvement by fitting the theoretical model to the experiment results. We validate the effectiveness of the theoretical framework on various LLMs and datasets. Beyond self-improvement, we extend our analysis to investigate how external data influences these dynamics within the framework. Notably, we find that under limited external data regimes, such external data can be utilized at any stage without significantly affecting final performances, which accords with the empirical observations.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Knowledge Transfer #PEFT
TL;DR:We propose a new framework TiTok, which enables effective LoRA transplantation through token-level knowledge transfer
🎯 研究动机当前大型语言模型的微调成本高,参数高效微调方法(如LoRA)虽缓解了此问题,但其参数依赖于基础模型,难以在不同模型间迁移。
❓ 解决问题提出一种新框架TiTok,通过对LoRA模型和无LoRA模型在任务上的token级对比信息实现知识迁移,解决参数迁移难问题。
🔍 现象分析LoRA的参数迁移性能在很大程度上依赖数据集质量,而现有解决方案如通过生成合成数据会增加额外的模型训练复杂性。
🛠️ 主要方法使用token级的对比过量信息,突出任务相关的关键token,选择性地过滤合成数据,且无需额外模型或开销。
📊 数据与实验在三个基准任务和多种迁移场景中验证TiTok,相较现有基线方法,平均性能提升4-10%。
⭐ 主要贡献提出了TiTok框架,通过对比信息实现token级知识传递,在不引入额外复杂性的前提下显著提高了LoRA迁移的效果。
查看完整摘要 (Abstract)
Large Language Models (LLMs) are widely applied in real world scenarios, yet fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs; however, the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data; nevertheless, this adds complexity since it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a token-wise contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, we demonstrate that TiTok is consistently effective, achieving average performance gains of +4–10% compared to baselines overall.
基础/前沿模型 (含LLM)
指令微调与对齐
#Natural Language Processing #AI/NLP for Science #Large Language Models #Vision Language Models #Reinforcement Learning #Code Generation #Representation Learning
🎯 研究动机科学领域对从文本描述生成高质量图形的需求日益增长,TikZ代码是常用的图形表示方式。现有Text-to-TikZ数据集规模不足且质量较差,导致文本与渲染图形语义不匹配。
❓ 解决问题针对现有数据集小、噪音大及方法依赖监督微调(SFT)导致图形语义偏差的问题,构建高质量大规数据集并引入强化学习优化渲染语义对齐。
🔍 现象分析现有SFT方法未考虑图形渲染后的语义信息,容易产生循环结构、无关内容及空间关系错误等问题,限制了生成图形的准确性与复杂性。
🛠️ 主要方法采用两阶段训练流程:先在高质量DaTikZ-V4数据集上进行SFT,再通过强化学习结合逆图像编码器提供语义奖励信号。训练小型开源Qwen模型(3B/8B)系列。
📊 数据与实验构建DaTikZ-V4数据集,规模较前版扩大四倍且质量显著提升,包含LLM生成的图形描述。人工评估超过1000条数据,5分制评分显示方法优于基准模型及GPT-4o,并与GPT-5图像评估持平。
⭐ 主要贡献提出TikZilla模型框架,通过高质量数据集与强化学习结合首次实现小模型在Text-to-TikZ任务中达到大模型性能。公开代码、数据与模型促进领域发展。
查看完整摘要 (Abstract)
Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.
基础/前沿模型 (含LLM)
指令微调与对齐
#Procedural Memory #Memory tokens #Continual adaptation #Large language models
🎯 研究动机大语言模型通常通过提示进行控制,但提示需重复处理且难以模块化复用。本研究探索一种更高效的任务控制与记忆存储方式。
❓ 解决问题提出一种框架,使任务过程模块化存储于训练的记忆单元中,以减少上下文开销并支持持续适应新任务。
🔍 现象分析通过减少提示重复处理和直接存储可重用过程,解决现有方法中的生成控制与记忆效率问题。
🛠️ 主要方法设计了 TokMem 框架,将任务过程编译为单一记忆 token,作为生成控制信号,同时保持主模型冻结。
📊 数据与实验在 1,000 个 Super-Natural Instructions 任务和多步函数调用组合测试,验证 TokMem 的回忆性能及生成控制效果。
⭐ 主要贡献TokMem 超越检索增强型提示方法,并在使用更少参数的情况下匹配或优于参数高效微调,同时支持持续扩展任务过程。
查看完整摘要 (Abstract)
Large language models are typically controlled via prompts, which must be repeatedly re-processed for every new query and are difficult to reuse modularly. We introduce TokMem, a procedural memory framework that compiles each reusable task procedure into a single trainable memory token. Each token serves as both a procedure index and a generation control signal that steers generation, enabling targeted behaviors with constant-size overhead. TokMem keeps the backbone LLM frozen and stores procedural knowledge entirely in these dedicated units, so new procedures can be added continually without interfering with existing ones. We evaluate TokMem on two settings: atomic recall over 1,000 Super-Natural Instructions tasks and compositional recall on multi-step function-calling. Our results show that TokMem consistently outperforms retrieval-augmented prompting while avoiding repeated context overhead. Moreover, it matches or exceeds parameter-efficient fine-tuning with substantially fewer trainable parameters.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLMs #RLHF #DPO #Human Preference Alignment #Token-lmportance #Triplet Loss
TL;DR:We proposes Token-Importance Guided Direct Preference Optimization (TI-DPO) to better align LLMs with human preferences by using a hybrid weighting mechanism to identify key tokens and a triplet loss to guide the optimization process.
🎯 研究动机对齐大型语言模型与人类偏好是确保安全与高效AI交互的关键,但现有方法对数据噪声敏感,且忽略了单个Token的重要性差异。
❓ 解决问题现有基于Token的重要性计算方法不足以处理噪声与语义精细控制问题,造成输出偏好指引不足。
🔍 现象分析现行方法多使用概率预测或简单加权机制计算Token重要性,但无法兼顾准确性与鲁棒性,难以实现优化过程中的精细语义控制。
🛠️ 主要方法提出TI-DPO框架,创新性结合梯度归因与高斯先验的混合加权机制,并使用三元组损失函数引导模型输出更接近优选响应并远离非优选响应。
📊 数据与实验实验结果表明,TI-DPO在精度与生成多样性上优于DPO与其他RLHF方法,同时具有更高的稳定性与计算效率。
⭐ 主要贡献通过混合加权机制和三元组损失,首次实现了对Token重要性的鲁棒计算与语义控制,在对齐效果与效率两方面突破现有方法局限。
查看完整摘要 (Abstract)
Aligning Large Language Models (LLMs) with human preferences is crucial for safe and effective AI interactions. While popular methods like Direct Preference Optimization (DPO) have simplified alignment, they remain sensitive to data noise and overlook the differential importance of individual tokens. Existing token-level approaches often rely on probability prediction or simplistic weighting schemes to obtain token importance, which still cannot fully address these issues. To solve this problem, we propose the Token-Importance Guided Direct Preference Optimization (TI-DPO), a framework that achieves fine-grained semantic control through two synergistic innovations.
First, we propose a novel hybrid weighting mechanism that combines gradient attribution with a Gaussian prior, ensuring both the accuracy and robustness of token importance scores. Second, we employ a triplet loss to provide structured guidance for the optimization, explicitly guiding model outputs to approach preferred responses and diverge from non-preferred ones. Experimental results show that TI-DPO achieves higher accuracy and stronger generative diversity, providing more stable and computationally efficient solutions compared with DPO and other RLHF methods.
基础/前沿模型 (含LLM)
指令微调与对齐
#reasoning model #tool-integrated reasoning #self-evolved training #information entropy
🎯 研究动机当前大语言模型在工具整合推理中表现出效率低下和稳定性不足的问题,亟需提升工具整合推理能力的框架。
❓ 解决问题解决推理模型中工具调用次数过少、过多以及处理工具调用结果后过度推理等难题,优化推理效率与准确性。
🔍 现象分析通过信息熵分析发现工具调用结果会显著影响后续推理内容的信息熵变化,且推理链的整体信息熵依赖于工具调用次数的变化。
🛠️ 主要方法提出Tool-Light框架,包括数据集构建和多阶段微调两部分,数据集采用自演化采样技术结合信息熵指导采样,并设计严格正负样本筛选标准;训练过程分为监督微调和自演化直接偏好优化两阶段。
📊 数据与实验在10个数据集上测试,实验结果表明Tool-Light框架显著提升了工具整合推理任务的效率和准确性。
⭐ 主要贡献基于信息熵分析提供理解工具调用对推理过程的影响的新视角,提出高度优化的Tool-Light框架以推进工具整合推理能力的发展。
查看完整摘要 (Abstract)
Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to enhance their internal reasoning ability by integrating external tools. However, models with TIR often exhibit suboptimal behaviors, including insufficient tool calls, excessive tool calls, and overthinking after receiving tool call results. How to empower LLMs to perform TIR efficiently and accurately, while stabilizing the reasoning process, remains an open challenge.
In this paper, we first analyze the impact of tool calls on model reasoning from the perspective of information entropy. We find that when tool call results are provided, the information entropy of subsequent reasoning content will show a clear trend of change, and the overall information entropy of the reasoning chain will vary depending on the number of tool calls. Based on these observations, we propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately. Our framework consists of dataset construction and multi-stage fine-tuning. For dataset construction, we use the trained model for continuous self-evolved sampling, integrating two methods: vanilla sampling and entropy-guided sampling. At the same time, during the sampling process, we design strict criteria for selecting positive-negative pairs. For the training process, we introduce a two-stage method, which includes a Supervised Fine-Tuning (SFT), and Self-Evolved Direct Preference Optimization (DPO).
Test results on 10 datasets reveal the effectiveness of Tool-Light, significantly improving the efficiency and accuracy of the model in completing TIR tasks.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM #scalable oversight #weak supervision #agentic systems
🎯 研究动机随着人工智能系统逐步超越人类专家的表现,评估与训练中获取高质量的人类监督变得困难,尤其是涉及多领域深度知识任务时监督瓶颈更加显著。
❓ 解决问题针对人类专家仅具备单一领域深度知识且难以全面评估超人任务的问题,提出一种利用弱信号的可扩展监督框架,减少依赖完整真值标注。
🔍 现象分析人类专家可以基于领域专业知识提供弱信号,例如指出某选项不正确,这种信号仍能为高级AI系统的正确性评价提供帮助。
🛠️ 主要方法提出一种从补充性标签推导准确率的无偏估计器,并结合稀缺的普通标签开发两种新的估计器,同时量化所需补充标签数量和提供有限样本偏差保证。
📊 数据与实验通过对大语言模型的输出进行评估实验,展示了无需真值标签的能力;同时验证了利用补充性标签进行AI系统训练的可能性,实现系统的自主优化。
⭐ 主要贡献提出了基于补充性标签的监督框架,提供相关理论保证与实用工具,证明了其在评估与训练高级AI系统中的有效性和可行性。
查看完整摘要 (Abstract)
As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging.
Our focus is on tasks that require deep knowledge and skills of multiple domains, where this bottleneck is severe.
Unfortunately, even the best human experts are knowledgeable only in a single narrow area, and will not be able to evaluate the correctness of advanced AI systems on such superhuman tasks.
However, based on their narrow expertise, humans may provide a weak signal, i.e., a *complementary label* indicating an option that is incorrect. For example, a cardiologist could state that ''this is not related to any cardiovascular disease,'' even if they cannot identify the true disease.
Based on this weak signal, we propose a scalable oversight framework that enables us to evaluate frontier AI systems without the need to prepare the ground truth.
We derive an *unbiased* estimator of top-1 accuracy from complementary labels and quantify how many complementary labels are needed to match the variance of ordinary labels. We further introduce two estimators to combine scarce ordinary labels with abundant complementary labels.
We provide finite-sample deviation guarantees for both complementary-only and the mixed estimators.
Empirically, we show that we can evaluate the output of large language models without the ground truth, if we have complementary labels.
We further show that we can train an AI system with such weak signals: we show how we can design an agentic AI system automatically that can improve itself with this partitioned human supervision. Our code is available at https://github.com/R-Yin-217/Towards-Scalable-Oversight-via-Partitioned-Human-Supervision.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Strategic Behavior #Information Design
🎯 研究动机大型语言模型在说服能力上表现出媲美人类的潜力,但在不同领域中的说服效果差异显著,缺乏系统性评价框架。
❓ 解决问题基于贝叶斯说服理论,提供一个可扩展、系统的框架,用于评估和提升大型语言模型的说服能力。
🔍 现象分析前沿模型在实验中展现出持续的高说服增益,采用的策略与理论预期一致。
🛠️ 主要方法通过将人类之间的说服数据集重新设计为评估和训练环境,结合强化学习,优化语言模型的策略性说服能力。
📊 数据与实验使用改造自人类说服数据集的环境,在这些环境中对小型和前沿语言模型进行训练和评估。
⭐ 主要贡献提出了基于理论的评估框架,验证了大型语言模型的策略性说服能力,并通过强化学习显著提升小型模型的说服效果。
查看完整摘要 (Abstract)
Large language models (LLMs) have demonstrated strong persuasive capabilities comparable to those of humans, offering promising benefits while raising societal concerns. However, systematically evaluating the persuasive capabilities of LLMs is inherently challenging, as the effectiveness of persuasion among humans varies significantly across different domains. In this paper, we take a theory-driven approach to provide a scalable and principled framework for studying the persuasive capabilities of LLMs. Grounded in Bayesian persuasion theory, we repurpose human-human persuasion datasets to construct environments for evaluating and training LLMs as strategic persuaders. Our results reveal that frontier models can consistently achieve high persuasion gains and exhibit sophisticated persuasion strategies that align with theoretical characterizations. Building on this, we use reinforcement learning to train LLMs for strategic persuasion in our environments. Our results also demonstrate that even small LLMs can obtain significantly higher persuasion gains through reinforcement learning.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large language model alignment #preference data #influence function
TL;DR:We assess preference data quality through our newly proposed truncated influence function (TIF), and then we propose a set of candidate scoring functions that are positive correlated with TIF to select valuable preference data.
🎯 研究动机大语言模型(LLM)的对齐依赖于基于人类偏好的学习,偏好数据质量对对齐效果至关重要。现有研究多采用外部奖励模型对数据预处理,效果虽提升但未精细评估个体数据点的实际益处。
❓ 解决问题提出新型截断影响函数(TIF),用于精确测量个别数据对验证数据的影响,并解决传统方法中的过评分问题。目标是开发适用于特定模型的偏好数据选择方法。
🔍 现象分析偏好数据质量具有模型依赖性,即对某模型有益的数据对其他模型可能有害。同时,简单的评分函数可部分对应TIF相关性,但存在误差。
🛠️ 主要方法基于TIF提出两种评分函数,计算复杂度较低且与TIF正相关。通过结合多种评分函数的误差特性,开发了一种高效的偏好数据选择规则。
📊 数据与实验在不同对齐基准与多种LLM上进行实验,验证新方法能以更少数据实现更高对齐性能。结果证明方法的广泛适用性。
⭐ 主要贡献提出TIF衡量偏好数据质量的新方法及其简化评分函数;开发高效数据选择规则;验证更精确的数据选择可显著提升模型对齐表现。
查看完整摘要 (Abstract)
Large language model (LLM) alignment is typically achieved through learning from human preference comparisons, making the quality of preference data critical to its success. Existing studies often pre-process raw training datasets to identify valuable preference pairs using external reward models or off-the-shelf LLMs, achieving improved overall performance but rarely examining whether individual, selected data point is genuinely beneficial. We assess data quality through individual influence on validation data using our newly proposed truncated influence function (TIF), which mitigates the over-scoring present in traditional measures and reveals that preference data quality is inherently a property of the model. In other words, a data pair that benefits one model may harm another. This leaves the need to improve the preference data selection approaches to be adapting to specific models. To this end, we introduce two candidate scoring functions (SFs) that are computationally simpler than TIF and positively correlated with it. They are also model dependent and can serve as potential indicators of individual data quality for preference data selection. Furthermore, we observe that these SFs inherently exhibit errors when compared to TIF. To this end, we combine them to offset their diverse error sources, resulting in a simple yet effective data selection rule that enables the models to achieve a more precise selection of valuable preference data. We conduct experiments across diverse alignment benchmarks and various LLM families, with results demonstrating that better alignment performance can be achieved using less data, showing the generality of our findings and new methods. Our code is publicly available at~\url{https://github.com/tmlr-group/TIF_LossDiff-IRM}.
基础/前沿模型 (含LLM)
指令微调与对齐
#Evaluation #Large language model
🎯 研究动机现有语言模型基准对模型排名存在矛盾,妨碍模型选择与比较,同时为竞争性模型生态系统带来混乱。
❓ 解决问题通过统一的基准特定微调来衡量模型潜力,解决基于直接评估的排名矛盾问题。
🔍 现象分析传统排名在直接评估下外部有效性较低,而采用训练前测试方法后,排名在基准间展现出高度一致性,并恢复了困惑度与任务性能的关联性。
🛠️ 主要方法提出一种名为‘训练前测试’的框架,为每个模型提供统一的基准微调以比较潜力,并基于多个实验验证此方法的有效性。
📊 数据与实验实验覆盖24个基准数据集和61个模型,全面评估了模型在微调后的潜力表现,揭示潜力排名的核心隐变量。
⭐ 主要贡献提出了‘训练前测试’的模型评估新视角,显著提升模型排名的一致性和外部有效性,为理解模型适应性能提供重要思路并简化模型潜力矩阵结构。
查看完整摘要 (Abstract)
Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. While direct evaluation remains useful for assessing deployment-ready performance, train-before-test provides a complementary lens for understanding achievable performance of models after adaptation.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM #Continuous Normalizing Flow #Diffusion Model #RLAIF #Explainable AI
🎯 研究动机随着人类和各类智能体共存,能够以自然语言解释智能体策略对于可靠协作至关重要。现有方法在生成解释时对人类反馈的捕捉仍不充分。为提升智能体策略解释的可预测性和逻辑性,需要更高效的生成机制。
❓ 解决问题如何训练生成高质量策略解释的LLM,同时保证这些解释符合人类奖励分布,并降低认知负担。现有RLAIF和RLHF方法在解释质量和奖励一致性方面存在局限。
🔍 现象分析通过引入连续归一化流(CNF),发现人类对解释的评判具有复合性和概率性。这种多样性在基于LLM的代理奖励生成中无法完整捕捉,可能导致解释偏离真实人类偏好。
🛠️ 主要方法提出结合CNF生成奖励的框架,使用强化学习从AI反馈优化LLM生成的解释。在设计中,特定的CNF架构关注语言线索及决策上下文,以改善奖励生成的准确性。
📊 数据与实验采用人类和LLM双重评价实验,测试生成解释的预测准确性、逻辑性和可操作性。结果显示,该方法优于基于代理LLM奖励策略及现有RLHF、RLAIF基线。
⭐ 主要贡献实现了基于CNF的奖励生成框架,显著提升了LLM解释的预测能力、逻辑性和认知效率;提出了针对性CNF架构,为生成自然语言策略解释提供了通用方法。
查看完整摘要 (Abstract)
As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain agent policies in natural language is vital for reliable coexistence. We introduce a general-purpose framework that trains explanation-generating LLMs via reinforcement learning from AI feedback, with distributional rewards generated by generative continuous normalizing flows (CNFs). CNFs capture the pluralistic and probabilistic nature of human judgments about explanations. Moreover, under mild assumptions, CNFs provably bound deviations from true human reward distributions when trained on noisy proxy rewards from LLMs. We design a specialized CNF architecture that selectively attends to linguistic cues in decision context and explanations when generating rewards. Human and LLM evaluators find that our method delivers explanations that enable more accurate predictions of true agent decisions, exhibit greater logical soundness and actionability, and impose lower cognitive load than explanations trained with proxy LLM rewards or state-of-the-art RLHF and RLAIF baselines.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM-as-a-Judge #LLM Evaluation #Large Language Models
🎯 研究动机大型语言模型(LLMs)作为自动评估器(LLM-as-a-Judge)的应用暴露了评分和偏好传递中的重大不一致性,影响了评估结果的可靠性。
❓ 解决问题发现并缓解两种核心不一致性问题:评分比较不一致性和偏好传递不一致性,解决信息损失与模糊判断导致的局限性。
🔍 现象分析提出评分比较不一致性(低评分回答在对比中优于高评分回答)和偏好传递不一致性(如循环偏好链和等价矛盾)的理论定义与来源分析。
🛠️ 主要方法提出 TrustJudge 框架,通过分布敏感评分保留信息熵与基于概率的聚合方案解决不一致性,同时实现更高的精确性。
📊 数据与实验基于 Llama-3.1-70B-Instruct 进行评估,TrustJudge 将评分比较不一致性从 23.32% 降至 14.89%,传递性不一致性从 15.22% 降至 4.40%,并保持更高的评估准确率。
⭐ 主要贡献首次系统性分析 LLM-as-a-Judge 框架的不一致性问题,并提出 TrustJudge 框架,提供理论见解和可行解决方案,显著提升大模型评估的可靠性与一致性。
查看完整摘要 (Abstract)
The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) \textit{Score-Comparison Inconsistency}, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) \textit{Pairwise Transitivity Inconsistency}, manifested through circular preference chains ($A\!>\!B\!>\!C\!>\!A$) and equivalence contradictions ($A\!=\!B\!=\!C\!\neq\!A$). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose \textbf{TrustJudge}, a probabilistic framework that addresses these limitations through two key innovations: 1) \textit{distribution-sensitive scoring} that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) \textit{likelihood-aware aggregation} that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge’s components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43\% (from 23.32\% to 14.89\%) and Pairwise Transitivity inconsistency by 10.82\% (from 15.22\% to 4.40\%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations.
基础/前沿模型 (含LLM)
指令微调与对齐
#Verification #Co-Alignment #Preference-Aligned LLM Annotations #Reference-Free Metric
🎯 研究动机大语言模型需具备文化定制性和个性化对齐的能力,但现有方法因标注成本高或预训练分布的限制难以满足多样化用户需求。
❓ 解决问题如何在不依赖大规模标注的情况下,获取反映多样且主观用户偏好的模型输出对齐策略并提高对齐性能。
🔍 现象分析当前方法难以处理未标注语料中的输出对齐问题,尤其是存在模型自信过剩及任务特化不足的情况。
🛠️ 主要方法提出了无训练需求的异质一致性联合对齐(HCC)框架,通过知识丰富的LLM和任务特化轻量模型协作,基于一致性与不一致性信号(CAI比例)验证输出,并通过非参数嵌入方式调整不一致样本以符合用户偏好。
📊 数据与实验在八个NLU数据集及多种开源与闭源LLM上实验,HCC显著提升了标注对齐性能,并使Llama-3-8B在多任务中超过GPT-3.5/4o-mini的表现。
⭐ 主要贡献实现了无需参考的用户偏好对齐标注扩展方法,提出CAI比例作为强相关于准确性的信号,解决了传统方法对标注依赖与主观偏好获取难题,推动了自监督对齐技术的应用。
查看完整摘要 (Abstract)
Large Language Models (LLMs) are increasingly expected to be culturally customisable and personally aligned for natural language understanding (NLU). However, existing methods, from supervised fine-tuning (SFT) to personalised RLHF and prompting, either require costly large-scale annotations or remain constrained by their pretraining distributions. Moreover, acquiring annotations that reflect subjective, diverse, and evolving user preferences is both expensive and labour-intensive. To address these limitations, we propose \textit{\textbf{H}eterogeneous-\textbf{C}onsistency \textbf{C}o-Alignment} (HCC), a training-free annotation paradigm that leverages two heterogeneous models: a knowledge-rich yet potentially overconfident LLM and a task-specialised lightweight model guided by a small user preference set. Together, they verify and co-align misaligned outputs over unlabelled corpora.
For verification, HCC introduces the reference-free \textit{\textbf{C}onsistent}-\textit{\textbf{A}nd}-\textit{\textbf{I}nconsistent} (\textbf{CAI}) Ratio, an uncertainty signal derived from inter-model agreements (consistent samples) and disagreements (inconsistent samples) to determine whether refinement is necessary. For co-alignment, HCC employs a non-parametric, embedding-based preference assignment scheme to recalibrate inconsistent samples according to user preferences.
Across eight NLU datasets and both open- and closed-source LLMs, HCC consistently improves annotation alignment and, in several tasks, enables \textit{Llama-3-8B} to surpass \textit{GPT-3.5/4o-mini} after co-alignment correction. Moreover, CAI strongly correlates with accuracy and tracks pre- and post-alignment gains, offering a reference-free signal for scaling preference-aligned annotation without ground-truth supervision.
基础/前沿模型 (含LLM)
指令微调与对齐
#machine learning #vision-language models #deep learning #reinforceme
TL;DR:We present multi reward and multi loss objective reinforcement learning training method to improve visual understanding and reduce hallucination.
🎯 研究动机视觉-语言模型普遍存在视觉幻觉和语言捷径问题,其根本原因在于后训练方法仅监督最终输出,缺少对中间视觉推理的显式指导,导致模型过度依赖语言先验。
❓ 解决问题为解决视觉推理的稀疏信号问题,本文提出一种不依赖外部视觉监督的自奖励强化学习方法,旨在增强视觉理解并减少幻觉。
🔍 现象分析现有方法依赖于人类标注或外部模型监督,成本高且引入延迟;而单纯输出匹配的训练使模型忽略视觉输入,导致推理失真。
🛠️ 主要方法通过三阶段自奖励强化学习,将推理分解为视觉与语言两部分,先生成自包含的视觉描述,再利用多奖励损失联合优化;采用解耦的奖励-优势框架进行细粒度奖励计算。
📊 数据与实验实验在多种视觉-语言任务上进行,表明方法能提升视觉推理、缓解幻觉并减少语言捷径,且无需额外GPU开销。
⭐ 主要贡献提出首个自奖励视觉-语言模型训练框架,通过推理分解与多奖励策略优化实现无外部监督的视觉强化;其解耦奖励机制避免了异质信号纠缠,提升了效率与效果。
查看完整摘要 (Abstract)
Vision-Language Models (VLMs) often suffer from visual hallucinations -- generating things that are not consistent with visual inputs -- and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and external signals can introduce high latency cost.
In this paper, we introduce Vision-SR1, a three-stage self-rewarding reinforcement learning method that improves visual reasoning without relying on external visual supervision. Vision-SR1 decomposes VLM reasoning into two components: visual reasoning and language reasoning, where the model is first prompted to produce self-contained visual descriptions sufficient to answer the question without referring back to the input image, before jointly optimizing both visual and language reasoning through our multi-reward loss objective. To validate this self-containment, the same VLM model is re-prompted to perform language reasoning using only the generated visual reasoning as input to compute visual reward. The final reward is computed through a decoupled reward-advantage framework, where visual reward and language reasoning reward each have their advantages, log probabilities, and KL divergence calculated separately. This decoupling enables more fine-grained reward computation by preventing the entanglement of heterogeneous reward signals. Our experiments show that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks, while being more efficient than methods that rely on external visual reward models, which require additional GPUs to host. In contrast, Vision-SR1 introduces no extra GPU overhead beyond that of standard training.
基础/前沿模型 (含LLM)
指令微调与对齐
#Compositionality #Visual instruction tuning #Complexity
🎯 研究动机当前视觉指令调优(VIT)数据集规模急剧扩大,但训练样本的信息丰富度被忽视。研究探索样本复杂度对信息数据筛选的影响,旨在提升数据效率。
❓ 解决问题提出COMPACT数据合成方法,通过单个训练样本整合多种原子视觉能力,从而显著减少所需训练数据量。该方法专注于提升样本的信息密度和复杂度。
🔍 现象分析现有数据集筛选方法仅能利用少量信息丰富样本,但样本复杂度未被系统考虑。高效微调需要同时兼顾数据质量和复杂性。
🛠️ 主要方法COMPACT为每张图像合成丰富的文本问题,将多个原子视觉能力组合到单一训练样本中,实现训练样本复杂度的可扩展提升。
📊 数据与实验在LLaVA-665K VIT数据集上验证,数据量减少90%仍达到100.2%的全数据性能,在MM-Vet和MMStar等复杂基准上超越全数据训练。
⭐ 主要贡献提出可扩展的合成数据生成方案COMPACT,显著提升视觉语言任务的数据效率,在复杂基准上表现优异,为高效多模态训练提供新范式。
查看完整摘要 (Abstract)
Visual instruction tuning (VIT) datasets have grown rapidly in scale, yet the informativeness of individual training samples has largely been overlooked. Recent dataset selection methods have shown that a small fraction of such datasets enriched with informative samples can lead to efficient finetuning of Multimodal Large Language Models. In this work, we explore the impact of sample complexity on informative data curation and introduce COMPACT (COMPositional Atomic-to-complex Visual Compositional Tuning), a visual compositional tuning data recipe that scales training sample complexity by combining multiple atomic visual capabilities in a single training example. Concretely, we synthesize rich and informative text questions for each image, allowing us to significantly reduce the number of training examples required for effective visual instruction tuning. COMPACT demonstrates superior data efficiency compared to existing data reduction methods. When applied to the LLAVA-665K VIT dataset, COMPACT reduces the data budget by 90% while still achieving 100.2% of the full VIT performance (compared to only 97.5% by the state-of-the-art method) across eight multimodal benchmarks. Further, training on the COMPACT data outperforms training on the full-scale VIT data on particularly complex benchmarks such as MM-Vet (+8.6%) and MMStar (+2.9%). COMPACT offers a scalable and efficient synthetic data generation recipe to improve on vision-language tasks.
基础/前沿模型 (含LLM)
指令微调与对齐
#LLM data pipeline #Reinforcement learning
🎯 研究动机大型语言模型通过模仿学习取得成功,但存在训练与生成间的差距,影响推理能力。强化学习作为一种数据高效的解决方案,其应用受限于数据规模不足。
❓ 解决问题现有的强化学习数据集规模与多样性远远小于预训练文本语料,无法满足扩展需求。
🔍 现象分析强化学习数据规模远小于预训练所需,表明缺乏高效生成大规模、多样化数据的方法是限制其扩展的关键瓶颈。
🛠️ 主要方法提出 Webscale-RL 管道,将大规模文档系统性转换为多样化、可验证的问答对数据,同时生成跨越 9 个领域的 120 万例数据集。
📊 数据与实验利用 Webscale-RL 数据集模型显著优于增量预训练和数据优化基线,仅用 1/100 的数据量即可达到相同性能。
⭐ 主要贡献提供了一条将强化学习扩展至预训练规模的可行途径,增强并提升语言模型效率与能力。
查看完整摘要 (Abstract)
Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the \textbf{\texttt{Webscale-RL} pipeline}, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the \textbf{\texttt{Webscale-RL} dataset}, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Model #Exploration
🎯 研究动机大型语言模型(LLM)在序列决策任务中的探索能力存在不足,亟需改进其在经典多臂老虎机任务中的表现及泛化能力。
❓ 解决问题探讨监督微调(SFT)和强化学习(RL)对LLM探索策略的影响,以及它们在更长时间跨度和不同任务家族中的泛化表现。
🔍 现象分析通过行为分析发现,模型改进的核心是更复杂但更贪婪的利用策略,RL/SFT训练的模型容易过早放弃探索,导致早期灾难性失败。
🛠️ 主要方法采用两种训练范式,包括基于专家轨迹的SFT,以及设计多种奖励信号(如减少方差的战略奖励、支持模仿的算法奖励)进行RL训练。
📊 数据与实验模型在6倍长时间跨度及不同多臂老虎机任务家族中展示优于预训练模型的性能,达到与UCB和汤普森采样相当的水平,同时进行行为和泛化性分析。
⭐ 主要贡献阐明了不同训练范式的适用场景,提出更具针对性的奖励设计及超越平均遗憾评分的评估方式,推动LLM具备更鲁棒的探索行为。
查看完整摘要 (Abstract)
While Large Language Models (LLMs) hold promise to become autonomous agents, they often explore suboptimally in sequential decision-making. Recent work has sought to enhance this capability via supervised fine-tuning (SFT) or reinforcement learning (RL), improving regret on the classic multi-armed bandit task. However, it remains unclear how these learning methods shape exploration strategies and how well they generalize. We investigate both paradigms by training LLMs with SFT on expert trajectories and RL with a range of tailored reward signals including a strategic, regret-shaped reward to reduce variance, and an algorithmic reward that enables oracle imitation. The resulting agents outperform pre-trained models and achieve performance comparable to Upper Confidence Bound (UCB) and Thompson Sampling, with robust generalization to 6$\times$ longer horizons and across bandit families. Behavioral analysis reveals that gains often stem from more sophisticated but greedier exploitation: RL/SFT agents are more prone to early catastrophic failure than pre-trained models, prematurely abandoning exploration. Furthermore, agents trained to imitate UCB learn to outperform their teacher by adopting more exploitative variants. Our findings clarify when each training paradigm is preferable and advocate tailored reward design and evaluation beyond average regret to promote robust exploratory behavior.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large language model #Preference alignment
TL;DR:We propose the Confidence-Weighted Preference Optimization (CW-PO) for preference alignment, which effectively leverages weak LLMs as annotators.
🎯 研究动机偏好对齐是将大语言模型(LLM)适应人类价值的关键步骤,但传统方法成本高且依赖人工标注或大型API模型。
❓ 解决问题探讨弱LLM是否可以通过高置信度样本代替人工标注,既减少成本又提高性能。
🔍 现象分析研究表明,仅选取弱LLM的高置信度样本能显著优于全量人工标注的模型性能。
🛠️ 主要方法提出了CW-PO框架,通过弱LLM的置信度对训练样本进行重新加权,适用于多个偏好优化目标。
📊 数据与实验实验表明,使用CW-PO的模型仅依赖20%人工标注即可超越完全人工标注的DPO模型。
⭐ 主要贡献证明弱LLM结合置信度加权在偏好对齐中可大幅降低成本,同时性能优于完全人工标注方法。
查看完整摘要 (Abstract)
Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose **C**onfidence-**W**eighted **P**reference **O**ptimization (CW-PO), a general framework that re-weights training samples by a weak LLM’s confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20\% of human annotations outperforms the model trained with 100\% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.
基础/前沿模型 (含LLM)
指令微调与对齐
#Reasoning Distillation
🎯 研究动机推理蒸馏作为提升学生模型性能的低成本方法备受关注,但对于蒸馏后模型能力来源的分析仍不充分,尤其是测试时模型行为的一致性问题。
❓ 解决问题明确蒸馏模型在测试情境中的行为是否能保持与教师模型一致,或是否会退回到学生模型原有输出模式,解决推理蒸馏模型泛化问题。
🔍 现象分析通过引入跨模型推理蒸馏溯源框架,分析蒸馏模型生成每个动作的来源,揭示模型能够生成与教师模型一致的行为并解释其性能表现。
🛠️ 主要方法提出基于教师指导的数据选择方法,直接比较教师与学生在训练数据上的差异,替代依赖启发式选择的方法,形成更具原则性的选择标准。
📊 数据与实验在多种教师模型与学生模型组合上验证方法的有效性,模型包括Deepseek-R1-671B、QwQ-32B等教师模型以及Qwen2.5-7B-Instruct等学生模型。
⭐ 主要贡献开发了推理蒸馏溯源框架,揭示模型性能来源;提出教师指导的数据选择方法,提升蒸馏效果;为推理蒸馏研究提供了新的实验与方法论支持。
查看完整摘要 (Abstract)
Reasoning distillation, a cost-effective approach for enhancing student model performance, has attracted increasing attention. It typically leverages a large teacher model to generate reasoning paths, which are then used to fine-tune a student model so that it mimics the teacher's behavior in training contexts.
However, previous approaches have lacked a detailed analysis of the origins of the distilled model's capabilities. It remains unclear whether the student can maintain consistent behaviors with the teacher in novel test-time contexts, or whether it regresses to its original output patterns, raising concerns about the generalization of distillation models.
To analyse this question, we introduce a cross-model Reasoning Distillation Provenance Tracing framework. For each action (e.g., a sentence) produced by the distilled model, we obtain the predictive probabilities assigned by the teacher, the original student, and the distilled model under the same context. By comparing these probabilities, we classify each action into four categories: (i) teacher-originated actions, (ii) student-originated actions, (iii) pre-existing actions in both models not enhanced by distillation, and (iv) pre-existing actions boosted through distillation. By systematically disentangling the provenance of each action, we experimentally demonstrate that, in test-time contexts, the distilled model can indeed generate teacher-originated actions, which correlate with and plausibly explain observed performance on distilled model. Building on this analysis, we further propose a teacher-guided data selection method. Unlike prior approach that rely on heuristics (e.g., selecting data most aligned with the student's original distribution), our method directly compares teacher–student divergences on the training data, providing a principled selection criterion. We validate the effectiveness of our approach across multiple representative teacher models (Deepseek-R1-671B, QwQ-32B, GPT-OSS-120B) and diverse student models (Qwen2.5-7B-Instruct, Qwen4-4B-Base, Qwen3-8B-Base, Qwen3-4B-Instruct-2507). The results highlight the utility of our provenance-tracing framework and underscore its promise for reasoning distillation. We hope to share Reasoning Distillation Provenance Tracing, along with our insights into reasoning distillation, with the community.
基础/前沿模型 (含LLM)
指令微调与对齐
#Direct Preference Optimization #Reinforcement Learning #Reinforcement learning with human feedback
TL;DR:DPO is not sound by design and can fail due to misspecification, we fix it with careful analysis.
🎯 研究动机直接偏好优化(DPO)通过偏好数据进行监督学习以优化模型,但其在统计估计上存在设计缺陷,可能导致错误的结果表达和敏感的行为。
❓ 解决问题解决DPO因模型类别限制而导致的偏差问题,以及其面对偏好数据分布时的不稳定性。
🔍 现象分析DPO在偏好生成的奖励函数无法用模型类别表示时会出现偏好顺序颠倒、策略奖励恶化以及对数据分布高度敏感等失效模式。
🛠️ 主要方法提出AuxDPO,通过在损失函数中引入辅助变量,结合几何特性分析,向RLHF解更好地收敛以缓解DPO的设计缺陷。
📊 数据与实验在教学性的bandit环境与大型语言模型(LLM)对齐任务上进行了实验,验证了AuxDPO的性能优越性。
⭐ 主要贡献揭示了DPO的理论缺陷,提出了结合RLHF特点的改进方法AuxDPO,并通过实验验证了其实际提升效果。
查看完整摘要 (Abstract)
Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss function to help move towards the RLHF solution in a principled manner and mitigate the misspecification in DPO. We empirically demonstrate the superior performance of AuxDPO on didactic bandit settings as well as LLM alignment tasks.
基础/前沿模型 (含LLM)
指令微调与对齐
#Reward Models #Language Models #Generalization #Distribution Shifts
🎯 研究动机语言模型中的隐式奖励模型(IM-RM)虽无需结构修改即可定义,但其在泛化能力上普遍不及显式奖励模型(EX-RM),尤其在分布偏移情况下表现更差。论文旨在探索这一泛化差距背后的根本原因。
❓ 解决问题针对IM-RM与EX-RM几乎相同却表现差异的现象,探讨影响泛化能力的隐性偏差并验证相关理论,试图揭示改进IM-RM的可能方向。
🔍 现象分析发现IM-RM过于依赖表层的token级线索,导致在分布偏移及任务内泛化情况下性能下降。同时驳斥替代性假设,如生成任务更难导致IM-RM表现弱于验证任务。
🛠️ 主要方法通过理论分析和实验验证比较IM-RM与EX-RM的结构差异及属性表现,其中EX-RM在隐层表示上添加线性头部以调整任务输出方式。
📊 数据与实验利用多种任务数据集在分布偏移和任务内场景中评估IM-RM和EX-RM的泛化能力,同时分析表层线索的影响程度,并验证多种假设。
⭐ 主要贡献揭示IM-RM泛化能力受限的根因是对token级线索依赖过高,提供基于设计选择对泛化行为影响的理论与实验支持,为奖励模型优化提供新视角。
查看完整摘要 (Abstract)
Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Toward a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Overall, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.
基础/前沿模型 (含LLM)
指令微调与对齐
#Multi-Modal Adapter #Personalized Federated Fine-Tuning #Few-Shot Learning of Vision Language Models
🎯 研究动机视觉语言模型在零样本/少样本场景下泛化能力出色,但在面对分布式异构数据时,高效个性化适应仍具挑战。现有联邦提示调优方法常因过度个性化牺牲泛化能力,尤其在新类别或域上表现不佳。
❓ 解决问题针对分布式异构数据中视觉语言模型的个性化微调问题,设计一种既能保持个性化适应,又不损害全局泛化能力的联邦学习框架。克服现有方法在新类别或域上泛化能力下降的缺陷。
🔍 现象分析传统联邦提示调优方法在个性化与泛化间存在权衡困境:过强的个性化会导致模型在新数据上表现退化。这源于客户端本地优化时缺乏跨模态特征的有效对齐机制。
🛠️ 主要方法提出pFedMMA框架,首次引入多模态适配器进行个性化联邦微调。适配器包含模态特定的上下投影层和全局共享投影层,通过协同训练共享投影来对齐跨模态特征。仅交换共享组件实现高效通信。
📊 数据与实验在11个数据集(含域偏移和标签偏移场景)上验证方法有效性。实验表明pFedMMA在个性化与泛化权衡方面达到最优,超越现有联邦提示调优方法。
⭐ 主要贡献首次将多模态适配器引入个性化联邦学习框架。提出通过共享投影层协同优化的新范式,实现个性化适应与全局泛化的平衡。设计了通信高效的联邦训练机制。
查看完整摘要 (Abstract)
Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero- and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up- and down-projection layers alongside a globally shared projection that aligns cross-modal features. Our optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve global generalization. This design is also communication-efficient, as only the shared component is exchanged during communication rounds. Through extensive experiments across eleven datasets, including domain- and label-shift scenarios, we show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods.
基础/前沿模型 (含LLM)
指令微调与对齐
#Large Language Models #Supervised Fine-tuning #Data Selection
🎯 研究动机数据质量是提升大语言模型监督微调的重要因素,现有基于单词级别数据选择的方法存在依赖额外参考模型以及仅使用损失信息的问题。
❓ 解决问题通过避免依赖额外训练的参考模型,并增强对语义重要单词的保留,解决损失信息无法全面表达语义的问题。
🔍 现象分析现有方法主要依赖损失指标选择单词,但这可能忽略了语义的重要部分,从而影响模型优化效果。
🛠️ 主要方法提出ssToken方法,利用历史模型计算损失差值进行单词自调选择,并引入基于注意力机制的语义评估指标,结合语义信息进行补充过滤。
📊 数据与实验在不同模型规模和家族上进行广泛实验,验证了自调选择和语义选择的单独有效性,并通过综合优势实现显著性能提升。
⭐ 主要贡献提出了一种无需额外参考模型的自调语义感知单词选择方法,兼顾优化效率与性能提升,对单词级别数据选择领域提供了新的解决方案。
查看完整摘要 (Abstract)
Data quality plays a critical role in enhancing supervised fine-tuning (SFT) for large language models (LLMs), and token-level data selection has emerged as a promising direction for its fine-grained nature. Despite their strong empirical performance, existing token-level selection methods share two key limitations: (1) requiring training or accessing an additional reference model, and (2) relying solely on loss information for token selection, which cannot well preserve semantically important tokens that are not favored by loss-based metrics. To address these challenges, we propose **ssToken**, a **S**elf-modulated and **S**emantic-aware **Token** Selection approach. ssToken leverages readily accessible history models to compute the per-token loss difference with the current model, which serves as a self-modulated signal that enables the model to adaptively select tokens along its optimization trajectory, rather than relying on excess loss from an offline-trained reference model as in prior works. We further introduce a semantic-aware, attention-based token importance estimation metric, orthogonal to loss-based selection and providing complementary semantic information for more effective filtering. Extensive experiments across different model families and scales demonstrate that both self-modulated selection and semantic-aware selection alone outperform full-data fine-tuning, while their integration—ssToken—achieves synergistic gains and further surpasses prior token-level selection methods, delivering performance improvements while maintaining training efficiency.
Source code is available at https://github.com/jianke0604/ssToken.
推理与思维链169 篇
基础/前沿模型 (含LLM)
推理与思维链
#LLM Reasoning #Test-Time Training;Textual Optimization
TL;DR:We propose $\nabla$-reasoner, an iterative decoding approach with policy refinement by test-time gradient descent on latent textual representations to improve LLM reasoning.
🎯 研究动机大型语言模型的推理能力通过推理时间计算规模化得以解锁,但现有方法效率低下且优化不足,需要新的途径来提升推理策略的效果和效率。
❓ 解决问题克服现有推理时间方法依赖离散搜索算法或试错提示的不优化性,以更高效的方式改进 LLM 的在线推理策略。
🔍 现象分析当前方法在数学推理任务中精度和效率受限,离散搜索方法导致大量模型调用,无法充分利用模型潜力。
🛠️ 主要方法提出 $
abla$-Reasoner,结合梯度优化的可微文本优化方法,将梯度信号用于推理时间的策略细化,同时加入拒绝采样和加速机制以增强鲁棒性和效率。
📊 数据与实验在具有挑战性的数学推理基准上测试,$
abla$-Reasoner 精度提升超过20%,模型调用次数减少约10-40%,相比现有强基线体现显著优势。
⭐ 主要贡献提供从零阶搜索到一阶优化的新范式,结合成本效益显著提升 LLM 推理能力,并在理论上与 KL 正则化强化学习策略对齐。
查看完整摘要 (Abstract)
Scaling inference-time compute for Large Language Models (LLMs) has unlocked unprecedented reasoning capabilities.
However, existing inference-time scaling methods typically rely on inefficient and suboptimal discrete search algorithms or trial-and-error prompting to improve the online policy. In this paper, we propose $\nabla$-Reasoner, an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop to refine the policy on the fly. Our core component, Differentiable Textual Optimization (DTO), leverages gradient signals from both the LLM’s likelihood and a reward model to refine textual representations. $\nabla$-Reasoner further incorporates rejection sampling and acceleration design to robustify and speed up decoding. Theoretically, we show that performing inference-time gradient descent in the sample space to maximize reward is dual to aligning an LLM policy via KL-regularized reinforcement learning. Empirically, $\nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark, while reducing number of model calls by approximately 10-40% compared to strong baselines. Overall, our work introduces a paradigm shift from zeroth-order search to first-order optimization at test time, offering a cost-effective path to amplify LLM reasoning.
基础/前沿模型 (含LLM)
推理与思维链
#Reasoning #Logic #Artificial Intelligence #Large Language Models #Abduction
🎯 研究动机大语言模型虽具备一定的形式推理能力,但在处理复杂证明规划问题时常出现不足。现有逻辑求解器虽然效率高,但无法处理缺失的常识关系。
❓ 解决问题提出一种方法,通过逻辑求解器的反馈,迭代性地补充由大语言模型提供的常识关系,以改善逻辑推理表现。
🔍 现象分析逻辑求解器在纯逻辑推理中表现优异,但在缺少常识信息的情况下能力受限;大语言模型虽能生成相关信息,但需要优化其生成的准确性与成本控制。
🛠️ 主要方法设计了一种搜索流程,在逻辑问题中添加潜在常识假设,以最大化有用信息的发现,同时控制成本,在逻辑求解器和语言模型间进行协作。
📊 数据与实验使用多个删减了常识信息的纯逻辑推理数据集进行验证,该方法在所有实验中显著优于现有方法。
⭐ 主要贡献提出了一种平衡的神经-符号方法,有效地结合语言模型与逻辑求解器,提高了缺乏常识信息情况下的逻辑推理能力。
查看完整摘要 (Abstract)
Although Large Language Models (LLMs) have demonstrated impressive formal reasoning abilities, they often break down when problems require complex proof planning. One promising approach for improving LLM reasoning abilities involves translating problems into formal logic and using a logic solver. Although off-the-shelf logic solvers are in principle substantially more efficient than LLMs at logical reasoning, they assume that all relevant facts are provided in a question and are unable to deal with missing commonsense relations. In this work, we propose a novel method that uses feedback from the logic solver to augment a logic problem with commonsense relations provided by the LLM, in an iterative manner. This involves a search procedure through potential commonsense assumptions to maximize the chance of finding useful facts while keeping cost tractable. On a collection of pure-logical reasoning datasets, from which some commonsense information has been removed, our method consistently achieves considerable improvements over existing techniques, demonstrating the value in balancing neural and symbolic elements when working in human contexts.
基础/前沿模型 (含LLM)
推理与思维链
#large language models #multi-hop question answering #information-theoretic analysis #multi-call reasoning framework
TL;DR:We derive a Fano-style accuracy bound for single-pass LLM in multi-hop QA, revealing an Accuracy Cliff, analyze MHQA’s vulnerability, and validate the theory with a controlled benchmark and the InfoQA framework.
🎯 研究动机多跳问答需要将分散且相互依赖的证据整合在一起,这对单次推理的语言模型构成了挑战,尤其在模型容量有限时面临准确性下降问题。
❓ 解决问题通过信息论分析,提出单次推理模式下的Fano类型准确性上限,揭示模型在任务复杂度超出容量时准确性急剧下降的现象。
🔍 现象分析分析指出,单次推理范式易受容量溢出的影响,即模型无法可靠整合超出其单次输出能力的多跳任务相关证据,导致准确性崩溃。
🛠️ 主要方法提出InfoQA框架,利用容量感知的任务分解和对先前推理轨迹的主动剪枝,控制单次推理的信息负载,同时通过明确的依赖工作流增强推理路径的精确性。
📊 数据与实验构建了一个包含大量噪声的严谨基准数据集进行验证,实验结果表明模型行为与理论预测的容量曲线吻合,且InfoQA在性能上实现了稳定提升。
⭐ 主要贡献提出并验证了单次推理模式下的理论性能边界;开发了一个提高多跳问答准确性和鲁棒性的多次调用框架;推动LLM多步推理方法的进一步发展。
查看完整摘要 (Abstract)
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://anonymous.4open.science/r/InfoQA-55D1}{InfoQA}.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #reasoning #efficient reasoning
🎯 研究动机当前长链式思维(CoT)推理虽然在复杂推理任务上的表现得到显著提升,但其高昂的计算和内存成本限制了实际使用效率和可行性。
❓ 解决问题通过将大模型的推理过程抽象为状态转移过程,旨在降低长链式推理的计算复杂度,同时提升推理效率。
🔍 现象分析压缩 CoT 序列虽然能提高推理效率,但会限制测试时的扩展能力,进而降低模型的推理能力。此外,冗余推理步骤会引发过度思考问题。
🛠️ 主要方法提出一种基于线性注意力机制的状态转移框架,利用推理状态记录历史信息,通过状态更新完成推理,每一步的计算复杂度从二次降至线性,并结合状态驱动策略缓解由噪声步骤引发的过度思考问题。
📊 数据与实验在多个数据集和模型规模上进行了广泛实验,结果表明所提框架显著提升了推理效率和推理性能。
⭐ 主要贡献提出了状态转移推理框架,改善大语言模型在复杂任务上的推理效率及性能;同时,通过线性注意力和状态策略有效降低了计算复杂度和噪声影响。
查看完整摘要 (Abstract)
While Long Chain-of-Thought (CoT) reasoning significantly improves Large Language Models (LLMs) performance on complex reasoning tasks, the substantial computational and memory costs of generating long CoT sequences limit their efficiency and practicality.
Existing studies usually enhance the reasoning efficiency of LLMs by compressing CoT sequences.
However, this approach conflicts with test‑time scaling, limiting the reasoning capacity of LLMs.
In this paper, we propose an efficient reasoning framework that models the reasoning process of LLMs as a state‑transition process.
Specifically, we first apply a linear attention mechanism to estimate the LLM’s reasoning state, which records the historical reasoning information from previous reasoning steps.
Then, based on the query prompt and the reasoning state, the LLM can efficiently perform the current reasoning step and update the state.
With the linear attention, each token in the current reasoning step can directly retrieve relevant historical reasoning information from the reasoning state, without explicitly attending to tokens in previous reasoning steps.
In this way, the computational complexity of attention is reduced from quadratic to linear, significantly improving the reasoning efficiency of LLMs.
In addition, we propose a state-based reasoning strategy to mitigate the over-thinking issue caused by noisy reasoning steps.
Extensive experiments across multiple datasets and model sizes demonstrate that our framework not only improves the reasoning efficiency of LLMs but also enhances their reasoning performance.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #Prompting #In-Context Learning #Tool-augmented Reasoning #Text-rich Graphs
TL;DR:A comprehensive study of LLMs for node classification, providing a principled understanding of their capabilities in processing graph information that practitioners can apply in real-world tasks
🎯 研究动机大语言模型(LLMs)在文本丰富的图学习任务中应用日益广泛,其中节点分类因其在欺诈检测和推荐系统等领域的高影响力而尤为重要。然而,目前对于LLMs在处理图数据方面的能力缺乏系统性的理解。
❓ 解决问题研究不同交互模式(提示、工具使用、代码生成)下的LLMs如何在多样化的图数据场景中表现,并分析其对输入特征(结构、特征、标签)的依赖性。
🔍 现象分析通过控制变量分析发现:代码生成模式表现最优,特别是在长文本或高度复杂的图上;异质性图上的表现优异,打破了低同质性下模型难以适用的假设;代码生成模式能够灵活调整对输入类型的依赖。
🛠️ 主要方法提出大规模控制实验,比较不同LLM与图数据交互模式,并通过特征截断、边删除、标签移除等手段量化各模式对输入维度的依赖。
📊 数据与实验实验涵盖引用、网页链接、电商和社交网络等多领域数据集,分别在同质性与异质性、短文本与长文本,以及不同LLM规模上开展系统测试。
⭐ 主要贡献系统性揭示LLM在图推理任务中的优势与局限;提出代码生成作为应对大规模复杂图推理的主导模式;为未来方法设计提供了明确的指导原则。
查看完整摘要 (Abstract)
Large language models (LLMs) are increasingly leveraged for text-rich graph machine learning tasks, with node classification standing out due to its high-impact application domains such as fraud detection and recommendation systems.
Yet, despite a surge of interest, the field lacks a principled understanding of the capabilities of LLMs in processing graph data.
In this work, we conduct a large-scale, controlled evaluation across the key axes of variability: the LLM-graph interaction mode, comparing prompting, tool-use, and code generation; dataset domains, spanning citation, web-link, e-commerce, and social networks; homophilic vs. heterophilic regimes; short- vs. long-text features; LLM sizes and reasoning capabilities. We further analyze dependencies by independently truncating features, deleting edges, and removing labels to quantify reliance on input types.
Our findings provide actionable guidance for both research and practice. (1) Code generation mode achieves the strongest overall performance, with especially large gains on long-text or high-degree graphs where prompting quickly exceeds the token budget. (2) All interaction strategies remain effective on heterophilic graphs, challenging the assumption that LLM-based methods collapse under low homophily. (3) Code generation mode is able to flexibly shift its reliance to the most informative input type, whether that be structure, features, or labels.
Together, these results establish a clear picture of the strengths and limitations of current LLM–graph interaction modes and point to design principles for future methods.
基础/前沿模型 (含LLM)
推理与思维链
#LLM Reasoning #Efficiency #Self-Consistency
🎯 研究动机大型语言模型推理时的思考预算分配对性能优化至关重要,但目前对于模型能力、查询复杂性与预算分配之间的关系理解不足。
❓ 解决问题探索如何通过自一致性识别查询需求的思考程度,并提出一种方法以优化性能与计算效率之间的权衡。
🔍 现象分析较低的自一致性表明某些查询需要更多的中间推理步骤才能正确回答,因此可以通过动态预算分配改善推理效果。
🛠️ 主要方法提出了$ exttt{Sonata}$,利用训练好的适配器在查询预填阶段根据隐藏层表征预测自一致性,从而动态分配思考预算,同时兼容现有的CoT压缩方法。
📊 数据与实验在多种模型(如Qwen系列、GPT-OSS-120B)和多项基准测试(如AIME25、GSM8K)上评估,$ exttt{Sonata}$实现了显著的效率提升与准确性优化。
⭐ 主要贡献设计了一个通用、高效的预算分配框架,使思考代价降低20%-60%且准确率不变,或在保持代价不变的情况下实现最高2%的准确率提升。
查看完整摘要 (Abstract)
Recent advances in large language models (LLMs) test-time computing have introduced the capability to perform intermediate chain-of-thought (CoT) reasoning (thinking) before generating answers. While increasing the thinking budget yields smooth performance improvements at inference time, the relationship between LLM capability, query complexity, and optimal budget allocation remains poorly understood for achieving compute-optimal inference. To address this challenge, we utilize $\textit{self-consistency}$, the agreement among multiple reasoning paths, as a proxy for thinking necessity. We first identify that lower self-consistency indicates when queries require extended thinking to reach correct answers. Building on this insight, we introduce $\texttt{Sonata}$ (Self-Consistency-Guided Adapter for Thinking Allocation), a lightweight approach that adaptively allocates thinking budgets to optimize the performance-efficiency tradeoff. $\texttt{Sonata}$ includes an adapter trained offline on a calibration dataset to predict self-consistency directly from the last layer hidden representations during the query prefilling stage. This prediction then guides on-the-fly budget allocation before thinking. The adapter is general, transferable across diverse tasks once trained, and introduces $<1$$\textperthousand$ computational overhead during inference. Notably, Sonata is compatible with existing CoT compression methods, enabling further efficiency gains when managing thinking budgets across queries. Extensive experiments on multiple models (Qwen3-8B, Qwen3-32B, GPT-OSS-120B, Qwen3-235B-A22B) and benchmarks~(AIME25, GSM8K, MATH500, GPQA, LiveCodeBench) demonstrate that $\texttt{Sonata}$ achieves $20\\%$ to $60\\%$ reduction in thinking tokens while maintaining the same accuracy, or up to $2\\%$ improvement in accuracy with the same token cost.
基础/前沿模型 (含LLM)
推理与思维链
#large language model #reasoning model #robustness #chain of thought
TL;DR:Reasoning LLMs mostly recover from disruptions using doubting mechanisms, but paraphrasing hinders this and recovery raises reasoning cost.
🎯 研究动机推理型大语言模型(RLLMs)通过生成逐步的思维链(CoTs)提升复杂任务的性能并使推理透明,但其推理过程对中途干扰的鲁棒性尚未明确,这成为亟需探究的问题。
❓ 解决问题提出一个控制评估框架,通过在固定时间点扰动模型的思维链以分析其对干扰的鲁棒性表现及恢复机制。
🔍 现象分析研究发现,RLLMs对大多数干扰具有恢复能力,但恢复效率受到模型规模、干扰发生时机及干扰类型的显著影响;此外,释疑机制是关键,但改写型干扰会抑制释疑过程并降低性能。
🛠️ 主要方法设计七种干扰(包括善意、中性、敌意类型)并以固定时序扰乱模型生成的思维链,从多个任务(数学、科学、逻辑)中评估模型的恢复性能与效率。
📊 数据与实验实验选用针对数学、科学及逻辑的多任务数据集,对不同规模的开源权重 RLLMs 应用干扰,系统分析恢复率、CoT 长度变化及性能下降情况。
⭐ 主要贡献揭示模型思维链对多种干扰的恢复机制与效率代价;证明释疑表达是恢复重要机制;指出鲁棒性与效率间的权衡对未来模型优化的启示。
查看完整摘要 (Abstract)
Reasoning LLMs (RLLMs) generate step-by-step chains of thought (CoTs) before giving an answer, which improves performance on complex tasks and makes reasoning transparent. But how robust are these reasoning traces to disruptions that occur within them? To address this question, we introduce a controlled evaluation framework that perturbs a model’s own CoT at fixed timesteps. We design seven interventions (benign, neutral, and adversarial) and apply them to multiple open-weight RLLMs across MATH, SCIENCE, and LOGIC tasks. Our results show that RLLMs are generally robust, reliably recovering from diverse perturbations, with robustness improving with model size and degrading when interventions occur early. However, robustness is not style-invariant: paraphrasing suppresses doubt-like expressions and reduces performance, while other interventions trigger doubt and support recovery. Recovery also carries a cost: neutral and adversarial noise can inflate CoT length by more than 200%, whereas paraphrasing shortens traces but harms accuracy. These findings provide new evidence on how RLLMs maintain reasoning integrity, identify doubt as a central recovery mechanism, and highlight trade-offs between robustness and efficiency that future training methods should address.
基础/前沿模型 (含LLM)
推理与思维链
#Instruction Following; Dynamic Attention; Large Language Models
🎯 研究动机大语言模型(LLMs)在执行逐步指令时表现出强大能力,但面对包含多种独立但交织子任务的组合式指令时常出现困难,需要优化其结构化注意力机制。
❓ 解决问题解决组合式指令中的子任务之间由于结构纠缠导致的注意力干扰现象,确保模型输出的准确性和忠实性。
🔍 现象分析组合指令中的子任务具有互斥特性,如分支、链式、并行结构,非活动子任务可能在生成过程中吸引不必要的关注,导致模型干扰并影响任务完成质量。
🛠️ 主要方法提出ATA(结构感知动态注意机制),通过动态识别当前活动子任务并抑制对非活动子任务的注意,在单次前向传播中优化注意力分布无需参数更新。
📊 数据与实验通过大量实验验证ATA在多种组合结构任务上的有效性,结果表明其显著提升了模型的指令执行能力并具备良好的泛化性能。
⭐ 主要贡献提出一种结构感知动态注意机制,有效缓解组合式子任务之间的注意力干扰,显著提升LLMs对复杂指令的跟随能力,并实现了无需额外参数更新的高效实现。
查看完整摘要 (Abstract)
Large language models (LLMs) have exhibited strong instruction-following capabilities; however, they often struggle with compositional instructions involving multiple interleaved yet logically independent sub-tasks. These sub-tasks are typically organized in mutually exclusive structures, such as branching, chaining, or paralleling, where only one sub-task should be active at each generation step, while the others remain dormant. Despite their inactivity, dormant sub-tasks can inadvertently attract the model's attention due to structural entanglement within the input context or intermediate representations, leading to interference that compromises output fidelity. To address this challenge, we propose ATA, a structure-aware dynamic attention mechanism grounded in compositional structures, which dynamically identifies the active sub-task during generation while suppressing attention to inactive ones. By precisely steering the model’s focus, ATA mitigates interference and explicitly enhances model adherence to the active sub-task. Importantly, ATA operates within a single forward pass without requiring parameter updates. Extensive experiments show that ATA consistently enhances LLMs' instruction-following ability across various compositional structures, effectively mitigating attention distraction and demonstrating a strong generalization ability.
基础/前沿模型 (含LLM)
推理与思维链
#Autoformalization #Retrieval-augmented Generation
🎯 研究动机交互式定理证明器需要繁重的人工形式化工作,自动形式化具备潜力但面临模型幻觉与语义鸿沟等挑战。
❓ 解决问题为解决模型生成中符号滥用及自然语言描述欠缺前提等问题,引入概念驱动的增强型检索框架CRAMF。
🔍 现象分析数学概念的多态性及高精度要求、缺乏结构化知识库,使得检索增强生成在形式化场景中复杂且重要。
🛠️ 主要方法从Mathlib4构建知识库并索引26,000+定义;通过上下文查询增强与双通道混合检索策略,改善概念定义获取的准确性。
📊 数据与实验基于miniF2F、ProofNet及新建立的AdvancedMath基准验证,CRAMF在翻译准确性上实现平均29.9%的相对提升。
⭐ 主要贡献提出CRAMF框架,将检索增强生成引入自动形式化,创新性解决数学概念多态性问题并显著提高翻译性能。
查看完整摘要 (Abstract)
Interactive theorem provers (ITPs) require manual formalization, which is labor-intensive and demands expert knowledge. While automated formalization offers a potential solution, it faces two major challenges: model hallucination (e.g., undefined predicates, symbol misuse, and version incompatibility) and the semantic gap caused by ambiguous or missing premises in natural language descriptions. To address these issues, we propose CRAMF, a Concept-driven Retrieval-Augmented Mathematical Formalization framework. CRAMF enhances LLM-based autoformalization by retrieving formal definitions of core mathematical concepts, providing contextual grounding during code generation. However, applying retrieval-augmented generation (RAG) in this setting is non-trivial due to the lack of structured knowledge bases, the polymorphic nature of mathematical concepts, and the high precision required in formal retrieval. We introduce a framework for automatically constructing a concept-definition knowledge base from Mathlib4, the standard mathematical library for the Lean 4 theorem prover, indexing over 26,000 formal definitions and 1,000+ core mathematical concepts. To address conceptual polymorphism, we propose contextual query augmentation with domain- and application-level signals. In addition, we design a dual-channel hybrid retrieval strategy with reranking to ensure accurate and relevant definition retrieval. Experiments on miniF2F, ProofNet, and our newly proposed AdvancedMath benchmark show that CRAMF can be seamlessly integrated into LLM-based autoformalizers, yielding consistent improvements in translation accuracy—achieving up to 62.1% and an average of 29.9% relative improvement.
基础/前沿模型 (含LLM)
推理与思维链
#LLM #test-time compute #majority voting #LLM ensemble
🎯 研究动机探索大语言模型(LLM)测试时的最佳选择策略,分析无限多数投票的表现并提出高效计算方法以解决无限计算预算的挑战。
❓ 解决问题解决测试时多数投票需要无限计算资源的问题,同时优化不同模型混合的推理性能。
🔍 现象分析无限多数投票在理论极限下表现优异,但实际应用受限于无限预算的需求;混合模型加权可超越单个模型性能。
🛠️ 主要方法提出自适应生成策略,根据答案一致性动态调整选取数量,并通过混合整数线性规划高效优化混合模型的加权策略。
📊 数据与实验在多个基准数据集上进行了广泛实验,验证了自适应生成和加权混合模型的有效性和性能提升。
⭐ 主要贡献提出了基于答案一致性动态调整的最佳生成方法,扩展了多LLM混合加权的理论框架,并通过实验证实其在性能与计算效率上的优越性。
查看完整摘要 (Abstract)
We study best-of-$N$ for large language models (LLMs) where the selection is based on majority voting. In particular, we analyze the limit $N \to \infty$, which we denote as best-of-$\infty$. While this approach achieves impressive performance in the limit, it requires an infinite test-time budget. To address this, we propose an adaptive generation scheme that selects $N$ based on answer agreement, thereby efficiently allocating inference-time computation. Beyond adaptivity, we extend the framework to weighted ensembles of multiple LLMs, showing that such mixtures can outperform any individual model. The optimal ensemble weighting is formulated and efficiently computed as a mixed-integer linear program. Extensive experiments demonstrate the effectiveness of our approach. Our code is available at https://github.com/jkomiyama/BoInf-code-publish/.
基础/前沿模型 (含LLM)
推理与思维链
#large language model #reinforcement learning
🎯 研究动机现有通过强化学习训练的大型语言模型虽展现了推理能力和反思性行为,但传统马尔科夫政策无法激发反思性探索行为,模型无法通过状态历史丰富上下文信息。
❓ 解决问题旨在解决传统强化学习模型缺乏反思性探索的问题,并通过贝叶斯自适应强化学习框架引入不确定性适应策略以激发反思行为。
🔍 现象分析传统强化学习中的探索行为仅服务于训练阶段的试错学习,无法在测试时自发引导反思性推理操作,缺乏对信息收集行为的有效激励。
🛠️ 主要方法提出了一种基于贝叶斯自适应强化学习的新算法 BARL,通过更新信念诱导模型进行信息收集和策略切换,引导模型开展自反性的探索和推理操作。
📊 数据与实验本文在合成和数学推理任务上验证了方法,实验结果表明 BARL 在测试性能和 token 效率方面显著优于传统强化学习方法。
⭐ 主要贡献提出反思性探索的新框架,通过贝叶斯强化学习优化语言模型的推理能力;开发了 BARL 算法,并通过公开的代码促进了研究复现。
查看完整摘要 (Abstract)
Large Language Models (LLMs) trained via Reinforcement Learning (RL) have exhibited strong reasoning capabilities and emergent reflective behaviors, such as rethinking and error correction, as a form of in-context exploration. However, the Markovian policy obtained from conventional RL training does not give rise to reflective exploration behaviors since the policy depends on the history only through the state and therefore has no incentive to enrich identical states with additional context. Instead, RL exploration is only useful during training to learn the optimal policy in a trial-and-error manner. Therefore, it remains unclear whether reflective reasoning will emerge during RL, or why it is beneficial. To remedy this, we recast reflective exploration within a Bayesian RL framework, which optimizes the expected return under a posterior distribution over Markov decision processes induced by the training data. This Bayesian formulation admits uncertainty-adaptive policies that, through belief updates, naturally incentivize information-gathering actions and induce self-reflection behaviors. Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on the observed outcomes, offering principled guidance on when and how the model should reflectively explore. Empirical results on both synthetic and mathematical reasoning tasks demonstrate that BARL outperforms conventional RL approaches, achieving superior test-time performance and token efficiency. Our code is available at https://github.com/shenao-zhang/BARL.
基础/前沿模型 (含LLM)
推理与思维链
#Large language models #Reasoning #Exploration
🎯 研究动机现有的基于强化学习的可验证奖励方法(RLVR)在大语言模型中存在探索性不足的问题,导致模型早期收敛和熵崩塌。同时,这些方法往往生成置信度过高但可能不正确的策略。
❓ 解决问题提出一种新的好奇心驱动探索(CDE)框架,有效引导模型探索,提高推理能力,并缓解现有方法的过早收敛和策略校准不足问题。
🔍 现象分析在现行RLVR框架下,模型生成会因探索性不足而偏向单一答案,导致输出缺乏多样性且置信度不受答案正确性影响。
🛠️ 主要方法通过引入基于好奇心的奖励信号:行为者使用生成响应的困惑度,评估者使用多头架构的价值估计方差;二者均作为探索奖励融入RLVR框架,促进正确响应的多样性和减少过度置信错误。
📊 数据与实验在AIME基准上,基于GRPO/PPO算法的实验表明,CDE方法相较标准RLVR取得了约+3分的性能提升,验证了其有效性。
⭐ 主要贡献1. 提出了以困惑度和价值方差为特征的好奇心驱动探索机制;2. 理论分析揭示其对过度置信错误的抑制和对探索奖励的关联性;3. 在强化学习与大语言模型结合领域实现显著性能改进。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. Moreover, they tend to produce poorly calibrated policies that remain confident in their generations regardless of correctness. To address this challenge, we introduce **Curiosity-Driven Exploration (CDE)**, a framework that leverages the model's intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head critic architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate **+3** point improvement over standard RLVR using GRPO/PPO on AIME benchmarks.
基础/前沿模型 (含LLM)
推理与思维链
#large language models #mathematical reasoning #conceptual understanding #fine-tuning #robustness
🎯 研究动机大型语言模型虽能解答复杂数学题,但在涉及深层概念理解时表现不足;现有强化学习方法强化答案正确性,但缺乏对概念应用的细粒度指导。
❓ 解决问题弥合数学推理中的定义与应用之间的差距,增强模型对概念的理解与应用能力。
🔍 现象分析通过验证性实验,发现模型能够复述定义但无法在与概念相关的测验中表现优异,量化了概念推理的能力差距。
🛠️ 主要方法提出 CORE 框架,包括生成概念相关测验、在生成轨迹中注入概念提示以及通过替换失效轨迹强化概念推理。
📊 数据与实验基于高质量低污染的教材资源验证方法有效性,并在多个模型和领域内外数学基准测试中取得超越基线的稳定性能提升。
⭐ 主要贡献提供了一种算法和验证器无关的细粒度概念监督方法,统一概念相关测验和轨迹注入,通过强化学习弥合问题解决能力与概念理解之间的差距。
查看完整摘要 (Abstract)
Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than conceptual applications. We introduce $\textit{CORE}$ (Concept-Oriented REinforcement), an RL training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions but fail concept-linked quizzes, quantifying the conceptual reasoning gap. $\textit{CORE}$ then (i) synthesizes concept-aligned quizzes, (ii) injects brief concept snippets during rollouts to elicit concept-primed trajectories, and (iii) reinforces conceptual reasoning via trajectory replacement after group failures, a lightweight forward-KL constraint that aligns unguided with concept-primed policies, or standard GRPO directly on concept-aligned quizzes. Across several models, $\textit{CORE}$ delivers consistent gains over vanilla and SFT baselines on both in-domain concept--exercise suites and diverse out-of-domain math benchmarks. $\textit{CORE}$ unifies direct training on concept-aligned quizzes and concept-injected rollouts under outcome regularization. It provides fine-grained conceptual supervision that bridges problem-solving competence and genuine conceptual reasoning, while remaining algorithm- and verifier-agnostic.
基础/前沿模型 (含LLM)
推理与思维链
#Test-time Scaling #Model Calibration #Efficient inference #Language Modeling #Scaling
TL;DR:We propose Self-Calibration, a new unsupervised framework to help model calibrate the confidence and using the confidence to efficiently test time scaling.
🎯 研究动机当前LLMs推理质量可通过测试时的计算扩展提升,但固定采样策略既可能浪费计算资源也可能限制高复杂性问题的探索效率。
❓ 解决问题针对LLMs置信度过高且不可靠的问题,提出一种自校准方法以提高测试时置信度估计的可靠性,并实现高效推理扩展。
🔍 现象分析传统的固定采样方法如Best-of-N和多数投票自一致性缺乏对问题复杂度的适配性,导致简单问题计算浪费和复杂问题探索不足。
🛠️ 主要方法通过将自一致性生成的置信度蒸馏至模型来实现自校准(Self-Calibration),并设计基于置信度的Calibrated Test-Time Scaling框架以自适应调整采样策略。
📊 数据与实验在三个LLMs上对九个数据集进行实验,CaTS方法在MathQA上实现准确率从73.7提升至83.6,仅需16次采样,展现其有效性。
⭐ 主要贡献提出了一种基于置信度的推理扩展新框架CaTS,创新性地结合自校准提高推理效率和可靠性,实验验证其显著优于传统方法。
查看完整摘要 (Abstract)
Increasing test-time computation is a straightforward approach to enhancing the quality of responses in Large Language Models (LLMs). While Best-of-N sampling and Self-Consistency with majority voting are simple and effective, they require a fixed number of sampling responses for each query, regardless of its complexity. This could result in wasted computation for simpler questions and insufficient exploration for more challenging ones. In this work, we argue that model confidence of responses can be used for improving the efficiency of test-time scaling. Unfortunately, LLMs are known to be overconfident and provide unreliable confidence estimation. To address this limitation, we introduce Self-Calibration by distilling Self-Consistency-derived confidence into the model itself. This enables reliable confidence estimation at test time with one forward pass. We then design Calibrated Test-Time Scaling (CaTS), adapting common repeated sampling methods, such as self-consistency and Best-of-N to handle queries of various difficulty. We also show that CaTS-SC is provably better than vanilla self-consistency. Experiments on three LLMs across nine datasets demonstrate the effectiveness of our approach. Specifically, applying confidence-based Early Stopping (CaTS-ES) to Best-of-N improves MathQA accuracy from 73.7 to 83.6 with a sample budget of 16 responses, demonstrating the effectiveness of the confidence-based sampling strategy at inference time.
基础/前沿模型 (含LLM)
推理与思维链
#Latent Reasoning; Recurrent Depth; RWKV-Product; State-Guided Sparse Attention
🎯 研究动机当前大规模语言模型受限于固定深度的Transformer架构,难以高效解决复杂推理任务,且现有方法如思维链需依赖自然语言生成,计算成本随序列长度快速增加。
❓ 解决问题目标是通过将推理迁移到潜在计算空间并优化计算与深度之间的平衡,提高复杂推理任务的效率和性能。
🔍 现象分析现有系统在深度推理任务上的表现有限,尤其在长距离建模和计算效率方面存在显著瓶颈。
🛠️ 主要方法提出ChainGPT模型,通过多步状态更新和状态引导的稀疏注意力在层内实现深度计算,并通过递归深度方法跨层次优化潜在状态,辅以自适应训练与停止策略。
📊 数据与实验实验中ChainGPT在多个具有挑战性的推理任务上展示出相较于现有模型的稳定改进,同时保持高效的计算性能。
⭐ 主要贡献ChainGPT统一了推理能力与计算效率,通过理论证明其通用计算能力,并提供适用于下一代语言模型的框架。
查看完整摘要 (Abstract)
Large language models, constrained by the fixed-depth Transformer architecture, struggle to solve complex reasoning tasks in an end-to-end manner. Existing approaches, such as Chain of Thought, improve reasoning depth to some extent but rely heavily on natural language generation, with computational costs increasing rapidly as the length of the generated sequence grows. To address these limitations, we propose ChainGPT, a dual-reasoning model that shifts reasoning into latent computational space. Within each layer, ChainGPT employs multi-substep state updates combined with state-guided sparse attention, enabling deep local computation and efficient long-range modeling without quadratic costs. Across layers, recurrent depth approach iteratively refine latent states, supported by adaptive training and stopping strategies that balance reasoning depth against computational budget. Theoretically, we show that ChainGPT can, in principle, simulate general computation, and empirically it delivers consistent improvements over comparable models, including on reasoning tasks that remain challenging for existing systems. By unifying efficiency and reasoning ability, ChainGPT provides a principled foundation for next-generation language models.
基础/前沿模型 (含LLM)
推理与思维链
#Large language model reasoning #self-supervised RL
TL;DR:We propose Co-rewarding, a novel self-supervised RL framework that improves training stability for large language model reasoning.
🎯 研究动机强化学习的可验证奖励(RLVR)方法有效提升大语言模型推理能力,但依赖人工标注导致复杂任务的扩展性受限。近期无标签的自奖励方法虽具潜力,但存在训练崩溃问题,限制模型性能提升。
❓ 解决问题为解决自奖励方法因单视角监督信号导致奖励欺骗和训练不稳定的难题,提出一种可稳定训练的大语言模型推理框架,旨在克服单一视角的自一致性假象。
🔍 现象分析现有自奖励方法存在训练崩溃问题,归因于单视角信号容易形成奖励欺骗,使模型倾向于简单但无效的推理解决方案。
🛠️ 主要方法提出一种名为Co-rewarding的自监督强化学习框架,通过跨语义类问题的对比一致性(Co-rewarding-I)或参考教师的伪标签自蒸馏(Co-rewarding-II)提供补充监督信号,并探讨两者结合的优劣。
📊 数据与实验在多个数学推理基准测试中,Co-rewarding稳定性显著提升,平均性能比其他自奖励方法高出3.31%,在某些模型上如Llama-3.2-3B-Instruct提升达7.49%,部分场景超越基于人工标签的RLVR。
⭐ 主要贡献提出一种稳定性优异的自监督强化学习框架,显著提升大语言模型推理性能并部分替代人工标注,超越了现有方法性能,公开了代码以供研究参考。
查看完整摘要 (Abstract)
While reinforcement learning with verifiable rewards (RLVR) is effective to improve the reasoning ability of large language models (LLMs), its reliance on human-annotated labels leads to the scaling up dilemma, especially for complex tasks. Recent self-rewarding methods investigate a label-free alternative to unlock the reasoning capabilities of LLMs, yet they frequently encounter the non-negligible training collapse issue, as the single-view supervision signal easily forms the self-consistent illusion, yielding the reward hacking. Inspired by the success of self-supervised learning, we propose \textit{Co-rewarding}, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views. Specifically, we instantiate Co-rewarding in two ways: (1) \textit{Co-rewarding-I} is a data-side instantiation that derives reward signals from contrastive agreement across semantically analogous questions; and (2) \textit{Co-rewarding-II} is a model-side instantiation that maintains a slowly-updated reference teacher with pseudo labels to realize self-distillation. Intuitively, such instantiations introduce different levels of discrepancy to increase the difficulty of training collapse on trivial reasoning solutions. We also explore their orthogonally combined version to further boost the performance. Empirically, Co-rewarding exhibits stable training across various setups, and outperforms other self-rewarding baselines by $+3.31\%$ improvements on average on multiple mathematical reasoning benchmarks, especially by $+7.49\%$ on Llama-3.2-3B-Instruct. Notably, Co-rewarding reaches or even surpasses RLVR with ground-truth (GT) label in several cases, such as a Pass@$1$ of $94.01\%$ on GSM8K with Qwen3-8B-Base remarkably higher than GT. Our code is released at~\url{https://github.com/tmlr-group/Co-rewarding}.
基础/前沿模型 (含LLM)
推理与思维链
#Long CoT Distillation #Scientific Reasoning #Evolutionary Algorithm
TL;DR:We propose CoT-Evo, an evolutionary distillation framework that constructs diverse CoTs from multiple LLMs and iteratively refines them into high-quality CoTs for scientific reasoning.
🎯 研究动机现有从大语言模型(LLM)中蒸馏推理链(CoT)的方法在科学推理领域表现不佳,因科学推理需要高复杂度和专业知识,导致模型生成低质量推理数据。
❓ 解决问题直接利用低质量的LLM输出进行蒸馏会限制小型学生模型的性能。本研究提出一种框架以生成高质量的科学推理推理链。
🔍 现象分析现有LLM在处理科学领域任务时常生成错误或表面化推理,无足够应对复杂专业领域的能力,传统蒸馏方式难以提升推理链质量。
🛠️ 主要方法提出CoT-Evo框架,通过从多LLM生成多样化的推理轨迹,并结合领域知识,通过新颖性选择、反思重组及变异的演化策略迭代优化推理链质量。
📊 数据与实验使用演化生成的高质量CoT数据集微调紧凑型模型,并在科学推理基准测试中取得最新的性能表现。
⭐ 主要贡献建立了通过整合多样性LLM输出及演化优化生成高保真科学推理数据的新方法,为科学推理任务提供高质量数据支持并提升模型性能。
查看完整摘要 (Abstract)
While chain-of-thought (CoT) distillation from advanced large language models (LLMs) has proven effective in general reasoning tasks, it struggles in scientific domains where even advanced models often produce incorrect or superficial reasoning due to high complexity and specialized knowledge requirements. Directly distilling from such flawed outputs results in low-quality training data and limits the performance of smaller student models. To overcome this, we propose CoT-Evo, an evolutionary CoT distillation framework. It begins by constructing a diverse pool of reasoning trajectories from multiple LLM thinkers, enriches them with automatically retrieved domain knowledge, and iteratively refines the trajectories using novelty-driven selection, reflective recombination and mutation. The refinement is guided by a fitness function that evaluates answer correctness, coherence, and effective knowledge utilization. This results in a high-quality CoT dataset tailored for scientific reasoning. We employ this evolved dataset to fine-tune a compact model, which achieves state-of-the-art performance on scientific reasoning benchmarks. Our work establishes a scalable approach to synthesizing high-fidelity scientific reasoning data from diverse and fallible LLMs.
基础/前沿模型 (含LLM)
推理与思维链
#Process Reward Models
TL;DR:We shift the training of Process Reward Models from verifying domain-specific correctness to modeling domain-agnostic contextual coherence, achieving state-of-the-art multi-domain generalization.
🎯 研究动机现有过程奖励模型(PRMs)在数学领域表现卓越,但在跨域任务中泛化能力受限,主要因领域特定数据稀缺及学习模式依赖域内知识。
❓ 解决问题通过将学习目标从验证领域特定正确性转向建模领域无关的语境连贯性,解决PRMs在多领域泛化上的局限性。
🔍 现象分析传统PRMs主要聚焦数学推理,而在其他领域的表现规模提升有限,说明域内数据依赖性较强且语境逻辑未充分利用。
🛠️ 主要方法提出一种新颖的数据标注与训练框架,基于链式思维步骤间的语境连贯性,设计ContextPRM提升模型跨域泛化能力。
📊 数据与实验使用MMLU-Pro测试九个非数学领域,ContextPRM相比多数投票基线提升6.5%准确率,显著优于VersaPRM的2.2%和其他数学专注模型的0.5%。
⭐ 主要贡献首次利用语境连贯性优化PRMs跨域能力,统一数学及非数学领域性能表现,实现多领域测试时扩展的最新前沿成果。
查看完整摘要 (Abstract)
Process reward models (PRMs) have demonstrated significant efficacy in enhancing the mathematical reasoning capabilities of large language models (LLMs) by leveraging test-time scaling (TTS). However, while most PRMs exhibit substantial gains in mathematical domains, the scarcity of domain-specific training data and knowledge-based learning patterns limits their generalization ability when faced with other domains. To address this limitation, we shift the learning objective from verifying domain-specific knowledge to modeling domain-agnostic logical flow. Centering on \textit{contextual coherence} between chain-of-thought (CoT) steps, our approach is realized through a novel data annotation and training framework, which enhances the model's generalization capabilities across diverse domains. For instance, our resulting model, \textbf{ContextPRM}, achieves a notable 6.5\% average accuracy improvement over the majority voting baseline via weighted majority voting across nine non-mathematical domains in MMLU-Pro, including law, history, and philosophy, significantly surpassing the 2.2\% improvement from VersaPRM and 0.5\% gains from other mathematics-focused PRMs, demonstrating consistent performance across both mathematical and non-mathematical domains.
基础/前沿模型 (含LLM)
推理与思维链
#chain-of-thought #latent space reasoning #parallel exploration #transformers #policy optimization #multi token sampling
TL;DR:We establish theoretical benefits of chain-of-thought with continuous tokens and introduce new supervision and policy optimization strategies.
🎯 研究动机现有语言模型通过离散采样生成思维链,尽管成功显著,但连续值的思维链(CoT2)提供更丰富表达能力,并适用于需要搜索能力的逻辑推理任务。
❓ 解决问题提出新的理论保证与算法,解决思维链在连续值环境下的并行跟踪、多样性探索以及推理效率问题。
🔍 现象分析证明了CoT2可以同时跟踪多个离散证据链,且并行能力和推理效率受嵌入维度影响;实验显现连续监督策略优于其他方法,政策优化进一步提升性能。
🛠️ 主要方法设计基于CoT2的单层Transformer模型,通过匹配目标分布进行连续监督;提出多离散标记采样策略以调控并行度,同时用于政策优化。
📊 数据与实验采用逻辑推理任务,例如组合性‘子集求和问题’,实验验证了并行度受嵌入维度限制,连续监督策略和优化策略的有效性。
⭐ 主要贡献1. 提出CoT2理论优势及并行度评估方法;2. 设计基于CoT2的Transformer模型及监督策略;3. 引入新采样策略和政策优化用于性能提升。
查看完整摘要 (Abstract)
Modern language models generate chain-of-thought traces by autoregressively sampling tokens from a finite vocabulary. While this discrete sampling has achieved remarkable success, conducting chain-of-thought with continuously-valued tokens (CoT2) offers a richer and more expressive alternative. Our work provides new theoretical guarantees and algorithms for CoT2, motivated by logical reasoning tasks that inherently require search capabilities. Theoretically, we establish how CoT2 facilitates the model to track multiple discrete traces in parallel; and quantify the level of achievable parallelism and its benefits for inference efficiency. We also provide a CoT2-based one-layer transformer construction that solves the combinatorial ``subset sum problem'' given a sufficient embedding dimension. These insights arise from a novel and effective supervision strategy where we match the language model outputs to the empirical token distributions of a set of target traces. Complementing this, we introduce sampling strategies that unlock policy optimization methods for CoT2. Our primary strategy samples and composes $K$ discrete tokens at each decoding step to control the level of parallelism.
Experiments confirm that (i) the optimal level of parallelism is governed by the embedding dimension, (ii) our continuous supervision strategy can outperform alternative methods, and (iii) policy optimization with CoT2 indeed improves the performance of the model beyond its initial discrete or continuous supervision.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #Reinforcement Learning #Natural Language Reasoning
🎯 研究动机当前强化学习训练大语言模型时,奖励稀疏且探索能力有限,导致模型陷入重复低效的推理模式。论文旨在优化语言模型的探索策略。提升推理多样性与性能是关键目标。
❓ 解决问题解决强化学习中因结果导向奖励稀疏导致探索不足的问题,尤其针对复杂推理任务中模型容易陷入局部最优的现象。
🔍 现象分析现有RL方法在语言推理任务中,容易导致模型重复低效地生成推理路径,难以跳脱固定模式并探索更优解。
🛠️ 主要方法提出MERCI算法,通过轻量级Coin Flipping Network生成伪计数和推理路径的不确定性,并转化为激励探索的内在奖励,与现有任务奖励信号结合以优化策略。
📊 数据与实验利用复杂推理基准数据集进行实验,将MERCI融入先进RL框架如GRPO,结果显示方法提升了推理质量与多样性,同时超越前沿基线。
⭐ 主要贡献提出基于内在奖励的探索机制显著改善语言模型推理性能,验证了针对探索设计的内在动力可帮助逃离局部低效策略并发现更优解决路径。
查看完整摘要 (Abstract)
Reinforcement Learning (RL) has become a compelling way to strengthen the multi step reasoning ability of Large Language Models (LLMs). However, prevalent RL paradigms still lean on sparse outcome-based rewards and limited exploration, which often drives LLMs toward repetitive and suboptimal reasoning patterns. In this paper, we study the central question of how to design exploration for LLM reasoning and introduce MERCI (Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards), a novel RL algorithm that augments policy optimization with a principled intrinsic reward. Building on the idea of count-based exploration, MERCI leverages a lightweight Coin Flipping Network (CFN) to estimate the pseudo count and further epistemic uncertainty over reasoning trajectories, and converts them into an intrinsic reward that values novelty while preserving the learning signal from task rewards. We integrate MERCI into some advanced RL frameworks like Group Relative Policy Optimization (GRPO). Experiments on complex reasoning benchmarks demonstrate that MERCI encourages richer and more varied chains of thought, significantly improves performance over strong baselines, and helps the policy escape local routines to discover better solutions. It indicates that our targeted intrinsic motivation can make exploration reliable for language model reasoning.
基础/前沿模型 (含LLM)
推理与思维链
#Retrieval-Augmented Generation #Large Language Models #Counterfactual Reasoning
🎯 研究动机检索增强生成(RAG)在知识密集型任务上取得了进展,但存在无法辨别因果决定性证据和相关性误导信息的缺陷,即相关性陷阱问题。
❓ 解决问题设计一个能够进行因果推理的新框架,克服现有系统因相关性陷阱导致的系统性失败。
🔍 现象分析现有RAG系统对大量相关性高但误导的信息无法正确处理,缺乏可区分出因果决定性证据的能力,导致结果不可靠。
🛠️ 主要方法提出了Counterfactual RAG(CF-RAG),通过系统生成和评估反事实查询来识别因果相关性,并引入并行仲裁机制以调和冲突证据。
📊 数据与实验在多个具有挑战性的基准数据集上进行测试,CF-RAG在鲁棒性和性能上均显著优于传统RAG模型,同时维持了近似的运行效率。
⭐ 主要贡献提出了CF-RAG框架,解决了RAG中因相关性陷阱导致的因果推理缺陷,并实现了新的最先进性能。
查看完整摘要 (Abstract)
While Retrieval-Augmented Generation (RAG) has advanced knowledge-intensive tasks, we identify a fundamental vulnerability: the Correlation Trap. Existing systems cannot distinguish causally decisive evidence from overwhelmingly correlated yet misleading information, leading to systematic failures. We introduce Counterfactual RAG (CF-RAG), a new framework that operationalizes causal reasoning to overcome this limitation. CF-RAG systematically generates and evaluates counterfactual queries to identify causally relevant distinctions, and employs a parallel arbitration mechanism to reconcile conflicting evidence without interference. On challenging benchmarks, CF-RAG substantially improves robustness against the Correlation Trap, achieving state-of-the-art performance while maintaining comparable efficiency to standard RAG models.
基础/前沿模型 (含LLM)
推理与思维链
#LLM #Reinforcement Learning #Post Training
🎯 研究动机现有通过强化学习后训练的语言模型在复杂任务上表现有限,原因在于奖励过于稀疏。通过逐步增加任务难度有利于语言模型渐进获得推理能力。
❓ 解决问题解决单纯使用RLVR在困难任务上效果欠佳的问题,提出从简单到复杂的任务规划方法以改善语言模型的推理能力。
🔍 现象分析实验发现仅强调简单任务容易导致过拟合,而适当淡化简单任务能有效提升模型在复杂问题上的泛化能力。
🛠️ 主要方法提出E2H Reasoner方法,采用从易到难的任务规划策略,并结合理论分析证明任务分解与阶段性学习能够减少样本需求。
📊 数据与实验基于多元数据集和多种模型进行实验,验证了E2H Reasoner在推理能力提升上的广泛有效性。
⭐ 主要贡献提出了结合课程学习与强化学习的创新方法,理论上证明了其收敛性与样本复杂度优势,并实证验证其显著提升语言模型推理能力的效果。
查看完整摘要 (Abstract)
We aim to improve the reasoning capabilities of language models via reinforcement learning with verifiable rewards (RLVR). Recent RLVR post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RLVR alone to improve reasoning on inherently difficult tasks is less effective due to sparse rewards. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across diverse datasets and models demonstrate that E2H Reasoner substantially enhances LLM reasoning. Code is available at - https://github.com/divelab/E2H-Reasoning
基础/前沿模型 (含LLM)
推理与思维链
#Large language model #Reasoning #Testing time scaling
🎯 研究动机大型推理模型在复杂问题求解中依赖推理过程中的反思标记,但其频率和位置的优化尚未被深入研究。
❓ 解决问题通过调节反思标记的使用频率和位置,提升模型在推理阶段的性能,避免过多或过少使用导致的性能下降。
🔍 现象分析实验证明反思标记的过用和少用都会降低模型性能,这种现象与优化中的学习率调度类似。
🛠️ 主要方法提出了基于双向位置依赖三角波形的周期性反思标记调度方法 CyclicReflex,无需额外训练或计算成本,动态调整反思标记的使用。
📊 数据与实验实验在 MATH500、AIME2024/2025、AMC2023、GPQA Diamond 和 LiveCodeBench 数据集上进行,覆盖模型规模从 1.5B 至 14B,均显著优于基准方法。
⭐ 主要贡献提出了资源分配视角的反思标记调度问题;设计了高效的解码策略 CyclicReflex,并通过广泛实验验证其性能提升效果。
查看完整摘要 (Abstract)
Large reasoning models (LRMs), such as OpenAI’s o1 and DeepSeek-R1, harness test-time scaling to perform multi-step reasoning for complex problem-solving. This reasoning process, executed before producing final answers, is often guided by special juncture tokens that prompt self-evaluative reflection. These transition markers and reflective cues are referred to as “reflection tokens” (e.g., “wait”, “but”, “alternatively”). In this work, we treat reflection tokens as a “resource” and introduce the problem of resource allocation, aimed at improving the test-time compute performance of LRMs by adaptively regulating the frequency and placement of reflection tokens. Through empirical analysis, we show that both excessive and insufficient use of reflection tokens, referred to as over-reflection and under-reflection, can degrade model performance. To better understand this trade-off, we draw an analogy between reflection token usage and learning rate scheduling in optimization. Building on this insight, We propose cyclical reflection token scheduling (termed CyclicReflex), a training-free decoding strategy that dynamically modulates reflection token logits with a bidirectional, position-dependent triangular waveform, incurring no additional computation cost. Experiments on MATH500, AIME2024/2025, AMC2023, GPQA Diamond and LiveCodeBench demonstrate that CyclicReflex consistently improves performance across model sizes (1.5B–14B), outperforming standard decoding and recent approaches such as TIP (thought switching penalty) and S1.
基础/前沿模型 (含LLM)
推理与思维链
#LLMs #mathematical reasoning #directed acyclic graphs
TL;DR:We propose a new pipeline by modeling CoT on directed acyclic graphs (DAGs), introduce the concept of logic closeness, and then precisely evaluates the mathematical reasoning ability of LLMs via the proposed DAG-MATH format.
🎯 研究动机现有的大语言模型在解决数学问题时表现强劲,但其成功机制(搜索、记忆程序或规则一致推理)尚不明确,需要深入解析其推理过程。
❓ 解决问题探索如何通过有向无环图(DAG)建模链式思维(CoT)推理过程,并提出精确量化其逻辑轨迹与模型输出之间一致性的评价框架。
🔍 现象分析发现不同语言模型在标准数学数据集上的推理轨迹呈现显著差异,即使最终答案准确率相近,但模型的推理忠实性(rule-consistent derivation)存在显著间隙。
🛠️ 主要方法基于有向无环图构建逻辑轨迹框架,引入逻辑接近性指标,用于衡量模型的推理过程与规则一致性;设计DAG-MATH格式及评估基准来指导模型生成逻辑规范的链式思维输出。
📊 数据与实验实验覆盖多个标准数学推理数据集,统计对比多个语言模型的逻辑忠实性和答案准确率,为模型间推理能力分析提供了全面视角。
⭐ 主要贡献提出了融合链式思维与有向无环图的新推理评价框架及指标,为语言模型推理过程诊断提供工具;开发了可公开使用的DAG-MATH格式与测试基准,为领域研究提供基石。
查看完整摘要 (Abstract)
Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT), yet it remains unclear whether this success stems from search, rote procedures, or rule-consistent reasoning. To address this, we propose modeling CoT as a certain rule-based stochastic process over directed acyclic graphs (DAGs), where nodes represent intermediate derivation states and edges encode rule applications. Within this framework, we introduce \textbf{logical closeness}, a metric that quantifies how well a model’s CoT trajectory (i.e., the LLM's final output) adheres to the DAG structure, providing evaluation beyond classical PASS@$k$ metrics. Building on this, we introduce the \emph{DAG-MATH} CoT format and construct a benchmark that guides LLMs to generate CoT trajectories in this format, thereby enabling the evaluation of their reasoning ability under our framework. Across standard math reasoning datasets, our analysis uncovers statistically significant differences in reasoning fidelity among representative LLM families-even when PASS@$k$ is comparable-highlighting gaps between final-answer accuracy and rule-consistent derivation. Our framework provides a balance between free-form CoT and formal proofs systems, offering actionable diagnostics for LLMs reasoning evaluation. Our benchmark and code are available at https://github.com/YuanheZ/DAG-MATH.
基础/前沿模型 (含LLM)
推理与思维链
#GRPO #Advantage Vanishing #Reward Sparsity #Multimodal LLM #Difficulty-Adaptive
TL;DR:DIVA-GRPO dynamically adjusts problem difficulty and generates tailored variants to stabilize reward signals in GRPO, mitigating reward sparsity and advantage vanishing, improving both training efficiency and reasoning performance in multimodal LLMs.
🎯 研究动机GRPO在多模态大语言模型强化学习中应用广泛,但面临奖励稀疏和优势消失的核心问题。现有方法未能从根本解决组内奖励方差不足导致优化信号模糊的挑战。
❓ 解决问题提出DIVA-GRPO方法,通过动态调整问题难度分布和生成难度适配的变体,稳定GRPO训练中的奖励信号。旨在缓解奖励稀疏和优势消失,提升训练效率和推理性能。
🔍 现象分析奖励稀疏源于困难问题缺乏正向反馈;优势消失则由问题过难或过易时组内奖励一致性过高引起。现有样本增强、选择性使用或间接奖励设计方法均存在分布控制不足、数据利用率低或优化偏差等局限。
🛠️ 主要方法动态评估问题难度,从全局视角为每个问题采样难度适配的变体。在局部(单问题)和全局(问题及其变体)组内计算优势时,引入难度加权与归一化缩放机制。
📊 数据与实验在六个推理基准上进行了广泛实验,验证了方法在训练效率和推理性能上的优越性。实验结果表明DIVA-GRPO优于现有方法。
⭐ 主要贡献提出难度自适应变体增强优势框架,从根本上优化GRPO的奖励信号分布。通过动态难度调节和全局-局部优势计算,有效缓解奖励稀疏与优势消失,减少数据浪费并提升训练稳定性。
查看完整摘要 (Abstract)
Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a traditional critic model, it often suffers from sparse rewards, arising from the scarcity of positive feedback on difficult problems, and from advantage vanishing, which occurs when group-level rewards exhibit high consistency for problems that are too easy or too hard. Existing solutions fall into three categories: sample enhancement and expansion, which may aggravate vanishing advantage due to poor control of difficulty distribution; selective sample utilization, which fails to fully leverage the value of all data; and indirect reward design, which may introduce biased optimization directions due to misalignment between reasoning and the final outcome. However, these approaches overlook a fundamental question: for a given problem, how can we ensure that the within-group reward distribution of responses exhibits enough variance to yield clear optimization signals for each response? To address these issues, we propose DIVA-GRPO, a difficulty-adaptive variant augmentation advantage method that dynamically adjusts the difficulty distribution of variants for each problem from a global perspective. Our method dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and advantages are computed within both local and global(a problem and its variants) groups using difficulty-weighted and normalized scaling. This design alleviates reward sparsity and advantage vanishing, minimizes data waste, and improves training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in both training efficiency and reasoning performance.
基础/前沿模型 (含LLM)
推理与思维链
#Large Reasoning Model #Efficient Reasoning
TL;DR:We propose a novel method which integrates an optimized positive data distribution under a KL regularization into a discriminative objective to encourage efficient reasoning with minimal effect on performance.
🎯 研究动机近年来大型推理模型虽表现优异,但在简单问题上常出现过度推理,导致计算成本和响应延迟增加。现有方法通过奖励缩短推理长度,但会显著降低模型性能。
❓ 解决问题针对上述问题,提出一种新的框架 DRPO,旨在优化奖励机制以减少推理冗余,同时避免性能下降。
🔍 现象分析发现现有方法会因对长推理的正确样本处以惩罚,导致其优势函数为负值,从而主动抑制有效推理。
🛠️ 主要方法DRPO 通过将正样本的奖励信号从负样本中解耦,利用 KL 正则化优化正样本分布,并将其融入判别目标函数以提高推理效率。
📊 数据与实验在数学推理任务上实验,模型在 GSM8k 等数据集上实现了显著推理效率提升,用 1.5B 模型将推理长度减少 77%,性能仅下降 1.1%,大幅领先其他方法。
⭐ 主要贡献提出创新性 DRPO 框架,实现推理效率优化与性能平衡,并提供通用的分布优化方法,可扩展至其他偏好奖励场景。
查看完整摘要 (Abstract)
Recent large reasoning models (LRMs) driven by reinforcement learning algorithms (e.g., GRPO) have achieved remarkable performance on challenging reasoning tasks. However, these models suffer from overthinking, generating unnecessarily long and redundant reasoning even for simple questions, which substantially increases computational cost and response latency. While existing methods incorporate length rewards to GRPO to promote concise reasoning, they incur significant performance degradation. We identify the root cause: when rewards for correct but long rollouts are penalized, GRPO's group-relative advantage function can assign them negative advantages, actively discouraging valid reasoning. To overcome this, we propose Decoupled Reward Policy Optimization (DRPO), a novel framework that decouples the length-based learning signal of correct rollouts from incorrect ones. DRPO ensures that reward signals for correct rollouts are normalized solely within the positive group, shielding them from interference by negative samples. The DRPO's objective is grounded in integrating an optimized positive data distribution, which maximizes length-based rewards under a KL regularization, into a discriminative objective. We derive a closed-form solution for this distribution, enabling efficient computation of the objective and its gradients using only on-policy data and importance weighting. Of independent interest, this formulation is general and can incorporate other preference rewards of positive data beyond length. Experiments on mathematical reasoning tasks demonstrate DRPO's significant superiority over six efficient reasoning baselines. Notably, with a 1.5B model, our method achieves 77\% length reduction with only 1.1\% performance loss on simple questions like GSM8k dataset, while the follow-up baseline sacrifices 4.3\% for 68\% length reduction.
基础/前沿模型 (含LLM)
推理与思维链
#retrieval-augmented generation #adaptive retrieve
🎯 研究动机大语言模型在推理能力上表现出色,但因其参数知识的时效性、准确性和全面性有限,实际应用中常出现严重的事实幻觉问题。增强检索辅助生成与推理的结合仍存在挑战。
❓ 解决问题现有方法中任务分解无效以及冗余检索会引入噪声,导致响应质量下降,因此亟需构建一种能够合理且自适应检索的框架。
🔍 现象分析传统的检索增强生成方法往往无法动态判断何时需要检索外部知识,导致资源浪费和回答质量的下降。
🛠️ 主要方法提出框架 DeepRAG,将检索增强推理建模为一个马尔可夫决策过程,通过逐步分解查询,动态确定每一步是检索知识还是使用参数推理。
📊 数据与实验实验证明 DeepRAG 能提高检索效率,并将答案准确率提升了 25.41%。
⭐ 主要贡献提供了一种将检索与推理有效结合的新框架,大幅提高了检索效率和答案准确度,同时展示了参数推理与知识检索的动态交互潜力。
查看完整摘要 (Abstract)
Large Language Models (LLMs) have shown remarkable reasoning capabilities, while their practical applications are limited by severe factual hallucinations due to limitations in the timeliness, accuracy, and comprehensiveness of their parametric knowledge. Meanwhile, enhancing retrieval-augmented generation (RAG) with reasoning remains challenging due to ineffective task decomposition and redundant retrieval, which can introduce noise and degrade response quality. In this paper, we propose DeepRAG, a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP), enabling reasonable and adaptive retrieval. By iteratively decomposing queries, DeepRAG dynamically determines whether to retrieve external knowledge or rely on parametric reasoning at each step. Experiments show that DeepRAG improves retrieval efficiency and boosts answer accuracy by 25.41%, demonstrating its effectiveness in enhancing retrieval-augmented reasoning.
基础/前沿模型 (含LLM)
推理与思维链
#MCTS #RLVR
🎯 研究动机强化学习中的可验证奖励已普遍应用于语言模型逻辑推理,但训练效率瓶颈制约了性能提升,需寻求优化探索模式的解决方案。
❓ 解决问题现有方法探索空间稀疏,模型容易错过关键推理路径,训练效果在重复计算中逐渐递减。
🔍 现象分析当前RLVR方法依赖有限的路径搜索,导致解决空间覆盖不足和推理步骤的奖励分配不够精细,从而降低了训练收益。
🛠️ 主要方法提出DeepSearch框架,将蒙特卡罗树搜索(MCTS)引入训练环节,采用全局前沿节点优先策略、基于熵的路径选择、多样回放缓存优化来提升探索效率。
📊 数据与实验在数学推理基准测试中,DeepSearch平均准确率达62.95%,耗费GPU时长为传统训练方法的1/5.7,显著提高了性能与效率。
⭐ 主要贡献验证系统化搜索优化了模型推理能力;突破了RLVR在训练阶段的探索瓶颈,树立计算效率与准确率的新标准。
查看完整摘要 (Abstract)
Although Reinforcement Learning with Verifiable Rewards (RLVR) has become an essential component for developing advanced reasoning skills in language models, contemporary studies have documented training plateaus after thousands of optimization steps, i.e., notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search (MCTS) directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance gains over prolonged training. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves an average accuracy of 62.95% and establishes a new state-of-the-art reasoning model, while using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.
基础/前沿模型 (含LLM)
推理与思维链
#LLM reasoning #Reinforcement learning with verifiable rewards #efficient exploration #diversity
TL;DR:We promote global sentence-level diversity to incentivize deep exploration for versatile LLM reasoning.
🎯 研究动机现有的大语言模型(LLM)在推理任务中由于状态空间巨大和奖励稀疏性,探索效率低下,表现欠佳。
❓ 解决问题提出一种新的框架,以促进全局序列级别的多样性,从而激励深度探索并增强模型的通用推理能力。
🔍 现象分析通过实证研究发现,全局多样性与推理能力之间存在强正相关性,为探索改进提供了理论依据。
🛠️ 主要方法设计了以全局多样性为核心的内在奖励机制和潜力驱动的奖励修整机制,同时结合简单的启发式策略来防止奖励作弊。
📊 数据与实验实验覆盖了域内和域外任务,性能在多种RLVR基准和探索策略上均表现优越,Pass@1和Pass@k指标全面优胜。
⭐ 主要贡献提出了基于多样性激励的新框架,显著提升了推理任务的探索深度和效率,为强化学习与大语言模型的结合提供了新思路。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a crucial paradigm for incentivizing reasoning capabilities in Large Language Models (LLMs). Due to vast state-action spaces and reward sparsity in reasoning tasks, existing methods often struggle with deficient exploration and poor sample efficiency. In the paper, we propose **DIVER** (**D**iversity-**I**ncentivized Exploration for **V**ersatil**E** **R**easoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning. We first conduct a primary empirical study to reveal a strong positive correlation between global diversity and reasoning capacity. Building on this insight, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. Incorporating the intrinsic reward, we develop a potential-based reward shaping mechanism to preserve optimal policy invariance and design simple heuristics to mitigate possible reward hacking. Experimental results show that DIVER outperforms competitive RLVR baselines with various exploration strategies on both in-domain and out-of-domain tasks, excelling in both Pass@1 and Pass@k evaluations. Our code is available at https://github.com/NJU-RL/DIVER.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #Evaluation Metrics #Bayesian Methods #Uncertainty Quantification #Model Ranking #Reasoning #Statistical Significance
TL;DR:We propose a Bayesian evaluation framework that replaces Pass@k with stable, uncertainty-aware metrics for reliable and compute efficient LLM evaluation.
🎯 研究动机Pass@$k$在LLM推理评估中应用广泛,但其计算不稳定且误导模型排名,尤其在样本数量有限或计算受限时问题更加明显。
❓ 解决问题提出一种基于贝叶斯的评估框架,用后验成功概率和可信区间代替Pass@$k$和平均准确率,解决排名稳定性和计算效率问题。
🔍 现象分析Pass@$k$和平均准确率无法有效量化模型的不确定性,导致评估结果在小样本规模下波动较大且缺乏统计显著性支持。
🛠️ 主要方法通过使用Dirichlet先验对分类结果建模,以后验均值和不确定性为基础进行模型排名,同时支持使用既有证据的加权评估方式。
📊 数据与实验在仿真实验及AIME'24/'25、HMMT和BrUMO数据集上,新框架在更少样本情况下实现了更快收敛和排名稳定性,并能区分统计显著性差异。
⭐ 主要贡献提出一个统一的二值与非二值评估框架,用贝叶斯后验代替传统Pass@$k$,显式处理评估不确定性并显著提升计算效率。
查看完整摘要 (Abstract)
Pass@$k$ is widely used to report performance for LLM reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass@$k$ and average accuracy over $N$ trials (avg@$N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass@$1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT, and BrUMO, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass@$k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass@$k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://github.com/mohsenhariri/scorio.
基础/前沿模型 (含LLM)
推理与思维链
#LRM #reasoning #finetuning
TL;DR:LRMs often repeat the question before thinking. We formalize this Echo of Prompt via a probabilistic cost, the Echo Likelihood Gap, and show it refocuses attention. ED-SFT and Echoic Prompting exploit it for consistent gains on math reasoning.
🎯 研究动机大型推理模型(LRM)在数学推理、代码生成等任务中通常重复问题。这种行为尽管被忽视,但潜藏理论价值。本研究正式探究这种现象并提出方法提升模型性能。
❓ 解决问题当前方法多通过添加通用思维符号或重复阅读问题以提高推理性能,但缺乏对模型自发重复现象的理论解释和利用。本研究旨在分析并采用这一现象优化推理流程。
🔍 现象分析模型自发重复问题被定义为“提示回声”(EOP),其理论基础通过“回声似然差值”(Echo Likelihood Gap)形式化,揭示早期重复与后续推理精度的关系。注意力研究表明此现象能提高中层注意力的重新聚焦能力。
🛠️ 主要方法提出Echo-Distilled SFT(监督微调以强化‘重复-推理’模式)和Echoic Prompting(无需训练直接利用回声提示重定位模型过程)两种方法,结合理论模型优化EOP。
📊 数据与实验在GSM8K、MathQA等数学推理数据集上进行统一计算预算和解码设置下的评估,与基线方法相比,提出的两种方法获得一致性性能提升。
⭐ 主要贡献首次从理论上形式化提示回声现象及其概率成本,提出回声优化的两种新方法;通过层级注意力分析揭示EOP的深层机制,在多个推理基准数据集上实现稳定性能提升。
查看完整摘要 (Abstract)
Test-time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning. Recent work has addressed this problem by scaling self-consistency and parallel thinking, adding generic thinking tokens and prompting models to re-read the question before answering. Unfortunately, these approaches either inject task-agnostic tokens or mandate heuristics that do not explain---and often ignore---the \emph{spontaneous} repetition that many LRMs exhibit at the head of their internal chains. In contrast, we analyze and harness the model's tendency to restate the question, which we term the \emph{Echo of Prompt (EOP)}, as a front-loaded, compute-shaping mechanism. We formalize its probabilistic cost by casting echo removal as rejection-based conditioning and defining the \emph{Echo Likelihood Gap} $\Delta\mathcal{L}$ as a computable proxy. This provides the missing theoretical link that links early repetition to likelihood gains and downstream accuracy. However, it does not by itself specify how to exploit EOP. Consequently, we develop \emph{Echo-Distilled SFT (ED-SFT)} to instill an ``echo-then-reason'' pattern through supervised finetuning, and \emph{Echoic Prompting (EP)} to re-ground the model mid-trace without training. While promising, quantifying benefits beyond verbosity is non-trivial. Therefore, we conduct length and suffix-controlled likelihood analyses together with layer-wise attention studies, showing that EOP increases answer to answer-prefix attention in middle layers, consistent with an \emph{attention refocusing} mechanism. We evaluate under identical decoding settings and compute budgets on GSM8K, MathQA, Hendrycks-MATH, AIME24, and MATH-500 under identical decoding settings and budgets, and find consistent gains over baselines.
基础/前沿模型 (含LLM)
推理与思维链
#Large Reasoning Models #Efficient Reasoning
🎯 研究动机大规模推理模型在推理能力上表现优异,但易出现过度推理或不足推理的问题,导致计算资源浪费和结果不准确,限制其在资源有限场景中的实用性。
❓ 解决问题现有方法在缓解过度推理时容易引入不足推理问题,降低准确性。论文旨在提出一种平衡推理效率与准确性的框架,以解决推理过程中的不均衡问题。
🔍 现象分析过度推理表现为高信心波动,不足推理表现为持续的过高信心。这些问题影响了模型的输出质量和效率,亟需动态调整推理模式。
🛠️ 主要方法提出无需训练的新框架 extsc{ReBalance},通过动态控制函数调节推理方向强度,以实时信心为依据减少冗余推理并增强探索能力,实现高效和平衡的推理机制。
📊 数据与实验在包含四种规模从0.5B到32B的模型以及九个数学推理、问答和编程任务基准上的实验验证了方法有效性,减少了冗余输出并提升准确性。
⭐ 主要贡献提供了一种通用、无需训练且即插即用的策略,用以优化大规模推理模型的效率和准确性,并公开代码和模型供研究社区使用。
查看完整摘要 (Abstract)
Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose \textsc{ReBalance}, a training-free framework that achieves efficient reasoning with balanced thinking. \textsc{ReBalance} leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs’ reasoning trajectories. A dynamic control function modulates this vector’s strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that \textsc{ReBalance} effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Code and models will be made publicly available.
基础/前沿模型 (含LLM)
推理与思维链
#hierarchical reasoning #LLM #reinforcement learning
🎯 研究动机强化学习已被证明能显著提升大语言模型的复杂推理能力,但其成功的内在机制尚不明确。
❓ 解决问题现有算法对优化信号分配不加区分,导致推理能力进展受限,需要更有效的信号分配方法以突破这一瓶颈。
🔍 现象分析研究发现诸如“顿悟时刻”“长度扩展”和熵动态等现象是推理层级涌现的特征,类似于人类认知中高层战略规划与低层程序执行的分离。
🛠️ 主要方法提出分层感知的信号分配算法 HICRA,专注于优化对高冲击力规划标记的分配,从而提升学习效率。
📊 数据与实验通过大量实验验证 HICRA 方法对强基线的显著超越,并揭示了基于战略探索的推理能力提升机理。
⭐ 主要贡献提供了关于推理层级涌现的深刻洞察,提出的 HICRA 方法有效破解现有强化学习算法的核心效率难题。
查看完整摘要 (Abstract)
Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like ``aha moments", ``length-scaling'' and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose Hierarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. Our extensive experiments validate that HICRA significantly outperforms strong baselines, and offer deep insights into how reasoning advances through the lens of strategic exploration.
基础/前沿模型 (含LLM)
推理与思维链
#Chain-of-Thought reasoning #Direct Preference Optimization #Process supervision #Twisted Sequential Monte Carlo #Large language models
🎯 研究动机推理时扩展语言模型的链式思维提高了推理能力,但现有方法依赖单一策略且在长时段规划中易漂移失效,特别是对于容量较小的模型。
❓ 解决问题为解决长时段推理中策略漂移问题,提出一种多层级推理框架,将长链式思维生成重构为双层随机过程。
🔍 现象分析传统基于结果奖励的强化学习存在稀疏和延迟反馈,导致长轨迹中的信用分配困难,并降低训练稳定性和推理准确性。
🛠️ 主要方法设计高层规划器生成结构化步骤描述,低层执行器基于描述完成推理,采用迭代式Step-DPO方法结合扭曲序列蒙特卡洛实现逐步偏好优化。
📊 数据与实验在数学、科学和逻辑推理基准测试上进行实验,使用有限数据预算下,验证MLR方法在多种基模型和任务上的鲁棒性与性能优越性。
⭐ 主要贡献提出多层级推理框架有效解决链式思维漂移,优化复杂推理任务的稳定性与准确性,显著提升长时段推理能力。
查看完整摘要 (Abstract)
Inference-time scaling enhances a model’s reasoning by extending its chain-of-thought (CoT). However, existing approaches typically rely on a single policy trained with outcome-reward reinforcement learning (RL), which often suffers from long-horizon plan failures where the implicit plan drifts from valid strategies, especially for small LMs with limited capacity. To address this, we propose Multi-Level Reasoning (MLR), which reformulates long-CoT generation as a two-level stochastic process. A high-level planner generates structured step descriptors specifying both the reasoning mode and the semantic subgoal. The low-level executor then produces detailed reasoning conditioned on these descriptors, forming an alternating plan--execute loop. To maintain scalability, we adopt a minimal design where the base model serves as the low-level policy and a lightweight LoRA module implements the high-level policy. For training, we observe that outcome-reward RL provides sparse and delayed feedback for long trajectories (e.g., several thousand tokens), hindering credit assignment. We therefore introduce iterative Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision. This yields more effective training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks show that, under the same reduced data budget (10% SFT and 5% preference relative to the DeepSeek-R1 distillation setup), MLR outperforms both SFT-based distillation and strong RL/preference-optimization baselines across multiple base models and tasks. Moreover, MLR exhibits slower performance degradation on long-horizon reasoning, demonstrating stronger robustness under extended CoT generation.
基础/前沿模型 (含LLM)
推理与思维链
#Reinforcement Learning #Large Reasoning Model #Reinforcement Learning with Verifiable Rewards
TL;DR:We present a systematic study of what makes reasoning experiences valuable in zero RLVR and propose a framework that leverages these insights to exploit high-value experiences for efficient RLVR.
🎯 研究动机强化学习中可验证奖励 (RLVR) 可提升大语言模型的推理能力,但现有方法在每次更新后丢弃经验,导致效率低下和不稳定性。
❓ 解决问题探索什么样的推理经验具有价值,并提出框架优化经验利用,从而提升 RLVR 的效率和稳定性。
🔍 现象分析识别出经验的正确性和熵是推理经验价值的重要指标,这些特性对学习动态有显著影响。
🛠️ 主要方法提出 ExGRPO 框架,通过组织和优先处理高价值经验,并设计混合策略目标在探索与经验利用间取得平衡。
📊 数据与实验在包含 1.5B 至 8B 参数的五种模型上进行测试,数学和通用推理基准上分别平均提升 3.5 和 7.6 分,显著稳定强弱模型的训练过程。
⭐ 主要贡献首次系统分析推理经验的价值特征,提出经验管理驱动的 RLVR 框架,为高效、可扩展的推理学习提供新思路。
查看完整摘要 (Abstract)
Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.
基础/前沿模型 (含LLM)
推理与思维链
#Counterfactual Reasoning #Large Language Models #Reinforcement Learning #Generalization
TL;DR:We introduce executable counterfactuals, a novel framework that operationalizes causal reasoning through code and math problems, and highlight the promise of RL for improving LLMs' counterfactual reasoning.
🎯 研究动机反事实推理是智能的核心特征,对提升大语言模型的因果理解和拓展其在科学研究与医疗等高风险领域的应用至关重要。目前的研究往往跳过关键的归纳步骤,导致性能过高估计。
❓ 解决问题现有方法对反事实推理的评估仅聚焦于干预阶段,忽视了归纳与预测环节。论文旨在开发新框架解决此问题并提高模型反事实推理能力。
🔍 现象分析在从干预推理转向完整的反事实推理时,现有模型(如 o4-mini 和 Claude-4-Sonnet)的准确率明显下降,幅度达 25-40%。基于监督微调的方式虽改善了分布内性能,但在分布外任务上出现准确率下降现象。
🛠️ 主要方法提出可执行反事实框架,通过代码与数学问题来实现因果推理的三个步骤并生成可扩展的合成数据;使用强化学习(RL)引入核心认知行为推动对分布外任务的泛化能力提升。
📊 数据与实验构建包含反事实代码问题的训练集,通过代码结构(如 if 条件、while 循环)与数学文字问题进行分布外测试,验证了 RL 的泛化性能,使模型在代码和数学问题上的准确率提高 1.5x-2x。
⭐ 主要贡献建立了首个完整反事实推理框架,突出了强化学习的有效性与规模化数据生成的潜力,为提升 LLM 的因果推理开启了新方向,并显著改善基于代码与数学问题的精度表现。
查看完整摘要 (Abstract)
Counterfactual reasoning, a hallmark of intelligence, consists of three steps: inferring latent variables from observations (abduction), constructing alternative situations (interventions), and predicting the outcomes of the alternatives (prediction). This skill is essential for advancing LLMs' causal understanding and expanding their applications in high-stakes domains such as scientific research and healthcare. However, existing efforts in assessing LLM's counterfactual reasoning capabilities tend to skip the abduction step, effectively reducing to interventional reasoning and leading to over-estimated LLM performance. To address this, we introduce executable counterfactuals, a novel framework that operationalizes causal reasoning through code and math problems. Our framework explicitly requires all three steps of counterfactual reasoning and enables scalable synthetic data creation with varying difficulty, creating a new frontier for evaluating and improving LLM's reasoning. Our results reveal substantial drop in accuracy (25-40%) from interventional to counterfactual reasoning for state-of-the-art models such as o4-mini and Claude-4-Sonnet. To address this gap, we construct a training set comprising counterfactual code problems having if-condition and test on out-of-distribution code structures (e.g., having while-loop); we also test whether a model trained on code would generalize to counterfactual math word problems. While Supervised Finetuning (SFT) on stronger models' reasoning traces improves in-distribution performance of Qwen models, it leads to a decrease in accuracy on out-of-distribution tasks such as counterfactual math problems. In contrast, reinforcement learning (RL) induces the core cognitive behaviors and generalizes to new distributions, yielding substantial accuracy gains over the base model on both code (improvement of 1.5x-2x) and counterfactual math problems. Analysis of the reasoning traces further reinforces these findings and highlights the promise of RL with scalable data generation for improving LLMs' counterfactual reasoning.
基础/前沿模型 (含LLM)
推理与思维链
#large language models #reasoning potential #long chain of thought #reasoning pattern #challenging mathematical reasoning
TL;DR:We propose CoTP to select data rich in high-value reasoning patterns to greatly expand the reasoning potential of the 85A6B MoE foundational model, thus achieving a 9.58% improvement on AIME 2025&2024 and raising the upper bounds of RL by 7.81%.
🎯 研究动机当前数学推理模型的性能进步依赖强化学习,但利用长链式推理数据时缺乏针对性,尚未明确哪些数据最有效提升模型推理能力。
❓ 解决问题定义推理潜力为正确回答问题所需独立尝试次数的倒数,并通过高价值推理模式丰富数据来扩展模型的推理潜力。
🔍 现象分析长链式推理数据在中期训练中可显著提升推理深度,但数据选择的无差别性制约了推理能力的最大化提升。
🛠️ 主要方法提出从长链推理序列中抽象出具有共性和归纳能力的推理模式,建立核心参考集,并通过双粒度算法选择符合核心集的高价值推理数据。
📊 数据与实验通过10B-token的高价值推理数据训练85A6B MoE模型,在AIME 2024及2025任务上提升9.58%,并提升下游强化学习性能上限7.81%。
⭐ 主要贡献首次定义推理潜力,提出基于推理模式的数据筛选方法,为复杂数学推理任务提供高效解决方案。
查看完整摘要 (Abstract)
Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL).
Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth.
However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities.
In this paper, we define the foundation model's reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance.
We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential.
Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns.
Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively.
Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by **9.58\%** on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by **7.81\%**.
基础/前沿模型 (含LLM)
推理与思维链
#Knowledge Graph #Large Language Models #Knowledge-enhanced reasoning #reinforcement learning
🎯 研究动机大型语言模型(LLMs)的推理过程常受幻觉和事实缺失的影响,基于可验证知识源(如知识图谱)是有效解决方式。
❓ 解决问题现有方法受限于预定义规则或固定示例路径,推理能力难以泛化到分布外的知识图谱问题。
🔍 现象分析传统知识图谱增强推理方法约束了LLMs的推理模式,无法有效扩展推理空间,会忽视对新路径探索的可能性。
🛠️ 主要方法提出Explore-on-Graph框架,通过引入基于强化学习的奖励模型鼓励模型自主探索多样化路径,并通过路径信息优化探索过程。
📊 数据与实验在五个知识图谱问答基准数据集上进行了广泛实验,结果表明该方法达到了业内最优性能,优于开源和闭源LLMs。
⭐ 主要贡献提出了一种新的框架激发LLMs在知识图谱上的自主探索,提升了跨分布推理能力并显著优化了问答任务表现。
查看完整摘要 (Abstract)
The reasoning process of Large Language Models (LLMs) is often plagued by hallucinations and missing facts in question-answering tasks.
A promising solution is to ground LLMs' answers in verifiable knowledge sources, such as Knowledge Graphs (KGs). Prevailing KG-enhanced methods typically constrained LLM reasoning either by enforcing rules during generation or by imitating paths from a fixed set of demonstrations. However, they naturally confined the reasoning patterns of LLMs within the scope of prior experience or fine-tuning data, limiting their generalizability to out-of-distribution graph reasoning problems.
To tackle this problem, in this paper, we propose Explore-on-Graph (EoG), a novel framework that encourages LLMs to autonomously explore a more diverse reasoning space on KGs.
To incentivize exploration and discovery of novel reasoning paths, we propose to introduce reinforcement learning during training, whose reward is the correctness of the reasoning paths' final answers.
To enhance the efficiency and meaningfulness of the exploration, we propose to incorporate path information as additional reward signals to refine the exploration process and reduce futile efforts.
Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
基础/前沿模型 (含LLM)
推理与思维链
#Efficient Reasoning #Attention Outlier #Reasoning
TL;DR:We propose FROST, an attention-aware method that identifies and removes reasoning outliers to prune uncritical reasoning paths, producing shorter and more reliable reasoning trajectories without sacrificing reasoning capacity.
🎯 研究动机现有推理方法效率低且易受冗余推理路径影响,需通过新机制优化推理能力。
❓ 解决问题提出一种基于注意力权重识别并移除推理异常值的方法,以缩短推理路径并提高可靠性。
🔍 现象分析推理异常值会导致推理过程中出现冗余和不相关路径,降低模型的精度与效率。
🛠️ 主要方法设计一种注意力驱动的推理异常值检测与移除机制,在句子级别保留关键推理路径。
📊 数据与实验在四个基准数据集上验证,使用Phi-4-Reasoning和GPT-oss-20B,两者均超过TALE等SOTA模型表现。
⭐ 主要贡献FROST实现了推理所需Token数量平均减少69.68%,准确率提升26.70%,注意力异常指标显著优化。
查看完整摘要 (Abstract)
We propose **FROST**, an attention-aware method for efficient reasoning. Unlike traditional approaches, FROST leverages attention weights to prune uncritical reasoning paths, yielding shorter and more reliable reasoning trajectories. Methodologically, we introduce the concept of *reasoning outliers* and design an attention-based mechanism to remove them. Theoretically, FROST preserves and enhances the model’s reasoning capacity while eliminating outliers at the sentence level. Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi-4-Reasoning and GPT-oss-20B), outperforming state-of-the-art methods such as TALE and ThinkLess. Notably, FROST achieves an average **69.68%** reduction in token usage and a **26.70%** improvement in accuracy over the base model. Furthermore, in evaluations of attention outlier metrics, FROST reduces the maximum infinity norm $\||\mathbf{x}\||_{\infty}$ by **15.97%** and the average kurtosis by **91.09%** compared to the base model.
基础/前沿模型 (含LLM)
推理与思维链
#Multimodal Large Language Model #Fine-Grained Visual Recognition #Reinforcement Learning
🎯 研究动机现有多模态大语言模型在粗粒度视觉任务表现优异,但在细粒度视觉识别中存在性能不足,且需要大量标注数据。该研究旨在通过改进训练框架,提升模型对细粒度类别的识别能力。
❓ 解决问题解决细粒度视觉识别中标注数据成本高、模型对已见子类别过拟合及对未见子类别泛化能力差的问题。
🔍 现象分析通用MLLMs在细粒度识别任务上性能显著弱于专门设计的对比学习模型(如CLIP),且存在对新子类别适应性有限的情况。
🛠️ 主要方法提出Fine-R1框架,包含两个关键训练阶段:一是链式思维监督微调,构建高质量细粒度识别推理数据集;二是三元组增强策略优化,通过类内增强和类间增强提升模型鲁棒性和判别能力。
📊 数据与实验仅需4个训练样本,Fine-R1在已见和未见细粒度类别识别上超越现有MLLMs和CLIP模型,代码已开源。
⭐ 主要贡献设计了一种新的训练框架,显著提升了MLLMs在细粒度视觉识别中的性能,尤其是在数据稀缺场景下,为知识密集型领域的应用提供了可行方案。
查看完整摘要 (Abstract)
Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of "visual analysis, candidate sub-categories, comparison, and prediction”, transition the model into a strong open-world classifier; and (2) Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra-class variance, while Inter-class Augmentation maximizes the response distinction conditioned on images across sub-categories to enhance discriminative ability. With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub-categories, showing promise in working in knowledge-intensive domains where gathering expert annotations for all sub-categories is arduous. Code is available at [https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026](https://github.com/PKU-ICST-MIPL/FineR1\_ICLR2026).
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #Contextual Reasoning #Collaborative Inference
🎯 研究动机大语言模型(LLMs)在上下文推理方面表现出色,但计算成本高,限制了在资源受限环境中的应用;小语言模型(SLMs)计算效率高,但因参数容量有限及遗忘现象,难以处理复杂上下文。
❓ 解决问题现有增强 SLMs 的方法需依赖额外训练且存在固有局限。提出无需训练的机制,提升 SLMs 在复杂上下文中的推理能力。
🔍 现象分析SLMs 在上下文理解中表现欠佳的根源在于其有限的参数容量和效率处理能力。不充分的推理能力影响其整体性能。
🛠️ 主要方法设计了一种名为 Navigation 的无训练框架,通过 LLM 的推理能力提取通用模板输送给 SLMs,并利用三阶段流程(生成、利用、更新)逐步提升其在复杂场景中的信息处理能力。
📊 数据与实验实验基于多个上下文推理任务,平均精度提升 10.7%,模板数量仅占数据集总量的 2.1%,使如 Qwen2.5-3B-Instruct 等模型超越 GPT-3.5-Turbo 的表现。
⭐ 主要贡献首次实现无训练方式提升 SLMs 的上下文推理能力,提出可扩展的导航模板库,低成本高效增强 SLMs 性能,具有重要实践价值。
查看完整摘要 (Abstract)
Large language models (LLMs), such as OpenAI o1 and DeepSeek-R1, excel in contextual reasoning by leveraging extensive world knowledge and deep contextual understanding. However, their high computational costs limit deployment in resource-constrained settings. Conversely, small language models (SLMs) are more computationally efficient but often struggle with contextual reasoning due to limited parameter capacity and challenges like catastrophic forgetting. Existing enhancement methods for SLMs—such as knowledge distillation and data synthesis—still depend on additional training and face inherent limitations. To address this, we propose Navigation, a novel training-free framework that improves SLMs’ contextual reasoning by distilling LLM-derived contextual processing expertise into generalizable navigation templates. These templates, stored in a scalable Navigation database, guide SLMs through a three-stage process—Generation, Utilization, and Update—to locate and process critical information within complex contexts. Experiments demonstrate that our approach yields an average 10.7\% accuracy gain with a template count equivalent to no more than 2.1\% of the dataset size, enabling models such as Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct to outperform GPT-3.5-Turbo on diverse contextual reasoning tasks.
基础/前沿模型 (含LLM)
推理与思维链
#LVLMs #Video Reasoning #Reinforcement Learning
TL;DR:FrameThinker is a framework that equips LVLMs with iterative frame selection for long video reasoning, achieving state-of-the-art accuracy with fewer frames.
🎯 研究动机现有的 LVLMs 采用均匀采样和静态推理模式,导致长视频理解效率低且无法处理视觉密集型任务。
❓ 解决问题提出 FrameThinker 框架,赋予 LVLMs 迭代选择关键帧的能力,旨在减少处理帧数同时提升推理准确率。
🔍 现象分析长视频推理需要动态感知视觉内容并作出序列决策,而现有方法缺乏这种动作空间(如选帧)及有效奖励引导。
🛠️ 主要方法提出两阶段训练:先用监督微调(SFT)学习基础动作能力,再通过强化学习(RL)优化策略,重点探索动作奖励设计。
📊 数据与实验在 Video-Holmes、LongVideo-Reason 等多个推理和长视频理解基准上验证,7B 模型在 LongVideo-Reason 上仅用 20.6 帧达到 76.1% 准确率,超越现有方法。
⭐ 主要贡献首创“长视频思维”概念,通过迭代帧聚焦框架大幅减少处理帧数并提升精度,其 7B 模型在减少 20 倍帧数下仍实现 SOTA 性能。
查看完整摘要 (Abstract)
While Large Vision-Language Models (LVLMs) have achieved substantial progress in video understanding, their application to long video reasoning is hindered by uniform frame sampling and static textual reasoning, which are inefficient and struggle to handle visually intensive video tasks.
To overcome these challenges, in this paper, we introduce
the concept of thinking with long videos and propose a novel framework FrameThinker. Within this framework, LVLMs are able to iteratively interrogate video content.
Developing such video reasoning capabilities in LVLMs presents notable challenges, particularly in adapting the model to new video actions (e.g. select frame), and designing reward functions to guide LVLMs to adopt the newly introduced action.
To solve these challenges,
we propose a two-phase training strategy, first employing Supervised Fine-Tuning (SFT) to instill fundamental action capabilities, followed by Reinforcement Learning (RL) to optimize a strategic decision-making policy.
Notably, in this RL phase, we conduct an in-depth and comprehensive exploration of the reward design for each action and format reward.
Extensive experiments on reasoning benchmarks like Video-Holmes, LongVideo-Reason, and long-video understanding benchmarks such as LongVideoBench, MLVU, VideoMME, and LVBench, demonstrate that FrameThinker gets a significant average improvement of +10.4\% over baselines while drastically reducing the number of processed frames.
Most notably, our 7B model, FrameThinker establishes a new state-of-the-art on LongVideo-Reason, achieving 76.1\% accuracy using an average of only 20.6 frames. This not only outperforms the competitive LongVILA-R1 (72.0\%) but does so with over 20x fewer frames (vs. 512), demonstrating unparalleled efficiency and effectiveness.
Our code is available at:
\url{https://github.com/lcqysl/FrameThinker}.
基础/前沿模型 (含LLM)
推理与思维链
#Language Models #Autoregressive Image Generation #Chain-of-Thought
🎯 研究动机将链式思维(CoT)与强化学习(RL)结合用于文本到图像生成,旨在探索两者优化与不确定性之间的交互关系。
❓ 解决问题明确CoT的探索空间扩展与RL的优化收缩如何影响生成质量及稳定性,并解决由高熵引发的不确定性问题。
🔍 现象分析发现CoT扩展了生成空间,RL将其收缩至高奖励区域;图像token的熵均值和方差与最终奖励呈强负相关;CoT的文本熵直接影响图像质量,低熵文本生成更优图像。
🛠️ 主要方法提出熵引导的群体相对策略优化(EG-GRPO),通过区分熵水平分配优化预算:低熵token排除奖励更新,高熵token加入熵奖励以鼓励结构化探索。
📊 数据与实验基于标准文本到图像基准数据集进行实验验证,结果显示该方法在生成质量和性能上优于当前最先进技术。
⭐ 主要贡献首次提出基于熵的系统分析,并将熵约束嵌入策略优化中,实现更稳定的高质量图像生成,同时推动文本到图像生成领域的算法进展。
查看完整摘要 (Abstract)
Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's optimization remains unclear. We present a systematic entropy-based analysis that yields three key insights: (1) CoT expands the generative exploration space, while RL contracts it toward high-reward regions; (2) final reward is strongly negatively correlated with both the mean and variance of image-token entropy, highlighting the need to reduce uncertainty and instability; and (3) the entropy of the textual CoT directly governs downstream image quality, with lower-entropy CoTs leading to better generations. Motivated by these findings, we propose Entropy-Guided Group Relative Policy Optimization (EG-GRPO), a fine-tuning strategy that reallocates optimization budget by uncertainty: low-entropy tokens are excluded from reward-driven updates to preserve stability, while high-entropy tokens receive an entropy bonus that encourages structured exploration without collapse. Experiments on standard T2I benchmarks demonstrate that EG-GRPO achieves state-of-the-art performance.
基础/前沿模型 (含LLM)
推理与思维链
#Lean4 #Reinforcement Learning #LLM
TL;DR:This paper proposes GAR, a comprehensive RL training famework for Lean4 prover training that enables more efficient and effective RL training by adversarial training of both statement composer and theorem prover.
🎯 研究动机解决数学问题的形式语言(如 Lean)对学术界影响深远,但现有强化学习方法效率低下且难以处理复杂问题,本研究旨在改进这一现状。
❓ 解决问题传统方法依赖固定问题集,导致训练效率受限且无法处理更复杂的任务,该研究提出新框架缓解这些局限。
🔍 现象分析现有模型训练困难且性能提升空间有限,通过引入动态对抗机制,可以提升模型对高难度定理的解决能力。
🛠️ 主要方法提出GAR框架,通过生成对抗式强化学习策略,联合训练问题生成器和定理证明器,并引入隐式课程学习机制调节任务难度。
📊 数据与实验在MiniF2F-Test和ProofNet-Test基准上,GAR框架显著提高了模型的通过率;公开训练代码以促进社区合作。
⭐ 主要贡献提出一种通用RL训练范式,实现数学问题生成与解决的协同优化,推动形式定理证明领域方法发展。
查看完整摘要 (Abstract)
Solving math problems through verifiable languages such as Lean has significantly impacted both the mathematics and computer science communities. Current state-of-the-art models are often trained with expensive online Reinforcement Learning (RL) or expert iteration. However, these approaches rely on fixed problem sets, which causes inefficient training and limits the model to tackle complex problems. To overcome these limitations, we propose **GAR**: *Generative Adversarial Reinforcement learning*, a comprehensive RL training framework that jointly trains the problem composer and solver in an adversarial loop. **GAR** introduces an implicit curriculum learning mechanism, which aligns task difficulty with the prover's evolving capability. It thereby improves the training efficiency and enables stronger performance of proving advanced theorems. Experiments show that with **GAR** training, Goedel-Prover-V2-8B and DeepSeek-Prover-V2-7B achieve an average relative improvement in pass@32 of **4.20%** on MiniF2F-Test benchmark, while DeepSeek-Prover-V2's pass@32 on ProofNet-Test increases from 22.58% to **25.81%**. Beyond formal proving, **GAR** establishes a general RL paradigm for co-evolution of problem generation and solving under verifiable environments. The training code for this paper is open-sourced in [https://github.com/RickySkywalker/GAR-Official](https://github.com/RickySkywalker/GAR-Official)
基础/前沿模型 (含LLM)
推理与思维链
#large language model #inference #scaling #reasoning #reinforcement learning #post-training #attention #tensors
TL;DR:To generalize and enhance parallel inference scaling for LLMs, we introduce Bridge, an architectural addition to LLMs that allows parallel generations for the same input to share information with each other throughout the decoding process.
🎯 研究动机为提升大语言模型的推理质量,探索如何在并行生成多个响应时实现信息共享,从而解决独立生成模式下信息未利用的问题。
❓ 解决问题克服并行响应生成过程中的独立性限制,使生成的每个响应能够共享和利用其他响应中的信息,提高结果的质量与一致性。
🔍 现象分析现有大语言模型并行推理通常视每个生成响应为独立单元,未充分挖掘并利用生成过程中的交互信息,导致资源效率低下与结果质量差异。
🛠️ 主要方法提出一种名为 Bridge 的架构,将批量生成的隐藏状态重新定义为整体化张量,用少量额外参数实现并行生成间的相互关联与信息共享。
📊 数据与实验基于强化学习和验证性奖励进行实验,桥接提升方法在宽度无关的并行生成中相对准确率提高达39%,响应一致性显著增强。
⭐ 主要贡献引入一种适配任何生成后聚合技术的通用模式,显著提高并行生成的结果质量与效率,同时以极少参数间接实现推理扩展性。
查看完整摘要 (Abstract)
Parallel LLM inference scaling involves sampling a set of $N>1$ responses for a single input prompt. However, these $N$ parallel responses tend to be generated independently from each other, partitioning compute resources and leaving potentially useful information in one generation untapped by others. This is in contrast to response length scaling where past computation is used in all future steps. For higher quality responses and response sets, we propose Bridge to generate interdependent responses in parallel by rethinking batched LLM hidden states as holistic tensors rather than independent slices. With only a small amount (2.8\%-5.1\%) of new parameters, Bridge improves the relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 39\% and boosts consistency of correct responses. Trained once, Bridge scales to any generation width, all with greater performance than independent generations, unlocking a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique.
基础/前沿模型 (含LLM)
推理与思维链
#Large language models #Reinforcement learning #Adversarial training
🎯 研究动机当前的大型语言模型在数学推理方面表现出色,但仍然存在过程性错误,如计算错误、逻辑脆弱性和表面合理但无效的步骤。针对这一问题,研究旨在提升语言模型推理质量。
❓ 解决问题旨在通过对模型推理的增强训练,解决过程性错误及推理链的逻辑一致性问题,提高样本利用效率和推理性能。
🔍 现象分析推理过程中模型常出现逻辑不完整或错误推导,现有奖励机制稀疏难以有效指示细节问题,导致推理链质量不佳。
🛠️ 主要方法提出生成对抗推理框架,通过对抗强化学习联合训练推理模型与基于LLM的判别器,利用详细分片评价和密集奖励机制优化推理过程。
📊 数据与实验在多个数学基准上测试,方法显著提升性能,如AIME24数据集上高于强基线模型7.3至10.0个百分点,并展示了多种目标的灵活奖励塑造能力。
⭐ 主要贡献提出新的对抗性推理框架改进LLM推理质量;设计密集奖励机制提高信用分配和样本效率;模块化判别器实现多目标奖励塑形与灵活训练。
查看完整摘要 (Abstract)
Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice’s soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.
基础/前沿模型 (含LLM)
推理与思维链
#Theorem Proving #Reasoning
🎯 研究动机自动定理证明是数学与人工智能交叉领域的核心挑战,旨在生成通过形式语言验证的数学问题证明。现有方法在性能和效率上存在局限性,进一步优化具有重要意义。
❓ 解决问题提出一个新的模型家族Goedel-Prover-V2,用于提升自动定理证明能力,同时保持高效计算并突破模型规模限制。
🔍 现象分析传统方法在训练效率和生成准确性方面存在瓶颈,通过结合自我纠错与引导式数据合成可以突破性能瓶颈。
🛠️ 主要方法引入三种关键创新:引导式数据合成生成高难度问题、自我纠错通过编译器反馈优化、测试时通过模型检查点集成提升性能。
📊 数据与实验在MiniF2F和PutnamBench数据集上进行测试,表现显著优于现有最先进方法,同时证明了较小模型在功效与计算开销上的优势。
⭐ 主要贡献提出了SOTA自动定理证明模型,显著提升了准确率和效率;发布了模型代码和数据供社区使用,实现资源共享。
查看完整摘要 (Abstract)
Automated theorem proving (ATP) --- the task of generating a proof that passes automated proof verification given a math question in formal language --- is a critical challenge at the intersection of mathematics and Artificial Intelligence (AI). We introduce Goedel-Prover-V2, a family of two language models that establish a new state-of-the-art (SOTA) in open-source ATP, using the Lean proof assistant. In addition to standard expert iteration and reinforcement learning, our approach incorporates three key innovations: (1) During training when improvement plateaus on human questions, the prover does scaffolded data synthesis to generate synthetic questions of increasing difficulty for its own training; (2) The prover is trained to self-correct using Lean compiler feedback; (3) Improved test-time exploration through checkpoint averaging to balance accuracy and diversity.
Our small model, Goedel-Prover-V2-8B, reaches 84.6\% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B despite being $80\times$ smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1\% on MiniF2F at pass@32 in standard mode and 90.4\% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing first place among open-source models and surpassing DeepSeek-Prover-V2-671B's record of 47 problems by pass@1024 with about $20\times$ smaller model size and significantly lower compute budget. Our models, code, and data are released at \url{https://github.com/Goedel-LM/Goedel-Prover-V2}.
基础/前沿模型 (含LLM)
推理与思维链
#Mathematical Reasoning #Group Relative Policy Optimization #Question Reformulation
TL;DR:We propose a MathForge framework to improve mathematical reasoning by targeting harder questions from both algorithmic and data perspectives, including Difficulty-Aware Group Policy Optimization (DGPO) and Multi-Aspect Question Reformulation (MQR).
🎯 研究动机现有强化学习在数学推理中的方法未能充分关注困难题目,这限制了模型能力的提升。通过改进算法和数据处理,可以更好地培养模型应对复杂任务的能力。
❓ 解决问题现有方法在算法上存在对困难题目优化不足的数据不平衡问题,同时数据增强方法未能系统性提升问题本身的难度。
🔍 现象分析常用的群组相对策略优化(GRPO)对困难问题更新幅度较小,导致隐式失衡;传统数据增强主要聚焦于问题重述,并未有效增加其内在复杂性。
🛠️ 主要方法提出双重框架 MathForge,包括难度感知的群组策略优化(DGPO)和多维度问题重构(MQR);DGPO 通过难度平衡的组优势估计和问题加权机制优化困难问题,MQR 从多方面重构问题以增加难度并保持原答案。
📊 数据与实验在多个数学推理任务上进行实验验证,结果显示 MathForge 显著优于现有方法。代码和增强数据集已开源。
⭐ 主要贡献首次系统性地识别并解决困难问题在数学推理优化中的重要性,提出 MathForge 框架有效结合算法与数据增强,以显著提升模型难度适应能力。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models.
However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities.
Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions.
Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty.
To address these issues, we propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy.
Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation, and further prioritizes harder questions by difficulty-aware question-level weighting.
Meanwhile, MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer.
Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data.
Extensive experiments show that MathForge significantly outperforms existing methods on various mathematical reasoning tasks.
The code and augmented data are all available at https://github.com/AMAP-ML/MathForge.
基础/前沿模型 (含LLM)
推理与思维链
#Reinforcement Learning #Large Language Models #Mathematical Reasoning
🎯 研究动机当前大语言模型在需要长远推理和精准操作的任务中表现不足,尤其在复杂数学推理领域瓶颈明显。
❓ 解决问题针对稀疏奖励导致的学习信号缺失和探索停滞问题,提升模型的复杂任务推理能力。
🔍 现象分析稀疏的奖励机制掩盖了几乎正确尝试的贡献,导致模型难以突破现有能力瓶颈并挖掘更优解。
🛠️ 主要方法提出 HiPO 框架,通过捕捉偶然成功轨迹并提取其正确步骤作为政策优化提示,将稀罕成功转化为密集对比学习信号。
📊 数据与实验在五个复杂数学推理基准上测试,平均提升 avg@32 指标 5.0 个百分点,其中 CMIMC 2025 提升 10.3 个百分点,其他基准也有显著进步。
⭐ 主要贡献提出一种全新探索范式,将稀有成功转化为可复用指导,以高效、可扩展方式增强模型复杂推理能力,显著改善 RLVR 表现。
查看完整摘要 (Abstract)
Reinforcement Learning from Verifiable Rewards (RLVR) is a promising method for enhancing the complex
problem-solving abilities of large language models (LLMs). This is particularly evident in domains requiring
long-horizon reasoning and precise execution, such as solving complex mathematical problems where solutions
hinge on a fragile sequence of tool-based actions. However, current approaches are often crippled by two
interconnected issues: the near-miss problem, where sparse rewards nullify the learning signal for
almost-correct attempts, and the resulting exploration stagnation, which prevents the model from
discovering better solutions. To address these challenges, we introduce HiPO (Hint-guided Policy Optimization),
a novel RLVR framework that enables the agent to learn from its own rare successes.
Our core insight is to capture an occasional successful trajectory within a training batch and
repurpose its initial correct steps as an on-policy “hint”. This process
transforms a single, stochastically-found success into a dense contrastive learning signal,
effectively allowing the model to teach itself how to overcome the near-miss
problem and break exploration stagnation. On a challenging suite of five mathematical reasoning benchmarks,
HiPO improves the average avg@32 by +5.0 percentage points (pp) over the strong GRPO baseline.
This improvement is driven by substantial absolute point gains on challenging datasets,
including +10.3 pp on CMIMC 2025, +4.9 pp on BRUMO 2025, +4.6 pp on AIME 2024, and +3.1 pp on AIME 2025.
Furthermore, HiPO demonstrates a new exploration paradigm,
repurposing rare successes into reusable guidance to significantly accelerate skill acquisition for complex tasks,
establishing a more efficient and scalable path for models to autonomously master intricate reasoning.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Model #Post Training
🎯 研究动机大型语言模型虽功能强大,但对输入上下文的微小变化过于敏感,影响其可靠性。传统指标如准确度和困惑度难以衡量模型预测的局部鲁棒性。
❓ 解决问题提出一种新的度量指标——Token Constraint Bound ($delta_{TCB}$),量化模型内部状态对扰动的耐受性,以评估其下一步预测的稳定性。
🔍 现象分析标准化输出概率可能掩盖模型内部状态的抗干扰性,导致无法有效检测模型预测的关键不稳定性。
🛠️ 主要方法通过分析输出嵌入空间的几何结构,引入$delta_{TCB}$指标以评估模型预测稳定性,与提示工程效果显著相关联。
📊 数据与实验实验表明$delta_{TCB}$揭示了困惑度无法捕捉的预测不稳定性,对上下文学习与文本生成的稳定性具有指导意义。
⭐ 主要贡献提供了一个理论完善的补充性分析手段,大幅提升对大型语言模型预测稳定性的理解与优化潜力。
查看完整摘要 (Abstract)
Large Language Models (LLMs) exhibit impressive capabilities yet suffer from sensitivity to slight input context variations, hampering reliability. Conventional metrics like accuracy and perplexity fail to assess local prediction robustness, as normalized output probabilities can obscure the underlying resilience of an LLM's internal state to perturbations. We introduce the Token Constraint Bound ($\delta_{\mathrm{TCB}}$), a novel metric that quantifies the maximum internal state perturbation an LLM can withstand before its dominant next-token prediction significantly changes. Intrinsically linked to output embedding space geometry, $\delta_{\mathrm{TCB}}$ provides insights into the stability of the model's internal predictive commitment. Our experiments show $\delta_{\mathrm{TCB}}$ correlates with effective prompt engineering and uncovers critical prediction instabilities missed by perplexity during in-context learning and text generation. $\delta_{\mathrm{TCB}}$ offers a principled, complementary approach to analyze and potentially improve the contextual stability of LLM predictions.
基础/前沿模型 (含LLM)
推理与思维链
#transformers #in-context learning #interpretability #Markov chain
TL;DR:We show that transformers can and do approximate Bayesian model averaging to infer varying causal dependencies in-context, with information-theoretic guarantees.
🎯 研究动机现有理论主要假设固定依赖结构,但真实序列具有灵活的上下文关系。本研究探讨transformers能否通过上下文直接学习序列元素间的因果结构。
❓ 解决问题提出一种基于随机因果依赖的马尔可夫链框架,使transformers推断序列中元素间的依赖关系以提高预测准确性。
🔍 现象分析transformers通过上下文学习因果结构,展现出对统计不确定性进行推理的能力。实验表明,注意力模式直接反映了推断出的因果依赖。
🛠️ 主要方法利用两层transformer结合相对位置编码,证明其可实现贝叶斯模型平均法(BMA),并设计专门任务评估其性能。
📊 数据与实验使用随机化因果依赖任务和连续动力系统进行实验,结合参数级别分析验证了transformers对因果结构的学习能力。
⭐ 主要贡献提出transformers基于BMA推断因果结构的理论框架;通过信息论保障解释其因果学习机制;揭示离散与连续动态系统在表示需求上的差异。
查看完整摘要 (Abstract)
Transformers have demonstrated remarkable in-context learning abilities, adapting to new tasks from just a few examples without parameter updates. However, theoretical understanding of this phenomenon typically assumes fixed dependency structures, while real-world sequences exhibit flexible, context-dependent relationships. We address this gap by investigating whether transformers can learn causal structures -- the underlying dependencies between sequence elements -- directly from in-context examples. We propose a novel framework using Markov chains with randomly sampled causal dependencies, where transformers must infer which tokens depend on which predecessors to make accurate predictions. Our key contributions are threefold: (1) We prove that a two-layer transformer with relative positional embeddings can implement Bayesian Model Averaging (BMA), the optimal statistical algorithm for causal structure inference; (2) Through extensive experiments and parameter-level analysis, we demonstrate that transformers trained on this task approximate BMA, with attention patterns directly reflecting the inferred causal structures; (3) We provide information-theoretic guarantees showing how transformers recover causal dependencies and extend our analysis to continuous dynamical systems, revealing fundamental differences in representational requirements. Our findings bridge the gap between empirical observations of in-context learning and theoretical understanding, showing that transformers can perform sophisticated statistical inference over structural uncertainty.
基础/前沿模型 (含LLM)
推理与思维链
#LLM #reasoning #intervention #SFT #RL
🎯 研究动机现有强化学习方法在大语言模型的推理过程中只能针对最终答案进行奖惩,导致中间正确步骤可能被忽视或错误步骤可能被强化,无法实现有效的信贷分配。
❓ 解决问题提出一种针对细粒度推理步骤进行信贷分配的新方法,以解决在推理失败时正确步骤被惩罚及推理成功时错误步骤被强化的局限。
🔍 现象分析通过对推理路径的误差分析,发现标准强化学习在评估中无法定位推理错误具体发生的步骤,从而导致训练效果受限。
🛠️ 主要方法引入干预训练(InT)范式,模型通过自身推理路径识别首个错误并提出单步修正,结合监督微调和强化学习优化模型表现。
📊 数据与实验使用数学推理数据集中的参考解作为基准,在IMO-AnswerBench上进行模型评估,结果显示精度相比基础模型提升近14%,并优于多参开源模型。
⭐ 主要贡献提出了有效的干预训练方法,实现对推理路径的细粒度信贷分配,显著提升中型模型在数学推理任务上的表现,为强化学习与语言模型融合提供新方向。
查看完整摘要 (Abstract)
Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs). However, standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect and uniformly reinforcing all steps when it is correct. As a result, correct intermediate steps may be discouraged in failed traces, while spurious steps may be reinforced in successful ones. We refer to this failure mode as the problem of credit assignment. While a natural remedy is to train a process reward model, accurately optimizing such models to identify corrective reasoning steps remains challenging. We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces by proposing short, targeted corrections that steer trajectories toward higher reward. Using reference solutions commonly available in mathematical reasoning datasets and exploiting the fact that verifying a model-generated solution is easier than generating a correct one from scratch, the model identifies the first error in its reasoning and proposes a single-step intervention to redirect the trajectory toward the correct solution. We then apply supervised fine-tuning (SFT) to the on-policy rollout up to the point of error concatenated with the intervention, localizing error to the specific step that caused failure. We show that the resulting model serves as a far better initialization for RL training. After running InT and subsequent fine-tuning with RL, we improve accuracy by nearly 14% over a 4B-parameter base model on IMO-AnswerBench, outperforming larger open-source models such as gpt-oss-20b.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Model #Reinforcement Learning #Reasoning
🎯 研究动机现有的大语言模型在推理能力方面存在不足,难以实现类似人类的复杂推理过程,需要新的方法提升模型的推理能力。
❓ 解决问题提出一种强化学习驱动的微调框架,通过引入功能性标记让模型能够构建和应用多样化的链式推理路径。
🔍 现象分析传统基于提示的推理方法局限于依赖预定义结构,难以灵活适应复杂任务并生成有效推理过程。
🛠️ 主要方法设计了两阶段框架:通过功能性标记的监督微调生成初始推理能力,采用实时强化学习探索多样化的功能性推理路径来实现模型自我提升。
📊 数据与实验在数学基准和泛化领域进行广泛实验,验证了方法的推理能力和通用性优势,并通过测试时添加更多搜索操作进一步提升性能。
⭐ 主要贡献开发了一种增强推理能力的大语言模型微调框架RFTT,显著提高了数学推理和跨领域任务表现,并公开共享代码与数据集以促进相关研究。
查看完整摘要 (Abstract)
In this work, we propose ***R**einforced **F**unctional **T**oken **T**uning* (RFTT), a novel reinforced fine-tuning framework that empowers Large Language Models (LLMs) with learn-to-reason capabilities. Unlike prior prompt-driven reasoning efforts, RFTT embeds a rich set of learnable functional tokens (*e.g.*, \<analyze\>, \<verify\>, \<refine\>) directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors. Specifically, RFTT comprises two phases: (1) supervised fine-tuning performs prompt-driven tree search to obtain self-generated training data annotated with functional tokens, which warms up the model to learn these tokens for initial reasoning capability; and (2) online reinforcement learning further allows the model to explore diverse reasoning pathways through functional token sampling without relying on prompts, thereby facilitating effective self-improvement for functional reasoning. Extensive experiments demonstrate the superiority of the proposed RFTT on mathematical benchmarks and highlight its strong generalization capability to other general domains. Moreover, the performance of RFTT exhibits consistent gains with increased test-time computation through additional search rollouts. Our code and dataset are available at https://github.com/sastpg/RFTT.
基础/前沿模型 (含LLM)
推理与思维链
#LLMs #in-context learning
TL;DR:A large-scale evaluation to empirically characterise in-context learning as a learning paradigm, ablating out common drawbacks of LLM evaluation, and finding results contradicting or aligning with conventional wisdom
🎯 研究动机探讨大规模语言模型中的上下文学习是否真正具备学习能力,尤其针对在训练之外解决新任务的能力进行质疑和研究。
❓ 解决问题通过数学定义和实证分析,评估上下文学习在面对未见任务时的学习能力与泛化能力,解决关于记忆化偏差、分布转移及提示风格影响等问题。
🔍 现象分析上下文学习依赖预训练知识和有限样本提示,缺乏对观察内容的显式编码;对提示风格和输入特性敏感,难以在未见任务中表现出强泛化能力。
🛠️ 主要方法通过大规模实验切割记忆化、预训练影响及分布变化,同时分析不同提示方式(如链式推理)的表现和局限性。
📊 数据与实验使用多种任务和数据分布进行实验,分析样本数量、提示风格、模型类型及输入语言特性对预测准确性的影响。
⭐ 主要贡献质疑上下文学习作为普适学习机制的鲁棒性,揭示其对提示规律的依赖及有限的泛化能力,并为未来语言模型研究提供了参考方向。
查看完整摘要 (Abstract)
In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model's ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL fits the definition of learning; however, its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that, empirically, ICL is limited in its ability to learn and generalise to unseen tasks. Namely, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input's linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies and on formally similar tasks, we conclude that autoregression's _ad-hoc_ encoding is not a robust mechanism for learning, and suggests limited all-purpose generalisability.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #Reasoning #Benchmark #Linguistic reasoning #Permutation
TL;DR:An inductive reasoning benchmark about natural languages designed to minimise the ability to solve with knowledge or memory
🎯 研究动机现有大型语言模型在推理问题上表现出色,但常通过知识或记忆能力而非真实推理能力解决问题,导致结果被高估。
❓ 解决问题设计一个新的推理基准,能排除模型利用知识或记忆的方式,以准确评估其推理能力。
🔍 现象分析实验表明,模型在原题上的表现因依赖捷径而较好,而经过混淆处理后,表现显著下降,反映其推理能力不足。
🛠️ 主要方法通过专业设计的混淆操作改编语言学奥赛题目,保持解题逻辑的同时削弱模型解决问题时依赖知识或记忆的可能性。
📊 数据与实验提出包含1,203道问题及6,995个子问题的推理基准,对比模型在混淆前后的表现,性能从0.59降至0.48。
⭐ 主要贡献开发了一个新颖的推理基准 LINGOLY-TOO,成功将推理能力与知识依赖分离,为衡量语言模型的真实推理能力提供可靠标准。
查看完整摘要 (Abstract)
Frontier language models demonstrate increasing ability at solving reasoning
problems, but their performance is often inflated by circumventing reasoning and
instead relying on their expanding knowledge and memorisation capacity. We
introduce LINGOLY-TOO, a challenging reasoning benchmark of 1,203 questions
and a total of 6,995 sub-questions that counters these shortcuts by applying expert-
designed obfuscations to Linguistics Olympiad problems. These obfuscations
preserve the underlying solution logic while reducing the likelihood problems
are solvable with via knowledge or memorisation. Our experiments show that
models exploit shortcuts on the original question as performance markedly drop
upon obfuscation. Even the best reasoning models remain highly sensitive, with
scores dropping from around 0.59 on original problems to 0.48 after obfuscation.
LINGOLY-TOO disentangles reasoning from knowledge, offering a clearer measure
of true reasoning capabilities.
基础/前沿模型 (含LLM)
推理与思维链
#Causal Inference #Large Language Models #Reasoning #Narratives
TL;DR:In this paper, we examine the failure Modes of LLMs for causal reasoning on narratives and the unreliable shortcuts LLMs take to make causal inferences.
🎯 研究动机因果关系识别对自主决策与应对新场景至关重要,但需结合世界知识与逻辑推理。现有模型在这两者间平衡存在挑战。
❓ 解决问题探索大型语言模型在叙事中的因果推理能力及其使用不可靠的简化策略的问题。
🔍 现象分析模型常通过事件顺序推理因果关系或无视上下文直接调用记忆的世界知识,表现出依赖表面启发式的趋势。
🛠️ 主要方法通过调整任务表述方式,优化模型的因果推理行为,并使用涵盖线性链、碰撞点及分叉结构的综合因果结构进行评估。
📊 数据与实验实验设计包含人工、半人工及真实世界数据集,以控制变量形式分析模型对不同因果结构的表现。
⭐ 主要贡献揭示LLMs在因果推理中的系统性模式,提出更符合因果推理原则的方法方向,为模型优化奠定基础。
查看完整摘要 (Abstract)
The ability to robustly identify causal relationships is essential for autonomous decision-making and adaptation to novel scenarios. However, accurately inferring causal structure requires integrating both world knowledge and abstract logical reasoning. In this work, we investigate the interaction between these two capabilities through the representative task of causal reasoning over narratives. Through controlled synthetic, semi-synthetic and real-world experiments, we find that state-of-the-art large language models (LLMs) often rely on superficial heuristics—for example, inferring causality from event order or recalling memorized world knowledge without attending to context. Furthermore, we show that simple reformulations of the task can elicit more robust reasoning behavior. Our evaluation spans a range of causal structures, from linear chains to complex graphs involving colliders and forks. These findings uncover systematic patterns in how LLMs perform causal reasoning and lay the groundwork for developing methods that better align LLM behavior with principled causal inference.
基础/前沿模型 (含LLM)
推理与思维链
#Large language model #latent chain-of-thought #reasoning
TL;DR:We show that Soft Thinking are reduced to single-path reasoning and propose randomness-based strategies, with Gumbel-Softmax proving most effective for enhancing reasoning performance.
🎯 研究动机当前的大语言模型在推理中依赖于离散的令牌生成,这限制了对抽象概念的表达能力;促进模型在连续概念空间中的推理能力逐渐成为研究重点。
❓ 解决问题揭示大语言模型在软推理(Soft Thinking)中的工作机制,发现和克服单路径推理带来的局限性。
🔍 现象分析通过系统的内部行为分析发现,大语言模型在软推理中倾向于选择概率最高的令牌进行下一步预测,导致贪婪反馈循环抑制了其他可能路径。
🛠️ 主要方法提出随机化的软推理(Stochastic Soft Thinking),引入随机性设计以摆脱贪婪趋势,其中 Gumbel-Softmax 技巧表现最优。
📊 数据与实验在八个推理基准上测试,随机化策略显著提升推理性能并验证了方法的有效性。
⭐ 主要贡献首次揭示大语言模型在软推理中单线程工作的现象;提出随机性的改进机制,有效释放软推理潜能;实验结果证明了所得方法在多任务推理中的优越性。
查看完整摘要 (Abstract)
Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. In this paper, we investigate the $\textit{Soft Thinking}$ capabilities of various LLMs through a systematic analysis of their internal behavior using a suite of probing techniques. Contrary to the prevailing belief that Soft Thinking supports parallel exploration of diverse reasoning paths, our findings reveal that $\textbf{LLMs behave as single-threaded reasoners}$—they predominantly rely on the token with the highest probability in the soft input to predict the next step. This behavior induces a greedy feedback loop that suppresses alternative reasoning paths and undermines the benefits of transmitting richer information via Soft Tokens. To address this $\textit{Greedy Pitfall}$, we propose $\textbf{Stochastic Soft Thinking}$, which introduces stochasticity to break free from the greedy tendency. Our experiments demonstrate that incorporating $\textit{randomness}$—particularly with the $\textbf{Gumbel-Softmax trick}$—can alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking, resulting in superior performance across eight reasoning benchmarks.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #Reasoning #Diffusion Models #Latent Reasoning
TL;DR:LaDiR is a novel latent reasoning framework that encodes latent “thought tokens” with a VAE and predicts them via latent diffusion models, enabling adaptive test-time compute, parallel diverse generation, and better intepretability.
🎯 研究动机大模型通过链式思维生成展现推理能力,但自回归解码机制限制了回访和整体优化能力,无法高效探索多样化解决方案。
❓ 解决问题提出一种新框架 LaDiR,通过连续潜在表示和潜在扩散模型的迭代优化,解决自回归推理中的局限性,包括解码效率和解决方案多样性问题。
🔍 现象分析典型的自回归采样容易产生重复性解决方案,缺乏对潜在推理过程的多样性探索,同时难以提供解释性和高效优化手段。
🛠️ 主要方法利用变分自编码器构建结构化潜在推理空间,将推理步骤编码为语义紧凑的思维块,并采用潜在扩散模型,通过双向注意机制实现块级降噪与多样性引导推理。
📊 数据与实验在数学推理与规划基准上进行评估,结果显示,在准确性、多样性和解释性方面,LaDiR均比现有自回归、扩散和潜在推理方法有显著提升。
⭐ 主要贡献提出一种融合连续潜在表达与扩散优化的新型推理框架,为语言模型推理研究开辟新方向,同时提升解码灵活性、推理多样性与可解释性。
查看完整摘要 (Abstract)
Large Language Models (LLMs) demonstrate their reasoning ability through chain-of-thought (CoT) generation. However, LLM's autoregressive decoding may limit the ability to revisit and refine earlier tokens in a holistic manner, which can also lead to inefficient exploration for diverse solutions. In this paper, we propose LaDiR (Lalent Diffusion Reasoner), a novel reasoning framework that unifies the expressiveness of continuous latent representation with the iterative refinement capabilities of latent diffusion models for an existing LLM. We first construct a structured latent reasoning space using a Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of thought tokens, preserving semantic information and interpretability while offering compact but expressive representations. Subsequently, we utilize a latent diffusion model that learns to denoise a block of latent thought tokens with a blockwise bidirectional attention mask, enabling longer horizon and iterative refinement with adaptive test-time compute. This design, combined with explicit diversity guidance during diffusion inference, enables the generation of multiple diverse reasoning trajectories that explore distinct regions of the latent space, rather than producing repetitive solutions as often occurs in standard autoregressive sampling. We conduct evaluations on a suite of mathematical reasoning and planning benchmarks. Empirical results show that LaDiR consistently improves accuracy, diversity, and interpretability over existing autoregressive, diffusion-based, and latent reasoning methods, revealing a new paradigm for text reasoning with latent diffusion.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #Reasoning #Visualization
TL;DR:We introduce a visualization tool for users to inspect the reasoning paths of chain-of-thought and its derivatives on any multi-choice dataset
🎯 研究动机大型语言模型在逐步推理中的重要性日益显著,但其推理行为仍缺乏深入理解,阻碍了研究与实际应用的发展和安全性保障。
❓ 解决问题为弥补对大型语言模型推理行为理解的不足,提出了一种可视化工具,用于分析链式推理及其衍生方法在多选数据集中的推理路径。
🔍 现象分析通过定性与定量分析,该方法能够区分强弱模型、正确与错误答案以及不同推理任务,还揭示出不良推理模式,如一致性低和不确定性高。
🛠️ 主要方法提出了一种名为“思想景观”(LoT)的工具,将推理路径的文本状态表示为数值特征,并通过 t-SNE 将其可视化为二维图表,便于直观分析。
📊 数据与实验设计了一个轻量级验证模型,通过适应 LoT 工具来预测推理轨迹的正确性,实验证明此方法提升了推理准确性及测试阶段的扩展效果。
⭐ 主要贡献首次提供对大型语言模型推理行为的可视化工具,揭示模型推理中的潜在问题,并通过验证模型有效提升了推理性能,代码已开源以供社区使用。
查看完整摘要 (Abstract)
Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavior of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts (LoT), the first landscape visualization tool to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset. We represent the textual states in a trajectory as numerical features that quantify the states' distances to the answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative and quantitative analysis with the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt LoT to a model that predicts the property they observe. We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories. Empirically, this verifier boosts the reasoning accuracy and the test-time scaling effect. The code is publicly available at: https://github.com/tmlr-group/landscape-of-thoughts.
基础/前沿模型 (含LLM)
推理与思维链
#Latent representation learning #scaling test-time compute
TL;DR:We demonstrate that latent thoughts of LLMs contain rich reward signals, and scaling test-time thinking with supervision can be directly performed in the latent space.
🎯 研究动机大型语言模型(LLMs)生成自然语言形式的推理链,虽在问题解决上表现优异,但计算代价高且易出现过度推理。为提升效率与可靠性,利用潜在空间中的推理过程成为一个新方向。
❓ 解决问题当前潜在推理方法存在可解释性差和监督难的问题,导致模型推理过程的正确性与可靠性难以保障。
🔍 现象分析潜在推理中,正确与错误答案对应的潜在思维模式差异显著,且基于这些模式可通过潜在分类器预测答案的正确性。
🛠️ 主要方法提出Latent Thinking Optimization (LTO),利用潜在分类器作为潜在奖励模型(Latent Reward Model, LRM),通过概率算法优化潜在推理过程。
📊 数据与实验在多种推理任务上进行广泛实验,结果表明LRM能有效检测错误的潜在思维模式,LTO显著提升推理过程并具备跨领域泛化能力。
⭐ 主要贡献验证了潜在推理中隐含丰富的奖励信号;提出LTO方法,支持在潜在空间中直接进行监督与优化;证明该方法高效、通用且适用于一般LLMs的推理改进。
查看完整摘要 (Abstract)
Large Language Models (LLMs) excel at problem solving by generating chain of thoughts in natural language, but such verbal thinking is computationally costly and prone to overthinking. A recent work instead proposes a latent thinking architecture, Huginn-3.5B, which represents intermediate reasoning steps as a sequence of latent representations. However, latent thoughts lack interpretability and are difficult to supervise, raising concerns about the correctness and reliability of the model's latent thinking processes. In this paper, we provide a systematic study of how Huginn-3.5B thinks in the latent space and how external supervision signals can improve its latent thinking processes. We show that latent thoughts leading to correct versus incorrect answers exhibit highly distinguishable patterns, and that a latent classifier can reliably predict answer correctness directly from latent thoughts. Leveraging these insights, we propose Latent Thinking Optimization (LTO), a probabilistic algorithm that employs the latent classifier as a Latent Reward Model (LRM) to optimize the latent thinking processes. Extensive experiments across diverse reasoning tasks demonstrate that LRM is highly effective in detecting incorrect latent thinking patterns, and LTO can significantly improve the latent thinking processes. Furthermore, we show that LRM can generalize across diverse domains, and LTO can be seamlessly applied to general LLMs to improve their thinking processes. In contrast to verbal thinking, our method demonstrates that reward modeling and scaling test-time thinking with supervision can be performed directly in the latent space, highlighting its potential as a general, efficient, and domain-agnostic approach to improving the thinking processes of LLMs.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #Efficient Reasoning
🎯 研究动机大型语言模型在复杂推理任务上表现优秀,但计算成本高限制了其实际应用。研究认为这一问题源于高阶认知规划与逐步生成文本间的紧耦合。
❓ 解决问题提出通过潜在指导分离规划与执行的协同框架,降低推理成本并提升小型模型效率。
🔍 现象分析高效的推理能力无法充分在小型模型中体现,现有方法常需大型模型处理完整推理链。
🛠️ 主要方法采用分工框架,大模型作为隐性思考器压缩推理策略至潜在向量,小模型作为显性执行器,基于此生成推理链,并以信息理论双损设计提升潜在向量质量。
📊 数据与实验在8个多样化推理基准上进行实验,小模型规模从0.5B到8B,验证方法显著提升小模型的推理能力,并优于强基线模型。
⭐ 主要贡献提出新理论框架,将大模型思维能力赋予小模型,优化复杂推理任务的性能成本权衡,并使小模型准确率最高提升13.9%。
查看完整摘要 (Abstract)
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, but their high computational costs limit their widespread practical application. We argue that this inefficiency arises from the tight coupling of high-level cognitive planning (devising the solution strategy) and low-level linguistic realization (generating step-by-step text). To address this challenge, we propose a novel collaborative framework that decouples these two processes through Latent Guidance. Our approach implements a division of labor: a large model acts as an Implicit Thinker, performing high-level cognitive planning and compressing its solution strategy into a set of compact latent guidance vectors. A small, efficient model then serves as an Explicit Executor, which receives this latent guidance to generate a concise and effective reasoning chain. This process is enabled by a dual-loss training objective, grounded in information-theoretic principles, where a reconstruction loss explicitly compels the latent guidance to become a high-fidelity representation of the full reasoning chain. Extensive experiments on 8 diverse reasoning benchmarks demonstrate that our method substantially enhances the reasoning capabilities of small models across various scales (from 0.5B to 8B), allowing them to outperform strong baselines and exhibit superior generalization. Notably, our framework boosts small model accuracy by up to 13.9% over its standalone baseline. Our work introduces a new, theoretically-grounded paradigm for empowering small models with large-model thinking, substantially improving the performance-cost trade-off for complex reasoning.
基础/前沿模型 (含LLM)
推理与思维链
#Reinforcement learning; Large Language Model; Active Learning; Reasoning
TL;DR:Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR
🎯 研究动机当前大语言模型通过带奖励的强化学习(RLVR)提升了数学推理能力,但需要大量标注查询,成本过高。探索如何用更少但更有信息量的查询达到类似或更优的性能十分重要。
❓ 解决问题现有主动学习采样策略忽视了主观不确定性与客观不确定性的一致性,导致其效果无法优于随机选择。本研究提出通过不确定性一致性指导采样来优化查询选择。
🔍 现象分析经典的主动学习方法过度依赖主观不确定性而忽略客观不确定性,一致性不足是性能受限的主要原因。在线设置中动态输出分布使得客观一致性评估更加困难。
🛠️ 主要方法提出了不确定性一致性指标,离线设置中基于点双列相关系数(PBC)进行计算;在线设置中设计了一种基于归一化优势与主观不确定性的指标,并证明其与离线PBC严格负相关。
📊 数据与实验在减少训练样本到30%的情况下,本方法能够在推理任务中达到完整数据集的性能。实验对比表明其优于随机采样和传统的主动学习基线模型。
⭐ 主要贡献显著降低了RLVR的训练成本,提出了不确定性一致性指标并将其用于有效的样本选择,为主动学习优化强化学习任务提供了理论和实践支持。
查看完整摘要 (Abstract)
Large Language Models (LLMs) have recently improved mathematical reasoning through Reinforcement Learning with Verifiable Reward (RLVR). However, existing RLVR algorithms require large query budgets, making annotation costly. We investigate whether fewer but more informative queries can yield similar or superior performance, introducing active learning (AL) into RLVR. We identify that classic AL sampling strategies fail to outperform random selection in this setting, due to ignoring \textbf{objective uncertainty} when only selecting by subjective uncertainty. This work proposes an \textbf{uncertainty consistency} metric to evaluate how well subjective uncertainty aligns with objective uncertainty. In the offline setting, this alignment is measured using the Point-Biserial Correlation Coefficient (PBC). For online training, because of limited sampling and dynamically shifting output distributions, PBC estimation is difficult. Therefore, we introduce a new online variant, computed from normalized advantage and subjective uncertainty. Theoretically, we prove that the online variant is strictly negatively correlated with offline PBC and supports better sample selection. Experiments show our method consistently outperforms random and classic AL baselines, achieving full-dataset performance while training on only 30\% of the data, effectively reducing the cost of RLVR for reasoning tasks.\footnote{The code is available at \hyperref[https://github.com/yihao-123/uncertainty-consistency]{https://github.com/yihao-123/uncertainty-consistency}.
}
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #Large Reasoning Models #Efficient Reasoning #Reinforcement Learning
🎯 研究动机大规模推理模型通过强化学习生成长推理链条处理复杂问题,但冗余输出限制了效率。亟需提升推理效率的方法。
❓ 解决问题提出一种基于长度奖励调整的强化学习方法,优化推理性能与输出效率之间的平衡。
🔍 现象分析推理行为在训练过程中动态变化,奖励本身需自适应;简单问题应更严格限制冗长推理以实现效率优化。
🛠️ 主要方法提出 LASER 方法,通过步函数设定目标长度奖励,并扩展为 LASER-D 引入动态和难度感知奖励机制。
📊 数据与实验在五个开源模型(规模从1.5B到32B)上的实验表明,LASER-D提升推理性能的同时减少64%的 token 使用量。
⭐ 主要贡献定义统一框架以适应多种推理效率提升问题;实现动态与难度感知奖励机制以提升模型效率;公开模型、代码及数据增强开放研究。
查看完整摘要 (Abstract)
Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel **L**ength-b**A**sed **S**t**E**p **R**eward shaping method (LASER), which employs a step function as the reward based on target length. LASER surpasses previous methods, achieving a superior trade-off between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves dynamically during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (**D**ynamic and **D**ifficulty-aware). Experiments on five open-weight models from 1.5B to 32B demonstrate that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D achieves a **5.3** improvement on AIME2024 while reducing token usage by **64**%. Further analysis reveals that our RL-based compression produces more concise reasoning patterns with less redundant ``self-reflections''. All resources (Models, Code, Data)
are available at https://github.com/hkust-nlp/Laser.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models,Complex Reasoning
🎯 研究动机链式推理显著提高了大型语言模型在复杂任务中的推理准确性,但现有方法容易产生早期错误传播问题,同时缺乏处理冗余推理及提取关键信息的结构化分析框架。
❓ 解决问题解决推理链中早期错误放大及缺乏全局协调问题,同时构建稳定且可解释的推理路径以提高模型鲁棒性和准确性。
🔍 现象分析现有链式推理方法受到自回归生成的限制,易导致推理链不稳定,并缺乏有效机制处理冗余和提取关键推理特征。
🛠️ 主要方法提出基于拓扑数据分析的全球假设结构(GHS-TDA),通过语义富集的全局假设图整合与协调多条候选推理路径,结合多尺度结构提取实现高效推理优化。
📊 数据与实验在多个推理基准上进行评估,验证方法在准确性和鲁棒性上显著优于现有强基线。
⭐ 主要贡献设计了一种全局假设空间下的推理优化框架,提升了推理过程中对错误的纠正能力,同时增强了推理的解释性与稳定性。
查看完整摘要 (Abstract)
Chain-of-Thought (CoT) has been shown to significantly improve the reasoning accuracy of large language models (LLMs) on complex tasks. However, due to the autoregressive, step-by-step generation paradigm, existing CoT methods suffer from two fundamental limitations. First, the reasoning process is highly susceptible to early-stage errors, which tend to propagate and amplify without a global coordination and correction mechanism, thereby distorting the overall reasoning chain. Second, current CoT methods lack structured analytical frameworks for pruning redundant reasoning and identifying critical reasoning features, resulting in instability and reduced interpretability. To address these issues, we propose Global Hypothesis Structure via Topological Data Analysis (GHS-TDA), which constructs a semantically enriched global hypothesis graph that integrates and coordinates multiple candidate reasoning paths, thereby supporting global consistency refinement and error mitigation. GHS-TDA applies persistent homology-based topological data analysis to capture stable multi-scale structures, remove redundancy and inconsistencies, and extract a more reliable reasoning skeleton. By jointly leveraging reasoning diversity and topological stability, GHS-TDA achieves self-adaptive convergence, produces high-confidence and interpretable reasoning paths, and consistently outperforms strong baselines in terms of both accuracy and robustness across multiple reasoning benchmarks.
基础/前沿模型 (含LLM)
推理与思维链
#multi-hop reasoning #large language models #reinforcement learning #synthetic data
TL;DR:RL fine-tuning LLMs on synthetic data improves real-world multi-hop reasoning by teaching knowledge composition skills
🎯 研究动机多跳推理是大语言模型的重要能力,但当前的强化学习微调方法过于依赖高质量数据,这些数据通常昂贵且难以获取。
❓ 解决问题提出通过规则生成的合成数据进行强化学习微调,从而减轻对人工标注或其他高成本数据来源的依赖。
🔍 现象分析实验表明,即使合成数据仅包含虚构知识,微调后的模型在真实世界数据集上表现显著提升,尤其是在需要知识组合的复杂问题上。
🛠️ 主要方法利用规则生成合成数据,将其用于强化学习微调多跳推理模型,使模型掌握更通用的知识组合能力。
📊 数据与实验通过多个真实世界问答基准测试模型性能,对比分析不同问题难度下合成数据微调的效果。
⭐ 主要贡献首次展示规则生成的合成数据可以作为一种免费且可扩展的资源,大幅提升大语言模型的推理能力,尤其是知识组合能力。
查看完整摘要 (Abstract)
Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks.
However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers.
All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow.
In this work, we investigate a cheaper alternative: RL fine-tuning on _rule-generated synthetic data_ for multi-hop reasoning tasks.
We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge.
On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to _compose knowledge_---a fundamental and generalizable reasoning skill.
Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.
基础/前沿模型 (含LLM)
推理与思维链
#reinforcement learning #reasoning #blackwell optimality #post training
🎯 研究动机大型推理模型计算成本高且延迟较大,亟需开发高效推理机制以加速目标达成。传统观念认为较长的推理输出提升准确率,需重新审视这一假设。
❓ 解决问题通过折扣强化学习框架对推理长度进行惩罚,鼓励生成简洁但准确的推理过程,优化决策路径中的效率与效果兼顾问题。
🔍 现象分析推理过程中的较长 token 链可能并不总提升结果质量,与随机最短路径问题类似,需优先选择成功且简洁的路径。
🛠️ 主要方法引入小成本的 token 惩罚机制,将推理过程建模为带折扣因子的强化学习问题,并结合 Blackwell Optimality 分析有限策略下的最优行为。
📊 数据与实验通过实验验证提出方法在缩短推理链长度的同时保持任务准确性,支持理论推导的应用价值。
⭐ 主要贡献提出一种新型效率推理机制,结合折扣强化学习和 Blackwell Optimality,从理论与实践两方面实现准确性与效率间的权衡,为推理领域提供创新路径。
查看完整摘要 (Abstract)
Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. More broadly, in goal reaching sequential decision problems we often want to reach the goal quickly, and LRM reasoning can be viewed through this lens. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning, analogous to preferring shorter successful trajectories in a stochastic shortest path problem. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.
基础/前沿模型 (含LLM)
推理与思维链
#large language model #reinforcement learning #llm reasoning #structured in-context environment
🎯 研究动机现有的大型语言模型在基于环境的强化学习中展示了推理能力的进步,但目前的数学和编程环境难以扩展,游戏环境则缺乏普适性,难以有效支持通用推理能力的学习。
❓ 解决问题现有推理环境在扩展性、通用性和可验证性方面存在不足,限制了语言模型的推理能力优化和迁移能力提升。
🔍 现象分析实验表明,现有环境的局限性如重度依赖人工注释或技能的过度专业化,阻碍了模型的跨领域推理能力拓展。
🛠️ 主要方法提出了SIE框架,从大规模结构化数据中自动构建推理环境,以丰富的组合模式支持通用推理,并通过显式的模式和推理链实现规则验证。
📊 数据与实验实验结果显示,SIE框架显著提升了领域内的结构化推理,并使得模型能够有效地迁移到数学和逻辑推理任务;在信息受限的部分SIE中,模型通过环境探索推断缺失信息,进一步提高了推理的稳健性和泛化能力。
⭐ 主要贡献构建了SIE框架以填补现有推理环境的不足,实现了可扩展、可通用和可验证的推理能力;证明了其在结构化推理、跨领域迁移和稳健性方面的显著效果。
查看完整摘要 (Abstract)
Large language models (LLMs) have achieved significant advancements in reasoning capabilities through reinforcement learning (RL) via environmental exploration. As the intrinsic properties of the environment determine the abilities that LLMs can learn, the environment plays an important role in the RL finetuning process. An ideal LLM reasoning environment should possess three core characteristics: scalability, generalizable reasoning, and verifiability. However, existing mathematical and coding environments are difficult to scale due to heavy reliance on expert annotation, while the skills learned in game-based environments are too specialized to generalize. To bridge this gap, we introduce the \textbf{S}tructured \textbf{I}n-context \textbf{E}nvironment (SIE) framework. SIE achieves scalability by automatically constructing reasoning environments from large-scale structured data, where the rich compositional patterns naturally support generalizable reasoning. Moreover, the explicit schemas and reasoning chains in structured data provide a foundation for rule-based verifiability. Experimental results show that the SIE framework not only achieves substantial improvements in in-domain structured reasoning, but also enables the learned compositional reasoning skills to generalize effectively to out-of-domain mathematical and logical reasoning tasks. We further explored learning in information-limited partial SIEs and found that LLMs can infer the missing information through exploring the environment, leading to robust reasoning improvements and generalization performance. Our code can be available at \url{https://github.com/PursuitYP/SIE_ICLR}.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #Hybrid Reasoning #Reinforcement Learning
🎯 研究动机大型语言模型在复杂推理任务中表现优异,但传统的离散token推理成本高且效率低;已有潜在嵌入空间推理研究虽提高效率,却损害清晰性和性能。
❓ 解决问题如何在语言模型推理过程中动态切换显式与潜在推理,以提升效率并保持推理性能,同时降低计算资源消耗。
🔍 现象分析整合显式与潜在推理可平衡推理清晰性与计算效率,现有方法在两者之间缺乏动态选择机制,经优化可显著减少资源开销并提高效果。
🛠️ 主要方法提出HyRea框架:通过监督学习冷启动阶段引入嵌入推理,并使用基于任务奖励的强化学习策略优化推理路径选择。
📊 数据与实验在数学推理基准数据集上进行实验,结果显示HyRea框架降低了token使用量,同时在准确率方面保持或超越传统方法。
⭐ 主要贡献提供一种融合显式与隐式推理的统一框架,为多步复杂推理任务实现高效、可扩展的解决方案,引入新的强化学习策略优化 reasoning 过程。
查看完整摘要 (Abstract)
Large Language Models (LLMs) have shown strong performance in complex reasoning tasks, especially when guided by Chain-of-Thought (CoT) prompting. However, conventional CoT reasoning in the discrete token space suffers from high computational and memory costs due to verbose intermediate steps. Recent work has explored latent reasoning in the embedding space to improve efficiency, but often at the cost of clarity and performance. In this work, we propose $\underline{Hy}$brid $\underline{Rea}$soning ($\texttt{HyRea}$), a unified framework that enables LLMs to dynamically switch between explicit (token-based) and latent (embedding-based) reasoning during inference. To train the model to make these decisions effectively, we introduce a two-stage training pipeline: (1) a supervised cold-start phase that introduces latent reasoning by replacing low-entropy CoT steps with embeddings, and (2) a reinforcement learning phase using Group Relative Policy Optimization (GRPO) to fine-tune the model’s reasoning strategy based on task-specific rewards.
Experiments on mathematical reasoning benchmarks show that \texttt{HyRea} achieves significant reductions in token usage while maintaining or improving accuracy, offering an effective and scalable solution for efficient multi-step reasoning in LLMs.
基础/前沿模型 (含LLM)
推理与思维链
#Logical Reasoning #Self-evolving Training #Large Language Models #Parallel Scaling #Test time scaling
🎯 研究动机现有的大语言模型主要依赖单一推理模式(如自然语言)进行逻辑推理训练,缺乏多模态协同能力,导致逻辑推理效果受限。
❓ 解决问题通过引入多模态推理框架,弥补现有方法在多模态协作上的不足,从而提升模型的逻辑推理能力。
🔍 现象分析实验表明,逻辑推理中不同推理模态具有互补优势,其中真值表推理可有效缓解自然语言推理中的关键瓶颈。
🛠️ 主要方法提出Mixture-of-Thought (MoT)框架,包括自进化多模态训练和多模态推理两个阶段,综合利用自然语言、代码和真值表三种模态进行逻辑推理。
📊 数据与实验在FOLIO和ProofWriter等逻辑推理基准上,MoT框架相比单模态方法提升了平均准确率最多11.7个百分点,特别在复杂推理任务中表现优异。
⭐ 主要贡献构建了一个多模态推理框架,证明了模态协同的有效性,显著提升了大语言模型在逻辑推理任务中的表现。
查看完整摘要 (Abstract)
Human beings naturally utilize multiple reasoning modalities to learn and solve logical problems, i.e., different representational formats such as natural language, code, and symbolic logic. In contrast, most existing LLM-based approaches operate with a single reasoning modality during training, typically natural language. Although some methods explored modality selection or augmentation at inference time, the training process remains modality-blind, limiting synergy among modalities. To fill in this gap, we propose Mixture-of-Thought (MoT), a framework that enables LLMs to reason across three complementary modalities: natural language, code, and a newly introduced symbolic modality, truth-table, which systematically enumerates logical cases and partially mitigates key failure modes in natural language reasoning. MoT adopts a two-phase design: (1) **self-evolving MoT training**, which jointly learns from filtered, self-generated rationales across modalities; and (2) **MoT inference**, which fully leverages the synergy of three modalities to produce better predictions. Experiments on logical reasoning benchmarks including FOLIO and ProofWriter demonstrate that our MoT framework consistently and significantly outperforms strong LLM baselines with single-modality chain-of-thought approaches,
achieving up to **+11.7pp** average accuracy gain.
Further analyses show that our MoT framework benefits both training and inference stages; that it is particularly effective on harder logical reasoning problems; and that different modalities contribute complementary strengths, with truth-table reasoning helping to overcome key bottlenecks in natural language inference.
基础/前沿模型 (含LLM)
推理与思维链
#RL #Reasoning #LLM
🎯 研究动机当前通过可验证奖励的强化学习(RLVR)在训练大语言模型进行复杂推理方面表现良好,但其依赖于昂贵且特定领域的监督,限制了可扩展性。
❓ 解决问题研究如何在数据标签或外部奖励缺乏的情况下,使语言模型通过内在反馈进行有效学习,以替代传统的外部奖励依赖。
🔍 现象分析实验表明,模型的内在信号如自信度可作为有效的学习指标,能够驱动跨领域任务如代码生成等的泛化能力,性能接近传统使用外部奖励的方法。
🛠️ 主要方法提出一种新的强化学习框架——Intuitor,利用模型的自信度(自确定性)作为唯一的奖励信号,替换外部奖励用于 Group Relative Policy Optimization (GRPO)。
📊 数据与实验在数学基准任务上,Intuitor表现与传统 GRPO 方法持平,同时在跨领域任务中表现出更好的泛化能力,不需要任何金标准解决方案或测试用例。
⭐ 主要贡献证明内在模型信号可驱动有效学习,提供了可扩展的替代方案;为未来自主 AI 系统设计不依赖可验证奖励的学习框架铺平了道路;代码已开源以促进进一步研究。
查看完整摘要 (Abstract)
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence—termed self-certainty—as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor
基础/前沿模型 (含LLM)
推理与思维链
#Contrastive Decoding #Multilingual Language Models #Inference-Time Knowledge Integration #Token-Level Confidence Gating #LLM
TL;DR:We propose LoRA-Gated Contrastive Decoding (LGCD), a training-free decoding method that mitigates catastrophic forgetting in language-adapted LLMs by dynamically incorporating knowledge from the original pretrained model during inference.
🎯 研究动机大型语言模型(LLM)经过语言特定适应(如持续预训练或指令微调)后,常出现灾难性遗忘,导致生成内容事实错误,尤其在多语言场景下,适应过程可能用语言特定模式覆盖通用世界知识。
❓ 解决问题提出 LGCD,一种无需训练的推理时解码框架,旨在通过动态整合原始预训练模型的知识,减轻语言适应 LLM 中的事实错误,特别针对多语言事实生成任务。
🔍 现象分析语言适应过程(如 LoRA 微调)可能导致模型遗忘预训练中学习的一般事实知识,从而在多语言生成中引发幻觉或事实不准确,关键在于如何在推理时有效利用被覆盖的预训练知识。
🛠️ 主要方法LGCD 通过三个步骤实现:基于 LoRA 分解从 FFN 层提取事实表示以近似预训练知识,基于词元级置信度动态门控解码决策,并采用 Top-K 掩码的对比解码参考近似知识修正不确定预测。
📊 数据与实验在九种语言的多项选择与长式问答任务上进行广泛实验,证明 LGCD 能有效减少幻觉,提升语言适应模型的事实准确性,且无需额外训练或访问原始预训练数据。
⭐ 主要贡献提出首个无需训练的推理时解码方法,通过门控对比解码动态整合预训练知识,显著提升多语言适应模型的事实生成能力,为缓解灾难性遗忘提供了新的推理端解决方案。
查看完整摘要 (Abstract)
Large language models (LLMs) adapted to specific languages through continual pretraining or instruction tuning often suffer from catastrophic forgetting, which can lead to factual inaccuracies. This issue is particularly pronounced in multilingual settings, where adaptation may override general world knowledge with language-specific patterns. We propose LoRA-Gated Contrastive Decoding (LGCD), a training-free inference-time decoding framework that improves factuality in language-adapted LLMs by leveraging knowledge from the original pretrained model. LGCD operates by (1) extracting factual representations from Feed-Forward Network (FFN) layers via LoRA-based decomposition, approximating pretrained knowledge, (2) dynamically gating decoding based on token-level confidence, and (3) applying contrastive decoding with Top-K masking to revise uncertain predictions by referencing the approximated representation of pretrained knowledge. LGCD requires no additional training or access to the original pretraining data. Extensive experiments with LGCD on multilingual multiple-choice and long-form QA tasks across nine languages demonstrate its strong effectiveness in mitigating hallucinations and enhancing factual accuracy in language-adapted models. These results further indicate that pretrained knowledge can be strategically reintroduced during decoding to promote factual multilingual generation.
基础/前沿模型 (含LLM)
推理与思维链
#LLM #reasoning #process reward model #reinforcement learning
🎯 研究动机现有的过程奖励模型(PRM)无法有效处理推理步骤间的依赖关系,也难以将奖励与最终结果进行明确关联,导致信用分配模糊和性能下降。
❓ 解决问题提出条件奖励模型(CRM),通过将奖励与推理步骤及最终结果建立明确的因果关联,解决现存模型的信用分配问题。
🔍 现象分析现有模型面临推理步骤孤立处理以及奖励信号无法反映因果关系的问题,易受奖励欺骗影响且性能不稳定。
🛠️ 主要方法CRM通过条件概率规则建模推理路径,奖励随步骤和最终结果动态变化,提高奖励信号的因果关联性和跨样本比较的一致性。
📊 数据与实验在Best-of-N采样、束搜索和强化学习任务上的多项实验表明,CRM相较现有模型更稳健,具备持续性的性能提升。
⭐ 主要贡献提出了一个更具原则性的条件奖励框架,显著提升了LLM的推理能力,同时减少了奖励欺骗现象并稳定了模型性能。
查看完整摘要 (Abstract)
Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning capabilities of large language models (LLMs) by guiding their step-by-step reasoning toward a final answer.
However, existing PRMs either treat each reasoning step in isolation, failing to capture inter-step dependencies, or struggle to align process rewards with the final outcome.
Consequently, the reward signal fails to respect temporal causality in sequential reasoning and faces ambiguous credit assignment.
These limitations make downstream models vulnerable to reward hacking and lead to suboptimal performance.
In this work, we propose Conditional Reward Modeling (CRM) that frames LLM reasoning as a temporal process leading to a correct answer.
The reward of each reasoning step is not only conditioned on the preceding steps but also explicitly linked to the final outcome of the reasoning trajectory. By enforcing conditional probability rules, our design captures the causal relationships among reasoning steps, with the link to the outcome allowing precise attribution of each intermediate step, thereby resolving credit assignment ambiguity.
Further, through this consistent probabilistic modeling, the rewards produced by CRM enable more reliable cross-sample comparison.
Experiments across Best-of-N sampling, beam search and reinforcement learning demonstrate that CRM consistently outperforms existing reward models, offering a principled framework for enhancing LLM reasoning.
In particular, CRM is more robust to reward hacking and delivers stable downstream improvements without relying on verifiable rewards derived from ground truth.
基础/前沿模型 (含LLM)
推理与思维链
#Multilingual #Reasoning #Long Chain-of-Thought
TL;DR:We analyze scaling trends, pretraining, and inference to understand long CoT reasoning transfer to non-English languages, demonstrating that post-training techniques overcome key limitations.
🎯 研究动机研究长链式推理模型在非英语环境中的能力转移,解决多语言长链推理未知性问题。
❓ 解决问题分析模型规模、预训练、后训练及推断过程中的限制因素,以改进非英语长链推理表现。
🔍 现象分析模型规模扩展改善 En-CoT 多语言任务表现,但 Target-CoT 在长链推理任务中表现不佳,且英语推理模式效率更高。
🛠️ 主要方法使用多语言预训练与自动翻译的人工数据进行后训练比较,探索英语与目标语言推理的性能差异。
📊 数据与实验采用九种非英语语言分别进行 En-CoT 和 Target-CoT实验,结合自动翻译与模型蒸馏生成的推理数据进行微调。
⭐ 主要贡献揭示多语言长链推理性能差异及瓶颈,提出能改善表现的后训练数据生成方法,并发布模型及资源支持后续研究。
查看完整摘要 (Abstract)
While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the world’s languages. In this work, we systematically investigate four key stages of model development–scaling, pretraining, post-training, and inference–to understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical reasoning. Shifting to pretraining, we find that adding a specialized reasoning stage enhances En-CoT performance but degrades Target-CoT, whereas broad multilingual pretraining improves both modes simultaneously. Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training. We demonstrate that fine-tuning on reasoning traces
automatically translated from gold English traces outperforms fine-tuning on target-language traces distilled from large reasoning models. Finally, we report disparities in inference efficiency between languages and uncover language-specific failure modes
in CoTs. We release models, datasets, and code to foster further research.
基础/前沿模型 (含LLM)
推理与思维链
#RLVR #GRPO #rollout #LLM #reasoning
🎯 研究动机现有的基于可验证奖励的强化学习(RLVR)算法在增强大语言模型的推理能力时,因群体 rollout 采样的轨迹多样性较低而受到限制。
❓ 解决问题提出提高采样轨迹多样性的新策略,以解决局部随机采样导致的轨迹同质性问题,从而优化策略学习。
🔍 现象分析轨迹样本的局部变异性因采样过程的收敛,转变为近似相同的推理路径,限制了奖励信号对策略更新的有效性。
🛠️ 主要方法引入 Lookahead Tree-Based Rollouts(LATR)策略,通过高不确定生成步骤分支、向前模拟及剪枝相似分支,提高生成轨迹的多样性。
📊 数据与实验在包括 GRPO 和 DAPO 的多项推理任务中验证,LATR 比随机采样提高了 131% 的策略学习效率,最终 pass@1 性能提升 4.2%。
⭐ 主要贡献通过 LATR 显著提升轨迹多样性与策略效率,为强化学习中的推理任务提供了性能验证的改进方案,并公开了相关代码与数据。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with Stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are available at https://github.com/starreeze/latr.
基础/前沿模型 (含LLM)
推理与思维链
#MLLM #Reasoning
🎯 研究动机当前的多模态大语言模型(MLLMs)在数学和逻辑等推理任务上表现良好,但其处理复杂真实问题所需的长链反思推理能力尚未充分探索,缺乏系统性评估。
❓ 解决问题本研究旨在通过构建专门的多模态基准,量化MLLMs在长链反思推理上的缺陷,并开发新型训练方法以有效提升此类能力,弥补现有模型的不足。
🔍 现象分析在新建的MM-HELIX基准测试中,现有MLLMs在需要迭代思考和回溯的长链反思推理任务上表现显著不佳,表明现有模型在该能力上存在明显短板。
🛠️ 主要方法提出自适应混合策略优化(AHPO),动态融合离线监督与在线优化,解决稀疏奖励和灾难性遗忘问题;并通过Step-Elicited Response生成大规模高质量指令调优数据(MM-HELIX-100K)。
📊 数据与实验构建包含42类合成任务、1260个样本的MM-HELIX多模态基准,以及10万条反思推理轨迹的MM-HELIX-100K数据集;在Qwen2.5-VL-7B上实验,在基准测试中准确率提升18.6%,在通用数学逻辑任务上平均提升5.7%。
⭐ 主要贡献创建首个专注于长链反思推理的多模态基准和数据集;提出AHPO训练策略有效提升MLLMs的复杂推理能力;实验证明反思推理能力可被有效学习并泛化,为开发更强大的MLLMs奠定基础。
查看完整摘要 (Abstract)
While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.
基础/前沿模型 (含LLM)
推理与思维链
#Efficient LLM #CoT compression
🎯 研究动机大语言模型(LLM)在链式思维(CoT)推理中表现出色,但冗长的推理过程导致推理成本增加和效率降低。
❓ 解决问题提出一种基于步骤熵(step entropy)的CoT压缩框架,以识别和去除冗余推理步骤,提高推理效率。
🔍 现象分析理论分析和实验证实,低熵步骤高度冗余,裁剪80%的低熵中间步骤不会显著影响最终答案的准确性;而随机或高熵裁剪显著降低推理性能。
🛠️ 主要方法通过监督微调(SFT)和群体相对策略优化(GRPO)相结合的双阶段训练策略,使模型在推理中自动生成带有[SKIP]标记的压缩CoT。
📊 数据与实验在数学推理基准上验证框架,对DeepSeek和Qwen等模型进行实验,表明方法在保持精度的情况下显著提高推理效率。
⭐ 主要贡献首次基于信息熵提出CoT压缩框架,显著减少冗余推理步骤;提出双阶段训练提升模型效率;揭示LLM推理结构的新洞见,对实际部署具有深远影响。
查看完整摘要 (Abstract)
Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies the informational contribution of individual reasoning steps to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80% of low-entropy intermediate steps can be pruned without significant degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly enhances LLM inference efficiency while rigorously preserving accuracy, offering profound implications for practical LLM deployment and a deeper understanding of reasoning structures.
基础/前沿模型 (含LLM)
推理与思维链
#LLM Reasoning #Mathematical Reasoning #Data Augmentation
🎯 研究动机数学推理是提升大型语言模型能力的重要领域,现有方法在推理步骤质量上存在制约。
❓ 解决问题现有扩展推理步骤的方法需要强大的外部模型或较高的计算成本,亟需一种高效可扩展的替代方案。
🔍 现象分析增加详细的中间推理步骤能够有效提升模型性能,但现存方法在成本和资源依赖上存在不足。
🛠️ 主要方法提出 MathFimer 框架,通过借鉴代码推理中的‘填补中间’任务,以前缀-后缀对的形式分解解题步骤,并训练模型生成中间步骤。
📊 数据与实验基于精心构建的 NuminaMath-FIM 数据集训练 MathFimer-7B 模型,用其扩展现有数学推理数据集,并在 GSM8K 等基准上验证性能提升。
⭐ 主要贡献提供了一种无需外部强大模型或高成本推理的高效解决方案,在多个数学推理任务上显著增加模型推理能力。
查看完整摘要 (Abstract)
Mathematical reasoning represents a critical frontier in advancing large language models (LLMs). While step-by-step approaches have emerged as the dominant paradigm for mathematical problem-solving in LLMs, the quality of reasoning steps in training data fundamentally constrains the performance of the models.
Recent studies have demonstrated that more detailed intermediate steps can enhance model performance, yet existing methods for step expansion either require more powerful external models or incur substantial computational costs.
In this paper, we introduce MathFimer, a novel framework for mathematical reasoning step expansion inspired by the ''Fill-in-the-middle'' task from code reasoning.
By decomposing solution chains into prefix-suffix pairs and training models to reconstruct missing intermediate steps, we develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-FIM dataset.
We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains, creating MathFimer-expanded versions.
Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct, MetaMathQA and etc., we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on original data across various benchmarks such as GSM8K and MATH.
Our approach offers a practical, scalable solution for enhancing mathematical reasoning capabilities in LLMs without relying on powerful external models or expensive inference procedures.
基础/前沿模型 (含LLM)
推理与思维链
#visual reasoning #adaptive reasoning #multimodal large language models
TL;DR:We introduce an mixture-of-visual-thoughts paradigm that unifies different visual reasoning modes within a model and guides it to adaptively select the appropriate mode based on context, achieving consistent gains across various scenarios.
🎯 研究动机现有视觉推理方法多专注于特定推理模式,虽在特定领域有提升,但难以发展通用的推理能力。因此,研究旨在构建一个能够适应不同场景的通用视觉推理模型。
❓ 解决问题为解决当前模型在通用视觉推理上的局限性,提出了MoVT范式,通过统一多种推理模式并实现基于上下文的自适应选择,以提升跨场景的推理性能。
🔍 现象分析视觉推理模型通常依赖单一或固定模式,导致在面对多样化任务时泛化能力不足,限制了其实际应用范围和鲁棒性。
🛠️ 主要方法提出了AdaVaR两阶段学习框架:在监督冷启动阶段统一学习多种推理模式,并通过精心设计的AdaGRPO算法进行强化学习以诱导模式选择能力。
📊 数据与实验在多个场景下进行广泛实验,证明AdaVaR能有效指导模型学习和区分多种模式,并进行上下文自适应选择,实现一致性能提升。
⭐ 主要贡献引入MoVT自适应推理范式,提出AdaVaR学习框架及AdaGRPO算法,为构建通用视觉推理模型提供了有效解决方案,显著提高了跨场景性能。
查看完整摘要 (Abstract)
Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, $\underline{\text{M}}$ixture-$\underline{\text{o}}$f-$\underline{\text{V}}$isual-$\underline{\text{T}}$houghts (**MoVT**), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce **AdaVaR**, a two-stage $\underline{\text{Ada}}$ptive $\underline{\text{V}}$isu$\underline{\text{a}}$l $\underline{\text{R}}$easoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.
基础/前沿模型 (含LLM)
推理与思维链
#test-time compute #reasoning #diversity
🎯 研究动机测试阶段的并行采样受限于多样性坍缩问题,模型倾向于集中在少量模式,导致重复错误,亟需优化采样计算的分配方式。
❓ 解决问题提出一种显式分配推理模式采样资源的框架,以解决多样性坍缩问题并提升测试阶段效率与性能。
🔍 现象分析当前训练方式未充分利用数据中的多样性,造成模式过度集中,限制了采样效率和推理性能的进一步提升。
🛠️ 主要方法通过引入模式条件化(ModC)框架,采用专家模型或模式特定前缀分配采样计算,并利用梯度聚类实现无预定义模式标签的推理方式。
📊 数据与实验在图搜索任务与数学推理基准上验证,包括从0.5B到7B模型,对OpenThoughts进行微调提升采样效率,并在NuminaMath等数据集上实现显著性能改进。
⭐ 主要贡献提供了一种简单有效的方法解锁数据多样性优势,通过Mode-conditioning显著提升测试阶段采样效率和多样性利用效果,同时增强基于强化学习的性能表现。
查看完整摘要 (Abstract)
Parallel sampling is essential to test-time scaling and reinforcement learning (RL), but its effectiveness is sharply limited by diversity collapse, where models concentrate on a few modes and repeated samples produce the same mistakes. We propose the mode-conditioning (ModC) framework, which explicitly allocates sampling compute across reasoning modes using either specialist models or mode-specific prefixes. With predefined mode labels, ModC consistently improves test-time scaling (Pass@k) across controlled graph-search tasks and math reasoning benchmarks, spanning model families and sizes from 0.5B to 7B. On OpenThoughts, fine-tuning Qwen2.5-7B with ModC achieves an 4× efficiency gain over standard training while also improving the maximum attainable Pass@k. We further show that gradient clustering enables ModC without predefined mode labels, yielding up to 10% gains on datasets such as NuminaMath. Finally, we show that ModC improves Pass@k after RL training and can further boost the Pass@k gains of diversity-inducing RL methods. These results demonstrate that standard training underutilizes the diversity in data, and that ModC provides a simple, effective remedy for unlocking the full benefits of diversity in parallel sampling.
基础/前沿模型 (含LLM)
推理与思维链
#Vision-Language Models #Multimodal Reasoning #Reinforcement Learning
🎯 研究动机大型语言模型(LLMs)通过强化学习展现出强大的推理能力,近年研究尝试将推理扩展到视觉-语言模型(VLMs)。但当前对VLMs推理过程的副作用缺乏深入理解,特别是在感知与逻辑的平衡上。
❓ 解决问题本文揭示了多模态推理的双重性:推理增强逻辑能力的同时,可能损害视觉感知基础。目标是解决视觉遗忘问题,确保模型在复杂推理中保持对视觉信息的依赖。
🔍 现象分析研究发现,过长的推理链导致模型逐渐忽略视觉输入,表现为基本视觉问题识别失败,即“视觉遗忘”。这种现象削弱了模型的多模态感知能力,尤其在需要视觉基础的任务中。
🛠️ 主要方法提出Vision-Anchored Policy Optimization (VAPO)方法,通过在强化学习框架中明确引导模型将推理轨迹锚定在视觉输入上。该方法被应用于VAPO-Thinker-7B模型,以增强视觉依赖。
📊 数据与实验在多个基准数据集上评估了VAPO方法,结果显示,该方法有效提升了模型对视觉信息的利用能力,并在多种视觉任务上取得了新的最先进性能。
⭐ 主要贡献首次系统揭示了多模态推理中视觉遗忘现象的双重效应;提出了VAPO方法,解决了视觉遗忘问题;通过实验验证了该方法的有效性,推动了VLMs在复杂推理任务中的应用。
查看完整摘要 (Abstract)
Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, longer reasoning length may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning ength causes models to disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on various benchmarks.
基础/前沿模型 (含LLM)
推理与思维链
#multi-agent debate #memory selection #robustness
🎯 研究动机大型语言模型在推理任务中表现出色,现有多智能体辩论框架通过多轮辩论迭代推理,但仍然受到错误记忆的影响,亟需提高抗错误能力。
❓ 解决问题观察到多智能体辩论框架依赖于历史记忆的质量,错误记忆会降低推理性能,因此需要设计机制筛选和优化记忆信息。
🔍 现象分析理论分析表明,模型性能与此前辩论生成的记忆质量紧密相关,错误记忆威胁辩论框架的推理可靠性。
🛠️ 主要方法提出基于记忆屏蔽的多智能体辩论框架(MAD-M$^2$),通过在每轮辩论开始时屏蔽错误记忆,仅保留有用信息以优化上下文。
📊 数据与实验使用主流数学和逻辑推理基准测试,进行广泛实验与分析,验证框架在筛选错误记忆和提升推理性能上的有效性。
⭐ 主要贡献提出新框架MAD-M$^2$,显著改善了多智能体辩论的鲁棒性;提供关于记忆质量对性能影响的理论见解;实验结果表明该方法在推理任务中优于现有框架。
查看完整摘要 (Abstract)
Large language models (LLMs) have recently demonstrated impressive capabilities in reasoning tasks. Currently, mainstream LLM reasoning frameworks predominantly focus on scaling up inference-time sampling to enhance performance. In particular, among all LLM reasoning frameworks, *multi-agent debate* (MAD), which employs multiple LLMs as agents to perform reasoning in the way of multi-round debate, has emerged as a powerful reasoning paradigm since it allows agents to access previous memories to alleviate fallacious content and refine their reasoning iteratively in each debate round. However, although MAD significantly improves the reasoning capabilities of LLMs, in this paper, we observe that there remain erroneous memories, and LLM agents are vulnerable to these erroneous memories. To explore this phenomenon, we provide a theoretical insight that the performance of MAD is highly dependent on the quality of memories derived from the previous debate, indicating that the existence of erroneous memories poses a threat to the performance of MAD. To address this problem, we introduce a simple yet effective multi-agent debate framework, *multi-agent debate with memory masking* (MAD-M$^2$), to improve the robustness of MAD by allowing LLM agents to mask erroneous memories from the previous debate round at the beginning of each debate round. In this way, MAD-M$^2$ can polish the contextual information before each debate round by preserving informative and meaningful memories while discarding the erroneous memories. Extensive experiments and analyses on mainstream mathematical and logical reasoning benchmarks demonstrate that MAD-M$^2$ can identify the erroneous memories and achieve better performance in reasoning than MAD.
基础/前沿模型 (含LLM)
推理与思维链
#Language models #in-context learning #reasoning #interpretability #decoding
TL;DR:If a shallow auxiliary prediction head struggles to approximiate the full next token prediction, we can infer that the model is doing complex in-context computation.
🎯 研究动机当前语言模型的上下文计算复杂性难以量化,现有指标如下一词损失无法有效捕捉推理复杂度,而基于压缩性的指标往往不稳定且侵入性强。
❓ 解决问题提出一种非侵入性的新指标 Multiple Token Divergence (MTD),用于衡量语言模型的上下文计算密度,并解决现有方法在推理复杂任务中的局限性。
🔍 现象分析关键发现是复杂上下文计算会导致模型输出分布与浅预测头的分布差异增大,表明上下文推理的复杂性与 MTD 正相关。
🛠️ 主要方法设计 MTD 作为衡量计算复杂度的简单指标,通过模型的完整输出分布与辅助预测头分布的 KL 散度计算而得,无需额外训练;并提出新的解码方法 Divergence Steering 来控制生成文本的计算特性。
📊 数据与实验在数学推理基准任务上验证,MTD 与问题难度正相关且远优于现有方法,同时低 MTD 与更高的推理准确性相关。
⭐ 主要贡献提出了 MTD 这一轻量化工具,用于分析和引导语言模型的上下文推理动态,显著提升复杂推理任务的可解释性和控制能力。
查看完整摘要 (Abstract)
Measuring the in-context computational effort of language models is a key challenge, as metrics like next-token loss fail to capture reasoning complexity. Prior methods based on latent state compressibility can be invasive and unstable. We propose Multiple Token Divergence (MTD), a simple measure of computational effort defined as the KL divergence between a model's full output distribution and that of a shallow, auxiliary prediction head. MTD can be computed directly from pre-trained models with multiple prediction heads, requiring no additional training. Building on this, we introduce Divergence Steering, a novel decoding method to control the computational character of generated text. We empirically show that MTD is more effective than prior methods at distinguishing complex tasks from simple ones. On mathematical reasoning benchmarks, MTD correlates positively with problem difficulty. Lower MTD is associated with more accurate reasoning. MTD provides a practical, lightweight tool for analyzing and steering the computational dynamics of language models.
基础/前沿模型 (含LLM)
推理与思维链
#large language models #reinforcement learning with verifiable rewards #llm reasoning
TL;DR:Zero-Variance Prompts is a valuable source of learning signals for RLVR to improve LLM Reasoning.
🎯 研究动机当前强化学习验证回报框架(RLVR)主要关注模型回答准确性差异较大的输入(prompt),忽视了零误差(prompt答案一致)的学习信号,造成潜在优化空间未被开发。
❓ 解决问题探索如何从零误差提示中提取有意义的学习信号,以改善大型语言模型(LLM)的推理能力。
🔍 现象分析零误差提示虽无答案差异,却可通过细粒度的反馈机制提供有效指导,具有尚未被充分利用的优化潜力。
🛠️ 主要方法提出强化学习零误差提示算法(RL-ZVP),通过结合正确性奖励和细化的误差惩罚,在令牌级别提取策略优化信号。
📊 数据与实验基于六个数学推理基准测试进行验证,与现有GRPO方法相比,RL-ZVP在准确率和通过率上分别提升至多8.61和7.77个百分点。
⭐ 主要贡献揭示零误差提示的学习价值,策略性构建RL-ZVP算法,显著提升LLM推理性能并突破现有方法的局限性。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward — so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce Reinforcement Learning with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.
基础/前沿模型 (含LLM)
推理与思维链
#Nudging LLM #LLM Reasoning #GRPO
🎯 研究动机现有在线强化学习算法如 GRPO 在处理复杂问题时存在局限,无法从模型认为‘不可解’的问题中学习,导致对难题无提升能力,仅针对易解问题性能改善,引发模型推理能力上限难以突破的问题。
❓ 解决问题通过引入自生成的提示(hint)降低问题复杂度,使模型能够利用这些潜在丰富的学习信号,从原本无奖励不可训练的问题中学习,旨在突破 LLM 推理能力上限。
🔍 现象分析难解样本的通过率为 0%时,模型无法获得梯度训练信号,传统 GRPO 训练仅提高易解样本的成功概率,而对模型能够解决的最大难度问题无影响。
🛠️ 主要方法提出新方法 NuRL,通过模型生成的抽象提示降低问题难度,在在线 RL 训练中对难题注入预生成提示并重新采样,引导模型生成有效轨迹以获取训练信号,同时避免分布偏移。
📊 数据与实验在六个多样化基准测试和三个模型上进行实验,验证 NuRL 比 GRPO 有持续性提升,同时证明提示应简单抽象且在 GRPO 收敛后加入最佳。
⭐ 主要贡献提出突破现有 LLM 推理上限的方法 NuRL,显著提高难题通过率和推理能力;深入研究有效提示的特性和使用场景,为提升复杂任务的模型推理性能提供指导。
查看完整摘要 (Abstract)
Current online reinforcement learning (RL) algorithms like GRPO share a key limitation in LLM reasoning: they cannot learn from problems that are "unsolvable" to the model. In other words, they can only improve performance on problems where the model is capable of exploring the correct answer. If a problem is too difficult -- such that even hundreds of attempts never produce a correct solution -- the model cannot learn from it. Consequently, the model's "upper limit" remains unchanged after RL training, even though the likelihood of solving easier, solvable problems may increase. These hard, unsolvable samples -- though potentially rich in learning signal -- cannot contribute to training, as no rollouts yield rewards and thus no gradients are produced. To unlock learning from these hard samples, we propose NuRL, a "nudging" method that aims to push the upper bound of LLM reasoning using self-generated hints, i.e., abstract cues that help reduce the problem difficulty for the model. Given a question and its gold answer, the model generates a Chain-of-Thought (CoT) and then produces a hint containing the core knowledge needed to solve the problem. During online RL training, we generate G rollouts from the base policy and use the pass rate to decide whether the hint should be injected. For hard samples with a 0% pass rate, we inject the offline-generated hint and regenerate a new batch of trajectories. This yields two benefits: (1) the hint boosts pass rates (from 0% to non-zero), thereby introducing training signals for previously unsolvable samples, and (2) the hints are self-generated (conditioned on the gold answer), avoiding distributional shift and do not rely on external models. Compared to standard GRPO, NuRL achieves consistent improvements across six diverse benchmarks and three models, while remaining complementary to test-time scaling. Notably, NuRL can raise the model's upper limit, whereas GRPO leaves pass@1024 unchanged from the base model. Furthermore, we present a systematic study of what makes an effective hint and when hints are most useful. Interestingly, the best hints are abstract and high-level -- as revealing gold answers actually hurt performance -- and are most beneficial when applied necessarily and after GRPO has converged.
基础/前沿模型 (含LLM)
推理与思维链
#Chain of Thoughts #Spatial Reasoning
🎯 研究动机空间推理是听觉感知的关键,但现有的音频大语言模型依赖于非结构化双耳线索和单步推断,限制了方向和距离估计的精度及可解释性。
❓ 解决问题旨在克服当前方法基于粗略分类标签和缺乏几何监督的挑战,提高空间推理性能与解析能力。
🔍 现象分析现有模型如 BAT 在双耳音频空间问答中的表现受限于粗粒度标签,难以实现高分辨率和鲁棒性的方向和距离估计。
🛠️ 主要方法提出 SAGE 编码器进行几何感知,将双耳声学特征与 3D 空间结构对齐,同时结合基于多步推理的空间链式思维模型 OWL 实现更精准的方向与距离估计。
📊 数据与实验构建 BiDepth 数据集,利用超过一百万条双耳音频与全景深度图及房间冲激响应的问答样本,通过两个基准数据集验证其性能,提升方向估计精度和问答准确性。
⭐ 主要贡献设计几何感知音频编码器并提出空间链式推理模型 OWL;发布大规模 BiDepth 数据集;在空间问答任务中显著降低方向误差并提高推理准确性。
查看完整摘要 (Abstract)
Spatial reasoning is fundamental to auditory perception, yet current audio large
language models (ALLMs) largely rely on unstructured binaural cues and single-
step inference. This limits both perceptual accuracy in direction and distance
estimation and the capacity for interpretable reasoning. Recent work such as BAT
demonstrates spatial QA with binaural audio, but its reliance on coarse categorical
labels (left, right, up, down) and the absence of explicit geometric supervision
constrain resolution and robustness. We introduce the $\textbf{Spatial-Acoustic Geometry
Encoder (SAGE}$), a geometry-aware audio encoder that aligns binaural acoustic
features with 3D spatial structure using panoramic depth images and room-impulse
responses at training time, while requiring only audio at inference. Building on this
representation, we present $\textbf{OWL}$, an ALLM that integrates $\textbf{SAGE}$ with a spatially
grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and
distance estimates. Through curriculum learning from perceptual QA to multi-step
reasoning, $\textbf{OWL}$ supports o’clock-level azimuth and DoA
estimation. To enable large-scale training and evaluation, we construct and release $\textbf{BiDepth}$,
a dataset of over one million QA pairs combining binaural audio with panoramic
depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new $\textbf{BiDepth}$ and the public SpatialSoundQA, $\textbf{OWL}$ reduces mean DoA error by $\textbf{11$^{\circ}$}$ through $\textbf{SAGE}$
and improves spatial reasoning QA accuracy by up to $\textbf{25}$% over BAT. Our dataset and code are available at: https://github.com/BASHLab/OWL
基础/前沿模型 (含LLM)
推理与思维链
#large langage models; language model reasoning; multi-model collaboration; off-trajectory reasoning
TL;DR:We propose twin test framework to study LLM off-trajectory reasoning
🎯 研究动机现有的推理语言模型通过表述自身思考过程展现推理能力,但在多模型协作中尚不确定其能否评估及利用其他模型的部分推理轨迹,有效提升推理效率与探索能力。
❓ 解决问题探讨标准单体模型训练管线是否能够实现所需的跨轨迹推理行为,尤其是在面对干扰性信息时恢复正确推理及在合作伙伴提供的正确推理指导下高效协作。
🔍 现象分析研究发现更强模型在面对误导性信息时表现更脆弱,且所有模型在超出其能力范围的问题上无法有效利用合作伙伴的指导步骤,解决率不超过 9.2%。
🛠️ 主要方法提出双测试框架,包括恢复能力测试与指导能力测试,从两个极端场景评估模型的跨轨迹推理表现,同时控制研究分析蒸馏教师、强化学习应用及数据选择策略的后训练影响。
📊 数据与实验实验评估了 15 个开放权重模型(规模从 1.5B 到 32B),并进一步设计对照试验以分离不同后训练因素对跨轨迹推理行为的影响。
⭐ 主要贡献提出将多模型协作纳入推理任务评价的新框架,揭示现有推理模型在共享推理背景下的局限,为训练本地强协作推理模型提供实践指导。
查看完整摘要 (Abstract)
Reasoning LLMs are trained to verbalize their thinking process, yielding strong gains on reasoning tasks. This transparency also opens a promising direction: multiple reasoners should directly collaborate on each other's thinking on a shared trajectory, yielding better inference efficiency and exploration. A key prerequisite, however, is their abilities to assess usefulness of and build on other models' partial thinking traces -- we call this *off-trajectory reasoning*. Our paper investigates a critical question: can standard *solo-reasoning* training pipelines yield desired *off-trajectory* behaviors? To this end, we propose twin tests that capture the two extremes of the spectrum: **Recoverability**, which tests whether LLMs can backtrack from "distractions" induced by misleading reasoning traces, and **Guidability**, which tests their ability to build upon correct reasoning from stronger collaborators. Our study evaluates 15 open-weight LLMs (1.5B–32B) and reveals a counterintuitive finding -- "stronger" LLMs on benchmarks are often more fragile under distraction. Moreover, all models tested fail to effectively leverage guiding steps from collaborators on problems beyond their inherent capabilities, with solve rates remaining under 9.2% for math. Finally, we conduct control studies to isolate the effects of three factors in post-training on these behaviors: the choice of distillation teacher, the use of RL, and data selection strategy. Our results provide actionable insights for training natively strong reasoning collaborators; e.g., we find that sub-optimal recoverability behaviors of teacher models are transferred to distilled students even if the distilled data trajectories are correct. Taken together, this work introduces the framework for evaluating multi-model collaborations under shared reasoning, while revealing limitations of off-the-shelf reasoning LLMs.
基础/前沿模型 (含LLM)
推理与思维链
#code-induced reasoning #systematic perturbations #large language models #data-centric evaluation
TL;DR:We investigate which aspects of code benefit LLM reasoning through systematic perturbations and large-scale finetuning, evaluating across natural language, math, and code tasks.
🎯 研究动机代码数据已被证实可增强大型语言模型的推理能力,但具体哪些代码特性起作用仍不明确。
❓ 解决问题探究代码的结构和语义特性如何影响LLM推理能力,明确不同特性对各类任务的贡献。
🔍 现象分析结果表明,LLM对结构性扰动更敏感,尤其体现在数学和代码任务中;即使扰动后的代码在维持表面规律性时仍有竞争力。
🛠️ 主要方法设计以数据为核心的框架,构建10种编程语言的平行数据集,并引入结构与语义扰动,通过大规模微调和任务评估分析LLM性能变化。
📊 数据与实验实验覆盖5个模型家族和8种规模的LLM,进行3,331次实验,评估其在自然语言、数学及代码任务上的表现。
⭐ 主要贡献系统分析了代码不同特性对LLM推理能力的影响,为优化训练数据设计与提升推理能力提供了新见解。
查看完整摘要 (Abstract)
Code data has been shown to enhance the reasoning capabilities of large language models (LLMs), but it remains unclear which aspects of code are most responsible. We
investigate this question with a systematic, data-centric framework. We construct parallel instruction datasets across ten programming languages and introduce controlled
perturbations that selectively disrupt structural and semantic properties of code. We
then fine-tune LLMs from five model families and eight scales on each variant and
evaluate their performance on natural language, math, and code tasks. Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations
than semantic ones, particularly on math and code tasks. Appropriate abstractions like
pseudocode and flowcharts can be as effective as code, while encoding the same information with fewer tokens without adhering to original syntax can often retain or even
improve performance. Notably, even corrupted code with misleading signals remains
competitive when surface-level regularities persist. Finally, syntactic styles also shape
task-specific gains, with Python favoring natural language reasoning and lower-level
languages such as Java and Rust favoring math. Through our systematic framework,
we provide a fine-grained analysis of how different aspects of code influence reasoning
and inform the design of training data for enhancing LLM reasoning capabilities.
基础/前沿模型 (含LLM)
推理与思维链
#LLM #Reasoning #Structural Causal Models
🎯 研究动机语言模型虽能模仿人类思维解决复杂推理任务,但仍在简单任务上表现失误,需探讨人类语言与思维建模的差距。
❓ 解决问题分析语言模型无法理解隐式表达的原因,研究语言表达偏见在推理中的作用,并提出解决方法。
🔍 现象分析构建结构因果模型表明,语言的表达偏见使模型忽略低频隐式表达,从而导致信息遗漏和推理失败。
🛠️ 主要方法提出提示级干预策略,引导语言模型扩展并关注全部表达,减少隐式表达导致的信息丢失。
📊 数据与实验构造包含隐式表达的真实数据集,在11项任务与4种代表性语言模型上验证方法有效性,并提升通用推理能力。
⭐ 主要贡献揭示语言模型推理局限的原因,提出基于因果干预的解决方案,改善多任务推理性能,并开源代码供研究者使用。
查看完整摘要 (Abstract)
Large Language Models (LLMs) demonstrate remarkable capabilities in solving complicated reasoning tasks by imitating the human thinking process from human languages. However, even the most capable LLMs can still fail in tasks that are simple for humans. To understand the gap, we construct structural causal models of next-token predictors in human languages. As language is primarily a tool for humans to share knowledge instead of thinking, modeling human thinking from languages can integrate language expression biases into LLMs. More specifically, we show that LLMs can fail to understand implicit expressions -- expression patterns occur less frequently during training. Consequently, LLMs can easily overlook critical information when biased by implicit expressions. We verify our theoretical claims with carefully constructed realistic datasets containing implicit expressions. Furthermore, we also propose a prompt-level intervention to instruct LLMs to carefully expand and focus on all the expressions available. The empirical success of the prompt-level intervention across 11 tasks and 4 representative LLMs, along with the improvements over general reasoning tasks, reaffirms our findings. Our code is publicly available at the project website: https://causalcoat.github.io/lot
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Model #Natural Language Processing #Self-Correction #Agent #Guided Generation #Post-Hoc Refinement
TL;DR:Once-More is a training-free framework that prevents LLM errors from compounding by using token-level perplexity and verifier feedback to enable inference-time self-correction via logit redistribution.
🎯 研究动机大语言模型在长文本生成过程中容易出现错误累积问题,早期错误可能导致逻辑偏移、推理错误或重复生成。现有解决方案受限于计算资源或泛化能力,亟需新的自纠正机制。
❓ 解决问题提出一种在生成过程中通过连续性干预减少错误传播的框架,旨在改进推理时模型的自纠正能力,提高生成质量。
🔍 现象分析当前方法要么依赖监督训练的数据收集,难以跨领域扩展,要么后处理需等待大量文本生成后反馈,无法实时干预,生成的错误仍可能重现。
🛠️ 主要方法提出 Once-More 框架,通过整合 token 级别困惑度与验证器反馈,利用 logits 重分布机制在实时生成过程中进行连续性自纠正,实现对生成路径的动态引导。
📊 数据与实验在多个基准数据集上进行评估,结果显示 Compared Once-More 方法优于其他现有自纠正技术,在生成质量上实现了指标上的领先。
⭐ 主要贡献首次结合 token 困惑度和外部验证反馈提出实时引导的自纠正框架,为后处理提供了连续干预的新范式,并实现了状态领先的生成性能。
查看完整摘要 (Abstract)
Large Language Models (LLMs) often experience compounding errors during long text generation. Early mistakes can propagate and lead to drift, faulty reasoning, or repetition. While scaling up models improves capabilities, it requires substantial computational resources, and the resulting self-correction behaviour remains unpredictable at inference time. Self-correction is a promising technique for addressing this issue. However, existing approaches have limitations. Supervised training methods can build self-correcting behaviours into models, but require training data collection and lack cross-domain generalizability. Current post-hoc iterative refinement methods operate only at inference time, but must wait for substantial portions of the draft to be generated before providing feedback. This feedback does not guarantee effective guidance, and the same mistake patterns can still reappear. In this paper, we introduce Once-More, a model-agnostic post-hoc self-correction framework that intervenes during generation. Once-More leverages token-level perplexity and feedback from verifiers to provide continuous guided steering of the generation path through a logit redistribution mechanism. This approach essentially helps accumulate "more correct" steps throughout the generation process. Evaluation on multiple benchmarks demonstrates that Once-More achieves state-of-the-art results compared to other self-correction methods. To our knowledge, Once-More is the first post-hoc method to leverage token perplexity and external feedback to perform continuous guided self-correction.
基础/前沿模型 (含LLM)
推理与思维链
#Reasoning #Data #LLM
TL;DR:Data pipeline analysis for training reasoning models
🎯 研究动机推理模型在数学、代码及科学领域取得显著进展,但训练配方存在诸多未解问题,尤其是因依赖有限的专有数据集导致研究受限。
❓ 解决问题通过构建开放数据集,推动推理模型的训练研究摆脱对专有数据的依赖,并提升模型表现。
🔍 现象分析现有模型通常依赖不可公开的专有数据,导致研究透明度及复现性不足,亟需系统化的开源方法论和数据分析。
🛠️ 主要方法提出 OpenThoughts 数据生成管线,实施 1000+ 实验优化数据质量,并将此管线扩展以生成更大规模的开源数据集,从而支持训练高性能推理模型。
📊 数据与实验开发了 OpenThoughts3 数据集,规模达 1.2M 样例,并以 QwQ-32B 作为教师模型训练推理模型。实验表明其在多个标杆任务上显著超越现有专有模型。
⭐ 主要贡献创建了开放数据集和模型,首次使公开训练数据模型在推理基准上匹敌甚至超越专有模型,开源所有资源以推动领域发展。
查看完整摘要 (Abstract)
Reasoning models have made rapid progress on many benchmarks involving math,
code, and science. Yet, there are still many open questions about the best train-
ing recipes for reasoning since state-of-the-art models often rely on proprietary
datasets with little to no public information available. To address this, the goal of
the OpenThoughts project is to create open-source datasets for training reasoning
models. Our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model
trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard
reasoning benchmarks such as AIME and LiveCodeBench. We then improve
our dataset further by systematically investigating each step of our data genera-
tion pipeline with 1,000+ controlled experiments, which led to OpenThoughts3.
Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields
our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on
AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Dia-
mond – improvements of 15.3, 17.2, and 20.5 percentage points compared to the
DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on
openthoughts.ai.
基础/前沿模型 (含LLM)
推理与思维链
#test-time scaling #process reward models
TL;DR:Our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.
🎯 研究动机现有的测试时扩展方法中,过程奖励模型(PRM)的验证信号未能充分利用,导致表现不稳定。如何优化 LLM 与 PRM 信号的融合策略成为关键问题。
❓ 解决问题提出一种理论框架,旨在通过加权聚合有效结合 LLM 和 PRM 信号,以提升测试时扩展的效率及性能。
🔍 现象分析发现简单多数投票在某些情况下超越传统 PRM 信号选择,揭示当前 PRM 信号使用策略存在不足。理论分析显示最佳聚合策略需要复杂权重,其中可能包含显著的负权值。
🛠️ 主要方法构建基于模型间相互作用的最优权值估计框架,并设计高效预计算方法以校准权值函数,使加权聚合更加精准。
📊 数据与实验通过5个LLM与7个PRM的组合实验验证,优化方法在仅使用约21.3%的计算量情况下显著提高了效率,优于传统加权多数投票策略。
⭐ 主要贡献提出适用于LLM与PRM信号融合的最优加权理论与算法框架,显著提升测试时扩展效率,为智能信号聚合提供新的研究方向。
查看完整摘要 (Abstract)
Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection. This raises a critical question: How can we effectively utilize verification signals from PRMs for TTS? To address this, we start by developing a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models.
Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights.
Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions.
Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method significantly boosts the TTS efficiency, surpassing the performance of vanilla weighted majority voting while using only $\sim 21.3\\%$ of the computation.
Ultimately, our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.
基础/前沿模型 (含LLM)
推理与思维链
#reasoning llms #overthinking #underthinking #evaluation #benchmark
TL;DR:We propose OptimalThinkingBench, which jointly evaluates overthinking and underthinking in LLMs. We propose a unified metric to track progress and extensively evaluate existing models and efficiency methods, finding none achieve optimal thinking.
🎯 研究动机目前的大型语言模型在复杂任务上需要更多计算,导致简单任务中过度思考,而轻量模型则因不足思考无法处理复杂问题,这对用户选择模型提出了挑战。
❓ 解决问题提出一个统一的基准 OptimalThinkingBench,用于评估模型中过度思考与不足思考,并推动开发兼顾性能和效率的最佳思考模型。
🔍 现象分析思考型模型在简单任务上常过度生成内容但性能未提升;非思考型模型在复杂推理任务上表现逊色于较小规模的思考型模型。
🛠️ 主要方法设计了两个子基准:OverthinkingBench评估简单任务和数学问题中的过度思考,UnderthinkingBench评估复杂推理和困难数学问题中的不足思考,并提出调整思考精度的指标。
📊 数据与实验在33个模型上进行广泛实验发现,目前没有模型能在基准上实现最佳思考,同时验证提高一个子基准上的性能常以牺牲另一个子基准为代价。
⭐ 主要贡献首次系统性地评估了思考与非思考模型的过度与不足问题,提出了统一的度量方法与基准,为今后开发更平衡的语言模型提供了方向。
查看完整摘要 (Abstract)
Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. In this work, we introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple general queries in 72 domains along with simple math problems, and UnderthinkingBench, containing 11 challenging reasoning tasks along with tough math problems. Using novel thinking-adjusted accuracy metrics, we perform an extensive evaluation of 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models ``underthink'', often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.
基础/前沿模型 (含LLM)
推理与思维链
#efficient reasoning; curriculum sampling with decoupled reward
TL;DR:theoretical unveil the underlying limitations of length reward and propose D$^2$yOR to achieve supreme efficiency without performance degradation
🎯 研究动机现有大型推理模型因“过度思考”问题生成冗长推理路径,导致效率低下且难以实用;现有长度奖励解决方案因奖励与优化目标不匹配而导致性能下降。
❓ 解决问题旨在减少推理过程中无意义的冗余,同时保持模型的推理性能,解决现阶段长度奖励机制的两大根本性缺陷。
🔍 现象分析分析发现当前长度奖励错误惩罚了必要的探索性符号,并意外奖励了部分冗余内容,导致效率与性能间的失衡。
🛠️ 主要方法提出全新的DECS框架,包括针对符号级别的分离式奖励机制以及课程式批次调度策略,有效区分并惩罚冗余符号。
📊 数据与实验在七个基准数据集上实验验证,DECS将推理符号减少超过50%,同时保持甚至提升模型推理性能,展示其效率和可靠性。
⭐ 主要贡献理论揭示长度奖励的内在缺陷,提出降低推理冗余的新框架DECS,在无需性能妥协的前提下显著提高推理效率。
查看完整摘要 (Abstract)
While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power. Code is available at \url{https://github.com/pixas/DECS}.
基础/前沿模型 (含LLM)
推理与思维链
#RLVR #reasoning
🎯 研究动机当前用于复杂推理任务的RLVR模型在推理过程中需要高成本的策略生成,导致训练效率受限,尤其是在较长的推演序列和大模型情况下成本显著增加。
❓ 解决问题减轻RLVR模型训练过程中的计算冗余和成本瓶颈,以提升其可扩展性和效率。
🔍 现象分析实验分析发现,同一查询的独立推演中通常存在相似的早期步骤,表明推演存在显著的冗余性。
🛠️ 主要方法提出Pros方法,通过复用历史推演中表现优秀的前缀,生成增强查询并作为训练输入。同时,引入层级贝叶斯模型,优先选择奖励不确定性最高的增强查询进行训练。
📊 数据与实验在多个场景下开展实验,结果表明Pros在提高训练效率和准确性方面优于现有强基线方法。
⭐ 主要贡献提出了一种高效的推演前缀复用方法Pros,显著降低了RLVR模型的训练计算成本,为构建可扩展性强的复杂推理模型提供了新方向。
查看完整摘要 (Abstract)
Large reasoning models (LRMs) trained with *Reinforcement Learning with Verifiable Rewards* (RLVR) have achieved remarkable progress on complex reasoning tasks. However, RLVR heavily relies on on-policy rollout generation, whose cost grows rapidly with rollout length and model size, eventually becoming the training bottleneck. Our empirical analysis reveals that independent rollouts for the same query often share similar early steps, indicating substantial redundancy. To address this, we propose **Pros** (**P**refix **R**euse for **O**n-policy **S**ampling), a paradigm that reuses promising prefixes of historical rollouts in RLVR training. **Pros** appends these self-generated partial rollouts to the original queries to form *Augmented Queries*, which are then used as regular training inputs in subsequent iterations, thereby reducing redundant computation. To select training batch from augmented queries, **Pros** adopts a hierarchical Bayesian model to estimate their pass rates and prioritize those with the highest reward uncertainty. Experiments across diverse settings show that **Pros** consistently improves training efficiency and achieves higher accuracy than strong baselines. These results highlight **Pros** as a practical path toward scalable and compute-efficient RLVR.
基础/前沿模型 (含LLM)
推理与思维链
#Reasoning Paradigms #Parallel Thinking #RL #LLM
🎯 研究动机并行思维为提升大语言模型推理能力提供了新路径,但其训练激活仍存在挑战,尤其缺乏有效的探索与泛化机制。
❓ 解决问题现有方法依赖监督微调,无法高效促成探索和泛化。作者提出通过强化学习框架克服这一限制,实现复杂推理任务中的并行思维训练。
🔍 现象分析实验揭示模型在训练早期使用并行思维进行探索,后期将其用作多视角验证工具,并验证了并行思维作为中期训练探索支架的作用。
🛠️ 主要方法提出Parallel-R1框架,结合逐步学习策略,先利用易任务进行监督微调启动并行思维,再通过强化学习在复杂任务上拓展能力。
📊 数据与实验在多个数学基准(MATH、AMC23、AIME)上实验,表明该方法使模型精度较直接强化学习提升8.4%,并在最终性能上超越基线42.9%。
⭐ 主要贡献首次提出强化学习框架Parallel-R1,实现并行思维能力的训练与验证;提出中期探索支架概念,证明其显著提升推理性能上限。
查看完整摘要 (Abstract)
Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging. Existing methods mainly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced learning rather than exploration and generalization. To address this issue, we propose **Parallel-R1**, the first reinforcement learning (RL) framework that instills parallel thinking for complex real-world reasoning tasks. Our framework employs a progressive curriculum that addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking behavior, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully elicits parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on difficult tasks with RL. Further analysis reveals a distinct shift in the model's thinking patterns: in the early stage, it utilizes parallel thinking as an exploration strategy, while in the later stage, it employs this ability for multi-perspective verification.
Most significantly, we validate parallel thinking as a **mid-training exploration scaffold**, where this intermediate phase unlocks a higher performance ceiling after RL, yielding a **42.9%** improvement over the sequential RL baseline.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Model #Test-Time Compute #Reasoning #Effectiveness #Efficiency
🎯 研究动机大语言模型在复杂推理任务中表现优秀,但推理计算效率较低;常见的过度推理现象导致简单问题的计算资源浪费。
❓ 解决问题改善模型推理效率,同时在复杂任务中避免固定令牌预算导致的思考不足问题。
🔍 现象分析通过实证分析发现,推理效率低下源于问题解决策略不明确,并提出用子问题的不确定性进行分解的理论框架 BAM。
🛠️ 主要方法提出一个测试时的框架 Plan-and-Budget,将复杂问题分解为子问题,并根据估计的复杂度对令牌预算进行自适应分配。
📊 数据与实验在多个任务和模型中测试,结果显示推理效率提高:准确率提升至70%、令牌使用减少39%、E3指标提升193.8%。
⭐ 主要贡献提出无需重训练的框架,使小模型效率达到大模型水平;以模型无关的方式显著提升推理效率和解决复杂问题能力,同时公开代码资源。
查看完整摘要 (Abstract)
Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent work has tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BAM (Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the E3 metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to 70% accuracy gains, 39% token reduction, and 193.8% improvement in E3. Notably, it improves the efficiency of a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B), demonstrating Plan-and-Budget’s ability to close performance gaps without retraining. Our code is available at https://github.com/junhongmit/P-and-B.
基础/前沿模型 (含LLM)
推理与思维链
#Knowledge Graphs #Large Language Models #Question Answering
🎯 研究动机现有结合知识图谱的LLM推理方法在多跳推理和复杂逻辑查询中表现出局限性,特别在搜索空间截断以及实体错误放大方面亟需改进。
❓ 解决问题通过改进推理框架,解决现有方法中过早剪枝正确候选项及过分依赖错误实体导致推理不准确的问题。
🔍 现象分析当前方法采用线性实体-关系推理路径,导致推理空间截断;同时检索与回答模式会加剧错误实体对推理的负面影响。
🛠️ 主要方法提出PARoG框架,通过从知识图谱中提取SPARQL查询,生成结构化的分步计划,再通过计划、回答、迭代修正的三阶段推理过程减少LLM与知识图谱间的知识冲突。
📊 数据与实验在多个知识图谱推理基准上进行了实验,特别在多跳和复杂逻辑查询任务上显著优于现有方法。
⭐ 主要贡献提出了结合结构化计划执行与自我修正的创新框架,有效提升了知识图谱增强LLM推理的逻辑一致性与准确性。
查看完整摘要 (Abstract)
Incorporating knowledge graphs (KGs) into large language model (LLM) reasoning has shown promise in alleviating hallucinations and factual errors. Although existing paradigms of KG-augmented LLMs have achieved encouraging results, they still exhibit notable limitations when handling multi-hop reasoning and complex logical queries: (1) search space truncation bias: current methods generate linear entity-relation reasoning paths, which can prune correct candidates prematurely during iterative exploration; and (2) entity error amplification: existing methods typically follow the retrieve-and-answer paradigm which causes LLMs to over-rely on retrieved evidence, exacerbating the impact of incorrect entities during reasoning. To alleviate the existing challenges, we propose Plan-Answer-Refine-on-Graph (PARoG), a novel framework for LLM reasoning on knowledge graphs. First, PARoG leverages SPARQL queries from KG data as references, decomposing them into structured step-by-step plans. We further train LLMs to construct such structured plans, which improves the logical consistency of reasoning, ensures uniform step granularity, and facilitates effective execution on the graph. Second, during reasoning over KGs, PARoG adopts a plan-answer-refine paradigm: the model first attempts to answer each sub-query independently, and then refines its prediction by integrating evidence retrieved from the KG. This process mitigates knowledge conflicts between LLM and KG, substantially reducing hallucinations. Experimental results on multiple KG reasoning benchmarks demonstrate that PARoG significantly outperforms state-of-the-art approaches, achieving especially superior accuracy on multi-hop and logically complex queries.
基础/前沿模型 (含LLM)
推理与思维链
#ai for math #proof simplification
TL;DR:we train a model to simplify AI-generated Lean proofs
🎯 研究动机神经定理证明生成的形式化证明过长,影响人类理解及数学洞察,简化证明成为关键瓶颈。
❓ 解决问题训练语言模型简化 Lean 生成的长格式证明,无需额外人工监督,突破训练数据稀缺限制。
🔍 现象分析现有方法依赖现成的语言模型,但对强化学习生成的超长证明处理效果有限。
🛠️ 主要方法提出ProofOptimizer,通过专家迭代和强化学习训练模型,同时利用 Lean 验证简化结果并提供训练信号,进行迭代式简化。
📊 数据与实验在miniF2F、PutnamBench和Seed-Prover生成的IMO 2025证明上进行实验,证明长度分别减少87%、57%和50%。
⭐ 主要贡献开发了无需人类监督的证明简化模型,简化后证明检查速度更快,并提升下游证明器性能。
查看完整摘要 (Abstract)
Neural theorem proving has advanced rapidly in the past year, reaching IMO gold-medalist capabilities and producing formal proofs that span thousands of lines. Although such proofs are mechanically verified by formal systems like Lean, their excessive length renders them difficult for humans to comprehend and limits their usefulness for mathematical insight. Proof simplification is therefore a critical bottleneck. Yet, training data for this task is scarce, and existing methods—mainly agentic scaffolding with off-the-shelf LLMs—struggle with the extremely long proofs generated by RL-trained provers. We introduce ProofOptimizer, the first language model trained to simplify Lean proofs without requiring additional human supervision. ProofOptimizer is trained via expert iteration and reinforcement learning, using Lean to verify simplifications and provide training signal. At inference time, it operates within an iterative proof-shortening workflow, progressively reducing proof length. Experiments show that ProofOptimizer substantially compresses proofs generated by state-of-the-art RL-trained provers on standard benchmarks, reducing proof length by 87% on miniF2F, 57% on PutnamBench, and 50% on Seed-Prover's IMO 2025 proofs. Beyond conciseness, the simplified proofs check faster in Lean and further improve downstream prover performance when reused as training data for supervised finetuning.
基础/前沿模型 (含LLM)
推理与思维链
#in-context learning #self-reflection #policy optimization #FTRL #bandits #large language models #reasoning
TL;DR:We provide a provable and practical in-context policy optimization for test-time scaling
🎯 研究动机研究如何通过测试时的多轮自反思机制,让模型在推理过程中优化其答案,提高性能。
❓ 解决问题提出了一种无需修改模型参数的上下文内策略优化方法,以优化模型输出并提升数学推理任务的表现。
🔍 现象分析通过理论分析,证明单层线性自注意力模型可以在足够预训练下模仿线性赌博策略优化算法,实现上下文内策略优化。
🛠️ 主要方法提出了一种名为 Minimum-Entropy ICPO 的算法,通过选择最小熵的响应和奖励,确保自评奖励的鲁棒性,并在推理时逐步优化输出。
📊 数据与实验在标准数学推理任务上进行实验,显示算法以可负担的推理成本达到了较高竞争性能。
⭐ 主要贡献提供了大语言模型自反思的理论性理解,并提出了具有实际应用价值的测试时扩展优化方法。
查看完整摘要 (Abstract)
We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters.
To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time.
By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority voting.
Across standard mathematical reasoning tasks, ME-ICPO attains competitive, top-tier performance while keeping inference costs affordable compared with other inference-time algorithms. Overall, ICPO provides a principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling for mathematical reasoning.
基础/前沿模型 (含LLM)
推理与思维链
#Multilingual #Math #Reasoning
🎯 研究动机现有研究主要集中在英语的长链式推理,关于其他语言的推理研究较少,为提高非英语模型的推理能力提出创新方法。
❓ 解决问题针对多语言环境下的推理性能瓶颈,提出通过英语与目标语言结合的推理机制,减少翻译误差以优化推理质量。
🔍 现象分析实验表明多语言推理框架(Language-Mixed CoT)优于单语言推理模型,同时推理模式可通过设计改进目标语言的表现。
🛠️ 主要方法提出Language-Mixed CoT框架,将英语作为推理锚点,与目标语言交替使用,增强推理链条的鲁棒性和跨语言表现。
📊 数据与实验构建Yi-Sang-HQ韩文数据集,涵盖多领域共计5.79M条样本及3.7M条推理路径;在9个模型及6种架构上进行评估,验证不同规模模型的性能提升。
⭐ 主要贡献提出语言混合推理框架,显著提升多语言环境下的推理能力;构建高质量韩文数据集,提供模型、数据与评估工具,推动语言特定推理研究。
查看完整摘要 (Abstract)
Recent frontier models employ long-chain-of-thought reasoning to explore solution spaces in context and achieve stronger performance. While many works study distillation to build smaller yet capable models, most focus on English and little is known about language-specific reasoning. To bridge this gap, we first introduce **Language-Mixed CoT**, a reasoning schema that switches between English and a target language, using English as an anchor to excel in reasoning while minimizing translation artifacts. As a Korean case study, we curate **Yi-Sang-HQ**: 5.79M native-Korean prompts from web Q&A, exams, STEM, and code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k high-yield subset. We train nine models (4B–35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc). Our best model, **KO-REAson-35B**, achieves state-of-the-art performance, with the highest overall average score ($64.0_{\pm2.5}$), ranking first on 5/9 benchmarks and second on the remainder. Smaller and mid-sized models also benefit substantially, with an average improvement of $+18.6$ points across the evaluated nine benchmarks. Ablations show **Language-Mixed CoT** is more effective than monolingual CoT, indicating that reasoning patterns can be engineered to improve non-English performance. We release our data-curation pipeline, evaluation system, datasets, and models to advance research on language-specific reasoning.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Model; Reinforcement Learning
TL;DR:By introducing partial solutions to make hard problems easier during reinforcement learning, this method significantly boosts the mathematical reasoning capabilities of language models.
🎯 研究动机大模型在推理任务中的能力提升受限于强化学习,对更复杂推理问题的解决能力不足亟需改进。
❓ 解决问题通过引入部分解法降低问题难度,以便强化学习能够更有效地训练模型应对复杂推理任务。
🔍 现象分析当前强化学习在提高推理能力上进展有限,尤其是在应对高难度数学推理时表现不佳。
🛠️ 主要方法提出一种名为 QuestA 的方法,在强化学习训练过程中通过问题增强技术引入部分解法,提供更具信息量的学习信号。
📊 数据与实验在 AIME24、AIME25 和 HMMT25 等数学基准上进行评估,使用 1.5B 参数模型,实现 pass@k 的显著提升,并超过此前开源模型的性能。
⭐ 主要贡献提出一种简单高效的推理能力增强方法,实现多个数学推理基准的 SOTA 性能提升,代码与模型开源可用。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has emerged as a central paradigm for training large language models (LLMs) in reasoning tasks. Yet recent studies question RL’s ability to incentivize reasoning capacity beyond the base model. This raises a key challenge: how can RL be adapted to solve harder reasoning problems more effectively?
To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals.
Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k—particularly on problems where standard RL struggles to make progress.
This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 72.50\% (+10.73\%) on AIME24, 62.29\% (+12.79\%) on AIME25, and 41.67\% (+10.11\%) on HMMT25. Code, data and model are available at https://anonymous.4open.science/r/questa932.
基础/前沿模型 (含LLM)
推理与思维链
#Large Reasoning Models #Long Horizon Reasoning
TL;DR:A scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs
🎯 研究动机现有的推理模型在长链式推理任务中取得显著进展,但现有评估基准未能充分衡量模型在复杂长时间跨度场景下的能力。
❓ 解决问题通过设计 R-HORIZON 方法和基准,系统性评估并改进大规模推理模型在长时间跨度任务中的推理能力。
🔍 现象分析实验表明当前最先进的大规模推理模型在长跨度推理任务中表现显著下降,并且难以有效分配多任务的思维资源。
🛠️ 主要方法R-HORIZON 基于查询组合激发模型的长跨度推理行为,并通过强化学习结合已验证的奖励信号改进模型推理性能。
📊 数据与实验构建了长跨度推理基准任务集,涵盖复杂多步推理,通过强化学习增强有验证奖励信号的数据使模型在多跨度和标准推理任务中分别提升性能。
⭐ 主要贡献提出 R-HORIZON,为大规模推理模型提供可扩展、可控和低成本的推理能力评估与提升范式,有效改善其在长跨度及常规推理任务中的表现。
查看完整摘要 (Abstract)
Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models’ ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks (+7.5 on AIME2024). These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.
基础/前沿模型 (含LLM)
推理与思维链
#large language model #reinforcement learning #self-evolving #reasoning
TL;DR:We propose R-zero, a data-free method to improve LLM reasoning ability.
🎯 研究动机现有大语言模型的自我进化方法严重依赖人工构建的训练任务和标签,这限制了AI系统突破人类智能的能力边界。
❓ 解决问题提出一种无需依赖外部数据的框架,解决现阶段大语言模型训练中对人工数据的高度依赖问题。
🔍 现象分析当前模型的智能提升受制于任务和标签的人工构造,无法自主生成针对性学习路径,导致进化效率低下。
🛠️ 主要方法设计R-Zero框架,分设挑战者和解答者两角色,自主生成难度递增的任务和解答,并通过交互优化实现模型的共同进化。
📊 数据与实验无需预设任务和标签,采用Qwen3-4B等基线模型,通过数学和通用推理基准测试分别提升+6.49和+7.54分,验证其自进化能力。
⭐ 主要贡献提出了一个完全数据独立的自进化方法,实现了大语言模型推理能力的显著提升,开辟了通向超智能的新路径。
查看完整摘要 (Abstract)
Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.
基础/前沿模型 (含LLM)
推理与思维链
#routing #adaptive reasoning #item response theory #reasoning models #large language models
🎯 研究动机推理语言模型的实用部署面临性能与成本的权衡,需在模型大小和推理预算之间找到平衡。而现有方法在查询路由方面有效性不足,亟需改进以提升模型配置效率和成本效益。
❓ 解决问题提出一种轻量化、可解释且可扩展的路由框架,解决如何根据查询难度与推理能力匹配模型配置的问题,从而优化推理效果与资源使用。
🔍 现象分析复杂查询通常需要更强计算能力的模型预算支持,而简单查询可以用更低成本配置处理;现有路由方法对这种动态匹配能力不足。
🛠️ 主要方法设计了RADAR框架,基于心理测量学的项目反应理论,通过模型响应数据学习查询难度和模型预算能力,并实现可解释的动态路由决策。
📊 数据与实验在8个广泛使用的高难度推理基准上测试,展示RADAR超越最先进路由方法的性能,同时验证其对分布外查询的泛化能力与效率。
⭐ 主要贡献提出了一种基于项目反应理论的推理路由模型,解决性能与成本之间权衡问题;验证其有效性、可扩展性和一般化能力,为推理语言模型优化提供新方法。
查看完整摘要 (Abstract)
Reasoning language models have demonstrated remarkable performance on many challenging tasks in math, science, and coding. Choosing the right reasoning model for practical deployment involves a performance-cost trade-off at two key levels: model size and reasoning budget, where larger models and higher reasoning budgets lead to better performance but incur greater cost and latency. In this work, we tackle this tradeoff from the angle of model configuration routing for different queries, and present RADAR (Reasoning–Ability and Difficulty-Aware Routing), a lightweight, interpretable, and scalable routing framework. Inspired by psychometrics, RADAR learns an item response model from model responses with different budgets to different queries, with interpretable parameters including query difficulties and model-budget abilities. RADAR then routes queries with higher difficulty to model-budget pairs with higher ability, and vice versa. We conduct extensive experiments on 8 widely used challenging reasoning benchmarks, demonstrating the superior performance of RADAR compared to state-of-the-art model routing methods. RADAR also exhibits query generalization capabilities, achieving strong performance on out-of-distribution queries on all benchmarks. RADAR is also scalable and can efficiently integrate additional models by dynamically selecting a small set of evaluation queries to estimate their abilities.
基础/前沿模型 (含LLM)
推理与思维链
#RAG #Large Language Models #Question Answering #Knowledge Graphs #Graph LLM
🎯 研究动机大型语言模型在知识密集型任务中表现优异,但因检索内容缺乏结构性,在多步推理时表现有限。现有研究指出中间推理结构的关键性,亟需改进现有方法的组织性能力。
❓ 解决问题检索增强生成方法存在对检索片段组织不足的问题,导致推理路径脆弱,需通过动态构建结构化知识来提高推理的准确性与鲁棒性。
🔍 现象分析直接使用无结构化的检索内容限制了模型的推理能力,与研究发现的推理结构重要性存在矛盾。传统方法在面对复杂问题时易出现推理错误。
🛠️ 主要方法提出 RAS 框架,通过迭代检索和知识图构建,实现针对问题动态生成知识结构。方法融合检索规划与增量式图构造,使模型能够逐步组织针对性知识以执行复杂推理。
📊 数据与实验在七个知识密集型基准上进行评估,使用专有与开源模型分别实现最高 8.7% 和 7.0% 的性能提升,实验验证了框架的效果。
⭐ 主要贡献提出了一种动态问题特定的知识结构构建方法,显著提升复杂问题推理性能,为知识增强型语言模型提供了新的发展方向。
查看完整摘要 (Abstract)
Large language models (LLMs) have achieved impressive performance on knowledge-intensive tasks, yet they often struggle with multi-step reasoning due to the unstructured nature of retrieved context. While retrieval-augmented generation (RAG) methods provide external information, the lack of explicit organization among retrieved passages limits their effectiveness, leading to brittle reasoning pathways. Recent interpretability studies highlighting the importance of structured intermediate reasoning further align with this perspective.
We propose Retrieval-And-Structuring (RAS), a framework that dynamically constructs question-specific knowledge graphs through iterative retrieval and structured knowledge building. RAS interleaves targeted retrieval planning with incremental graph construction, enabling models to assemble and reason over evolving knowledge structures tailored to each query. On seven knowledge-intensive benchmarks, RAS consistently outperforms strong baselines, achieving up to 8.7\% and 7.0\% gains with proprietary and open-source LLMs, respectively. Our results demonstrate that dynamic, question-specific knowledge structuring offers a robust path to improving reasoning accuracy and robustness in language model generation.
基础/前沿模型 (含LLM)
推理与思维链
#LLM #self-training #RL #unsupervised learning #self-penalization
🎯 研究动机当前强化学习依赖人工标注数据来提升大型模型的推理能力,但标注成本高且对复杂任务效果有限。基于经验驱动的学习是下一步自然选择,通过无标注数据实现模型改进。
❓ 解决问题在无标注数据场景下,如何将缺乏标签的信息转化为有意义的学习信号,同时避免模型对伪相关多数票结果的过度依赖。
🔍 现象分析传统方法倾向依赖错误的多数答案,RESTRAIN通过对模型的答案分布进行信号分析,指出模型过度自信和一致性低的问题。
🛠️ 主要方法提出了一种自惩罚强化学习框架 RESTRAIN,通过对过度自信的推导和低一致性示例进行惩罚,同时保留具有潜力的推理链,与策略优化算法无缝结合。
📊 数据与实验在无标注数据上评估 RESTRAIN,于 AIME25、MMLU STEM 和 GPQA-Diamond 数据集的高难度推理任务上显著提升性能,分别提高 Pass@1 达 +140.7%、+36.2% 和 +19.6%。
⭐ 主要贡献提出了 RESTRAIN 框架,实现了无需人工标签的持续自我改进,显著提高推理性能,并为无标签强化学习提供了可扩展路径。
查看完整摘要 (Abstract)
Reinforcement learning with human-annotated data has boosted chain-of-thought
reasoning in large reasoning models, but these gains come at high costs in labeled
data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data.
We introduce REinforcement learning with Self-resTRAINt training (RESTRAIN),
a self-penalizing RL framework that converts the absence of gold labels into a
useful learning signal. Instead of overcommitting to spurious majority votes,
RESTRAIN exploits signals from the model’s entire answer distribution: penalizing
overconfident rollouts and low-consistency examples while preserving promising
reasoning chains. This self-penalization mechanism integrates seamlessly into
policy optimization methods such as GRPO, enabling continual self-improvement
without supervision. On challenging reasoning benchmarks, RESTRAIN delivers
large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker
Hybrid-8B-Base, it boosts Pass@1 by up to +140.7% on AIME25, +36.2% on
MMLU STEM, and +19.6% on GPQA-Diamond, nearly matching gold-label
training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.
基础/前沿模型 (含LLM)
推理与思维链
#reasoning #reinforcement learning #curriculum learning #sequence modeling
TL;DR:using a dynamic per-sample curriculum for RL training of reasoning models on math datasets
🎯 研究动机序列生成问题的组合爆炸性输出空间学习复杂,专家演示难以随序列长度扩展,强化学习面临稀疏奖励挑战。存在监督学习与强化学习之间的部分监督领域未被充分研究。
❓ 解决问题是否可以通过利用部分监督来高效学习某些序列生成问题,并在长序列潜在依赖关系任务中改进模型的泛化能力。
🔍 现象分析完全监督和强化学习在长序列学习任务中表现受限,而动态调整生成目标的部分前缀长度能有效缓解稀疏奖励和长依赖的学习难题。
🛠️ 主要方法提出了一种名为AdaBack的动态逐样本课程学习算法,根据模型的奖励信号动态调整目标输出部分前缀的长度,从而引导模型逐步学习完成复杂推理链。
📊 数据与实验在含有潜在奇偶校验约束的合成任务以及DeepScaleR、MATH和GSM8k三大数学推理数据集中进行了实验,证明AdaBack能够解决强化学习单独无法完成的问题。
⭐ 主要贡献提出了一种结合部分监督与强化学习的课程学习算法AdaBack,证明其在复杂推理任务中的高效学习能力,并扩展了模型的数学推理能力。
查看完整摘要 (Abstract)
Learning in the combinatorially large output space of sequence generation problems is challenging as providing expert demonstrations scales poorly with sequence length, and RL struggles with sparse rewards. Between dense demonstrations in supervised training and no demonstrations in reinforcement learning lies an underexplored regime: partial supervision. We ask whether some classes of sequence learning problems become efficiently learnable by exploiting this gap. We address this by introducing adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals a partial prefix of the target output. The supervision length is adjusted dynamically for each sample based on the model’s past reward signal, allowing it to incrementally learn to complete reasoning chains by conditioning on correct partial solutions. We investigate this intermediate regime between SFT and RL and argue that per-sample curriculum learning is more than a trade-off between efficiency and generality—it can succeed in tasks with long sequences of latent dependencies where SFT and RL both fail to generalize. Using a synthetic task with latent parity constraints, we show that AdaBack reliably solves problems that are otherwise intractable. On three mathematical reasoning benchmarks, DeepScaleR, MATH, and GSM8k, we find that AdaBack enables models to solve problems that RL alone cannot, acquiring new reasoning capabilities through incremental exposure to partial solutions.
基础/前沿模型 (含LLM)
推理与思维链
#Reasoning abstractions; LLM; RL; Structured exploration; Reasoning
TL;DR:A two-agent training framework for generating and applying reasoning abstractions to solve complex problems.
🎯 研究动机推理需要超越模式匹配或记忆解决方案,通过识别和实施算法程序来解决复杂问题,但现有模型在这一方面表现有限。
❓ 解决问题针对深度优先、暴力求解的推理方式不足,引入推理抽象以提升模型在算法化推理中的表现和泛化能力。
🔍 现象分析现有强化学习后训练方法难以发现有效的算法行为,无法充分利用中间结果和程序性知识。
🛠️ 主要方法提出RLAD框架,通过两个智能体协作训练,一个生成推理抽象,另一个基于抽象生成解决方案,并通过强化学习解耦信号、优化探索过程。
📊 数据与实验实验表明在测试阶段投入额外计算资源用于生成抽象比生成更多解决方案更有效,尤其在解决复杂问题时表现显著提升。
⭐ 主要贡献提出推理抽象的新概念,引入两阶段RL框架以提升推理能力,验证了抽象引导的探索在复杂问题中的有效性。
查看完整摘要 (Abstract)
Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement algorithmic procedures that can be used to deduce answers to hard problems. Doing so requires reusing primitives, intermediate results, or procedures across multiple problems. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, the depth-first and brute-force nature of reasoning traces learned by these models suggests that this is far from a fulfilled promise. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing several useful abstractions given a problem, followed by RL training that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and an abstraction-conditioned solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that spending more test-time compute into generating abstractions is more beneficial for performance than generating more solutions at large inference-time budgets, illustrating the role of abstractions in guiding global exploration.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models (LLMs) #Reinforcement Learning #RLVR #Math Reasoning #Diversity
TL;DR:A minimalist yet highly effective RL method that maintains both quality and diversity for LLM reasoning
🎯 研究动机当前RLVR方法在提升LLM推理能力方面效果显著,但往往因训练不稳定和多样性损失而依赖复杂的启发式技巧与精细调参。论文希望充分利用数学推理中更简单的底层结构,减少不必要的优化复杂度。
❓ 解决问题现有RLVR方法复杂性高,尤其是在广义政策迭代中易引发多样性崩塌,此研究旨在开发一个更简洁且高效的RL推理算法,同时保持推理质量与多样性。
🔍 现象分析数学推理任务可建模为特殊的有限时长马尔可夫决策过程,其拥有确定性状态转换、树状动态结构以及二元终端奖励,这种底层结构比通用控制场景更简单,现有方法的许多复杂设计和技巧可能并非必要。
🛠️ 主要方法提出ROVER算法,通过随机策略的Q-函数值选取最优动作,直接绕过广义政策迭代过程。在该算法中,通过应用softmax采样促进多样性,极大简化实现过程,同时有效保留了探索多种路径的能力。
📊 数据与实验在多个基础模型和标准数学推理基准上进行验证,ROVER在推理质量(如pass@1提高8.2,pass@256提高16.8)和多样性(提升20.5%)方面均显著优于复杂的现有方法。
⭐ 主要贡献首次证明随机策略Q值即可恢复最优动作,提出ROVER算法并以极简设计实现高效且多样化的LLM推理,显著降低RLVR的复杂性并提升推理效果。
查看完整摘要 (Abstract)
RL with Verifiable Rewards (RLVR) has emerged as a promising paradigm for improving the reasoning abilities of large language models (LLMs). Current methods rely primarily on policy optimization frameworks like PPO and GRPO, which follow generalized policy iteration that alternates between evaluating the current policy's value and improving the policy based on evaluation. While effective, they often suffer from training instability and diversity collapse, requiring complex heuristic tricks and careful tuning. We observe that standard RLVR in math reasoning can be formalized as a specialized finite-horizon Markov Decision Process with deterministic state transitions, tree-structured dynamics, and binary terminal rewards. Though large in scale, the underlying structure is simpler than general-purpose control settings for which popular RL algorithms (e.g., PPO) were developed, suggesting that several sophisticated techniques in existing methods may be reduced or even omitted. Based on this insight, we prove a surprising result: the optimal action can be recovered from the Q-function of a fixed uniformly random policy, thereby bypassing the generalized policy iteration loop and its associated heuristics. We introduce \underline{\textbf{R}}andom P\underline{\textbf{o}}licy \underline{\textbf{V}}aluation for Div\underline{\textbf{e}}rse \underline{\textbf{R}}easoning (ROVER) to translate this principle into a practical and scalable algorithm for LLM math reasoning, a minimalist yet highly effective RL method that samples actions from a softmax over these uniform-policy Q-values. ROVER preserves diversity throughout training, allowing sustained exploration of multiple valid pathways. Across multiple base models and standard math reasoning benchmarks, ROVER demonstrates superior performance in both \textbf{quality} (\textbf{+8.2} on pass@1, \textbf{+16.8} on pass@256) and \textbf{diversity} (\textbf{+20.5\%}), despite its radical simplification compared to strong, complicated existing methods.
基础/前沿模型 (含LLM)
推理与思维链
#LLM Reasoning Distillation #Large Reasoning Model #Reasoning Scaffolding #Semantic Signals
TL;DR:We introduce Reasoning Scaffolding, a new reasoning distillation framework that transfers reasoning patterns—not just text—from large to small language models, resulting in stronger small reasoning models.
🎯 研究动机现有的大语言模型推理蒸馏方法主要依赖于文本行为的模仿,缺乏对推理本质的深度抽象与传递,导致小模型逻辑鲁棒性不足。
❓ 解决问题旨在突破现有行为克隆方法的局限,通过直接传递算法结构而非表面文本模式,提升小模型的推理能力和一致性。
🔍 现象分析现有方法过于关注表面模仿,忽视了推理过程中的算法结构和逻辑流动,从而导致小模型难以进行真正严谨的逻辑推理。
🛠️ 主要方法提出了Reasoning Scaffolding框架,将推理重新定义为一个结构化生成过程,通过离散的语义信号抽象教师模型的思维过程,并采用多任务目标训练学生模型以预测信号和生成对应步骤。
📊 数据与实验在一系列高难度推理基准测试上验证了方法的有效性,显示其在准确性和逻辑一致性方面显著超越现有最优蒸馏方法。
⭐ 主要贡献提出了一种全新推理蒸馏框架,将算法结构直接传递给小模型,为构建更强推理能力的小模型提供了新路径。
查看完整摘要 (Abstract)
The prevailing approach to distilling reasoning from Large Language Models (LLMs)—behavioral cloning from textual rationales—is fundamentally limited. It teaches Small Language Models (SLMs) to mimic surface-level patterns rather than the underlying algorithmic structure of thought, resulting in a critical lack of logical robustness. We argue that instead of cloning text, distillation should transfer this algorithmic structure directly. We introduce Reasoning Scaffolding, a framework that reframes reasoning as a structured generation process. Our method first abstracts the teacher's thought process into a sequence of discrete, interpretable semantic signals (e.g., Contrast, Addition) that act as a scaffold. The student model is then trained via a multi-task objective to both (1) predict the next semantic signal, anticipating the reasoning flow, and (2) generate the corresponding step, conditioned on that signal. This multi-task scheme acts as a powerful regularizer, compelling the student to internalize the computational patterns of coherent reasoning. On a suite of challenging reasoning benchmarks, our method significantly outperforms state-of-the-art distillation in both accuracy and logical consistency, providing a path towards creating smaller models that are genuine reasoners, not just fluent mimics.
基础/前沿模型 (含LLM)
推理与思维链
#LLMs #reasoning #MCMC #sampling #inference-time compute
TL;DR:We find a training-free sampling algorithm that achieves reasoning boosts on base models comparable to those obtained by RL techniques.
🎯 研究动机现有前沿推理模型多依赖于通过强化学习对大语言模型进行后训练,但尚未明确哪些能力是由后训练引入的,基模型本身是否存在未被激发的潜力。本文旨在探索是否可以在推理时通过采样替代训练,激发基模型的推理能力。
❓ 解决问题设计一种无需额外训练和数据集的采样算法,从基模型中引出与后训练获得的推理性能相当或更好的推理能力,避免RL后训练中常见的多样性崩溃问题。
🔍 现象分析基模型本身具备潜在的推理能力,传统RL引入的增强推理优势部分可以通过优化推理时的采样方式实现,而无需依赖额外的训练或验证器。
🛠️ 主要方法受MCMC方法启发,提出一种基于基模型自身似然的迭代采样算法,通过调整采样分布提升单次任务推理能力,并保留采样多样性。
📊 数据与实验在多个基模型上实验,使用涵盖MATH500、HumanEval和GPQA等任务的数据集,验证所提方法在许多单次任务中接近甚至超越了RL后训练的推理性能。
⭐ 主要贡献提出了一种训练无关的采样算法,有效提升基模型推理能力;证明基模型潜在能力可通过优化推理时的采样实现;方法无需训练、特定数据集或验证器,具备广泛适用性。
查看完整摘要 (Abstract)
Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilities can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.
基础/前沿模型 (含LLM)
推理与思维链
#Large Lanugae Model #Large Lanugae Model Reasoning #Reinforcement Learning with Verifiable Rewards
TL;DR:We propose RePro to rectify LLM thought and thus enhance LLM reasoning performance.
🎯 研究动机大型语言模型(LLM)的长链式思维(CoT)展示出推理优势,但存在过度思考和推理链过长等次优行为,影响性能。
❓ 解决问题通过优化视角分析LLM的推理过程,并提出新方法改进其推理能力,减少次优行为。
🔍 现象分析将CoT框定为梯度下降过程,将每一步推理视为向问题解决的更新,揭示当前模型推理的不足。
🛠️ 主要方法提出RePro方法,定义代理目标函数,利用双评分机制评估推理过程的强度与稳定性,将其整合到可验证奖励的强化学习框架中优化LLM。
📊 数据与实验在数学、科学和编码等多领域基准数据集上,结合多种强化学习算法和多类型LLM,进行广泛实验验证。
⭐ 主要贡献提出了一种全新过程级奖励方法RePro,显著提升LLM推理性能,缓解次优推理行为,为基于CoT的优化研究提供新思路。
查看完整摘要 (Abstract)
Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (**Re**ctifying **Pro**cess-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.
基础/前沿模型 (含LLM)
推理与思维链
#large language models #reasoning #reinforcement learning
TL;DR:We show theoretically and empirically that reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs
🎯 研究动机近年来长链式思维(CoT)推理能力的提升,引发了对验证型奖励强化学习(RLVR)在提升大语言模型推理潜力方面的关注。
❓ 解决问题探讨RLVR是否真正增强了模型的推理能力,还是仅优化了采样效率,并分析RLVR对数学和编码任务推理边界的扩展效果。
🔍 现象分析通过Pass@K实验证明,RLVR在推理过程中早期激励正确推理,并提升大语言模型的推理质量,尤其适用于新的CoT-Pass@K评估指标。
🛠️ 主要方法提出理论框架解析RLVR激励机制,并通过验证型奖励设计推动模型从答案正确性中学习中间推理步骤的合理性。
📊 数据与实验使用Pass@K和CoT-Pass@K指标对数学及编码任务进行广泛评估,结合基于推理连续性的实验验证RLVR对推理质量的显著提升。
⭐ 主要贡献揭示RLVR的推理激励机制及其提升大语言模型推理能力的潜力,提出新分析方法和指标,为模型训练和评估提供重要参考。
查看完整摘要 (Abstract)
Recent advancements in long chain-of-thought (CoT) reasoning, particularly through the Group Relative Policy Optimization algorithm used by DeepSeek-R1, have led to significant interest in the potential of Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). While RLVR promises to improve reasoning by allowing models to learn from free exploration, there remains debate over whether it truly enhances reasoning abilities or simply boosts sampling efficiency. This paper demonstrates the profound impact that RLVR has on the reasoning capabilities of LLMs. We revisit Pass@K experiments and show that RLVR can extend the reasoning boundary for both mathematical and coding tasks. This is supported by our introduction of a novel evaluation metric, CoT-Pass@K, which captures reasoning success by accounting for both the final answer and intermediate reasoning steps. Furthermore, we present a theoretical framework explaining RLVR's incentive mechanism, demonstrating how it can encourage correct reasoning even when rewards are based solely on answer correctness. Our analysis of RLVR's training dynamics reveals that it incentivizes correct reasoning early in the process, with substantial improvements in reasoning quality confirmed through extensive evaluations. These findings provide strong evidence of RLVR's potential to enhance LLM reasoning, offering valuable insights into its mechanisms and performance improvements.
基础/前沿模型 (含LLM)
推理与思维链
#automated proof evaluation; LLM-as-a-judge; LLM-generated math proofs; rubric-guided grading; prompt optimization; expert-annotated proof dataset; evaluator reliability; reward modeling
TL;DR:LLMs lack reliable proof evaluators. We introduce ProofBench and a 0–7 methodology; our ProofGrader (marking schemes + ensembling) hits RMSE 1.093 vs experts and lifts best-of-8 to 4.05/7, closing >90% of the gap to a human oracle.
🎯 研究动机当前大规模语言模型(LLMs)在数学推理中存在验证与生成自然语言数学证明的可靠细粒度评估工具的缺失问题。
❓ 解决问题提出一种系统性方法,针对LLM生成的数学证明进行0-7评分,并开发可靠评估器,以弥补现有方法在验证过程中的不足。
🔍 现象分析通过实验揭示了评估器在关键设计维度上的表现差异,例如模型主干、输入上下文、评分指令及流程优化等。
🛠️ 主要方法设计并实现了ProofGrader,利用强推理能力的模型主干、参考解与评分标准构建丰富上下文,并引入简单的集成方法以提升评估准确性。
📊 数据与实验构建了首个专家标注的ProofBench数据集,涵盖145道难题及435个LLM解答;评估结果显示ProofGrader的平均绝对误差为0.926,明显优于基线;在best-of-16选择任务中,模型获得4.14/7分,高于基准的2.48分。
⭐ 主要贡献开发了ProofBench数据集和ProofGrader评估器,显著缩小了LLM数学证明验证与人类专家之间的性能差距,为下游证明生成任务提供了实用工具。
查看完整摘要 (Abstract)
Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers while generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap.
To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs.
To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-Pro, o3, and DeepSeek-R1.
Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow.
Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines.
Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14/7, closing 78\% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.
基础/前沿模型 (含LLM)
推理与思维链
#large language model #efficient reasoning
🎯 研究动机大型语言模型通过生成逐步推理轨迹来解决复杂任务,但这些轨迹长度过长,导致推理成本高,尤其在简单任务中效率问题尤为明显。
❓ 解决问题现有方法尝试缩短推理轨迹,但仍依赖耗时的解码过程,未能从根本上解决效率挑战。本研究旨在探索是否可以通过隐式表示取代完整推理轨迹以提升效率。
🔍 现象分析实验表明,仅利用片段化的推理路径,LLM 也能生成准确答案,而无需依赖完整的逐词推理轨迹。
🛠️ 主要方法提出 Latent Reasoning Tuning(LRT)框架,通过轻量推理网络生成隐式推理向量,取代逐步生成的推理文本,从而在单次前向传播中完成推理并生成答案。
📊 数据与实验在数学和领域外基准测试中,LRT 一贯优于现有高效推理方法,并超越最先进的 Qwen3 混合推理框架。
⭐ 主要贡献通过隐式压缩推理过程,显著提高 LLM 推理效率;提出了一种新颖的轻量推理框架;首次将显式推理转化为隐式表示,并开放源码供社区使用。
查看完整摘要 (Abstract)
Large Language Models (LLMs) have achieved impressive performance on complex tasks by generating human-like, step-by-step rationales, referred to as \textit{reasoning trajectory}, before arriving at final answers. However, the length of these reasoning trajectories often far exceeds that of the final answers, which incurs substantial inference costs even for relatively simple tasks. Advanced methods typically attempt to compress reasoning trajectory length through post-training, but they remain decoding-intensive and fail to inherently mitigate the efficiency challenge. In this work, we challenge the necessity of generating full reasoning trajectories and empirically demonstrate that LLMs can generate accurate answers using only fragmental reasoning paths, without relying on complete token-by-token sequences. To this end, we propose a novel \textbf{Latent Reasoning Tuning (LRT)} framework, which empowers LLMs to perform reasoning using implicit, compact, learnable representations instead of explicit textual trajectories. Technically, LRT replaces the costly autoregressive generation of reasoning steps with a single forward pass through a lightweight reasoning network, which generates latent vectors that encapsulate the necessary reasoning logic and condition the LLM to produce the final answer. Experiments on mathematical and out-of-domain benchmarks demonstrate that our LRT consistently outperforms relevant efficient reasoning methods. Moreover, by transforming explicit reasoning into latent reasoning, our approach surpasses the state-of-the-art Qwen3 hybrid reasoning framework. Code is available at \texttt{https://github.com/MobiusDai/LRT} .
基础/前沿模型 (含LLM)
推理与思维链
#Efficient Reasoning #Large Reasoning Models #Retrieval Augmented Language Models
TL;DR:Retrieval-of-Thought (RoT) improves LLM reasoning efficiency by reusing prior reasoning steps as dynamic templates, cutting tokens, cost, and latency while preserving accuracy.
🎯 研究动机大型推理模型通过生成长推理路径提升精度,但会导致推理时的延迟和高成本问题,亟需提升推理效率。
❓ 解决问题提出一种基于动态模板复用的推理范式,通过重用先前推理步骤,减少冗余计算,降低推理过程中的资源消耗。
🔍 现象分析通过推理过程中重复使用具有语义相关性的先前步骤,可以有效减少生成的冗余 token,缓解高延迟和高成本,同时不损失结果精度。
🛠️ 主要方法提出 Retrieval-of-Thought (RoT) 方法,将推理步骤组织为具有序列和语义关系的思维图,利用检索和奖励引导构建问题特定模板,以优化生成过程。
📊 数据与实验在多个推理基准上使用不同模型进行实验,评估准确率、token 使用量、延迟和内存开销,结果表明 RoT 显著减少了生成 token 数量、推理延迟和总体成本。
⭐ 主要贡献通过动态模板构建和检索优化大型推理模型,RoT 将生成 token 减少 40%,延迟降低 82%,成本下降 59%,在准确率不变的情况下显著提升推理效率,确立高效推理的新范式。
查看完整摘要 (Abstract)
Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.
基础/前沿模型 (含LLM)
推理与思维链
#reasoning #open-ended generation #synthetic data
🎯 研究动机深度推理方法在可验证领域取得进展,但在开放式生成中面临挑战,传统方法如强化学习和指令蒸馏均存在局限性。
❓ 解决问题解决奖励信号不明确、模型成本高昂以及教师模型能力受限的问题,为开放式生成任务提供有效的推理机制。
🔍 现象分析现有推理技术无法高效处理需要深度推理的开放式任务,导致生成质量较低且耗费资源。
🛠️ 主要方法提出反向工程推理(REER)框架,通过从已知优解反向发现潜在深度推理过程,采用无梯度计算方式实现推理生成。
📊 数据与实验创建并开源20,000条深度推理轨迹的数据集DeepWriting-20K,并训练了深度推理模型DeepWriter-8B,达成高于主流开源基线甚至部分领先于专有模型的表现。
⭐ 主要贡献突破传统推理范式,提出具创新性的反向推理模型,提供大规模开源数据集及高性能模型,为开放式生成赋予强大的推理能力。
查看完整摘要 (Abstract)
While the "deep reasoning" paradigm has spurred significant advances in verifiable domains like mathematics, its application to open-ended, creative generation remains a critical challenge. The two dominant methods for instilling reasoning—reinforcement learning (RL) and instruction distillation -- falter in this area; RL struggles with the absence of clear reward signals and high-quality reward models, while distillation is prohibitively expensive and capped by the teacher model's capabilities. To overcome these limitations, we introduce REverse-Engineered Reasoning (REER), a new paradigm that fundamentally shifts the approach. Instead of building a reasoning process "forwards" through trial-and-error or imitation, REER works "backwards" from known good solutions to computationally discover the latent, step-by-step deep reasoning process that could have produced them. Using this scalable, gradient-free approach, we curate and open-source DeepWriting-20K, a large-scale dataset of 20,000 deep reasoning trajectories for open-ended tasks. Our model, DeepWriter-8B, trained on this data, not only surpasses strong open-source baselines but also achieves performance competitive with, and at times superior to, leading proprietary models like GPT-4o and Claude 3.5.
基础/前沿模型 (含LLM)
推理与思维链
#Multimodal Reasoning #Multimodal Reinforcement Learning
TL;DR:ReVisual-R1 with PAD and the GRAMMAR dataset sets new SOTA across major multimodal reasoning benchmarks.
🎯 研究动机受DeepSeek-R1在复杂文本任务中卓越推理能力的启发,现有研究尝试将强化学习直接应用于多模态大语言模型,但难以激活复杂推理。本文深入审视整个训练流程,探究如何有效提升MLLMs的推理能力。
❓ 解决问题研究解决了多模态强化学习中梯度停滞导致的训练不稳定与性能下降问题,并发现仅用精选文本数据进行冷启动也能超越许多多模态模型。同时探索了分阶段训练以平衡感知对齐与认知推理。
🔍 现象分析分析发现了三个关键现象:一是有效的冷启动初始化对增强MLLM推理至关重要;二是标准GRPO在多模态RL中存在梯度停滞;三是在多模态RL阶段后进行纯文本RL训练能进一步提升多模态推理。
🛠️ 主要方法提出了ReVisual-R1模型,采用分阶段训练策略:首先使用精选文本数据进行优化冷启动初始化,然后针对多模态RL的梯度停滞问题进行改进,最后进行纯文本RL训练以增强推理。
📊 数据与实验构建了GRAMMAR数据集,并在MathVerse、MathVision、WeMath、LogicVista、DynaMath以及AIME2024和AIME2025等具有挑战性的基准上进行了验证。实验表明该方法在开源7B MLLMs中达到了新的最先进水平。
⭐ 主要贡献提出了分阶段训练方法以平衡感知对齐与认知推理发展,解决了多模态RL中的梯度停滞问题。通过优化冷启动与分阶段强化学习,ReVisual-R1在多个多模态推理基准上创造了新的SOTA性能。
查看完整摘要 (Abstract)
Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning.
In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL.2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning.
This staged training approach effectively balances perceptual grounding and cognitive reasoning development.
By incorporating the above insights and addressing multimodal RL issues, we introduce \textbf{ReVisual-R1}, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.
基础/前沿模型 (含LLM)
推理与思维链
#Test-Time Scaling #Inference-Time Improvement #LLMs #RL
TL;DR:LLMs can learn from scalar reward signals at inference time to improve beyond their training data, outperforming existing test-time scaling methods across reasoning, creative, and scientific tasks.
🎯 研究动机强化学习可解决序列决策问题,但其与大语言模型在推理时的潜在关联尚未充分研究。作者设想推理阶段的语言模型可能具备自动优化能力。
❓ 解决问题探索大语言模型是否能够在推理阶段基于标量奖励信号进行自我优化,并提出适用于此目标的提示框架。
🔍 现象分析通过生成式多轮提示观察到,大语言模型在推理阶段可以结合奖励信号提高任务表现,展现类强化学习行为。
🛠️ 主要方法提出一种多轮提示框架(ICRL 提示),通过奖励信号与上下文累计更新,指导模型在推理阶段实现自我优化。
📊 数据与实验在 24 点游戏、创意写作、科学推理和奥赛数学竞赛等任务上验证性能,显著优于现有的测试时优化方法。
⭐ 主要贡献揭示语言模型推理阶段的类强化学习能力,提出新型测试时扩展方法,为推理能力提升及任务优化提供新范式。
查看完整摘要 (Abstract)
Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during inference, exhibiting behavior analogous to reinforcement learning. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Notably, even when the reward signals are generated by the same LLM, ICRL prompting still improves performance, highlighting a promising new paradigm for test-time scaling.
基础/前沿模型 (含LLM)
推理与思维链
#logical reasoning #rule-based reasoning #reinforcement learning #language models
TL;DR:RuleReasoner, a reinforcement learning method for reasoning models, beats frontier models at rule-based reasoning (ID/OOD) using effective curriculum learning and is more efficient.
🎯 研究动机规则推理是推理领域的核心问题之一,但现有方法在处理规则格式、类型和复杂性多样性时仍存在显著挑战。
❓ 解决问题为应对规则推理中的多样性问题,提出一种基于强化学习的动态采样方法,优化训练过程并提升模型性能。
🔍 现象分析规则推理模型当前面临任务分布不均和人工设计静态训练机制的效率低下问题,这限制了模型在实际应用中的表现。
🛠️ 主要方法采用领域感知的动态采样方法,根据历史奖励更新领域权重,平衡任务分布并启用主动学习策略,避免人为设定的静态混合训练。
📊 数据与实验通过八个ID任务和三个OOD基准进行评估,模型在规则推理精度上显著超越当前最优模型,同时展现更高的计算效率。
⭐ 主要贡献提出一种领域动态采样技术,显著提升了规则推理模型在内分布和外分布任务上的性能,并改善了计算效率,为强化学习驱动的推理研究提供了新的方向。
查看完整摘要 (Abstract)
Rule-based reasoning is acknowledged as one of the fundamental problems of reasoning. While recent studies show that large reasoning models (LRMs) have remarkable reasoning capabilities enhanced by reinforcement learning (RL), real applications still face severe challenges due to variations in rule formats, types, and complexity. To mitigate this issue, we introduce RuleReasoner, an effective method for rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach in RL. Specifically, RuleReasoner resamples each training batch by updating the domain weights based on historical rewards. This facilitates domain balance and active learning schedules for RL, obviating static mix-training engineered by humans. Evaluations of in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ($\Delta$4.1% on eight ID tasks and $\Delta$10.4% on three OOD benchmarks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior methods.
基础/前沿模型 (含LLM)
推理与思维链
#LLM-as-a-judge; Large Language Model
TL;DR:We introduce SCI-VerifyBench, a cross-disciplinary benchmark, and SCI-Verifier, a reasoning-augmented verifier, to provide systematic evaluation and reliable solutions for answer verification in scientific domains.
🎯 研究动机随着大型语言模型(LLMs)在科学推理中的应用增多,答案验证因格式复杂性和表达多样性成为关键且困难的任务。
❓ 解决问题现有验证方法缺乏系统性评估标准和学科全面性,并依赖于繁琐的规则设计或提示工程,限制了复杂推理场景中的效果及跨学科适应性。
🔍 现象分析科学领域答案验证现存评估不充分、跨学科覆盖不完全,且模型方法欠缺逻辑推理与等价判断能力。
🛠️ 主要方法提出跨学科基准数据集 SCI-VerifyBench 和强化推理能力的统一验证模型 SCI-Verifier,通过后训练提升科学领域的验证能力。
📊 数据与实验SCI-VerifyBench 涵盖数学、物理、生物、化学及综合科学问答,基于真实 LLM 输出并融合领域特定等价转换;实验采用模型与专家标注,确保数据多样性与高质量。
⭐ 主要贡献建立系统评估与解决框架,通过 SCI-VerifyBench 提供高质量基准,并通过 SCI-Verifier 提升科学验证的逻辑推理与等价判断能力,增强 LLM 在科学领域的可靠性与适用性。
查看完整摘要 (Abstract)
As large language models (LLMs) are increasingly applied to scientific reasoning, the complexity of answer formats and the diversity of equivalent expressions make answer verification a critical yet challenging task. Existing verification studies in scientific domains suffer from two major limitations: (a) the absence of systematic evaluation standards and insufficient disciplinary coverage, which hinders their comprehensive assessment; and (b) heavy reliance on cumbersome rule design or prompt engineering, which reduces their effectiveness in complex reasoning scenarios or limits their cross-disciplinary generalization. To address these challenges, we propose solutions at both the data and model levels. On the data side, we construct **SCI-VerifyBench**, a cross-disciplinary benchmark covering mathematics, physics, biology, chemistry, and general scientific QA. The benchmark is built from real LLM responses and enhanced with domain-specific equivalence transformations that generate challenging and realistic data. Model-based and expert annotations ensure both quality and diversity, enabling rigorous evaluation of verification ability. On the model side, we emphasize the importance of reasoning for verification and introduce **SCI-Verifier**, a unified reasoning-augmented verifier for scientific domains. Through post-training, SCI-Verifier demonstrates strong logical reasoning and equivalence judgment capabilities while maintaining concise and stable outputs. Together, SCI-VerifyBench and SCI-Verifier provide a principled framework for scientific verification, offering both systematic evaluation and practical pathways to enhance the reliability and applicability of LLMs in scientific domains.
基础/前沿模型 (含LLM)
推理与思维链
#large language models #Test-time reinforcement learning #test-time adaptation #self-play #pseudo labeling #infomax
TL;DR:Through test-time self-play between a solver and a reframer, our method, Self-Harmony, uses an InfoMax-derived harmonic mean to score and select pseudo-labels based on their joint frequency across original and reframed questions.
🎯 研究动机测试时强化学习(TTRL)需要可靠的学习信号,但现有方法容易陷入伪造但流行的错误答案,亟需一种能自适应的机制来改进结果稳定性。
❓ 解决问题在无需人工监督的前提下,提高模型在测试时对重新表述问题下的稳定性与准确性,避免视图依赖的伪答案陷阱。
🔍 现象分析多数投票等方式易偏向某些伪造答案,而正确答案应在原问题与其重新表述中保持一致。
🛠️ 主要方法提出Self-Harmony框架,让单一模型即作为Solver生成答案,也作为Reframer重新表述输入,再通过基于InfoMax的谐均值对原始与表述问题的伪标签频率进行筛选。
📊 数据与实验在多个推理基准上测试,Self-Harmony在30个场景中有28个达到最佳,并展示出零训练失败的稳定性。
⭐ 主要贡献提出无需人工监督的测试时自适应方法,实现行业领先的准确性与鲁棒性,标志着TTRL的稳定性与可靠性新突破。
查看完整摘要 (Abstract)
Test-time reinforcement learning (TTRL) offers a label-free paradigm for adapting models using only synthetic signals at inference, but its success hinges on constructing reliable learning signals. Standard approaches such as majority voting often collapse to spurious yet popular answers.
We introduce Self-Harmony, a framework built on a simple intuition: the correct answer should remain stable across both an original question and its paraphrase. Self-Harmony operationalizes this by employing a single model in two complementary roles: a Solver to produce answers and a Reframer to rephrase the input. Based on this, we further propose a pseudo-label method: instead of majority voting, it aggregates answer frequencies across these original and reframed views using the harmonic mean. This is a process that naturally selects for solutions stable under reframing, thereby avoiding the common trap of favoring view-dependent, spurious answers.
Crucially, this requires no human supervision or auxiliary models. Across diverse reasoning benchmarks, Self-Harmony achieves state-of-the-art results at the label-free test-time setting, ranking first in 28 of 30 settings across multiple methods. Beyond accuracy, it demonstrates unprecedented robustness, with zero training failures in all experiments, underscoring its stability and reliability.
基础/前沿模型 (含LLM)
推理与思维链
#Chain-of-Thought #large language model #math reasoning
🎯 研究动机隐式链式思维方法在大语言模型推理中效率较高,但存在性能差距,特别是在计算预算扩大时表现不稳定。
❓ 解决问题针对隐式链式推理在扩展推理步骤时的训练不稳定问题,提出方法以解决潜在表示语义多样性不足的挑战。
🔍 现象分析发现不稳定性源于缺乏足够的步骤级监督,导致潜在表示同质化,并丧失语义丰富性。
🛠️ 主要方法提出SIM-CoT模块,通过辅助解码器在训练期间引入步骤级监督,增强潜在表示的稳定性与多样性;推理阶段移除解码器,保持效率。
📊 数据与实验实验在Coconut和CODI等方法上验证,分别在GPT-2和LLaMA-3.1 8B上取得显著性能提升,且在大模型上接近显式方法表现。
⭐ 主要贡献提出一种无需增加推理开销的隐式链式推理训练模块,提升方法准确性、跨领域稳健性及解释性,实现更高的推理效率。
查看完整摘要 (Abstract)
Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited their adoption.
We identify a core latent instability issue when scaling the computational budget of implicit CoT: as the number of reasoning tokens increases, training often becomes unstable and collapses.
Our analysis shows that this instability arises from latent representations becoming homogeneous and losing semantic diversity, caused by insufficient step-level supervision in current implicit CoT methods.
To address this, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space.
SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring latent states capture distinct and meaningful information.
The auxiliary decoder is removed at inference, preserving the efficiency of implicit CoT with no added overhead.
It also provides interpretability by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization and diagnosis.
SIM-CoT significantly improves both in-domain accuracy and out-of-domain stability of implicit CoT methods, boosting Coconut by +8.2\% on GPT-2 and CODI by +3.0\% on LLaMA-3.1 8B.
It further surpasses the explicit CoT baseline on GPT-2 by 2.1\% with 2.3$\times$ greater token efficiency, while closing the performance gap on larger models like LLaMA-3.1 8B.
基础/前沿模型 (含LLM)
推理与思维链
#large language model #small language models
🎯 研究动机小型语言模型(SLM)在任务特定领域表现优异且效率高,但其单独精度有限,亟需探索如何高效协作以提升整体性能。
❓ 解决问题现有的模型编排方法主要针对大型语言模型(LLM),在小型语言模型上表现不佳,因此需设计针对SLM的编排方法。
🔍 现象分析通过有效组合与优化,多个SLM的协作能够在准确性和效率上超越个别模型甚至部分前沿LLM。
🛠️ 主要方法提出SLM-MUX多模型架构,并引入模型选择搜索和测试时扩展优化策略,实现对多SLM的高效协同编排。
📊 数据与实验实验覆盖MATH、GPQA、GSM8K等任务,SLM-MUX实现最高13.4%的性能提升,甚至超越部分大模型;扩展实验验证其在人类评估任务与其他模型类别上的通用性。
⭐ 主要贡献提出高效协同SLM的新方法SLM-MUX,显著提升多个任务的准确性,拓展了SLM编排与优化的理论和实践边界。
查看完整摘要 (Abstract)
With the rapid development of language models, the number of small language models (SLMs) has grown significantly. Although they do not achieve state-of-the-art accuracy, they are more efficient and often excel at specific tasks. This raises a natural question: can multiple SLMs be orchestrated into a system where each contributes effectively, achieving higher accuracy than any individual model? Existing orchestration methods have primarily targeted frontier models (e.g., GPT-4) and perform suboptimally when applied to SLMs. To address this gap, we propose a three-stage approach for orchestrating SLMs. First, we introduce SLM-MUX, a multi-model architecture that effectively coordinates multiple SLMs. Building on this, we develop two optimization strategies: (i) a model selection search that identifies the most complementary SLMs from a given pool, and (ii) test-time scaling tailored to SLM-MUX. Our approach delivers strong results: Compared to existing orchestration methods, our approach achieves up to 13.4% improvement on MATH, 8.8% on GPQA, and 7.0% on GSM8K. With just two SLMs, SLM-MUX outperforms Qwen 2.5 72B on GPQA and GSM8K, and matches its performance on MATH. We further provide theoretical analyses to substantiate the advantages of our method. Additional experiments show that the core principle of SLM-MUX extends to open-ended generation tasks (e.g., HumanEval) and benefits other model classes, including frontier LLMs and domain-specific fine-tuned SLMs. In summary, we demonstrate that SLMs can be effectively orchestrated into more accurate and efficient systems through the proposed approach.
基础/前沿模型 (含LLM)
推理与思维链
#Reinforcement Learning #Self-Play #Large Language Models #Reasoning #Multi-Agent Reinforcement Learning
TL;DR:Self-play on multiple zero-sum language games teaches LLMs transferable reasoning skills that improve mathematical and general reasoning benchmarks by up to 10%, without requiring any domain-specific training data.
🎯 研究动机当前强化学习方法依赖于人工设计的任务与奖励,而缺乏通用性和自适应性,阻碍了语言模型在推理能力上的全面提升。
❓ 解决问题提出一种无需人工监督的新方法,通过自我博弈的零和语言游戏训练语言模型,提升其可迁移的推理能力。
🔍 现象分析多种游戏的训练能生成独特且互补的认知模式,这些模式有助于提高模型在数学和通用推理基准上的性能表现,甚至超越传统的监管微调方法。
🛠️ 主要方法设计了一个在线多回合多智能体的强化学习系统,结合自我博弈框架和角色条件化优势估计(RAE),实现在动态对手环境中稳定训练语言模型。
📊 数据与实验使用包括TicTacToe、Kuhn Poker和简单谈判在内的多种游戏任务进行训练,在8个推理基准测试上实现了最高10%的性能提升,并验证了对不同模型家族的一致适应性。
⭐ 主要贡献展示了零和游戏在开发通用推理能力上的潜力,提出了无需领域特定数据的强化学习方法,为语言模型的自动化推理能力提升提供了新方向。
查看完整摘要 (Abstract)
Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, generating an automatic curriculum of stronger opponents, and eliminating the need for human supervision. To enable this self-play training at scale, we implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. SPIRAL produces reasoning capabilities that transfer broadly, improving performance by up to 10\% across a suite of 8 reasoning benchmarks on 4 different models spanning Qwen and Llama model families, outperforming supervised fine-tuning on 25,000 expert game trajectories. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) yields the strongest results, with improvements observed across both base and instruction-tuned models. Analysis of chain-of-thought traces reveals that games develop distinct cognitive patterns that transfer to improve reasoning performance, with different games developing complementary strengths. Even models which have already been trained on reasoning tasks using RLVR, like DeepSeek-R1-Distill-Qwen-7B, still benefit from our approach. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities across diverse model architectures and training stages, highlighting a promising direction for autonomous reasoning development.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Model #Reinforcement Learning with Verifiable Reward
🎯 研究动机强化学习验证奖励(RLVR)使大语言模型(LLMs)能够通过策略优化解决复杂逻辑问题,但现有方法需要全面标注数据集并均匀分配计算资源,效率低下。
❓ 解决问题提出如何在无需答案的情况下,从大规模训练集中识别出对模型推理具有关键作用的样本子集(Lottery-winning Samples)。
🔍 现象分析验证了在LLM训练过程中,仅使用一小部分关键样本进行训练即可达到与整个数据集相当的推理性能。
🛠️ 主要方法提出了一种新的无监督框架Complementary Conformal Selection(CONST),通过评估程序波动性和结果波动性两大互补特征,并结合符合性预测,选择对模型优化最重要的样本。
📊 数据与实验在多个数据集和不同LLMs上进行了广泛实验,证明CONST能够有效发现关键样本并显著提升推理性能。
⭐ 主要贡献通过无监督方法实现关键训练样本的自动发现,提出理论框架CONST并从波动性角度进行量化分析,为高效LLM推理提供了新思路。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Reward (RLVR) has equipped large language models (LLMs) with the capability of reasoning over complicated logical problems through policy optimization. However, conventional methods require complete annotation of the entire dataset and allocate computation uniformly over all samples. We articulate the lottery sample hypothesis in policy optimization of LLMs: a large training set contains a small subset that, when trained alone, yields performance comparable to that of the full dataset. This paper therefore explores the following question: How can we identify these lottery-winning samples from the original dataset without access to answers? Unlike prior efforts that analyze the effect of different samples in the training set with complete annotation, this paper focuses on the unsupervised discovery of critical instances for LLM reasoning and proposes a novel framework termed Complementary Conformal Selection (CONST). Specifically, CONST evaluates the importance of samples by considering two complementary components: procedural volatility and outcome volatility. Procedural volatility measures the potential variations during the LLM’s reasoning process, while outcome volatility captures inconsistencies in the final answer. Subsequently, conformal prediction is used to obtain a prediction set whose cardinality serves as the criterion for selecting the lottery-winning samples for annotation. We also provide a theoretical analysis, showing that CONST can effectively approximate the optimal policy. Extensive experiments on various LLMs across different datasets demonstrate the effectiveness of CONST.
基础/前沿模型 (含LLM)
推理与思维链
#Sampler #model uncertainty #LLM reasoning #min-p #calibration #chain-of-thought #self-consistency
TL;DR:LLM sampling should be reduced at high uncertainty tokens
🎯 研究动机大型语言模型在复杂推理任务中需兼顾多样性和准确性,但现有策略在高不确定性下的采样规则存在冲突,难以平衡两者。
❓ 解决问题提出通过校准与正确性相关的采样规则,而非仅基于置信度,从而改善推理任务中的采样质量。
🔍 现象分析现有方法在高不确定性步长中增加探索性或通过拒绝低置信度样本提升可靠性,但二者因混淆了不同的不确定性来源而互相矛盾。
🛠️ 主要方法提出了基于正确性校准的采样策略,包括 Greedy-Threshold 方法在极低置信度下使用贪心采样,Calibrated-TopK 和 Calibrated-ε 对采样阈值基于正确性进行调整。
📊 数据与实验在数学推理与通用推理基准上验证了所提出策略相较现有启发式方法的性能提升。
⭐ 主要贡献挑战了在不确定性下解码的传统启发式规则,提出以正确性为核心的采样校准策略,并通过实验展现广泛的性能收益。
查看完整摘要 (Abstract)
Large Language Models (LLMs) are increasingly applied to complex tasks that require extended reasoning. In such settings, models often benefit from diverse chains-of-thought to arrive at multiple candidate solutions. This requires two competing objectives: to inject enough stochasticity to explore multiple reasoning chains, and to ensure sufficient accuracy and quality in each path. Existing works pursue the first objective by increasing exploration at highly uncertain steps with higher temperature or larger candidate token sets, while others improve reliability by rejecting samples with low confidence post generation, implying that low confidence correlates with low answer quality. These two lines of thought are in conflict, as they conflate different sources of uncertainty. To resolve this, we argue that the decoding rule should be calibrated by *correctness*, not confidence alone. We should sample from tokens with higher estimated correctness, and reduce sampling where expected correctness is low. We propose simple strategies that achieve this goal: **Greedy-Threshold** makes sampling greedy at very low confidence steps. **Calibrated-TopK** and **Calibrated-ε** set truncation threshold based on estimated rank-wise correctness. Together, our findings challenge prevailing heuristics about decoding under uncertainty, showing consistent gains across math and general reasoning benchmarks.
基础/前沿模型 (含LLM)
推理与思维链
#LLM Reasoning #Reinforcement Learning from Verifier Rewards #Mathematical Reasoning
TL;DR:Scaf-GRPO boosts LLM reasoning in RLVR, using hierarchical hints to guide on-policy GRPO on difficult problems.
🎯 研究动机当前强化学习方法在增强大型语言模型(LLM)推理能力时面临“学习悬崖”现象,阻碍模型在解决复杂问题上的进步。
❓ 解决问题通过引入逐步的提示干预机制,解决在难题上零奖励信号导致模型学习停滞的问题。
🔍 现象分析难题零奖励信号使得 GRPO 算法中的优势计算无法有效进行,学习过程停滞,模型在复杂任务上无法取得进展。
🛠️ 主要方法设计 Scaf-GRPO 框架,通过学习停滞诊断后注入分层提示(从抽象概念到具体步骤),逐步引导模型独立完成问题求解。
📊 数据与实验通过 AIME24 数学基准数据集测试,Scaf-GRPO 框架显著提升 Qwen2.5-Math-7B 模型的 pass@1 分数,相对基线增加 44.3%。
⭐ 主要贡献提出了一种渐进式训练框架,有效提高 LLM 在复杂推理任务上的自主解决能力,推动了强化学习在高级推理领域的进展。
查看完整摘要 (Abstract)
Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ''learning cliff'' phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model's independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO's effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3\% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.
基础/前沿模型 (含LLM)
推理与思维链
#Test time scaling #Large language models #Length control Abstract:
TL;DR:A novel framework for scalable chain of thoughts that explicitly separates reasoning into two phases—thinking and solution—with independently allocated budgets.
🎯 研究动机在复杂任务中,大型推理模型通过生成扩展的思维链取得显著进展,但其输出长度难以控制,限制了在严格资源约束下的实际部署。
❓ 解决问题解决推理过程在推理时间预算有限时的控制问题,确保输出完整性与可靠性,同时降低训练成本。
🔍 现象分析推理过程中的未受控输出长度可能导致资源消耗超标和结果不可靠,尤其在实际应用环境的严格约束下显得尤为突出。
🛠️ 主要方法提出弹性推理框架,将推理分为“思考”和“解答”两个阶段,各自分配独立预算,并引入轻量级预算约束展开策略,使模型在思考时间受限时仍能适应性推理。
📊 数据与实验在数学(AIME, MATH500)和编程(LiveCodeBench, Codeforces)基准上进行实证分析,展示方法在严格预算下表现稳健,同时显著降低训练成本。
⭐ 主要贡献提供了一种可控且高效的扩展思维链解决方案,在资源约束和非约束环境下均表现出色,代码公开以推动相关研究应用。
查看完整摘要 (Abstract)
Large reasoning models (LRMs) have achieved remarkable progress on complex tasks by generating extended chains of thought (CoT). However, their uncontrolled output lengths pose significant challenges for real-world deployment, where inference-time budgets on tokens, latency, or compute are strictly constrained. We propose Elastic Reasoning, a novel framework for scalable chain of thoughts that explicitly separates reasoning into two phases—thinking and solution—with independently allocated budgets. At test time, Elastic Reasoning prioritizes the completeness of solution segments, significantly improving reliability under tight resource constraints. To train models that are robust to truncated thinking, we introduce a lightweight budget-constrained rollout strategy, integrated into GRPO, which teaches the model to reason adaptively when the thinking process is cut short and generalizes effectively to unseen budget constraints without additional training. Empirical results on mathematical (AIME, MATH500) and programming (LiveCodeBench, Codeforces) benchmarks demonstrate that Elastic Reasoning performs robustly under strict budget constraints, while incurring significantly lower training cost than baseline methods. Remarkably, our approach also produces more concise and efficient reasoning even in unconstrained settings. Elastic Reasoning offers a principled and practical solution to the pressing challenge of controllable reasoning at scale. Code is available in the supplementary material.
基础/前沿模型 (含LLM)
推理与思维链
#large language model #RLVR #Data Scheduling
🎯 研究动机现有RLVR数据调度方法忽略了查询推理树结构对优化大语言模型的重要性,导致数据效率和准确性难以进一步提升。
❓ 解决问题提出一种新的度量标准Reasoning Score (r-score),以推理树结构来衡量查询学习难度,并设计适应性数据调度算法。
🔍 现象分析传统路径评分法未充分利用推理树结构,而推理树的复杂性与模型学习效果显著相关。
🛠️ 主要方法依据r-score制定Reasoning Tree Schedule (Re-Schedule)算法,优先训练结构简单的高分查询,再逐步过渡到复杂的低分查询。
📊 数据与实验在六个数学推理基准数据集上实验,Re-Schedule算法使平均准确率最高提升3.2%。
⭐ 主要贡献证明推理树结构是增强RLVR数据调度的关键因素,并提出更有效的调度算法提高模型性能。
查看完整摘要 (Abstract)
Using Reinforcement Learning with Verifiable Rewards (RLVR) to optimize Large Language Models (LLMs) can be conceptualized as progressively editing a query's 'Reasoning Tree'. This process involves exploring nodes (tokens) and dynamically modifying the model's policy at each node. When combined with data scheduling, this process yields further gains in data efficiency and accuracy.
However, existing RLVR data scheduling methods typically rely on path-based metrics to rank queries, overlooking the reasoning tree structures of these queries.
In this paper, we introduce a novel metric, namely **Reasoning Score (r-score)**, which measures the query's learning difficulty based on the structure of its reasoning tree.
Based on the r-score, we propose the **Reasoning Tree Schedule (Re-Schedule)**, a scheduling algorithm that constructs a curriculum progressing from structurally simple (high r-score) to complex (low r-score) queries.
Experiments on six math-reasoning benchmarks show that Re-Schedule significantly improves average accuracy, achieving gains of up to 3.2\%.
These strong results validate our approach and demonstrate that a structural understanding of the reasoning tree provides a more powerful and principled foundation for RLVR data scheduling.
基础/前沿模型 (含LLM)
推理与思维链
#long CoTs #selective learning #integrated gradient #segment attributions
🎯 研究动机大型推理模型通过生成长推理链实现强大的推理性能,但冗余内容的存在会降低模型效率,特别是在监督微调后模型可能进一步模仿冗长且无效的模式。
❓ 解决问题明确长推理链中哪些部分对最终答案预测有重要贡献,并通过有选择性地学习这些关键部分来提高模型表现与输出效率。
🔍 现象分析论文发现大部分推理链内容是重复或不必要的,仅少部分对预测结果有效,而无效部分对模型的学习产生负面影响。
🛠️ 主要方法采用集成梯度归因方法计算每个词元对最终答案的贡献,提出两种段落级别指标:归因强度和方向一致性,并基于此设计选择性学习框架,仅针对具有高贡献的段落进行训练。
📊 数据与实验在多个模型和数据集上进行实验,验证选择性学习框架在提高预测准确率和输出效率方面的效果。
⭐ 主要贡献提出一种段落级别归因与选择性学习框架,解决长推理链中的冗余与无效性问题,有效提升模型性能并减少训练与推理成本。
查看完整摘要 (Abstract)
Large Reasoning Models (LRMs) achieve strong reasoning performance by generating long chains of thought (CoTs), yet only a small fraction of these traces meaningfully contributes to answer prediction, while the majority contains repetitive or truncated content. Such output redundancy is further propagated after supervised finetuning (SFT), as models learn to imitate verbose but uninformative patterns, which can degrade performance. To this end, we incorporate integrated gradient attribution to quantify each token's influence on final answers and aggregate them into two segment-level metrics: (1) \textit{attribution strength} measures the overall attribution magnitude; and (2) \textit{direction consistency} captures whether tokens' attributions within a segment are uniformly positive or negative (high consistency), or a mixture of both (moderate consistency). Based on these two metrics, we propose a segment-level selective learning framework to identify important segments with high attribution strength but moderate consistency that indicate reflective rather than shallow reasoning. The framework then applies selective SFT on these important segments while masking loss for unimportant ones. Experiments across multiple models and datasets show that our approach improves accuracy and output efficiency, enabling more effective learning from long reasoning traces.
基础/前沿模型 (含LLM)
推理与思维链
#LLM reasoning #Reinforcement Learning
TL;DR:We propose Slow-Fast Policy Optimization (SFPO), a reposition-before-update method that stabilizes on-policy optimization for LLM reasoning.
🎯 研究动机强化学习(RL)已成为提升大语言模型(LLM)推理能力的关键工具,但现有的在线优化算法在早期训练阶段易陷入梯度噪声和低效探索导致的不稳定性。
❓ 解决问题现有方法如GRPO在低质量回合采样中表现不佳,需设计新的策略优化机制以提高训练稳定性和效率。
🔍 现象分析早期训练中,由于策略漂移和低质量回合采样,出现梯度噪声过高和更新不稳定,显著影响了探索与收敛效果。
🛠️ 主要方法提出Slow-Fast Policy Optimization(SFPO),通过‘快轨迹内步+重定位步骤+慢修正’的三阶段机制进行优化,同时保持在线目标和回合采样方式不变,确保方法的兼容性与稳定性。
📊 数据与实验在多个数学推理基准上验证了SFPO的有效性,相较GRPO性能最高提升2.80分,并显著减少4.93倍的回合采样次数和4.19倍的训练时间。
⭐ 主要贡献引入SFPO创新性地解决了LLM推理强化学习中的不稳定性问题,显著提升了训练效率和性能,同时保持极高的算法兼容性。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient mechanism to address the above limitations via decomposing each iteration into three stages: a short fast trajectory of inner steps on the same batch, a reposition step to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces number of rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points on math reasoning benchmarks. It also achieves up to 4.93$\times$ fewer rollouts and a 4.19$\times$ reduction in wall-clock time to match GRPO’s best accuracy.
基础/前沿模型 (含LLM)
推理与思维链
#Generative Process Reward Model #Math Reasoning #Large Reasoning Model
TL;DR:We identify key pitfalls for GenPRMs—such as over-reliance on reasoning, exploration suppression, and reward hacking—and propose TP‑GRPO, a RL framework with intrinsic-signal evaluation, thought-level granularity, and ability-adaptive rewards.
🎯 研究动机大型推理模型在数学推理任务中表现卓越,但传统基于结果的奖励缺乏反馈密度,优化效率低下,亟需改进激励机制以提升训练效果。
❓ 解决问题当前生成式过程奖励模型存在过度依赖推理能力、抑制探索性及奖励欺骗等问题,需设计可靠的评估与奖励方法以克服这些缺点。
🔍 现象分析剖析生成式过程奖励模型的局限性,发现其依赖逻辑正确性评判步骤、导致探索受限并易受奖励欺骗攻击影响。
🛠️ 主要方法提出基于内在信号驱动的评估机制,结合思维层次奖励、难度感知奖励设计,以动态平衡探索与利用并降低奖励欺骗风险,最终构建TP-GRPO算法。
📊 数据与实验在1.5B及7B规模的逻辑推理模型上进行实验,显示TP-GRPO以更少样本数实现了显著性能提升,验证方法有效性。
⭐ 主要贡献设计并实现了内在信号评估机制、细粒度奖励和难度自适应激励框架,提高了生成式过程奖励模型的优化效率与鲁棒性,为大模型复杂任务训练提供了新范式。
查看完整摘要 (Abstract)
Large reasoning models (LRMs) have shown strong performance in complex mathematical reasoning when optimized via reinforcement learning (RL). However, conventional outcome-only reward provides sparse feedback, leading to inefficient optimization. In this work, we investigate whether generative process reward models (GenPRMs) can accelerate RL training of LRMs by improving the utilization of reasoning trajectories. We first analyze critical limitations in existing GenPRMs, including their heavy reliance on reasoning ability during correctness judgment, and suppression of exploration as well as vulnerability to reward hacking during reward assignment. To address these limitations, we first propose a novel \textbf{intrinsic-signal-driven evaluation} mechanism, which judges reasoning steps using semantic cues from the solution, thus mitigating extensive dependence on GenPRM. Furthermore, we (i) adopt \textbf{thought-level rewarding granularity} to alleviate over-dense step rewards, and (ii) design a \textbf{difficulty-aware reward formulation} that dynamically balances exploration and exploitation and keeping the optimization target of key tokens to mitigate reward hacking. We integrate these innovations into the process reward-based GRPO, resulting in the proposed \textbf{TP-GRPO} algorithm. Experiments on LRMs with 1.5B and 7B parameters show that TP-GRPO achieves higher improvements while using significantly fewer training samples, and more analyses further confirm the effectiveness of our proposed process evaluation mechanism.
基础/前沿模型 (含LLM)
推理与思维链
#reinforcement learning #large language models #math reasoning #latent reasoning #soft thinking #continuous tokens #reasoning
TL;DR:We present the first scalable method to learn continuous CoTs via RL, matching discrete tokens at pass@1 and outperforming them at pass@32.
🎯 研究动机近年来,连续令牌的使用被认定能够模拟多条推理路径的叠加,提高推理能力,但实际应用受限于训练困难和高计算成本。
❓ 解决问题克服传统方法中连续 CoT 的训练效率问题,提出一种无需依赖离散 CoT蒸馏且具有扩展性的连续 CoT学习方法。
🔍 现象分析理论证明连续令牌具备更强的表达能力,能够更高效解决部分问题;实验显示其在多样性和模型稳定性方面表现优于离散令牌。
🛠️ 主要方法通过强化学习引入软令牌,并在输入嵌入中添加噪声进行探索,显著降低训练开销以实现多令牌的连续 CoT学习。
📊 数据与实验使用 Llama 和 Qwen 模型(规模达 8B)进行数学推理基准测试,在 pass@$1$ 指标上与离散令牌持平,在 pass@$32$ 超越离散令牌。
⭐ 主要贡献提出首个扩展性强的连续 CoT学习方法,证明在连续令牌训练后使用离散令牌推理的最佳实践,并验证该方法提升了模型的泛化能力。
查看完整摘要 (Abstract)
The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens.
This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@$1$ and surpass them for pass@$32$, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.
基础/前沿模型 (含LLM)
推理与思维链
#Reinforcement Learning #Policy Gradients #Large Language Models
🎯 研究动机强化学习在提升大语言模型推理能力中效果显著,但策略梯度优化稳定性问题仍未被充分探索,导致训练样本需求增多和计算成本提升。
❓ 解决问题通过显式考虑二阶几何优化问题,提出方法解决策略梯度优化的不稳定性,提高样本利用效率并扩展模型规模化训练能力。
🔍 现象分析现有模型因保守超参数选择降低算法的灵活性;不稳定更新导致训练效率低下,特别是在大参数规模的情境下表现尤为突出。
🛠️ 主要方法提出CAPO框架,通过追踪和利用曲率信息对策略更新进行干预,结合数据选择机制屏蔽引发不稳定的样本,从而确保优化过程稳定。
📊 数据与实验在标准数学推理基准测试中,CAPO在极端学习设定下能稳定更新,实现比传统方法高达30倍的样本效率提升,拒绝样本比例不足8%。
⭐ 主要贡献提出具备理论保证的CAPO算法,大幅提高强化学习样本效率;验证在大语言模型推理任务中稳定性和可扩展性提升。
查看完整摘要 (Abstract)
Reinforcement Learning, particularly through policy gradient methods, has played a central role in enabling reasoning capabilities of Large Language Models. However, the optimization stability of policy gradients in this setting remains understudied. As a result, existing implementations often resort to conservative hyperparameter choices to ensure stability, which requires more training samples and increases computational costs. Hence, developing models for reliably tracking the underlying optimization dynamics and leveraging them into training enables more sample-efficient regimes and further unleashes scalable post-training. We address this gap by formalizing the stochastic optimization problem of policy gradients with explicit consideration of second-order geometry. We propose a tractable computational framework that tracks and leverages curvature information during policy updates. We further employ this framework to design interventions in the optimization process through data selection. The resultant algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out. Theoretically, we establish monotonic improvement guarantees under realistic assumptions. On standard math reasoning benchmarks, we empirically show that CAPO ensures stable updates under aggressive learning regimes where baselines catastrophically fail. With minimal intervention (rejecting fewer than 8% of tokens), CAPO achieves up to 30$\times$ improvement in sample efficiency over standard GRPO for LLM reasoning.
基础/前沿模型 (含LLM)
推理与思维链
#Test-time scaling #bandit learning #large language models #pure exploration
🎯 研究动机测试时计算量的优化是一种重要策略,但现有方法忽略了查询难度的差异,存在计算资源分配效率低的问题。
❓ 解决问题提出一种新的多臂老虎机学习框架,用于动态分配测试时计算,根据查询难度调整计算量,从而提高计算效率。
🔍 现象分析简单查询的计算量保持稳定,复杂查询的计算资源被优先分配,同时减少对不可解查询的过度计算。
🛠️ 主要方法设计自适应算法,实时估算查询难度并动态调整计算分配策略,同时对理论效率和实验效果进行了验证。
📊 数据与实验在 MATH-500、AIME25 和 LiveCodeBench 数据集上进行实验,性能增益分别达到最多 11.10%、10.82% 和 11.23%。
⭐ 主要贡献理论证明算法对计算效率的提升;实验展示算法在多个基准数据集上的显著性能改进。
查看完整摘要 (Abstract)
Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models. However, existing methods typically allocate compute uniformly across all queries, overlooking variation in query difficulty. To address this inefficiency, we formulate test-time compute allocation as a novel bandit learning problem and propose adaptive algorithms that estimate query difficulty on the fly and allocate compute accordingly. Compared to uniform allocation, our algorithms allocate more compute to challenging queries while maintaining accuracy on easier ones. Among challenging queries, our algorithms further learn to prioritize solvable instances, effectively reducing excessive computing on unsolvable queries. We theoretically prove that our algorithms achieve better compute efficiency than uniform allocation and empirically validate their effectiveness on math and code benchmarks. Specifically, our algorithms achieve up to an 11.10\% performance improvement (15.04\% relative) on the MATH-500 dataset, up to 10.82\% (14.44\% relative) on the AIME25 dataset, and up to an 11.23\% performance improvement (15.29\% relative) on the LiveCodeBench dataset.
基础/前沿模型 (含LLM)
推理与思维链
#LLMs #Reasoning #Streaming
TL;DR:We propose StreamingThinker, a framework that enables LLMs to think while reading.
🎯 研究动机当前的LLM推理依赖于接收完整输入后才开始思考,这导致延迟增加,且在动态场景中对早期信息的注意力减弱。论文受人类阅读过程中即时思考的认知启发,提出新的推理范式。
❓ 解决问题通过设计一种流式思考(Streaming Thinking)机制,使LLM能够在读取输入时即时展开推理,从而改善推理效率并提升动态场景中的性能。
🔍 现象分析现有的批量推理存在信息处理次序与推理时序分离的缺陷,导致产生较高计算等待时间和信息注意力弱化的问题。
🛠️ 主要方法通过Streaming CoT生成、流式约束训练和流式并行推理构成StreamingThinker框架,实现推理单元的质量控制、次序保持的注意力机制、位置编码,以及并行KV缓存解耦输入编码与推理生成。
📊 数据与实验采用Qwen3模型家族,评估数学、逻辑及基于上下文的问答推理任务,验证流式思考对性能保持及推理效率提升的有效性。
⭐ 主要贡献提出流式推理范式并实现框架StreamingThinker,在保留性能的同时减少80%的推理启动等待时间及60%的最终答案生成时间;提供公开代码以支持进一步研究。
查看完整摘要 (Abstract)
Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a **streaming thinking** paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete.
We instantiate this paradigm with *StreamingThinker*, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference.
Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks.
Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80\% reduction in token waiting before the onset of reasoning and a more than 60\% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning.
Code is publicly available at [this repository](https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker).
基础/前沿模型 (含LLM)
推理与思维链
#Reinforcement learning #Reasoning #Large Language Model #Agent
🎯 研究动机大型语言模型在多步推理任务中表现有限,小规模开源模型面临奖励信号稀缺或过拟合问题,需要更有效的训练框架提升推理能力。
❓ 解决问题针对RLVR在奖励稀缺时失效和SFT过拟合示例的问题,提出一种结合监督学习和强化学习的框架,促进灵活推理并提高模型学习效率。
🔍 现象分析RLVR难以处理低采样率情境,SFT倾向于逐字模仿示例,限制了模型解决复杂问题的能力。
🛠️ 主要方法提出监督强化学习(SRL),通过引入逻辑行动序列生成,加强模型内部推理,提供细粒度奖励信号以指导学习过程。
📊 数据与实验利用推理基准和软件工程任务验证SRL框架,通过结合SFT与RLVR的训练流程显著提升模型性能。
⭐ 主要贡献提出了SRL作为一个普适框架,有效提升小规模模型在复杂推理任务和工程任务中的表现,扩展了模型的学习能力和应用范围。
查看完整摘要 (Abstract)
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical ``actions''. SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
基础/前沿模型 (含LLM)
推理与思维链
#LLM #Reasoning
TL;DR:SwiReasoning is a training-free method for Pareto-superior reasoning LLMs that dynamically switches between explicit and latent thinking, with a switch count control to suppress overthinking.
🎯 研究动机尽管显式推理通过逐步链式思想推进,但受自然语言表达限制,潜在空间的连续推理展现了丰富信息和更高的标记效率,当前的潜在推理仍存在准确性和效率问题。
❓ 解决问题潜在推理面临分布扩散和噪声引入导致收敛性不足,以及即使在潜在空间中仍存在过度思考的问题。论文旨在通过新的框架解决这些问题。
🔍 现象分析仅依赖潜在推理,搜索分布无法集中于单一高置信度解,同时过度思考导致标记浪费并降低效率,这削弱了推理系统的实际效果。
🛠️ 主要方法提出SwiReasoning框架,动态切换显式与潜在推理,并通过基于熵趋势的块级置信度估计优化探索与收敛;限制思考块最大切换次数,抑制过度思考提升效率。
📊 数据与实验在数学、STEM、编码和通用基准上进行实验,结果显示不同规模和模型家族的推理LLM平均准确率提升1.8%-3.1%,在预算受限情况下标记效率提升57%-79%。
⭐ 主要贡献提出一种无需额外训练的动态推理方法SwiReasoning,显著改进推理LLM的准确率和标记效率,为更优的混合推理模式奠定基础。
查看完整摘要 (Abstract)
Recent work shows that, beyond discrete reasoning through explicit chain-of-thought steps, which are limited by the boundaries of natural languages, large language models (LLMs) can also reason continuously in latent space, allowing richer information per step and thereby improving token efficiency. Despite this promise, latent reasoning still faces two challenges, especially in training-free settings: 1) purely latent reasoning broadens the search distribution by maintaining multiple implicit paths, which diffuses probability mass, introduces noise, and impedes convergence to a single high-confidence solution, thereby hurting accuracy; and 2) overthinking persists even without explicit text, wasting tokens and degrading efficiency. To address these issues, we introduce SwiReasoning, a training-free framework for LLM reasoning which features two key innovations: 1) SwiReasoning dynamically switches between explicit and latent reasoning, guided by block-wise confidence estimated from entropy trends in next-token distributions, to balance exploration and exploitation and promote timely convergence. 2) By limiting the maximum number of thinking-block switches, SwiReasoning curbs overthinking and improves token efficiency across varying problem difficulties. On widely used mathematics, STEM, coding, and general benchmarks, SwiReasoning consistently improves average accuracy by 1.8%–3.1% across reasoning LLMs of different model families and scales. Furthermore, under constrained budgets, SwiReasoning improves average token efficiency by 57%-79%, with larger gains as budgets tighten.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #Model routing #Reinforcement Learning #Partially Observable MDPs (POMDPs)
TL;DR:We introduce TRIM, a routing approach for multi-step reasoning that selectively routes only critical steps to larger LLMs based on uncertainty and budget.
🎯 研究动机多步推理任务易受错误传播影响,单步推理错误可能导致整体解答失败。现有模型路由方法未能区分各步骤的重要性,使得资源浪费在非关键步骤上。
❓ 解决问题设计一种高效的路由方法,仅将关键推理步骤分配给强大的大模型,以避免错误传播并提升推理效率。
🔍 现象分析通过单步级干预,强模型仅处理高不确定性步骤,可显著降低推理成本,同时保持准确性。
🛠️ 主要方法提出 TRIM,利用奖励模型识别错误步骤,并基于不确定性和预算限制进行路由决策,包括简单阈值策略和考虑长期准确性与成本权衡的增强策略。
📊 数据与实验在 MATH-500 数据集上,基础策略的效率提升到 5 倍,更高级策略在 80%成本下降下匹配强大模型表现。在更难的 AIME 数据集上,效率提升达 6 倍,且方法具备跨任务推广能力。
⭐ 主要贡献提出新颖的单步级推理路由框架 TRIM,有效解决了多步推理中的错误传播和效率问题,并显著降低了推理成本。
查看完整摘要 (Abstract)
Multi-step reasoning tasks like mathematical problem solving are vulnerable to cascading failures, where a single incorrect step leads to complete solution breakdown. Current LLM routing methods assign entire queries to one model, treating all reasoning steps as equal. We propose TRIM (Targeted routing in multi-step reasoning tasks), which routes only critical steps$\unicode{x2013}$those likely to derail the solution$\unicode{x2013}$to larger models while letting smaller models handle routine continuations. Our key insight is that targeted step-level interventions can fundamentally transform inference efficiency by confining expensive calls to precisely those steps where stronger models prevent cascading errors. TRIM operates at the step-level: it uses process reward models to identify erroneous steps and makes routing decisions based on step-level uncertainty and budget constraints. We develop several routing strategies within TRIM, ranging from a simple threshold-based policy to more expressive policies that reason about long-horizon accuracy-cost trade-offs and uncertainty in step-level correctness estimates. On MATH-500, even the simplest thresholding strategy surpasses prior routing methods with 5x higher cost efficiency, while more advanced policies match the strong, expensive model's performance using 80% fewer expensive model tokens. On harder benchmarks such as AIME, TRIM achieves up to 6x higher cost efficiency. All methods generalize effectively across math reasoning tasks, demonstrating that step-level difficulty represents fundamental characteristics of reasoning.
基础/前沿模型 (含LLM)
推理与思维链
#Process Reward Model #Tabular Reasoning #Tool Integration #Test-time Scaling
TL;DR:A tool-augmented process reward model that improves tabular reasoning at test time.
🎯 研究动机现有的过程奖励模型(PRMs)在大规模推理模型的测试阶段扩展中表现良好,但对表格推理领域的监督能力尚未充分探索。
❓ 解决问题针对PRMs在表格特定操作(如子表检索、模式交互)上的表现瓶颈,提出增强测试阶段表格推理能力的解决方案。
🔍 现象分析现有PRMs主要适用于文本推理步骤,而在表格推理中缺乏对表格特性和工具验证的适配,导致性能受限。
🛠️ 主要方法提出TaTToo框架,通过表格推理步明确推理并结合工具验证,实现奖励监督,包括冷启动监督微调和基于工具奖励的强化学习两个阶段。
📊 数据与实验设计了一个包含超过6万条高质量分步标注的数据集,并在5个覆盖数值推理、事实验证和数据分析的基准上测试,表现超越了强基线。
⭐ 主要贡献TaTToo框架在推理能力上提升了30.9%,使用仅8B参数超越了72B的强PRM模型,且展示了对多种测试阶段扩展策略的强泛化性能。
查看完整摘要 (Abstract)
Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored.
Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification.
We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM.
Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9\% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.
基础/前沿模型 (含LLM)
推理与思维链
#LLM Reasoning
🎯 研究动机为了提升模型的推理路径选择质量,同时在测试时实现可控的推理长度扩展。
❓ 解决问题解决当前依赖大规模奖励模型和过程级标注带来的高计算资源需求及标注成本问题。
🔍 现象分析通过消除对过程级标注的依赖,仅使用自监督学习从结果奖励中直接学习优质推理路径选择。
🛠️ 主要方法提出反思生成模型(RGM),包含统一接口的策略与过程奖励模型,以及自监督的过程奖励模块,整体仅增加了50M参数。
📊 数据与实验在AIME24和HMMT25基准上,32B模型超越OpenAI o3-mini性能,分别达到84.2和53.1分。
⭐ 主要贡献提出新型反思生成形式与自监督过程奖励模型,使小规模模型在推理任务中显著优于传统大规模策略模型。
查看完整摘要 (Abstract)
We introduce a new Reflective Generative Model (RGM), which obtains OpenAI o3-mini's performance via a novel Reflective Generative Form. This form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 50M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model (SPRM), which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, RGM is naturally suitable for test-time scaling based on the controllable thinking length. Experiments show that our RGM, equipped with only 50M additional parameters in SPRM, outperforms policy models with 72B extra reward models, thereby enabling 32B model to outperform OpenAI o3-mini on AIME24 (84.2 vs. 79.6) and HMMT25 (53.1 vs. 53.0).
Code is available at https://github.com/MetaStone-AI/XBai-o4.
基础/前沿模型 (含LLM)
推理与思维链
#Test-time verification #Coverage #Approximate verifier #ROC
🎯 研究动机测试时验证技术在提升大型语言模型性能方面显示潜力,但验证器的作用及其缺陷尚未被充分探讨。需要统一框架量化覆盖率、收敛区域及采样算法之间的几何关系。
❓ 解决问题明确生成器覆盖率、验证器收敛区域和采样算法次优性之间的交互,并建立统一理论框架来分析其影响。
🔍 现象分析次优性-覆盖率曲线具有三种状态:运输状态次优性随覆盖率增加;策略改善状态次优性或随验证器特性改善;饱和状态次优性停止变化。
🛠️ 主要方法采用优化传输理论,将测试时验证问题框架化为覆盖率、收敛区域及次优性的几何交互,并分析顺序和批量采样算法的计算复杂度。
📊 数据与实验在 Qwen、Llama 和 Gemma 模型上进行实验,验证了理论预测并分析不同采样算法对性能的影响。
⭐ 主要贡献提出了统一框架解读验证器角色及其缺陷对性能的影响,定义了三种次优性状态并分析采样算法对性能贸易的影响。
查看完整摘要 (Abstract)
While test-time scaling with verification has shown promise in improving the performance of large language models (LLMs), role of the verifier and its imperfections remain underexplored. The effect of verification manifests through interactions of three quantities: (i) the generator’s *coverage*, (ii) the verifier’s *region of convergence* (ROC), and (iii) the sampling algorithm’s *sub-optimality*. Though recent studies capture subsets of these factors, a unified framework quantifying the geometry of their interplay is missing. We frame verifiable test-time scaling as a transport problem. This characterizes the interaction of coverage, ROC, and sub-optimality, and uncovers that the sub-optimality-coverage curve exhibits three regimes. A *transport regime* — where sub-optimality increases with coverage, a *policy improvement regime* — where sub-optimality may decrease with coverage, depending on the verifier’s ROC, and a *saturation regime* — where sub-optimality plateaus, unaffected by coverage. We further propose and analyze two classes of sampling algorithms — *sequential* and *batched*, and examine how their computational complexities shape these trade-offs. Empirical results with `Qwen`, `Llama`, and `Gemma` models corroborate our theoretical findings.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Model #Large Reasoning Model #Overthinking
🎯 研究动机推理模型容易出现过度思考现象,其中冗余的推理步骤浪费计算资源。论文探索输入问题引发的内部偏差作为关键触发因素的影响,旨在改善推理模型的效率。
❓ 解决问题识别和缓解由输入问题触发的内部偏差对推理模型的过度思考行为的影响,优化模型计算性能和合理性。
🔍 现象分析模型在接收到问题后会立即形成初步猜测,但这种猜测往往缺乏系统推理,当与后续推理冲突时触发过度反思,产生额外计算负担。
🛠️ 主要方法通过两种反事实干预方法验证因果关系,包括移除输入问题以减少冗余推理,以及人为注入偏差以观测过度思考趋势,结合解释性实验探讨模型注意机制。
📊 数据与实验利用多个模型和多种复杂推理任务验证内部偏差与过度思考的关联,同时测试多种方法减缓过度思考,但内部偏差影响无法完全消除。
⭐ 主要贡献揭示推理模型中的内部偏差是过度思考的主要诱因,提出通过输入问题注意机制优化推理路径的方法,为改进大型推理模型提供理论依据。
查看完整摘要 (Abstract)
Reasoning models often exhibit overthinking, characterized by redundant reasoning steps. We identify \emph{internal bias} elicited by the input question as a key trigger of such behavior. Upon encountering a problem, the model immediately forms a preliminary guess about the answer, which we term an internal bias since it may not be explicitly generated, and it arises without systematic reasoning. When this guess conflicts with its subsequent reasoning, the model tends to engage in excessive reflection, resulting in wasted computation. We validate the association between internal bias and overthinking across multiple models and diverse reasoning tasks. To demonstrate the causal relationship more rigorously, we conduct two counterfactual interventions, showing that removing the input question after the model reduces the redundant reasoning across various complex reasoning tasks, and manually injecting bias affects overthinking accordingly. Further interpretability experiments suggest that excessive attention to the input question serves as a key mechanism through which internal bias influences subsequent reasoning trajectories. Finally, we evaluated several methods aimed at mitigating overthinking, yet the influence of internal bias persisted under all conditions.
基础/前沿模型 (含LLM)
推理与思维链
#Length Generalization #Large Language Models #Turing Machine #Chain-of-Thought #Computable Reasoning #Synthetic Dataset
TL;DR:Train LLMs to imitate Turing machines for universal and effective length generalization on a challenging synthetic dataset of 18 tasks across 8 algorithmic classes.
🎯 研究动机序列长度泛化是Transformer架构的大规模语言模型在解决长序列问题时面临的核心挑战,现有方法多针对特定任务,难以通用。
❓ 解决问题提出一种普适性解决方案,通过模拟图灵机处理可计算问题,增强模型在长序列推理任务中的泛化能力。
🔍 现象分析研究发现图灵机的关键概念,如读写行为和内存访问机制,对提升长序列任务中的泛化性能至关重要,而非依赖类人思维模式。
🛠️ 主要方法引入图灵机模仿学习(TAIL),生成模拟图灵机执行流程的链式推理数据,以线性扩展推理步骤,并显式设计内存取用机制以优化动态数据访问。
📊 数据与实验构建覆盖8类算法和18个任务的综合性合成数据集,以验证TAIL方法在广泛任务中的通用性和可靠性,优于现有方法及DeepSeek-R1。
⭐ 主要贡献提出TAIL框架,显著提升LLM在长序列推理任务的泛化能力,揭示图灵机思维在提高模型表现上的潜力,开辟未来基于合成数据学习推理的新方向。
查看完整摘要 (Abstract)
Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLMs).
Although existing studies have predominantly focused on data-driven approaches for particular arithmetic operations or symbolic manipulation tasks, these approaches tend to be task-specific with limited performance on individual tasks.
To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are *computable*, *i.e.*, problems that algorithms can solve, thus can be solved by the Turing machine, which operates over inputs of unbounded length.
From this perspective, this paper proposes **T**uring m**A**chine **I**mitation **L**earning (**TAIL**) to improve the length generalization ability of LLMs.
TAIL uses computer programs to directly synthesize chain-of-thought (CoT) data that imitate the execution process of a Turing machine, which *linearly* expands the reasoning steps into *atomic* states to alleviate shortcut pattern learning and explicit *memory* fetch mechanism to reduce the difficulties of dynamic and long-range data access.
To validate the universality and reliability of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks.
Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B in individual tasks using only synthetic data, surpassing previous methods and DeepSeek-R1.
The experimental results reveal that the key concepts in the Turing machine, instead of the human-like thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.
基础/前沿模型 (含LLM)
推理与思维链
#LLM Reasoning #RL for LLMs #Reasoning Models #Scalable Reasoning #Test-Time Scaling
TL;DR:Delethink enables reasoning LLMs to scale linearly in compute with constant memory by chunking traces instead of carrying quadratic context. It delivers up to 40% faster speed and 70% lower memory with no performance loss.
🎯 研究动机推理类大型语言模型因上下文长度增加导致计算成本呈二次增长,限制了可验证奖励的强化学习训练及推理时的可扩展性。
❓ 解决问题提出一种算法,以线性计算扩展和固定内存需求取代传统推理轨迹的二次计算成本。
🔍 现象分析现有方法(如修剪、总结和多阶段训练)虽然减少推理轨迹长度,但仍然受限于二次增长的计算消耗,无法适应大规模推理任务需求。
🛠️ 主要方法引入 Delethink 算法,通过将推理轨迹分解为连续的 Markovian 状态分块,仅依赖固定数量的前置标记,省略多余上下文,在保证推理连续性的同时实现计算线性扩展。
📊 数据与实验在 DeepScaleR 数据集中,Delethink 应用于参数范围为 1.5B 至 30B 的现成推理模型,与传统的长链式推理方法性能相当,但推理速度快 40%,内存占用减少 70%。
⭐ 主要贡献通过 Markovian Thinking Paradigm,将推理长度与上下文长度解耦,显著提升推理效率,支持具有线性计算和固定内存需求的下一代推理语言模型开发。
查看完整摘要 (Abstract)
Reasoning LLMs suffer from quadratic compute growth as their context length increases, making reinforcement learning with verifiable rewards (RLVR) and test-time scaling prohibitively expensive. Prior work has tried to lighten the computational burden by shortening reasoning traces through pruning, summarization, or multi-stage training, but these methods remain bound to quadratic costs. We introduce Delethink, a thinking algorithm that realizes the Markovian Thinking Paradigm. Instead of producing one long monolithic reasoning trace, Delethink thinks in a sequence of chunks, the Delethink trace. Each chunk continues reasoning by referring only to a fixed number of prior tokens, which functions as a Markovian state sufficient for progressing reasoning, while deleting the rest. This preserves continuity without carrying the quadratic baggage. As a result, compute scales linearly and peak memory remains constant. In experiments, we show that Delethink can be applied directly to off-the-shelf reasoning models ranging from $1.5\textnormal{B}$ to $30\textnormal{B}$ parameters, with no loss in performance. Extended reasoning becomes possible under fixed memory and linear compute, while enabling efficient RL training on new tasks. On the DeepScaleR dataset, Delethink trains R1DistillQwen1.5B to the same benchmark performance as a standard long chain-of-thought (LongCoT) approach, where both models generate up to $24\textnormal{k}$ thinking tokens. The difference is efficiency. Delethink reasons $40\%$ faster with $70\%$ less memory footprint. By decoupling reasoning length from context length, the Markovian Thinking paradigm opens the door to next-generation reasoning LLMs that can scale to millions of tokens with linear compute and constant memory.
基础/前沿模型 (含LLM)
推理与思维链
#Reasoning #Chain-of-thought #Mathematical reasoning
🎯 研究动机Chain-of-thought (CoT) 提示已成为从大语言模型中引出推理能力的标准方法,但其成功背后的具体机制尚未明确。
❓ 解决问题分析 CoT 推理在数学问题中的表现,量化不同部分对最终答案正确性的贡献,并探索其传递性。
🔍 现象分析发现 CoT 潜力具非单调性、尖锐但偶尔难以解释的峰值,以及模型通过无相关理由获得正确答案的现象。
🛠️ 主要方法引入潜力的概念,衡量 CoT 各部分增加正确完成概率的程度,并通过部分 CoT 转移实验验证潜力的可迁移性。
📊 数据与实验实验基于竞赛级数学问题,分析不同模型的 CoT 迁移性能和潜力分布特征。
⭐ 主要贡献揭示 CoT 中推理潜力的动态特征,首次量化其对模型性能的贡献,并证明 CoT 机制在模型间的迁移性。
查看完整摘要 (Abstract)
Chain-of-thought (CoT) prompting is a de-facto standard technique to elicit reasoning-like responses from large language models (LLMs), allowing them to spell out individual steps before giving a final answer. While the resemblance to human-like reasoning is undeniable, the driving forces underpinning the success of CoT reasoning still remain largely unclear. In this work, we perform an in-depth analysis of CoT traces originating from competition-level mathematics questions, with the aim of better understanding how, and which parts of CoT actually contribute to the final answer. To this end, we introduce the notion of a \textit{potential}, quantifying how much a given part of CoT increases the likelihood of a correct completion. Upon examination of reasoning traces through the lens of the potential, we identify surprising patterns including (1) its often strong non-monotonicity (due to reasoning tangents), (2) very sharp but sometimes tough to interpret spikes (reasoning insights and jumps) as well as (3) at times lucky guesses, where the model arrives at the correct answer without providing any relevant justifications before. While some of the behaviours of the potential are readily interpretable and align with human intuition (such as insights and tangents), others remain difficult to understand from a human perspective. To further quantify the reliance of LLMs on reasoning insights, we investigate the notion of CoT transferability, where we measure the potential of a weaker model under the partial CoT from another, stronger model. Indeed aligning with our previous results, we find that as little as 20% of partial CoT can ``unlock'' the performance of the weaker model on problems that were previously unsolvable for it, highlighting that a large part of the mechanics underpinning CoT are transferable.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Model #Reasoning
🎯 研究动机多数投票在封闭式问题中表现有效,但在开放式推理任务如代码生成和网络深层研究中效果有限,需寻找更通用的解决方案。
❓ 解决问题提出一种新的解码策略,解决现有多数投票无法有效处理开放式任务的问题,同时提升生成结果的连贯性和质量。
🔍 现象分析通过开放式任务实验发现,多条并行推理轨迹的整合能够显著提升任务表现,而直接投票无法明确定义正确的完整解决方案。
🛠️ 主要方法提出的 THINKMERGE 策略在同步点平均多个推理轨迹的下一个 token 的 logits,生成一个统一的输出,并兼容常规解码技术。
📊 数据与实验在 AIME、GPQA 等分类任务及 LiveCodeBench 编码任务中验证方法效果,展示了 THINKMERGE 能超越多数投票并在多个模型上取得性能提升。
⭐ 主要贡献开发了一种训练无关、即插即用的解码策略,有效改进开放式推理任务,并验证其在代码生成和深度网络研究中的优势。
查看完整摘要 (Abstract)
Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a “majority” over complete solutions is ill-defined. We introduce THINKMERGE, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. THINKMERGE integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that THINKMERGE improves web-based deep-research agents (e.g., WebSailor-7B/32B) across GAIA, BrowseComp-en/zh, and XbenchDeepSearch. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.
基础/前沿模型 (含LLM)
推理与思维链
#LLM #Personalization #Reasoning
🎯 研究动机现有的大模型主要优化群体偏好,忽视个体用户的差异性需求,难以合理处理隐性偏好,尤其是在长文本生成中表现不足。个性化已成为大模型能力提升的关键方向。
❓ 解决问题为了解决传统方法在个性化长文本生成中推理能力不足的问题,本文提出了一种动态融合推理与生成的框架,以改善个性化生成效果和效率。
🔍 现象分析传统的“先推理再生成”方法在长文本生成过程中存在静态推理信息不足、难以适应内容动态变化的问题,导致学习复杂性增加及生成质量受限。
🛠️ 主要方法提出FlyThinker框架,通过引入独立的推理模型,采用并行推理和生成过程,并以令牌级别的动态推理指导文本生成,确保推理与生成高效协同,同时优化训练并行性。
📊 数据与实验利用多个真实数据集进行广泛实验,验证FlyThinker框架在个性化生成任务中的表现,在生成质量和训练推理效率方面均优于现有方法。
⭐ 主要贡献提出动态推理框架FlyThinker,解决了个性化长文本生成中的推理瓶颈,显著提升了生成效果及训练和推理效率,为个性化模型研究提供新的技术路径。
查看完整摘要 (Abstract)
Preference alignment has enabled large language models (LLMs) to better reflect human expectations, but current methods mostly optimize for population-level preferences, overlooking individual users. Personalization is essential, yet early approaches—such as prompt customization or fine-tuning—struggle to reason over implicit preferences, limiting real-world effectiveness. Recent “think-then-generate” methods address this by reasoning before response generation. However, they face challenges in long-form generation: their static one-shot reasoning must capture all relevant information for the full response generation, making learning difficult and limiting adaptability to evolving content. To address this issue, we propose **FlyThinker**, an efficient “think-while-generating” framework for personalized long-form generation. FlyThinker employs a separate reasoning model that generates latent token-level reasoning in parallel, which is fused into the generation model to dynamically guide response generation. This design enables reasoning and generation to run concurrently, ensuring inference efficiency. In addition, the reasoning model is designed to depend only on previous responses rather than its own prior outputs, which preserves training parallelism across different positions—allowing all reasoning tokens for training data to be produced in a single forward pass like standard LLM training, ensuring training efficiency. Extensive experiments on real-world benchmarks demonstrate that FlyThinker achieves better personalized generation while keeping training and inference efficiency.
基础/前沿模型 (含LLM)
推理与思维链
#large language model #reasoning #latent reasoning #chain of thought
🎯 研究动机随着大语言模型的发展,从显式的链式推理转向更高效的隐式推理,但隐式推理在面对复杂和分布外任务时表现脆弱,需要增强其鲁棒性。
❓ 解决问题优化隐式推理在分布外和高难度任务中的表现,避免模型参数更新并降低计算成本。
🔍 现象分析隐式推理依赖向量化的中间思维表示,在挑战性任务中容易失效,特别是在推理鲁棒性最重要的场景中表现不足。
🛠️ 主要方法提出参数无关的隐式思维策略优化(LTPO)框架,动态优化中间思维向量,结合在线策略梯度方法和基于模型置信输出的奖励信号,绕过外部监督与昂贵的文本生成。
📊 数据与实验在五个推理基准上进行实验,尤其在高难度的 AIME 基准上,LTPO 显著提升了准确率,在标准任务和分布外任务中均超过现有基线。
⭐ 主要贡献框架无需更新模型参数就显著增强推理鲁棒性,展示了在高复杂性推理任务中的独特优势。
查看完整摘要 (Abstract)
Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent "thought" vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM's own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #Reasoning #Reinforcement Learning with Verifiable Rewards #Long Chain-of-Thought
TL;DR:We propose Thinking-Free Policy Initialization, a stage prior to RL that can accelerate RL convergence to a higher performance ceiling and naturally yield reasoning-efficient models
🎯 研究动机传统的可验证奖励强化学习方法(RLVR)需要处理超长上下文,导致训练计算成本高昂,现有分阶段训练方法无法有效缓解这一问题。
❓ 解决问题提出一种新方法TFPI,通过减少推理过程中的无用内容来加速RL收敛,提升性能上限并降低计算资源需求。
🔍 现象分析直接从过短上下文训练会导致不可逆的性能下降,而长链式推理消耗大量计算资源且增益有限。
🛠️ 主要方法设计了一种‘无思考’策略初始化(TFPI),利用特殊标记移除推理中的无意义内容,减少训练中无关的Token消耗同时改善模型性能。
📊 数据与实验在AIME24和LiveCodeBench等多个基准测试上进行实验,用一个4B模型仅消耗少于4K H20小时达到89.0%和65.5%的准确率。
⭐ 主要贡献提出了一种简单但高效的TFPI方法,加速RL训练收敛,提升模型性能和推理效率,并为长链式推理任务提供了一种有效解决方案。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce **T**hinking-**F**ree **P**olicy **I**nitialization (**TFPI**), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple *ThinkFree* operation, explicitly discarding the thinking content via a direct *</think>* append, to reduce token usage during inference. Training with *ThinkFree*-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that {\method} accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.
基础/前沿模型 (含LLM)
推理与思维链
#Reasoning #Vision-Language Models #Contrasting
🎯 研究动机大型语言模型已展现出卓越的推理能力,且能通过自改进技术优化推理路径以提升语言任务表现。然而,将这种基于语言的自改进方法扩展到视觉语言模型时,视觉幻觉问题难以有效验证与修正,需新方法解决。
❓ 解决问题本文旨在解决视觉语言模型在自改进过程中因视觉幻觉而导致推理路径不可靠的独特挑战,并提出利用视觉对比来缓解该问题的框架。
🔍 现象分析通过观察发现,当视觉语言模型处理对比性视觉问答对(即两幅视觉相似图像及同义问题)时,其识别相关视觉线索的能力比处理单一视觉问答样本时更精确。
🛠️ 主要方法提出了视觉对比自教推理器(VC-STaR),一种新型自改进框架,利用视觉对比来减少模型生成原理中的幻觉。该方法通过多模态相似性构建对比对并生成原理,创建了一个包含 55K 样本的新视觉推理数据集 VisCoR-$55$K。
📊 数据与实验收集并策划了多样化的视觉问答数据集,构建了对比对以生成新数据集 VisCoR-$55$K。大量实验表明,VC-STaR 不仅优于现有自改进方法,还超越了基于最先进视觉推理数据集微调的模型,证明了视觉语言模型的内在对比能力能有效引导其视觉推理。
⭐ 主要贡献提出了首个利用视觉对比进行自改进的框架 VC-STaR,显著缓解了视觉语言模型中的幻觉问题;构建并开源了大规模视觉推理数据集 VisCoR-$55$K;实验证明了该方法的有效性,提升了多种视觉语言模型的推理能力。
查看完整摘要 (Abstract)
Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent fine-tuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge: visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely compared with when given a single VQA sample. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-$55$K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. The code, dataset and trained models will be released upon acceptance.
基础/前沿模型 (含LLM)
推理与思维链
#Diffusion Language Models #Semantic Entropy #Self-Consistency #Reinforcement Learning
TL;DR:We find that diffusion language models hide useful answers mid-generation and introduce simple voting and reinforcement learning methods that exploit the temporal dynamics to boost accuracy.
🎯 研究动机扩散大型语言模型在生成文本时忽略了中间步骤中丰富的预测信息,优化该过程有助于提高生成效果。
❓ 解决问题解决在扩散语言模型生成过程中正确答案被后续步骤覆盖的问题,通过利用时间一致性增强生成质量。
🔍 现象分析研究发现生成过程中的时间振荡现象,即正确答案往往在中间步骤出现,但最终被后续去噪覆盖。
🛠️ 主要方法提出两种方法:时间自一致性投票法,通过聚合去噪步骤中的预测提升一致性;时间一致性强化法,通过语义熵奖励信号引导模型生成稳定的输出。
📊 数据与实验在多个基准数据集上进行实证测试,包括Countdown、GSM8K、MATH500和SVAMP;新方法显著提升了生成模型的准确性,最高提升达25.3%。
⭐ 主要贡献揭示扩散语言模型中的时间动态潜力,提出时间一致性投票和强化方法,提高生成质量并扩展模型复用性。
查看完整摘要 (Abstract)
Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.
基础/前沿模型 (含LLM)
推理与思维链
#Reasoning models #efficient reasoning #LoRA #RLVR
TL;DR:Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA, and and provides hypotheses, supported by experiments, about why it works so well.
🎯 研究动机探讨语言模型在强化学习中实现高性价比推理能力的可能性。
❓ 解决问题设计小型推理模型,以低资源消耗实现与当前先进模型相当甚至更优的推理性能。
🔍 现象分析使用 LoRA 在强化学习中能快速适应推理任务的结构需求,同时保留基础模型的知识表示。
🛠️ 主要方法通过在一个 15 亿参数的小型模型中使用低秩适应 (LoRA) 技术进行强化学习训练,显著提高推理能力和效率。
📊 数据与实验实验覆盖多个开源推理数据集和多种消融设置,使用单一固定的超参数验证模型性能,最佳模型在 AIME24 上零样本 Pass@1 达到 43.33%。
⭐ 主要贡献提出 Tina 模型家族,以极低计算成本和资源消耗实现先进推理性能;模型性能较基准提高 20%以上;提供开源代码、训练日志、模型权重和检查点以支持开放研究。
查看完整摘要 (Abstract)
How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this question, we present Tina, a family of tiny reasoning models achieved with high cost-efficiency. Tina shows that substantial reasoning performance can be developed using only minimal resources, by applying low-rank adaptation (LoRA) during reinforcement learning (RL), to an already tiny 1.5B parameter base model. This minimalist approach produces models that are competitive with, and sometimes surpass, SOTA RL reasoning models built upon the same base model. Crucially, this is achieved at a tiny fraction of the computational cost employed by existing models. In fact, the best Tina model achieves a >20\% reasoning performance increase and 43.33\% zero-shot Pass@1 accuracy on AIME24, at only \$9 USD cost (i.e., an estimated 260x reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA. We validate this across multiple open-source reasoning datasets and various ablation settings starting with a single, fixed set of hyperparameters. Furthermore, we explore the hypothesis that this effectiveness and efficiency stem from LoRA rapidly adapting the model to the structural format of reasoning rewarded by RL, while largely preserving the base model's underlying knowledge. In service of accessibility and open research, we fully open-source all code, training logs, model weights, and checkpoints.
基础/前沿模型 (含LLM)
推理与思维链
#Large language model #Math Reasoning #Reinforcement Learning
🎯 研究动机强化学习能够验证奖励并提升大语言模型的推理能力,但如何显式控制训练过程中的探索或利用方向仍是一个未解决的问题。
❓ 解决问题提出一种名为 Token Hidden Reward (THR) 的指标,用于量化每个token对正确响应概率的影响,从而动态调整训练策略,平衡探索和利用。
🔍 现象分析训练过程主要被一小部分绝对THR值较大的token所主导;正THR值token增强对正确输出的信心,偏向利用;负THR值token保留其他输出的概率,促进探索。
🛠️ 主要方法基于THR值设计一种重新加权算法,通过放大正THR值token、削弱负值token来控制学习信号,从而引导训练偏向于探索或利用。
📊 数据与实验在多个数学推理基准上验证算法效果,发现放大正THR值改善贪婪解码准确性,偏向利用;反向操作提升Pass@K准确率,偏向探索,同时该算法适配GSPO等RL目标并在Llama等架构上通用。
⭐ 主要贡献首次提出利用THR作为细粒度机制动态调控RL中的探索与利用,为推理密集型任务中的LLM定向微调提供了一种新工具。
查看完整摘要 (Abstract)
Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token’s influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO’s learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and
exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.
基础/前沿模型 (含LLM)
推理与思维链
#large language model #reasoning #chain of thoughts
TL;DR:We propose Set Supervised Fine-Tuning (SSFT), which treats parallel reasoning as a set prediction problem and incorporates a set-based global loss into SFT using bipartite matching between global forking tokens and diverse reasoning traces.
🎯 研究动机大型语言模型需在测试时生成多样性和正确性兼具的推理路径,但在困难问题上,多样化策略导致准确性下降,存在优化瓶颈。
❓ 解决问题通过将并行推理建模为集合预测问题,并在监督微调中融入集合全局损失,解决推理路径之间的模式坍缩问题。
🔍 现象分析传统微调方法在多个推理路径上无法保持模式的独特性,而新的方法能够生成可引导复杂推理的全局分叉标记。
🛠️ 主要方法提出集合监督微调(SSFT),利用二分匹配优化全局分叉标记与多样推理路径的对应关系,同时配合全局分叉策略优化(GFPO)提升模型推理能力。
📊 数据与实验基于数学推理和代码生成的基准数据集进行测试,SSFT模型在所有实验中均优于传统监督微调方法。
⭐ 主要贡献提出一种处理并行推理的全新微调框架,显著提升语言模型在复杂推理和执行任务中的性能表现。
查看完整摘要 (Abstract)
Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using bipartite matching between global forking tokens and unique reasoning traces. We observe that whereas naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Global Forking Policy Optimization (GFPO) leverages these maximally steerable tokens to incentivize complex reasoning, and the resulting models consistently outperform their SFT counterparts with GRPO on both math reasoning and execution-based code generation benchmarks.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models Reasoning; Reinforcement Learning; Reasoning
🎯 研究动机强化学习(RL)在大语言模型(LLM)推理中的应用迅速发展,但标准化指导和对机制的统一理解仍然缺乏,阻碍了这一领域的进步。
❓ 解决问题解决实验设置不一致、训练数据差异及模型初始化导致的对RL技术结论冲突问题,同时帮助研究者选择适合的技术策略。
🔍 现象分析通过统一开源框架的深入实验,厘清不同RL技术的内在机制、适用场景及核心原理,并揭示困扰实践者的混乱根源。
🛠️ 主要方法系统复现并独立评估主流RL技术,结合细粒度实验对技术组合进行优化,包括困难度不同的数据集、模型规模与架构分析。
📊 数据与实验利用多种难度的数据集和不同规模的模型架构进行实验,结果表明简化组合的技术能够显著提升RL算法的泛化能力和性能表现。
⭐ 主要贡献明确LLM推理领域的RL技术选择原则,提出简化技术组合可提升无评论员策略学习能力,优于现有方法GRPO与DAPO。
查看完整摘要 (Abstract)
Reinforcement learning (RL) for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for applying RL techniques and a fragmented understanding of their underlying mechanisms. In addition, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we show that a minimalist combination of two techniques can unlock the learning capability of critic-free policies with a vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies such as GRPO and DAPO.
基础/前沿模型 (含LLM)
推理与思维链
#LLM; Reasoning; Thinking compression; Test-time scaling; Overthinking; Underthinking
TL;DR:A verifier-based, training-free, efficient framework to trim reasoning and enhance test-time scaling, explicitly tailored for production-level deployment.
🎯 研究动机现有大型推理模型通过延展式思维链提高复杂任务准确性,但测试时长因冗余推理导致效率受损。需要一种更高效的推理缩减策略,以支持工业级部署。
❓ 解决问题解决推理过程中的过度思考和不充分思考问题,减少冗余思维链以提升推理效率,同时保持推理准确性。
🔍 现象分析推理模型生成的冗余思维链存在显著的过度和不足模式,成为测试时间扩展的主要效率瓶颈。
🛠️ 主要方法提出TrimR框架,使用轻量级预训练、指令调优的验证器,无需模型或验证器微调,以检测并截断冗余中间推理,提升测试时推理效率。
📊 数据与实验在MATH500、AIME24/25和GPQA等基准上评估,框架在大批量推理任务中实现最高70%的推理时间优化,同时保持推理准确性。
⭐ 主要贡献开发了一种训练无关、高效的推理压缩框架,显著减少推理时间,支持工业级大规模部署,提高LRMs的测试时间扩展能力。
查看完整摘要 (Abstract)
Large Reasoning Models (LRMs) demonstrate exceptional capability in tackling complex mathematical, logical, and coding tasks by leveraging extended Chain-of-Thought (CoT) reasoning. Test-time scaling methods—such as prolonging CoT with explicit token-level exploration—can push LRMs’ accuracy boundaries, but they incur significant decoding overhead. A key inefficiency source is LRMs often generate redundant thinking CoTs, which demonstrate clear structured overthinking and underthinking patterns. Inspired by human cognitive reasoning processes and numerical optimization theories, we propose TrimR, a verifier-based, training-free, efficient framework to trim reasoning and enhance test-time scaling, explicitly tailored for production-level deployment. Our method employs a lightweight, pretrained, instruction-tuned verifier to detect and truncate redundant intermediate thoughts of LRMs without any LRM or verifier fine-tuning. We present both the core algorithm and asynchronous online system engineered for high-throughput industrial applications. Empirical evaluations on Ascend NPUs and vLLM show that our framework delivers substantial gains in inference efficiency under large-batch workloads. In particular, on the four MATH500, AIME24/25, and GPQA benchmarks, the reasoning runtime of QwQ-32B, DeepSeek-R1-Distill-Qwen-32B, and Pangu-R-38B is improved by up to 70% with negligible impact on accuracy.
基础/前沿模型 (含LLM)
推理与思维链
#language model adaptation #probabilistic programming #reasoning
TL;DR:We introduce Type-Compliant Adaptation Cascades (TACs), treating an entire typed workflow as a single probablistic program parametrized by lightweight PEFT modules, allowing end-to-end training with latent variables.
🎯 研究动机当前通过离散提示优化的语言模型在多步骤工作流中表现不可靠,难以满足结构化任务的形式合规要求。
❓ 解决问题提出一种框架 Type-Compliant Adaptation Cascades (TACs),将工作流适配视为学习带类型的概率程序,解决语言模型无法可靠组合的问题。
🔍 现象分析优化离散提示的方法易碎,不适合复杂工作流;模型需更强的理论支持以实现结构化任务的合规性。
🛠️ 主要方法将整个工作流表示为一个未经归一化的联合分布,通过参数高效模块和确定性逻辑,支持基于梯度的端到端训练,并消除优化偏差。
📊 数据与实验在多个任务上进行评估,与优化提示的基线相比,TACs在 FinQA 上从 12.0% 提升到 24.7%,MGSM-SymPy 从 57.1% 提升到 75.9%,MGSM 从 1.6% 提升到 27.3%,MuSR 从 36.5% 提升到 62.6%。
⭐ 主要贡献提出了一种理论和实验俱佳的框架,为可靠、任务合规的语言模型系统提供了新范式,显著提升了相关任务的性能。
查看完整摘要 (Abstract)
Reliably composing Large Language Models (LLMs) for complex, multi-step workflows remains a significant challenge. The dominant paradigm---optimizing discrete prompts in a pipeline---is notoriously brittle and struggles to enforce the formal compliance required for structured tasks. We introduce Type-Compliant Adaptation Cascades (TACs), a framework that recasts workflow adaptation as learning typed probabilistic programs. TACs treat the entire workflow, which is composed of parameter-efficiently adapted LLMs and deterministic logic, as an unnormalized joint distribution. This enables principled, gradient-based training even with latent intermediate structures. We provide theoretical justification for our tractable optimization objective, proving that the optimization bias vanishes as the model learns type compliance. Empirically, TACs significantly outperform state-of-the-art prompt-optimization baselines. Gains are particularly pronounced on structured tasks, improving FinQA from $12.0\%$ to $24.7\%$ for a Qwen 3 8B model, MGSM-SymPy from $57.1\%$ to $75.9\%$ for a Gemma 2 27B model, MGSM from $1.6\%$ to $27.3\%$, and MuSR from $36.5\%$ to $62.6\%$ for a Gemma 7B model. TACs offer a robust and theoretically grounded paradigm for developing reliable, task-compliant LLM systems.
基础/前沿模型 (含LLM)
推理与思维链
#transformer #in-context learning #task vector
TL;DR:We explain how task vectors emerge and function in in-context learning, and point out their limitations.
🎯 研究动机任务向量是一种加速上下文学习推理的重要机制,其机理尚不明确,亟需深入理解其功能和局限性。
❓ 解决问题研究任务向量如何在上下文学习中涌现、实际功能,以及其在高阶映射中的局限性。
🔍 现象分析通过理论和实证分析发现,任务向量可以被视为原始演示的单个上下文示例的提炼版本,且自然出现在线性Transformer的损失地貌中。
🛠️ 主要方法提出“任务向量作为代表性演示”假设,并通过失效预测、显著性分析和参数可视化验证其有效性,同时建议通过注入多个任务向量改善性能。
📊 数据与实验使用格式化为三元组提示的线性Transformer模型,以及实际的LLM模型进行实验验证其可行性和局限性。
⭐ 主要贡献推进了对任务向量及其在Transformer模型中的上下文学习机制的理解,并提出任务向量优化的新策略。
查看完整摘要 (Abstract)
Task vector is a compelling mechanism for accelerating inference in in-context learning (ICL) by distilling task-specific information into a single, reusable representation. Despite their empirical success, the underlying principles governing their emergence and functionality remain unclear. This work proposes the *Task Vectors as Representative Demonstrations* conjecture, positing that task vectors encode single in-context demonstrations distilled from the original ones. We provide both theoretical and empirical support for this conjecture. First, we show that task vectors naturally emerge in linear transformers trained on triplet-formatted prompts through loss landscape analysis. Next, we predict the failure of task vectors in representing high-rank mappings and confirm this on practical LLMs. Our findings are further validated through saliency analyses and parameter visualization, suggesting an enhancement of task vectors by injecting multiple ones into few-shot prompts. Together, our results advance the understanding of task vectors and shed light on the mechanisms underlying ICL in transformer-based models.
基础/前沿模型 (含LLM)
推理与思维链
#LLM Reasoning; Multi-agent LLMs
🎯 研究动机大模型在复杂推理任务中表现优异,多智能体框架扩展了其潜力,但存在协作不良的问题,限制了推理效果。
❓ 解决问题为解决多智能体中的懒惰行为和交互困境,提出稳定高效的因果影响测量方法及可验证奖励机制,以提升协作效率。
🔍 现象分析懒惰行为导致单一智能体主导推理,另一个智能体贡献有限;多轮交互可能使推理智能体陷入噪音和失败循环。
🛠️ 主要方法引入因果影响测量机制衡量协作效率,并设计允许智能体丢弃噪音输入和重启推理的可验证奖励机制以强化合作。
📊 数据与实验通过广泛实验验证方法有效性,表明框架缓解了懒惰行为并显著提高了复杂推理任务的协作能力。
⭐ 主要贡献提出创新性框架解决多智能体推理中的协作障碍,改善协作质量并释放多智能体推理潜力。
查看完整摘要 (Abstract)
Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi-agent setting, where a meta-thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi-agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost in multi-turn interactions and trapped by previous noisy responses. To counter this, we propose a verifiable reward mechanism that encourages deliberation by allowing the reasoning agent to discard noisy outputs, consolidate instructions, and restart its reasoning process when necessary. Extensive experiments demonstrate that our framework alleviates lazy agent behavior and unlocks the full potential of multi-agent framework for complex reasoning tasks.
基础/前沿模型 (含LLM)
推理与思维链
#generative verification #large language model #test-time scaling
TL;DR:We study the factors influence LLM-based generative verification, and apply findings to verifier-based test-time scaling.
🎯 研究动机近年来,测试时扩展计算能力使大语言模型能够解决更复杂的跨领域问题。验证器在无参考答案的情况下评估生成的多种候选解,通过生成链式推理 (CoT) 和二元判定实现验证。本文旨在系统研究 LLM 验证动态的影响因素。
❓ 解决问题探讨生成式验证器在问题难度、生成器能力和验证器生成能力三个维度上的表现差异。优化验证策略在测试时扩展计算中的应用效果。
🔍 现象分析发现三大验证规律:(1) 简单问题更易被验证器正确认证;(2) 弱生成器制造的错误更易被验证;(3) 验证能力与验证器本身的问题解决能力相关,但受问题难度影响显著。
🛠️ 主要方法采用生成式验证器,生成链式推理后进行二元判定,分析验证动态影响。设计实验对14个开放模型(2B至72B参数)及GPT-4o在12项任务中的表现进行系统评估。
📊 数据与实验使用12个基准数据集,覆盖数学推理、知识和自然语言推理任务,模型规模从2B到72B参数,测试包括开源模型和GPT-4o,通过实证对比验证动态。
⭐ 主要贡献揭示了验证能力优化及其应用的潜力,指出弱生成器能在验证后接近强生成器表现。识别验证器扩展的局限性,表明验证器能力提升无法单独解决验证瓶颈问题。
查看完整摘要 (Abstract)
Recent advances have shown that scaling test-time computation enables large language models (LLMs) to solve increasingly complex problems across diverse domains. One effective paradigm for test-time scaling (TTS) involves LLM generators producing multiple solution candidates, with LLM verifiers assessing the correctness of these candidates without reference answers. In this paper, we study generative verifiers, which perform verification by generating chain-of-thought (CoT) reasoning followed by a binary verdict. We systematically analyze verification dynamics across three dimensions -- problem difficulty, generator capability, and verifier generation capability -- through empirical studies on 12 benchmarks across mathematical reasoning, knowledge, and natural language reasoning tasks using 14 open-source models (2B to 72B parameter range) and GPT-4o. Our experiments reveal three key findings about verification effectiveness: (1) Easy problems allow verifiers to more reliably certify correct responses; (2) Weak generators produce errors that are easier to detect than strong generators; (3) Verification ability is generally correlated with the verifier's own problem-solving capability, but this relationship varies with problem difficulty. These findings reveal opportunities for optimizing basic verification strategies in TTS applications. First, given the same verifier, some weak generators can nearly match stronger ones in post-verification TTS performance (e.g., the Gemma2-9B to Gemma2-27B performance gap shrinks by 75.7%). Second, we identify cases where strong verifiers offer limited advantages over weak ones, as both fail to provide meaningful verification gains, suggesting that verifier scaling alone cannot overcome fundamental verification challenges.
基础/前沿模型 (含LLM)
推理与思维链
#Language Models #Variational Reasoning #Reinforcement Learning
TL;DR:We propose a variational reasoning framework that treats thinking traces as latent variables optimized via variational inference, yielding a principled and stable training objective that improves LLM reasoning across diverse benchmarks.
🎯 研究动机当前大语言模型在推理能力上表现受限,针对复杂任务的训练目标和优化过程仍不够稳定且缺乏理论统一性。
❓ 解决问题提出变分推理框架,将思维路径建模为潜变量,通过优化稳健推理目标,提升语言模型在多样推理任务上的表现。
🔍 现象分析通过理论推导发现,现有强化学习方法存在隐性偏向,即模型更倾向于追求准确性较高的简单问题,而忽略更复杂任务。
🛠️ 主要方法基于变分推理扩展证据下界(ELBO),提出多路径目标以提升推理质量,并引入正向-KL优化以稳定后验分布训练,结合拒绝采样微调及二值奖励强化学习方法。
📊 数据与实验在Qwen 2.5与Qwen 3模型家族上,针对多种推理任务进行验证,实验显示方法在多样化基准测试中显著提升推理性能。
⭐ 主要贡献提出统一变分推理与强化学习的概率框架,为语言模型推理能力提升提供理论基础与稳定训练目标。
查看完整摘要 (Abstract)
We introduce a **variational reasoning** framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where *an implicit weighting by model accuracy* naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models.
基础/前沿模型 (含LLM)
推理与思维链
#Multimodal Large Language Models #Reasoning
🎯 研究动机受DeepSeek-R1-Zero证明纯强化学习(RL)能在LLMs中激发推理能力的启发,本文探索如何利用RL提升MLLMs的推理能力。
❓ 解决问题针对MLLMs直接使用RL训练时因缺乏高质量多模态推理数据,难以激活提问、反思等复杂推理能力的问题,提出增强多模态推理的方案。
🔍 现象分析高质量多模态推理数据的缺失导致RL训练难以优化,且冷启动后的过思考现象会阻碍模型收敛。
🛠️ 主要方法首先通过现有MLLM和DeepSeek-R1构建无需人工标注的20万规模多模态CoT数据集用于冷启动;随后提出渐进式思维抑制训练(PTST)策略,结合GRPO和硬格式化结果奖励函数,在多模态数学数据上逐步优化复杂推理过程。
📊 数据与实验构建Vision-R1-cold数据集;RL训练仅使用1万多模态数学数据,在多个基准上平均提升约6%;7B模型在MathVista达到73.5%准确率,32B和72B模型分别提升至76.4%和78.2%。
⭐ 主要贡献提出Vision-R1模型,通过构建高质量多模态CoT数据集和渐进式训练策略,有效提升MLLMs推理能力;在低数据量RL训练下实现显著性能提升,并开源数据集、权重和代码。
查看完整摘要 (Abstract)
DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL).
Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data.
To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability.
Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1.
To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on the multimodal math dataset.
Comprehensive experiments show our model achieves an average improvement of $\sim$6\% across various multimodal math reasoning benchmarks using only a 10K multimodal math data during RL training.
Vision-R1-7B achieves a 73.5\% accuracy on the widely used MathVista benchmark, which is only 0.4\% lower than the leading reasoning model, OpenAI O1.
Scaling up the amount of multimodal math data in the RL training, Vision-R1-32B and Vison-R1-72B achieves 76.4\% and 78.2\% MathVista benchmark scores, respectively.
The datasets, weight and code will be released in: https://github.com/Osilly/Vision-R1.
基础/前沿模型 (含LLM)
推理与思维链
#visual planning
🎯 研究动机现有大型语言模型及多模态扩展主要依赖纯文本进行推理,即便任务涉及视觉信息。研究者认为,在处理空间和几何信息任务时,语言可能不是最自然有效的推理模态。
❓ 解决问题本文提出纯视觉规划范式,作为基于语言推理的补充通道,专门针对“视觉优先”任务。它旨在解决视觉信息任务中文本推理的局限性。
🔍 现象分析当前模型在推理时过度依赖文本表达与结构化,忽略了视觉模态在空间推理中的直观优势。这种文本中心化可能降低视觉任务的推理效率。
🛠️ 主要方法引入视觉规划新范式,通过纯视觉表征进行逐步推理。提出基于GRPO强化学习的视觉规划框架VPRL,用于后训练大型视觉模型。
📊 数据与实验在FrozenLake、Maze和MiniBehavior等视觉导航任务上验证方法。视觉规划性能超越所有纯文本推理变体,证明其有效性。
⭐ 主要贡献确立视觉规划作为语言推理的可行补充,为直觉式图像推理任务开辟新途径。提出的VPRL框架显著提升视觉任务规划能力。
查看完整摘要 (Abstract)
Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first'' tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.
基础/前沿模型 (含LLM)
推理与思维链
#Large Language Models #LLMs #Post-Training #Reasoning #theorem proving #Lean #f-divergences #Amari $\alpha$-divergences #Distributional Matching #diversity
TL;DR:We propose using a family of divergences that span mode seeking to mode covering to balance between precision and diversity in training LLMs for reasoning tasks
🎯 研究动机当前通过强化学习优化大型语言模型的推理能力会导致显著的多样性损失,亟需一种方法平衡模型的精度与多样性。
❓ 解决问题解决因使用反向 KL 散度优化目标分布而导致模型忽视目标分布中低概率区域的问题。
🔍 现象分析现有方法倾向于模式寻找(mode-seeking),导致模型质量集中于部分高概率区域,同时忽略其他潜在正确解答。
🛠️ 主要方法通过过滤错误答案构建目标分布,并利用 α-散度家族的分布匹配方法,在模式寻找与质量覆盖之间进行精确控制。
📊 数据与实验在 Lean 定理证明基准上评估模型性能,结果表明其在覆盖与精度的帕累托前沿上优于以往方法,尤其在覆盖指标上取得最优表现。
⭐ 主要贡献提出通过 α-散度统一分布匹配方法,改进推理任务中模型多样性,并在定理证明任务中达到最先进成果。
查看完整摘要 (Abstract)
Reinforcement Learning (RL) has become the _de facto_ standard for tuning LLMs to solve tasks involving reasoning.
However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity.
We argue that this arises because RL implicitly optimizes the "mode-seeking" or "zero-forcing" _Reverse KL_ to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others.
In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones.
Starting from a pre-trained LLM, we approximate this target distribution using the $\alpha$-divergence family, which unifies prior approaches and enables direct control of the precision–diversity trade-off by interpolating between mode-seeking and mass-covering divergences.
On a Lean theorem-proving benchmark, our method achieves state-of-the-art performance along the coverage–precision Pareto frontier, outperforming all prior methods on the coverage axis.
基础/前沿模型 (含LLM)
推理与思维链
#Chain-of-Thought reasoning #Simplicity bias #Test-time scaling #Reasoning length calibration
🎯 研究动机研究链式思维(Chain-of-Thought, CoT)长度对大型语言模型(LLMs)推理性能的影响,挑战传统认为更长推理链的表现更好的观点。
❓ 解决问题揭示 CoT 长度与任务准确性之间的非线性关系,并探讨如何动态调整推理链长度以优化模型表现。
🔍 现象分析实验发现,任务准确性随 CoT 长度呈现倒 U 型曲线变化,并且 CoT 的最佳长度受任务难度和模型能力影响而变化。
🛠️ 主要方法通过强化学习动态校准 CoT 长度,结合错误累积分析从理论角度解释推理链长度对性能的影响。
📊 数据与实验在真实世界 LLM 和理论模型上进行广泛实验,验证推理链长度调整的有效性及不同训练方式对性能的影响。
⭐ 主要贡献提出优化 CoT 长度的校准方法,探索推理链误差积累规律,为平衡任务复杂性与模型能力提供实践指导,解决当前训练方式中的适应性问题。
查看完整摘要 (Abstract)
Large Language Models (LLMs) increasingly rely on Chain-of-Thought (CoT) reasoning to solve complex problems. Contrary to the common belief that longer CoTs always improve performance, we demonstrate that **longer is not always better**. Across both real-world LLMs and theoretical models, task accuracy follows an inverted U-shaped curve with respect to CoT length: performance rises initially but declines once reasoning chains become too long. Through controlled experiments, we uncover **scaling behaviors of the optimal CoT length**: it increases with task difficulty but decreases with model capability. This exposes a significant mismatch with current practice, where supervised training often reuses the same CoT data across models and tasks without adaptivity. We further show that Reinforcement Learning (RL) can mitigate this gap by dynamically calibrating CoT length, thereby improving accuracy and offering a new perspective on differences between supervised fine-tuning and RL training. To explain these phenomena, we introduce an error-accumulation analysis that characterizes how reasoning errors propagate across steps and derives the scaling behaviors of CoT length observed empirically. Building on these insights, we show that training with optimally sized CoTs and applying length-aware filtering during inference yields substantial improvements in performance. Taken together, these findings establish a principled explanation of the ''overthinking'' effect and yield practical guidelines for calibrating CoT length in accordance with task complexity and model capability.
基础/前沿模型 (含LLM)
推理与思维链
#LLM #reasoning #test-time compute #RL #exploration
TL;DR:We identify three key ingredients to teach LLMs to explore in-context and improve performance when we extrapolate test-time compute beyond what the LLMs are trained for.
🎯 研究动机LLM推理性能可通过推理时扩展计算预算提升,而 extrapolation(额外计算预算提升难题解决能力)是其核心潜力,但现有模型难以实现有效 extrapolation。
❓ 解决问题提出一种方法使LLM能在推理时进行上下文内探索,从而在额外计算预算条件下提高推理性能和问题解决能力。
🔍 现象分析多数现有推理模型在训练时最大预算之外的预算条件下无法有效提升性能,原因在于模型缺乏有效的探索策略。
🛠️ 主要方法设计e3框架,包含三个关键步骤:利用非对称技能链优化搜索过程;使用负梯度扩展RL探索路径;通过专门的课程结构结合任务难度和训练预算。
📊 数据与实验模型在AIME'25和HMMT'25评测中表现突出,且在训练最大预算的两倍条件下实现 extrapolation,提高了pass@1和pass@k评分。
⭐ 主要贡献首次提出通过增强上下文内探索来提高LLM extrapolation性能的框架e3,有效改善小型参数模型的推理能力和泛化能力。
查看完整摘要 (Abstract)
Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep "thinking" for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging "negative" gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.
基础/前沿模型 (含LLM)
推理与思维链
#Diffusion Language Models #Reinforcement Learning #Reasoning
TL;DR:We propose a novel policy optimization method for dLLMs reasoning, reducing the error caused by log-likelihood approximation error.
🎯 研究动机增强基于扩散的大语言模型(dLLMs)的推理能力是一个未解决的问题,特别是在强化学习优化中因似然近似导致的高误差问题亟待解决。
❓ 解决问题提出新的比例无关策略优化方法 wd1,减少因传统重要性采样中的策略比计算所引入的方差及估计误差。
🔍 现象分析通过传统基于扩散的方法进行策略优化需要多次近似估计,导致计算开销高且误差积累显著。
🛠️ 主要方法wd1 方法将强化学习目标重新表述为加权的对数似然,避免策略比计算,并结合能量引导的离散扩散与负样本遗忘优化机制。
📊 数据与实验在 LLaDA-8B 模型上的实验显示,wd1 的性能优于 d1 且计算开销更低,同时拓展版 wd1++ 在 MATH500 和 GSM8K 数据集上分别取得 44.2% 和 84.5% 的领先数学推理表现。
⭐ 主要贡献提出一种增强 dLLMs 推理能力且计算更高效的策略优化方法,并在多个推理任务中实现了显著的性能提升,验证了方法的理论合理性与实践价值。
查看完整摘要 (Abstract)
Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead, and can lead to large variance and estimation error in RL objective -- particularly in computing the policy ratio for importance sampling. To mitigate these issues, we introduce wd1, a novel ratio-free policy optimization approach that reformulates the RL objective as a weighted log-likelihood, requiring only a single approximation for the current parametrized policy likelihood. We formally show that our proposed method can be interpreted as energy-guided discrete diffusion training combined with negative sample unlearning, thereby confirming its theoretical soundness. In experiments on LLaDA-8B model, wd1 outperforms diffusion-based GRPO (d1) while requiring lower computational cost, achieving up to a +59\% improvement in accuracy. Furthermore, we extend wd1 to denoising-stepwise weighted policy optimization (wd1++), achieving state-of-the-art math performance of 44.2\% on MATH500 and 84.5\% on GSM8K with only 20 RL training steps.
效率与压缩131 篇
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Models #Diffusion Models #Model Acceleration #Adaptive Sampling
TL;DR:SlowFast Sampling alternates between a slow exploratory phase and a fast parallel decoding phase, boosting diffusion LLMs by up to 34.22× with minimal quality loss.
🎯 研究动机扩散式语言模型(dLLMs)因支持并行生成显著降低推理延迟,被视为传统自回归语言模型的有力替代。然而,当前的采样策略由于行为过于静态,导致效率较低且灵活性不足。因此,优化采样方法是提升扩散模型性能的关键需求。
❓ 解决问题现有的采样策略在处理dLLMs时缺乏动态调整能力,无法充分利用扩散模型的潜力。本文旨在提出一种动态采样方法,以显著加速扩散模型推理,同时确保生成质量不显著下降。
🔍 现象分析静态采样策略在应对复杂语言生成场景时效率低下。通过探索和验证,发现动态切换的采样策略可以在保证推理质量的同时大幅提高解码速度。
🛠️ 主要方法提出SlowFast Sampling方案,结合三个黄金原则:确定性原则、收敛原则和位置原则,动态调整探索与加速阶段的切换。同时整合dLLM-Cache以减少重复计算,从而提升采样效率。
📊 数据与实验在多个基准测试和模型上开展了广泛实验,发现SlowFast Sampling与dLLM-Cache结合使用可实现最高34.22倍加速,单独使用SlowFast Sampling时可达到15.63倍加速,同时保持最小的准确性下降。
⭐ 主要贡献提出一种动态采样策略SlowFast Sampling,显著提高扩散语言模型的推理效率;验证该方法在速度和质量上的稳定优势;展示了扩散模型在动态优化采样下的潜力突破。
查看完整摘要 (Abstract)
Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63× speedup on LLaDA with minimal accuracy drop, and up to 34.22× when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.
基础/前沿模型 (含LLM)
效率与压缩
#LLM #memory-efficient #quantization #low-bit #Muon optimizer
TL;DR:We present 4-bit-Muon-GRASP, a method for compressing the Muon optimizer using subspace preservation and grid quantization to enhance memory efficiency.
🎯 研究动机大规模语言模型的优化器状态过大,导致训练时的内存限制问题亟需解决,尤其是对基于矩阵正交化的Muon优化器的低比特压缩仍属未探索领域。
❓ 解决问题针对Muon优化器状态压缩的正交化过程中的量化误差问题,提出了一种能有效减少内存使用的低比特压缩方法。
🔍 现象分析分析揭示量化误差主要来源于矩阵顶级奇异子空间和跨维度中的异常模式。
🛠️ 主要方法提出4-bit-Muon-GRASP方法,通过网格量化压缩优化器到4比特,同时以最小开销保留关键的顶级奇异子空间结构。
📊 数据与实验在LLaMA-130M、350M、1.1B规模的预训练模型及7B模型的推理任务中验证,结果显示新方法在保证精度的同时将训练内存消耗降低了28%。
⭐ 主要贡献首次提出针对Muon优化器的4比特压缩方法,实现性能接近全精度的优化,同时显著减少内存开销,推动了低比特优化技术的发展并公开源码。
查看完整摘要 (Abstract)
Training Large Language Models (LLMs) faces severe memory constraints due to the increasing size of model parameters and optimizer states. The Muon optimizer, which is based on matrix orthogonalization, has recently demonstrated significant potential and offers considerable memory advantages over AdamW by utilizing only the first moment. However, how to apply memory-reduction techniques to further compress the optimizer states of Muon remains underexplored. Directly applying existing methods may encounter significant difficulties due to the orthogonalization process. In this work, we investigate the low-bit compression of Muon and systematically analyze the quantization error exacerbated by orthogonalization. We identify that the error primarily originates from the top singular subspace and the outlier patterns of moment matrix appearing across both dimensions. To address this, we propose 4-bit-Muon-GRASP (GRid And Subspace Preserving), which compresses the Muon optimizer states to 4 bits using grid quantization, while preserving the top singular subspace with minimal overhead. We evaluate 4-bit-Muon-GRASP through pre-training on LLaMA-130M, 350M, and 1.1B architectures and fine-tuning on 7B models for various reasoning tasks. Extensive experiment results show that our 4-bit-Muon-GRASP achieves accuracy comparable to full-precision counterparts while reducing training memory consumption by up to 28\%. The source code is publicly available at ~\url{https://github.com/wuhuaijin/lowbit-Muon}.
基础/前沿模型 (含LLM)
效率与压缩
#dynamic sparse training #low-rank factorization #spectral sparse training #efficient training
🎯 研究动机大语言模型的规模迅速扩大,传统的全密度矩阵训练方法效率较低,亟需参数高效的稀疏训练方法以降低训练和推理成本。
❓ 解决问题现有的动态稀疏训练与低秩分解存在组合冲突,导致模型表达力受限,亟需一个统一框架来有效融合两者优势。
🔍 现象分析发现稀疏和低秩分解之间存在取消效应,通过定义重叠取消率(OCR)量化此现象,体现了输出冲突对模型性能的影响。
🛠️ 主要方法提出一种新的对齐损失函数,减少动态稀疏与低秩训练分支之间的冲突,并实现协同优化,从而形成一套参数高效的训练方法——CHTsL。
📊 数据与实验基于LLaMA60M和LLaMA130M模型,使用OpenWebText和C4数据集进行实验,仅保留10%-30%的参数,结果显示该方法改善了注意力层的Q和K矩阵性能,以及训练稳定性和整体表现。
⭐ 主要贡献提出并验证了一种融合动态稀疏与低秩训练的新框架,有效缓解分支冲突,显著提升稀疏训练的参数效率和性能,性能接近全密度训练。
查看完整摘要 (Abstract)
With the rapid development of large language models (LLMs), identifying efficient strategies for training such large-scale systems has become increasingly critical. Although LLMs have achieved remarkable success across diverse applications, the necessity of maintaining full dense matrices during pre-training has been questioned, giving rise to parameter-efficient sparse pre-training methods which retains parameter-efficiency in both training and inference. These methods can be further divided into connectivity sparse training and spectral sparse training, with dynamic connectivity sparse training and low-rank factorization emerging as representative approaches for the two branches.
However, a unified framework that effectively combines the strengths of both has yet to be established. In this work, we observe that the cancellation effect between the sparse and low-rank branches may limit the expressivity of the model, manifesting as output conflicts when the two components are combined. To address this issue, we first quantify the cancellation effect using the overlap cancellation ratio (OCR) and then propose a novel scheme that integrates dynamic sparse training with low-rank training, introducing a simple yet effective **alignment loss** to mitigate the disagreement between the two branches and promote better collaboration. We validate this scheme by combining a representative dynamic sparse training method, CHTs, with low-rank training, resulting in a new parameter-efficient training approach termed **CHTsL**. The method is evaluated on LLaMA60M and LLaMA130M using the OpenWebText and C4 datasets, where only 10%, 20%, and 30% of the parameters are preserved compared to dense training. Experimental results demonstrate that our proposed scheme effectively alleviates the cancellation effect, especially in the Q and K matrices of the attention layers, and improves training stability and performance compared to the naive combination of sparse and low-rank components. Additionally, the new scheme enables CHTsL to consistently outperform other parameter-efficient sparse training methods under the same parameter budget, achieving performance closest to that of dense training.
基础/前沿模型 (含LLM)
效率与压缩
#LLMs #Quantization #Optimizers #post-training quantization #quantization-aware training
TL;DR:We propose a systematic study of the effect of optimizer choice on quantization, both during and after training
🎯 研究动机量化已成为模型高效部署的常规方法,但关于优化器选择与量化之间相互作用的系统研究较少,本研究旨在填补这一空白。
❓ 解决问题分析不同优化器在模型训练和量化过程中对模型性能的影响,既包括训练后量化(PTQ),也包括量化感知训练(QAT)。
🔍 现象分析传统如最大值与均值比(MMR)等指标难以准确预测优化器在PTQ下的表现,因其未能充分考虑量化误差在网络中的累积效应。某些优化器在全精度训练表现良好,但在QAT下性能下降明显。
🛠️ 主要方法训练不同参数规模(50M至1.5B)的全精度模型,基于六种优化器建立高质量基线;对比PTQ和QAT训练下的性能,特别分析优化器对QAT精度退化的影响。
📊 数据与实验实验覆盖多种参数规模模型,系统评估了六种优化器在PTQ和QAT下的性能,并通过理论分析验证观察结果合理性。
⭐ 主要贡献揭示优化器对量化影响的关键因素;发现Shampoo优化器在QAT下精度退化最小;推导出不同优化器在量化感知训练中的扩展规律,验证其参数效率优势。
查看完整摘要 (Abstract)
As new optimizers gain traction and model quantization becomes standard for efficient deployment, a key question arises: how does the choice of optimizer affect model performance in the presence of quantization? Despite progress in both areas, systematic evidence on optimizer–quantization interactions remains limited. To fill this gap, we study the impact of optimizer choice on model robustness under quantization, considering both post-training quantization (PTQ), and quantization-aware training (QAT). We first train full-precision models, ranging from 50M to 1.5B parameters, with six optimizers, to explore the hyperparameter landscape, and establish well-tuned baselines.
We then apply PTQ to evaluate how model performance degrades when trained with different optimizers. We find that outlier-related metrics, such as the max-to-mean ratio (MMR) and Kurtosis, fail to predict the PTQ performance across different optimizers. We show analytically that this is due to the MMR capturing only isolated layer errors, while ignoring how quantization errors accumulate and propagate through the network. To study the QAT degradation, we train quantized models from scratch and compare them to our original-precision baselines. We find that optimizers performing well in the original pretraining setup may not remain optimal under QAT, and that models trained with Shampoo show the lowest accuracy degradation. Finally, we derive scaling laws for quantization-aware training under different optimizers, showing that Shampoo achieves the highest parameter efficiency of all tested optimizers.
基础/前沿模型 (含LLM)
效率与压缩
#Machine Learning #LLM
TL;DR:We repurpose the KV cache—traditionally used only for speedup—as a free representation for sampling and reasoning, enabling output-free self-evaluation and adaptive fast/slow thinking with negligible overhead and strong empirical results
🎯 研究动机传统的 KV 缓存仅用于提升自回归解码速度,但其所编码的上下文信息尚未被充分利用,有潜力作为零成本的轻量级表示用于推理和采样。
❓ 解决问题如何重新利用 KV 缓存表征以取代完整隐藏状态重计算,并在不损失性能的情况下支持下游任务如采样与推理。
🔍 现象分析KV 缓存尽管弱于专用嵌入,但其表示能力在两个应用场景中表现出色:链式嵌入和快慢思维切换,表明其在特定任务中的适用性。
🛠️ 主要方法将 KV 缓存视为轻量级表征,无需额外存储或计算,通过设计用于链式嵌入的表示、以及快慢思维自适应切换机制,有效提升推理与采样效率。
📊 数据与实验在 Llama-3.1-8B-Instruct 和 Qwen2-7B-Instruct 数据集上验证了链式嵌入性能,在 Qwen3-8B 和 DeepSeek-R1-Distil-Qwen-14B 上实现了快慢思维自适应推理,并显著减少生成 token 数量。
⭐ 主要贡献首次将 KV 缓存用于推理和采样的零成本表征方法,提升了链式嵌入和动态推理效率,揭示了 KV 缓存在大模型推断中新的潜在应用方向。
查看完整摘要 (Abstract)
KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: (i) Chain-of-Embedding, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and (ii) Fast/Slow Thinking Switching, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to $5.7\times$ with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference.
基础/前沿模型 (含LLM)
效率与压缩
#knowledge distillation #pretraining #adaptive compute #model interpolation
TL;DR:We identify a phenomenon called boomerang distillation, where distilling a teacher model into a student model enables us to reconstruct intermediate-sized models by incorporating teacher layers into the student with no additional training.
🎯 研究动机大规模语言模型需要在有限的内存和计算环境下部署,现有方法训练每种模型规模成本高且分辨率有限。
❓ 解决问题提出一种无需额外训练即可生成中间规模模型的高效方法,以解决模型规模选择的挑战。
🔍 现象分析发现一种名为回旋蒸馏的现象,通过将教师模型蒸馏为学生模型,然后融入部分教师层重建中间规模模型。
🛠️ 主要方法从教师模型开始,通过蒸馏技术生成学生模型,再将教师层与学生模型组合形成不同规模的插值模型。
📊 数据与实验实验展示插值模型性能可与相同规模预训练或蒸馏模型匹敌甚至超越,同时分析了剪枝和蒸馏对模型对齐的关键作用。
⭐ 主要贡献提出回旋蒸馏实现零样本模型规模插值,为细粒度模型生成提供低成本高效解决方案。
查看完整摘要 (Abstract)
Large language models (LLMs) are typically deployed under diverse memory and compute constraints. Existing approaches build model families by training each size independently, which is prohibitively expensive and provides only coarse-grained size options. In this work, we identify a novel phenomenon that we call boomerang distillation: starting from a large base model (the teacher), one first distills down to a small student and then progressively reconstructs intermediate-sized models by re-incorporating blocks of teacher layers into the student without any additional training. This process produces zero-shot interpolated models of many intermediate sizes whose performance scales smoothly between the student and teacher, often matching or surpassing pretrained or distilled models of the same size. We further analyze when this type of interpolation succeeds, showing that alignment between teacher and student through pruning and distillation is essential. Boomerang distillation thus provides a simple and efficient way to generate fine-grained model families, dramatically reducing training cost while enabling flexible adaptation across deployment environments. The code and models are available at [https://github.com/dcml-lab/boomerang-distillation](https://github.com/dcml-lab/boomerang-distillation).
基础/前沿模型 (含LLM)
效率与压缩
#Quantization #Quantization-Aware Training #Pre-Training
🎯 研究动机深度神经网络在边缘设备上运行时存在计算和内存效率低下的问题,量化感知预训练(QAPT)是一种有效解决方案,但现有方法无法同时满足信息理论最优性(ITO)和计算效率。
❓ 解决问题现有 QAPT 方法无法兼顾 ITO 和计算效率,该研究旨在提出一种既满足 ITO,又具有高计算效率的量化方法。
🔍 现象分析计算效率和信息理论最优性之间存在权衡点,传统方法难以跨越这一权衡边界。
🛠️ 主要方法提出了 BBQ 量化方法,通过将输入域的 ITO 量化结果映射至计算高效域,从而同时满足两者需求。
📊 数据与实验实验表明,在不同量化位宽设置下,BBQ 在困惑度指标上显著优于现有 SOTA 方法,尤其是1-bit模型提升高达18点。
⭐ 主要贡献首次实现了一种同时具备信息理论最优性和计算效率的量化方法(BBQ),为低比特量化模型的性能提升提供了新方案,并公开了相关代码。
查看完整摘要 (Abstract)
Quantization-Aware Pre-Training (QAPT) is an effective technique to reduce the compute and memory overhead of
Deep Neural Networks while improving their energy efficiency on edge devices. Existing QAPT methods produce models stored in compute-efficient data types (e.g. integers) that are not information theoretically optimal (ITO). On the other hand, existing ITO data types (e.g. Quantile/NormalFloat Quantization) are not compute-efficient.
We propose BBQ, the first ITO quantization method that is also compute-efficient. BBQ builds on our key insight that since learning is domain-agnostic, the output of a quantizer does not need to reside in the same domain as its input. BBQ performs ITO quantization in its input domain, and returns its output in a compute-efficient domain where ITO data types are mapped to compute-efficient data types. Without sacrificing compute efficiency, BBQ outperforms prior SOTA QAPT methods by a perplexity reduction of up to 2 points for 4-bit models, up to 4 points for 3-bit models, up to 5 points for 2-bit models, and up to 18 points for 1-bit models. Code is available at https://github.com/1733116199/bbq.
基础/前沿模型 (含LLM)
效率与压缩
#Speculative Decoding; Draft Tree Reward; Tree Optimization
TL;DR:We introduce GTO, which optimizes draft-tree reward to provably increase acceptance length and achieve >7% faster speculative decoding than EAGLE-3, while fine-tuning existing draft models.
🎯 研究动机现有推测解码方法仅优化单一贪婪草稿路径,与解码时的树策略不一致,限制推理加速效果。
❓ 解决问题提出方法纠正草稿政策偏差,对齐训练目标与解码策略,从而提高接受长度和推理速度。
🔍 现象分析通过对比当前动态草稿树和冻结参考模型的树策略,揭示了推测解码中训练与推理策略不一致的缺陷。
🛠️ 主要方法引入草稿树奖励和基于群体的草稿政策训练。前者通过期待接受长度直接优化解码性能;后者应用PPO风格代理,稳健更新最长接受序列。
📊 数据与实验在多个任务及模型(对话、代码、数学)上测试,如MT-Bench、HumanEval、GSM8K,结果表明接受长度提升7.4%,相比此前方法EAGLE-3速度加快7.7%。
⭐ 主要贡献提出了一种普适的高效推测解码方案,有效解决草稿政策偏差问题,并提供源码与模型供验证和扩展。
查看完整摘要 (Abstract)
Speculative decoding accelerates large language model (LLM) inference by letting a lightweight draft model propose multiple tokens that the target model verifies in parallel. Yet existing training objectives optimize only a single greedy draft path, while decoding follows a tree policy that re-ranks and verifies multiple branches. This draft policy misalignment limits achievable speedups.
We introduce **Group Tree Optimization** (GTO), which aligns training with the decoding-time tree policy through two components: (i) Draft Tree Reward, a sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance; (ii) Group-based Draft Policy Training, a stable optimization scheme that contrasts trees from the current and a frozen reference draft model, forming debiased group-standardized advantages and applying a PPO-style surrogate along the longest accepted sequence for robust updates. We further prove that increasing our Draft Tree Reward provably improves acceptance length and speedup.
Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), and multiple LLMs (e.g., LLaMA-3.1-8B, LLaMA-3.3-70B, Vicuna-1.3-13B, DeepSeek-R1-Distill-LLaMA-8B, Qwen3-8B), GTO increases acceptance length by \(7.4\%\) and yields an additional \(7.7\%\) speedup over prior state-of-the-art EAGLE-3. By bridging draft policy misalignment, GTO offers a practical, general solution for efficient LLM inference. Code and draft models are available at https://github.com/hsj576/GTO.
基础/前沿模型 (含LLM)
效率与压缩
#Parameter-efficient #LLMs pre-training #cross-layer low-rank #low-rank pre-training.
TL;DR:We propose a low-rank framework for LLMs pre-training named CR-Net which leveraging cross-layer activation residuals to enhance model efficiency while maintaining performance, reducing computational/memory costs.
🎯 研究动机低阶架构在大型语言模型预训练中具有提升效率的潜力,但现有方法存在性能下降、计算开销大、激活内存节省有限的问题。
❓ 解决问题提出CR-Net框架,通过跨层激活残差的低阶特性优化模型效率,同时保持性能表现并减少计算与内存需求。
🔍 现象分析研究发现层间激活残差具有低阶特性,这为参数高效性和内存节省提供了新的机制。
🛠️ 主要方法设计双路径结构,将上一层输出与其低阶差值结合,从而以少量参数重构高阶激活信息,同时引入特定的激活重算策略以节约内存。
📊 数据与实验在参数规模从60M到7B的模型上进行广泛预训练实验,验证CR-Net在减少资源需求的同时优于现有低阶框架。
⭐ 主要贡献提出跨层低阶残差网络,解决性能与效率之间的平衡问题,为参数高效大型语言模型预训练提供新路径。
查看完整摘要 (Abstract)
Low-rank architectures have become increasingly important for efficient large language model (LLM) pre-training, providing substantial reductions in both parameter complexity and memory/computational demands. Despite these advantages, current low-rank methods face three critical shortcomings: (1) compromised model performance, (2) considerable computational overhead, and (3) limited activation memory savings. To address these limitations, we propose **C**ross-layer Low-**R**ank residual **Net**work (**CR-Net**), an innovative parameter-efficient framework inspired by our discovery that inter-layer activation residuals possess low-rank properties. CR-Net implements this insight through a dual-path architecture that efficiently reconstructs layer activations by combining previous-layer outputs with their low-rank differences, thereby maintaining high-rank information with minimal parameters. We further develop a specialized activation recomputation strategy tailored for CR-Net that dramatically reduces memory requirements. Extensive pre-training experiments across model scales from 60M to 7B parameters demonstrate that CR-Net consistently outperforms state-of-the-art low-rank frameworks while requiring fewer computational resources and less memory.
基础/前沿模型 (含LLM)
效率与压缩
#KV cache #eviction #large language models #llm #long-context generation
TL;DR:We propose a learnable KV evicton method for long-context and long-horizon generation in LLMs
🎯 研究动机长上下文推理中的计算与内存瓶颈源于自注意力的二次开销以及不断增长的 KV 缓存,现有方法存在高成本或不可靠问题。
❓ 解决问题提出一种可学习的 KV 清除方法,旨在高效管理有限内存预算下的长上下文与长时间生成任务。
🔍 现象分析选择性保留重要 token 可抑制噪声并提高模型性能,同时揭示层和头的角色对 LLM 可解释性具有潜力。
🛠️ 主要方法利用轻量化保留门预测每个 token 的保留分数并随时间衰减,当内存超限时优先清除低分数 token,通过蒸馏与容量损失实现高效训练。
📊 数据与实验在数学推理、过程生成、对话长记忆等多个基准上,方法在低内存场景下性能优于强基线,部分设置下超越全缓存模型。
⭐ 主要贡献首次实现基于选择性保留的 KV 清除策略,既提升效率又增强解释性,显著改善长上下文生成表现。
查看完整摘要 (Abstract)
Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token’s intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBenchV2 and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.
基础/前沿模型 (含LLM)
效率与压缩
#LLM #multi-LLM #multi-agent #communication
TL;DR:We enable LLMs to communicate directly through their internal KV-Cache representations, rather than generating text
🎯 研究动机现有的多模型系统中,LLMs通过文本进行交流,导致语义信息损失并引入逐字生成的延迟。探索更高效的语义通信方式以提升性能和效率非常必要。
❓ 解决问题提出一种直接通过KV-Cache语义交流的新范式,避免文本生成过程中的信息损耗和时间成本。
🔍 现象分析实验表明,增强KV-Cache语义信息能够提升模型回应质量且无需增加缓存大小,验证KV-Cache可作为有效的跨模型通信媒介。
🛠️ 主要方法提出Cache-to-Cache(C2C)框架,利用神经网络将源模型的KV-Cache投影并融合到目标模型,通过可学习的门控机制选择优化通信的目标层。
📊 数据与实验在多个数据集上评估,C2C相比单模型平均准确性提升6.4-14.2%,优于文本通信模型3.1-5.4%,同时实现约2.5倍的延迟速度提升。
⭐ 主要贡献首次提出直接使用KV-Cache进行LLMs间语义通信的新范式,显著提升性能与效率,并提供开放源码供社区研究与扩展。
查看完整摘要 (Abstract)
Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains that are not attainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 6.4-14.2% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.1-5.4%, while delivering an average 2.5x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.
基础/前沿模型 (含LLM)
效率与压缩
#Large language models #Speculative sampling #Auto-regressive generation
TL;DR:We propose Cactus, a speculative sampling method that guarantees controlled divergence from the verifier distribution while increasing throughputs.
🎯 研究动机现有的推测采样方法严格限制生成分布与验证模型分布的匹配,过于受限且降低了灵活性,亟需改进以提升解码效率和适应性。
❓ 解决问题设计一种既能提高接受率又能控制与验证分布偏离程度的方法,从根本上解决传统方法的质量下降和输出分布失真问题。
🔍 现象分析传统推测采样方法虽提升了解码效率,但过于约束生成分布,加入基于熵的接受策略虽缓解问题,但易导致生成质量因验证信息的失真而下降。
🛠️ 主要方法提出Cactus算法,以约束优化为理论框架,结合受控接受机制,在保持生成质量的同时提升吞吐量。
📊 数据与实验通过多个基准实验验证Cactus在各种任务中的有效性,观察到显著的性能提升与质量控制。
⭐ 主要贡献基于约束优化重新定义推测采样算法,开发出能兼顾吞吐量与输出质量的改进方法,为大规模语言模型的生成任务提供了新的工具。
查看完整摘要 (Abstract)
Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variations of the verifier's distribution, such as sampling with top-$k$ or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information. In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose Cactus (**c**onstrained **ac**cep**t**ance spec**u**lative **s**ampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.
基础/前沿模型 (含LLM)
效率与压缩
#Mixture of Experts #Load Balancing #Computation Efficiency
TL;DR:We propose capacity-aware inference to mitigate imbalanced token assignments during inference, significantly improving efficiency without compromising performance.
🎯 研究动机混合专家(MoE)模型在专家并行推理时存在负载不均衡问题,计算负载过重的专家导致推理延迟显著增加,即『拖尾效应』。这严重影响大规模MoE模型的实际部署效率。
❓ 解决问题针对拖尾效应,本文提出容量感知推理框架,通过动态调整token分配策略来缓解负载不均衡问题,在不牺牲模型性能的前提下显著提升推理效率。
🔍 现象分析MoE推理时,每个token激活的专家数量存在差异,导致部分专家超载而其他专家空闲,超载专家的计算时间成为整体推理瓶颈。这种token分配不均衡是效率损失的根本原因。
🛠️ 主要方法提出容量感知token丢弃机制,强制专家容量上限以平衡负载;进而提出容量感知扩展丢弃方法,允许token在候选集中包含更多本地专家,提升低负载专家利用率。
📊 数据与实验在语言和多模态MoE模型上验证方法有效性,包括OLMoE和Mixtral-8×7B-Instruct等模型,实验显示推理速度最高提升1.85倍,性能损失小于0.9%。
⭐ 主要贡献系统识别并定义MoE推理中的拖尾效应,提出可扩展的容量感知推理框架;发布开源代码;为MoE模型的实际部署提供了高效的负载均衡解决方案。
查看完整摘要 (Abstract)
The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation to balance performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where underloaded experts complete computations early but must wait for overloaded experts, leading to global delays. We define this phenomenon as the \textbf{\textit{Straggler Effect}}, as the most burdened experts dictate the overall inference latency. To address this, we first propose \textit{\textbf{Capacity-Aware Token Drop}}, which enforces expert capacity limits by discarding excess tokens from overloaded experts, effectively reducing load imbalance with minimal performance impact (e.g., $30\%$ speedup with only $0.9\%$ degradation on OLMoE).
Next, given the presence of low-load experts remaining well below the capacity threshold, we introduce \textit{\textbf{Capacity-Aware Expanded Drop}}, which allows tokens to include additional local experts in their candidate set before enforcing strict local capacity constraints, thereby improving load balance and enhancing the utilization of underused experts.
Extensive experiments on both language and multimodal MoE models demonstrate the effectiveness of our approach, yielding substantial gains in expert utilization, model performance, and inference efficiency, e.g., applying Expanded Drop to Mixtral-8$\times$7B-Instruct yields a {0.2\%} average performance improvement and a {1.85$\times$} inference speedup. The code is released at: https://github.com/CASE-Lab-UMD/Capacity-Aware-MoE.
基础/前沿模型 (含LLM)
效率与压缩
#test-time training #self-distillation
TL;DR:We show how to use self-distillation to reduce long context memory consumption.
🎯 研究动机大语言模型在处理基于大规模文本语料的查询时需要加载整个语料到上下文窗口,导致KV缓存的内存消耗随输入长度线性增长,服务成本高昂。
❓ 解决问题探索一种通过离线训练较小的KV缓存(称为Cartridge)来减少长上下文内存消耗,同时保持模型性能。
🔍 现象分析直接通过语料进行下一词预测来训练Cartridge的效果不佳,无法与原始的上下文学习性能竞争。
🛠️ 主要方法提出一种称为自学习(self-study)的训练方案,通过生成合成对话并使用上下文蒸馏目标对Cartridge进行训练,以模拟上下文学习功能。
📊 数据与实验在多项长上下文基准测试中,Cartridge通过自学习实现与上下文学习相当的性能,同时减少38.6倍内存消耗并提高26.4倍推理速度。
⭐ 主要贡献显著降低长上下文条件下的服务成本,延展有效上下文长度,可组合Cartridge实现更广泛的推理扩展,无需重新训练。
查看完整摘要 (Abstract)
Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-10M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model's effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Models #Long Context #Efficiency
🎯 研究动机传统LLMs的KV缓存随着序列长度线性增长,导致长上下文推理时内存压力显著。量化技术可以提高内存效率,但低比特量化通常带来性能急剧下降的问题。
❓ 解决问题针对现有KV缓存中低比特量化性能下降的问题,提出一种能够支持通道感知的混合精度量化框架,以优化长上下文推理的效率与性能。
🔍 现象分析发现不同通道的量化敏感度具有显著差异,这为采用非均匀比特分配优化性能提供了可能性。
🛠️ 主要方法提出ChanMix框架,结合通道感知的比特重分配策略与2-bit通道级量化,优化了低比特量化性能;并通过自定义Triton内核实现。
📊 数据与实验在NIAH、RULER和InfiniteBench数据集上对Llama、Mistral和Qwen模型进行实验,ChanMix比基线方法在RULER上至少提升5个百分点,同时实现2.3×批处理规模和1.5×推理上下文长度扩展。
⭐ 主要贡献提出了面向长上下文推理的通道感知混合精度量化框架ChanMix,突破了低比特量化性能瓶颈,并公开了相关代码以促进后续研究。
查看完整摘要 (Abstract)
The key-value (KV) cache plays a vital role in accelerating autoregressive inference for large language models (LLMs). However, its linear memory growth with sequence length poses significant memory bottlenecks, especially in long-context scenarios.
Quantization offers a promising solution for memory efficiency. While existing methods typically apply channel-wise quantization to the key cache and token-wise quantization to the value cache, they suffer from severe performance degradation under low-bit configurations.
Our analysis reveals that quantization sensitivity varies across individual KV channels, presenting an opportunity for non-uniform bit allocation. Following this finding, we propose ChanMix, a mixed-precision quantization framework that supports channel-wise quantization on 2-bit setting with custom Triton kernels implementation. To improve low-bit quantization performance, we introduce a channel-aware bit reallocation strategy, which allocates bits across channel sensitivity.
Through extensive evaluation, ChanMix demonstrates superior performance across the NIAH, RULER, and InfiniteBench benchmarks for the Llama, Mistral, and Qwen model families, achieving improvements of at least 5 absolute percentage points on RULER compared to all baseline methods. Additionally, ChanMix enables a 2.3× increase in batch size and supports a 1.5× longer context length during inference.
Our code is available at https://github.com/cxiliao/ChanMix.
基础/前沿模型 (含LLM)
效率与压缩
#Mixture-of-experts #quantization
🎯 研究动机低精度模型在处理大型语言模型时,尤其是 Mixture-of-Experts (MoE) 架构中,容易因异常值导致量化误差进而损害模型准确性,亟需新的解决方案应对这一瓶颈问题。
❓ 解决问题针对 PTQ 过程中的异常值引发的量化误差问题,通过统一的聚类与量化方法减少其影响,提升低精度部署的可靠性与性能表现。
🔍 现象分析通过观察发现,传统旋转平滑技术虽然能一定程度缓解异常值影响,但仍存在残余误差对模型精度的阻碍。
🛠️ 主要方法提出 CodeQuant框架,通过学习旋转平滑激活异常值,并将权重异常值吸收到优化后的聚类中心,实现量化误差的优化和模型表达能力的平衡,同时结合 GPU 和 CPU 专用内核设计提升计算效率。
📊 数据与实验在多个 MoE 模型实验中,验证 CodeQuant 可以显著提高精度,同时实现高达 4.15 倍的推理速度提升,相较于现有量化方法具有显著优势。
⭐ 主要贡献提出 CodeQuant,首次将统一聚类与量化结合用于异常值平滑,有效增强 MoE 架构的低精度部署性能,并在公开代码中促进相关研究发展。
查看完整摘要 (Abstract)
Outliers have emerged as a fundamental bottleneck in preserving accuracy for low-precision large models, particularly within Mixture-of-Experts (MoE) architectures that are increasingly central to large-scale language modeling. Under post-training quantization (PTQ), these outliers induce substantial quantization errors, leading to severe accuracy degradation. While recent rotation-based smoothing techniques alleviate the problem by redistributing outlier magnitudes, residual errors remain and continue to impede reliable low-precision deployment.
In this work, we tackle this challenge by introducing CodeQuant, a unified quantization-and-clustering scheme that contains smoothing activation outliers via learnable rotation and absorbing weight outliers into fine-tuned cluster centroids for MoE. This design reduces the influence of extreme values by fitting them within cluster centroids, thereby lowering quantization error while maintaining expressive capacity. Coupled with a dedicated kernel design for GPU and CPU, CodeQuant achieves up to $4.15\times$ speedup while delivering significantly higher accuracy than state-of-the-art quantization approaches across diverse MoE models. Our results highlight CodeQuant as a promising direction for efficient and accurate deployment of MoE-based large language models under low-precision constraints. Our code is available at https://github.com/SAI-Lab-NYU/CodeQuant.
基础/前沿模型 (含LLM)
效率与压缩
#quantization-aware training #QAT #neural network quantization #compute optimization #scaling laws #large language models #LLMs #model compression #compute budget allocation #training efficiency #model optimization #quantized neural networks #efficient deep learning
TL;DR:The optimal fraction of quantization-aware training compute (vs. pretrain stage) increases with total compute budget. We derive scaling laws to predict optimal allocation and model loss, enabling higher-quality model training with the same compute.
🎯 研究动机量化感知训练(QAT)是提升量化神经网络精度的核心技术,但在训练中如何优化全精度阶段与量化阶段的计算资源分配尚不明确。
❓ 解决问题提出一种方法预测在不同计算预算下QAT与全精度训练的最佳分配比例,以提升模型性能并节约资源。
🔍 现象分析研究发现,与以往研究相反,QAT占比随总计算预算增加而提升,并能通过输入数据统计指标准确预测优化分配策略及模型损失。
🛠️ 主要方法推导了模型损失随QAT与全精度分配策略变化的缩放规律,并提出了结合学习率衰减的QAT融合方法以减少无效更新。
📊 数据与实验使用多种计算预算、量化位宽和模型规模(从86.0M到2.2B参数)进行广泛实验,验证预测模型性能和最佳量化位宽的准确性。
⭐ 主要贡献推导了损失缩放定律,提出高效的资源分配方法及QAT融合技术,使在固定预算下训练更高质量的量化模型成为可能。
查看完整摘要 (Abstract)
Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy compared to QAT alone. However, the optimal allocation of compute between the FP and QAT phases remains unclear. We conduct extensive experiments with various compute budgets, QAT bit widths, and model sizes from 86.0M to 2.2B to investigate how different QAT durations impact final performance. We demonstrate that, contrary to previous findings, the loss-optimal ratio of QAT to FP training increases with the total amount of compute. Moreover, the optimal fraction can be accurately predicted for a wide range of model sizes and quantization widths using the tokens-per-parameter-byte statistic. From experimental data, we derive a loss scaling law that predicts both optimal QAT ratios and final model performance across different QAT/FP compute allocation strategies and QAT bit widths. We use the scaling law to make further predictions, which we verify experimentally, including which QAT bit width is optimal under a given memory constraint and how QAT accuracy with different bit widths compares to full-precision model accuracy. Additionally, we propose a novel cooldown and QAT fusion approach that performs learning rate decay jointly with quantization-aware training, eliminating redundant full-precision model updates and achieving significant compute savings. These findings provide practical insights into efficient QAT planning and enable the training of higher-quality quantized models with the same compute budget.
基础/前沿模型 (含LLM)
效率与压缩
#Distributed Training #Foundation Models #Large Language Models #Optimizers #Communication Efficiency #Federated Learning #Distributed Systems #Optimization Theory #Scaling #Robustness
TL;DR:We propose provably convergent local adaptive optimizers with decoupled sync frequencies, empirically reducing communication 170x vs. DDP and 2x v s. Local Adam (prior SOTA), reducing time by 1.3x-2.1x , validated up-to billion-scale models.
🎯 研究动机在分布式训练中,当前的基础模型训练受限于带宽问题,而现有的局部通信方法难以直接应用于自适应优化器并且缺乏收敛保证。
❓ 解决问题提出一种低通信自适应优化器(DES-LOC),通过解耦同步周期,减少通信成本,同时保证收敛性和稳定性。
🔍 现象分析局部SGD仅同步模型参数,但难以稳定地处理优化器状态;现有方法如Local Adam虽然收敛性强,但通信成本剧增;高频动量同步可以提高步长稳定性。
🛠️ 主要方法采用独立同步周期分配机制,针对参数和动量分别设计不同的通信频率,理论证明其期望和高概率收敛特性,并优化稳定步长范围。
📊 数据与实验在规模达1.7B参数的语言模型上实验,通信减少170倍,相较前沿方案通信减少2倍,实测加速比达到1.3x至2.1x,验证在100Gb/s链接下的可扩展性与容错性。
⭐ 主要贡献提出新型优化器DES-LOC,显著降低分布式通信成本并提升训练效率;理论与实验证明高可扩展性与稳健性,为基础模型提供高效分布式训练解决方案。
查看完整摘要 (Abstract)
Scaling foundation model training with Distributed Data Parallel~(DDP) methods is bandwidth-limited.
Existing infrequent communication methods like Local SGD were designed to synchronize model parameters only and cannot be trivially applied to adaptive optimizers due to additional optimizer states.
Heuristic approaches that keep states local or reset them lack guarantees and can be unstable in compute‑efficient batch regimes; conversely, Local Adam synchronizes all states uniformly and is provably convergent but triples communication costs.
We propose Desynced Low Communication Adaptive Optimizers (DES-LOC), a family of optimizers assigning independent synchronization periods to parameters and momenta, enabling lower communication costs while preserving convergence. Our theoretical analysis shows that while parameter synchronization dominates the asymptotic rate in-expectation, high-probability convergence guarantees require at least infrequent synchronization of the second momentum. Furthermore, we prove that more frequent momentum sync permits larger stable step sizes. Experiments on language models of up to 1.7B show that DES-LOC can communicate 170x less than DDP and 2x less than the previous state-of-the-art Local Adam, enabling 1.3x–2.1x wall‑clock speedups over DDP for 1-13B models on 100Gb/s links. Furthermore, unlike previous heuristic methods, DES-LOC is robust to worker failures offering a scalable, efficient, and fault-tolerant solution for foundation model training.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Model
TL;DR:We introduce Dynamic Nested Depth (DND), an efficient paradigm that adaptively identifies critical tokens and selectively deepens their computation via nested re-processing.
🎯 研究动机当前大语言模型在推理过程中未能充分处理关键性强的复杂 token,导致有效性受限。
❓ 解决问题提出一种动态嵌套深度(DND)方法,通过选择性地重新处理关键 token 以提高模型性能。
🔍 现象分析不必要的重复计算浪费资源,而复杂 token 的处理不足会影响最终性能。
🛠️ 主要方法使用路由器和动态阈值机制识别关键 token,提供额外处理深度,实现精确计算分配与稳定性控制。
📊 数据与实验在多个基准数据集上测试,DND 在多种预训练密集模型和专家路由模型上取得了 0.87% 至 2.61% 的显著性能提升,计算开销增幅极小。
⭐ 主要贡献提出了一种适配大语言模型的新范式,动态优化推理过程,提高性能与计算效率,验证方法的通用性和迁移性。
查看完整摘要 (Abstract)
We introduce Dynamic Nested Depth (DND), a novel method that improves performance for off-the-shelf LLMs by selecting critical tokens to reprocess in a nested depth manner. Specifically, at the end of the given transformer layer, DND identifies more critical tokens with a router and feeds them back for an extra round of processing, effectively "reviewing" difficult tokens while avoiding redundant computation for easier ones. The dynamic selection mechanism is tailored for precise control via two novel strategies: a router controlling loss to enhance token selection distinguishability, and a threshold control scheme to ensure selection stability. We demonstrate the effectiveness of DND by directly integrating it into pre-trained dense and MoE models during a post-training phase. On diverse benchmarks, DND boosts the performances of the dense Qwen3-1.7B, Llama3.2-1B, and Gemma3-1B by 1.88%, 2.61%, and 2.50% and the MoE Qwen3-30B-A3B by 0.87%, all with a minimal parameter and computing increase.
基础/前沿模型 (含LLM)
效率与压缩
#Diffusion-based Large Language Models #Model Optimization and Efficiency #Token Pruning #Model Explainability
TL;DR:DPad is a training-free inference method that optimize diffusion-based large language models (dLLMs) by refining their inherent Scratchpad Mechanism; it dropouts redundant suffix tokens to yield significant speedups while maintaining model accuracy.
🎯 研究动机扩散式大语言模型(dLLMs)通过将解码视为去噪过程实现并行化文本生成,但在预测所有未来后缀时计算开销过高且利用率低。
❓ 解决问题设计一个无需重新训练的方法,减少后缀冗余计算,同时保持模型生成效率和精度的平衡。
🔍 现象分析dLLMs在每次解码中保留了少量有用后缀信息,而大部分后缀令牌是冗余的,造成了不必要的计算浪费。
🛠️ 主要方法提出DPad方法,通过固定长度滑动窗口和基于距离衰减的后缀令牌随机丢弃策略,优化注意力计算并减少冗余后缀。
📊 数据与实验在LLaDA和Dream模型上的多个基准测试中进行评估,结果显示DPad方法相比原始dLLMs最高实现61.4倍推理速度提升,并保持相似的生成精度。
⭐ 主要贡献提出一种轻量级、无需训练的推理优化方法DPad,大幅降低dLLMs的计算成本,为长序列推理提供了一种高效且可扩展的实现方式。
查看完整摘要 (Abstract)
Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose $\textbf{Diffusion Scratchpad} (\textbf{\textit{DPad}})$, a training-free method that restricts attention to a structured subset of suffix tokens, preserving fidelity while eliminating redundancy. $\textit{DPad}$ integrates two strategies: (i) a $\textit{sliding window}$, which maintains a fixed-length suffix window, and (ii) $\textit{distance-decay dropout}$, which deterministically removes distant suffix tokens before attention computation. This concise design is compatible with existing optimizations such as parallel decoding and prefix caching, and lends itself to a lightweight implementation. Comprehensive evaluations across multiple benchmarks on $\texttt{LLaDA}$ and $\texttt{Dream}$ models demonstrate that $\textit{DPad}$ delivers up to $\mathbf{61.4\times}$ speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference.
基础/前沿模型 (含LLM)
效率与压缩
#Mamba #Multimodal Large Language Models #Token Pruning #Efficiency #Interpretability
🎯 研究动机基于Mamba架构的多模态大语言模型虽具效率优势,但视觉令牌冗余仍导致推理开销过高,预填充阶段占推理时间主要部分。现有剪枝方法多为Transformer设计,未充分利用Mamba的内部特性,限制了效率与性能的平衡。
❓ 解决问题提出Delta引导的两阶段剪枝方法DTP,旨在通过选择性地剪除冗余视觉令牌来降低推理成本,特别是预填充延迟。该方法无需依赖Transformer的注意力机制,而是利用Mamba自身参数和隐式注意力模式实现高效剪枝。
🔍 现象分析多模态Mamba模型中视觉令牌的冗余导致计算负担增加,且预填充阶段是推理瓶颈。统计分析与实验发现,Mamba层内部参数能自然反映令牌重要性分布,为剪枝提供了新的可解释性视角。
🛠️ 主要方法DTP采用两阶段剪枝策略:在早期层进行选择性剪枝,在后期层进行完全剪枝。令牌重要性评分直接源自Mamba内部参数,结合隐式注意力模式动态决定剪枝层和待移除令牌。
📊 数据与实验在多样化基准测试上评估DTP,实验表明其能减少近50%的计算量,并在保持任务性能优于现有剪枝方法的同时,将预填充延迟降低超过35%。
⭐ 主要贡献提出了首个专为Mamba多模态模型设计的剪枝方法DTP,有效平衡效率与性能。揭示了Mamba层中视觉令牌的未充分探索行为,为未来基于Mamba的剪枝技术提供了原则性设计视角。
查看完整摘要 (Abstract)
Multimodal large language models built on the Mamba architecture offer efficiency advantages, yet remain hampered by redundant visual tokens that inflate inference cost, with the prefill stage accounting for the majority of total inference time. We introduce Delta-guided Two stage Pruning (DTP), a method that progressively reduces token redundancy through selective pruning at early layer and complete pruning at late layer. Unlike Transformer-oriented pruning methods, our approach derives token importance directly from Mamba’s internal parameters. The statistical distribution of these importance scores, combined with implicit attention patterns, then provides the basis for determining both the pruning layers and the tokens to be removed. Extensive evaluation across diverse benchmarks shows that DTP cuts computation by nearly 50\%, maintains higher task performance than existing pruning methods, and further achieves over a 35\% reduction in prefill latency. Beyond efficiency, our analysis reveals previously underexplored behaviors of visual tokens within Mamba layers, suggesting a principled perspective for designing future pruning techniques in Mamba-based Multimodal Large Language Models.
基础/前沿模型 (含LLM)
效率与压缩
#Efficient AI #Large Language Model; LLM Inference
TL;DR:This work exposes the inherent fragility of stability assumptions in KV cache eviction methods and introduces defensive aggregation to counter this issue,reducing quality loss by over 4x compared to leading methods.
🎯 研究动机大规模语言模型的推理效率受到 Key-Value 缓存内存和运行开销的限制,需要优化缓存淘汰机制以减少质量损失。
❓ 解决问题现有基于稳定性假设的缓存淘汰方法在极端情况下表现脆弱,导致生成质量显著下降。
🔍 现象分析缓存淘汰方法依赖稳定性假设,但假设本身容易失效,导致当前方法的均值聚合策略在极端情况下无法有效应对。
🛠️ 主要方法提出防御性聚合策略,通过控制最坏情况风险的线性时间两步方法优化缓存淘汰,并扩展为 Layer-DefensiveKV 配合分层预算分配。
📊 数据与实验在七个任务领域的十八个数据集上进行测试,在 20% 缓存大小的条件下,方法成功将生成质量损失减少了 2.3 倍和 4.3 倍。
⭐ 主要贡献揭示稳定性假设的脆弱性,提出具有极端情况防御能力的缓存淘汰方法,显著提升推理性能,并开创了缓存优化的新方向。
查看完整摘要 (Abstract)
Large language models have revolutionized natural language processing, yet their deployment remains hampered by the substantial memory and runtime overhead of the transformer’s Key-Value cache. To mitigate this, recent methods employ a scoring-aggregation framework to evict unimportant cache entries, based on the "stability assumption"—that a fixed subset of entries remains consistently important during generation. However, prior work has largely focused on refining importance indicators for scoring, while defaulting to mean aggregation due to a faithful trust in the stability assumption. In this work, we argue that this underlying assumption is inherently fragile, making mean aggregation highly vulnerable in extreme cases. To counter this, we propose a simple yet elegant defensive aggregation strategy: a two-step, linear-time approach that controls worst-case risk, thereby defending against extreme cases with negligible computational overhead. Embodying this strategy, we propose a novel cache eviction method, DefensiveKV and its extension, Layer-DefensiveKV, which incorporates layer-wise budget allocation. Across seven task domains (18 datasets), our methods reduce generation quality loss by 2.3× and 4.3× respectively, versus the strongest baseline under a 20\% cache size. These results set new performance benchmarks and pioneer a promising direction for optimizing cache eviction against underlying fragility through worst-case risk management.Our code is available at https://github.com/FFY0/DefensiveKV .
基础/前沿模型 (含LLM)
效率与压缩
#Diffusion Large Language Models #Discrete Diffusion Models #Inference Acceleration #KV Cache #AR-Diffusion Hybrid
🎯 研究动机扩散大型语言模型(dLLMs)在文本生成方面展现了潜力,但现有开源模型的推理速度仍不及同等大小的自回归(AR)模型。
❓ 解决问题提升 dLLMs 的推理速度,使其在保持生成质量的同时突破现有 AR 模型的速度瓶颈。
🔍 现象分析当前 dLLMs 存在推理效率低下的问题,原因在于多令牌解码未能有效利用 KV 缓存,且跨块生成需依赖先前块的完成。
🛠️ 主要方法提出离散扩散强制(D2F)策略,通过块状自回归生成和跨块并行解码,结合非对称蒸馏方法,将传统 dLLMs 转化为更高效的 AR-扩散混合模型。
📊 数据与实验在 GSM8K 数据集上验证,D2F dLLMs 比 LLaMA3 和 Qwen2.5 的推理速度提升超过 2.5 倍,相比原始 dLLMs 提速超过 50 倍,同时生成质量保持一致。
⭐ 主要贡献突破了 dLLMs 在推理速度上的现有限制,提出简单高效的 D2F 策略,展示了其在大规模语言模型推理中的实际加速潜力。
查看完整摘要 (Abstract)
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs to achieve rapid convergence.We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to the vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality.
基础/前沿模型 (含LLM)
效率与压缩
#Linear attention #Hybrid architectures #Distillation #Layer selection #Inference efficiency
🎯 研究动机提高大型语言模型推理效率,同时避免从零开始高成本的预训练,探索软化注意力与线性注意力层结合的混合架构潜力。
❓ 解决问题优化层选择策略,决定哪些预训练Transformer层转化为线性注意力模型,以增强混合架构的性能表现。
🔍 现象分析现有方法在层选择上存在局限,包括使用固定比例的层间隔或依赖特定诊断数据集,未能充分利用层的重要性信息。
🛠️ 主要方法提出通过小规模通用文本数据训练生成层重要性分数的简单层选择策略,结合现有的RADLADS知识蒸馏流程优化转换过程。
📊 数据与实验利用通用文本数据生成层选择分数并完成小量微调,验证该方法相比传统均匀间隔转换和复杂诊断数据集驱动方法更有效。
⭐ 主要贡献提供了一种低成本、简易且高效的混合架构层选择流程,为大型语言模型的推理效率优化提供新方向。
查看完整摘要 (Abstract)
Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]{goldstein2025radlads}, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective than existing approaches for layer selection, including heuristics that uniformly interleave linear attentions based on a fixed ratio, as well as more involved approaches that rely on specialized diagnostic datasets.
基础/前沿模型 (含LLM)
效率与压缩
#Spiking Neural Network #Optimization
🎯 研究动机尖峰神经网络(SNN)具有在类脑硬件上的计算优势,但直接训练大型语言模型(LLM)的成本高昂。通过转换预训练的人工神经网络(ANN)为SNN,可以降低训练成本并保留性能。
❓ 解决问题现有的ANN到SNN转换框架未充分考虑激活分布,导致因离散值分布失调产生隐藏的转换误差。需要一种分布对齐的编码方案以有效减少误差。
🔍 现象分析当前尖峰神经元的编码方式多为均匀分布,未能与激活分布一致,进而造成潜在转换误差,对模型精度和效率产生负面影响。
🛠️ 主要方法提出一种分布感知的多粒度相位编码方法,通过可学习的多基函数扩展传统相位编码,从不同粒度提升表示能力,同时提出基于隐藏层激活分布的高效训练机制以降低转换误差。
📊 数据与实验在大型语言模型LLaMA上进行广泛实验,验证编码方案与转换框架的效果,模型达到ANN级别的准确性,同时显著减少42%的关键计算操作能耗。
⭐ 主要贡献构建了分布感知的相位编码与优化的ANN到SNN转换范式,并为相关训练算法提供了收敛性理论支持,实现了高效的尖峰大型语言模型。
查看完整摘要 (Abstract)
Spiking large language models (LLMs) offer significant advantages on neuromorphic hardware, yet training them from scratch remains prohibitively expensive. A promising alternative is ANN-to-SNN conversion, which reuses pretrained ANN weights while minimizing conversion error.
However, existing conversion frameworks neglect activation distributions, as reflected in SNN neurons with rate or temporal coding to map uniformly distributed rather than distribution-aligned discrete values, thus causing latent conversion error arising from distribution misalignment.
To tackle this problem, we propose a distribution-aware multi-granularity phase coding approach, which achieves reasonable discrete value allocation by minimizing conversion error relative to activation distributions.
Specifically, multi-granularity phase coding extends conventional phase coding with multiple learnable bases, incorporating representational capacity across different granularities.
Building on this coding scheme, we further propose a novel ANN-to-SNN conversion paradigm designed towards lower conversion error.
In particular, our paradigm utilizes the activation distributions of hidden layers to sample data for cost-efficient neuron training, without requiring fine-tuning of model weights.
Theoretically, we provide a convergence guarantee for the neuron training algorithm.
Extensive experiments on the LLaMA model confirm the effectiveness of both our coding scheme and conversion paradigm.
Concretely, our spiking LLM attains the lowest perplexity with ANN-level accuracy, accompanied by a 42\% reduction in energy consumption of MAC and AC operations. Our code is available at https://github.com/JLU-Solar/PhaseSNN.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Models #Efficient Inference #Adaptive Computation #Test-time Optimization #Monte Carlo Tree Search #Dynamic Layer Routing
TL;DR:Dynamic Layer Routing in LLMs
🎯 研究动机大语言模型对每个输入都需遍历所有变换层,导致简单查询计算资源浪费且复杂查询缺乏足够灵活性,亟需一种高效的动态推理机制。
❓ 解决问题现有自适应深度方法需要高昂的推理时间搜索或模型重训练,并在效率提升的同时牺牲了准确性。本文旨在设计一种无需大规模架构更改、预训练模型即可应用的动态推理框架。
🔍 现象分析简单查询无需所有层参与计算,而复杂查询则需要更深入的推理;忽略特定任务或实例的动态需求会导致计算资源浪费及性能不足。
🛠️ 主要方法提出 Dr.LLM 框架,通过加入轻量级分层路由器实现动态计算,使用蒙特卡洛树搜索生成最优层配置,采用窗口池与平衡损失保障路由稳定性与鲁棒性。
📊 数据与实验在逻辑类数据集 ARC 和数学类数据集 DART 上实验,准确率提升最高达 3.4%且每例平均减少 5 层计算,同时在多个跨领域测试集上保持效率的同时仅降低 0.85%准确率。
⭐ 主要贡献开发了一种兼容预训练模型的动态推理框架,显著提升计算效率与推理准确性,无需更改模型权重,充分验证了监督式训练路由器的有效性。
查看完整摘要 (Abstract)
Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.
基础/前沿模型 (含LLM)
效率与压缩
#long-context #sparse attention #KV cache eviction #prompt compression
🎯 研究动机长上下文大型语言模型面临计算成本与内存需求过高的问题,现有近似推理方法对重要性预测较为粗糙,需要更精准的估计方法。
❓ 解决问题优化长上下文模型的推理效率,提出一个框架以利用小型草稿模型更准确地预测 token 和KV对的重要性。
🔍 现象分析理论与实验表明,基于预估策略的 lookahead 技术有助于更 précisément地裁剪缓存与上下文,从而实现高效推理。
🛠️ 主要方法提出三个方法:1. SpecKV,针对KV缓存采用基于草稿模型的精确丢弃策略;2. SpecPC,基于草稿模型注意力找到不重要的提示token并移除;3. SpecKV-PC,将上述两种技术结合以实现级联压缩。
📊 数据与实验在多个长上下文基准数据集上进行广泛实验,验证方法在准确性上优于现有基线,同时保持相同的内存、延迟和吞吐效率提升。
⭐ 主要贡献融合并扩展现有近似推理技术,设计框架与新方法,显著提高长上下文推理效率和准确性,并提供理论与经验支持。
查看完整摘要 (Abstract)
Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) **SpecKV**, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) **SpecPC**, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) **SpecKV-PC**, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.
基础/前沿模型 (含LLM)
效率与压缩
#dLLMs #Inference Acceleration
🎯 研究动机扩散式大语言模型(dLLMs)因其双向注意机制在文本生成任务中表现出色,但其算法复杂度随序列长度呈立方增长,限制了长序列和实时应用性能。
❓ 解决问题现有加速方法使用静态缓存或并行解码,无法适应跨层和解码步骤中动态变化的令牌特性,导致效率优化不足。
🔍 现象分析dLLMs的非自回归去噪步骤和缺乏关键值缓存机制是其计算复杂度高、性能受限的主要原因,现有方法未能充分利用层间动态特性。
🛠️ 主要方法提出Dynamic-dLLM框架,包括动态缓存更新(DCU)和自适应并行解码(APD),分别用于根据层级令牌动态分配缓存预算以及动态调整解码阈值以平衡生成质量与效率。
📊 数据与实验在LLaDA-8B-Instruct、LLaDA-1.5和Dream-v0-7B-Instruct模型上进行测试,基准包括MMLU、GSM8K和HumanEval,实验表明框架提升推理速度达平均3倍且性能维持稳定。
⭐ 主要贡献提出无需训练的高效dLLM加速框架Dynamic-dLLM,显著优于当前加速方法,为扩散式语言模型的高效部署提供即插即用解决方案。
查看完整摘要 (Abstract)
Diffusion Large Language Models (dLLMs) offer a promising alternative to autoregressive models, excelling in text generation tasks due to their bidirectional attention mechanisms. However, their computational complexity, scaling as $\mathcal{O}(L^3)$ with sequence length $L$, poses significant challenges for long-sequence and real-time applications, primarily due to the lack of compatibility with key-value caching and the non-autoregressive nature of denoising steps. Existing acceleration methods rely on static caching or parallel decoding strategies, which fail to account for the dynamic behavior of token properties across layers and decoding steps. We propose \textbf{Dynamic-dLLM}, a training-free framework that enhances dLLM inference efficiency through two components: Dynamic Cache Updating (DCU), which adaptively allocates cache-update budgets based on layer-wise token dynamics, and Adaptive Parallel Decoding (APD), which dynamically calibrates decoding thresholds to balance generation quality and efficiency. Extensive experiments on models like LLaDA-8B-Instruct, LLaDA-1.5, and Dream-v0-7B-Instruct across benchmarks such as MMLU, GSM8K, and HumanEval demonstrate that Dynamic-dLLM significantly improves inference speed, attaining an average speedup of exceeding 3$\times$ while maintaining performance. Dynamic-dLLM outperforms state-of-the-art acceleration methods and provides a plug-and-play solution for efficient dLLM deployment without compromising performance. Code and models will be made publicly available.
基础/前沿模型 (含LLM)
效率与压缩
#Artificial Intelligence #Offloading #LLM inference
TL;DR:The paper proposes DynamicInfer, a runtime inference framework that dynamically schedules and offloads neurons between the CPU and GPU. And the system speed up the LLM inference speed on consumer-grade GPUs
🎯 研究动机大型语言模型在自然语言处理任务中表现出色,但高内存占用限制了其在消费级 GPU 上的部署效率。
❓ 解决问题现有方法存在静态神经元分区的局限,导致 GPU 利用率低并增加了推理延迟。
🔍 现象分析模型推理性能受神经元动态激活模式的影响,需要对内存和计算资源进行更有效的动态管理。
🛠️ 主要方法提出动态推理框架 DynamicInfer,包括分层神经缓存策略、负载感知激活机制及带激活感知的预取流水线,实现数据传输与计算重叠优化。
📊 数据与实验在 ReluLLaMA 和 Prosparse 模型及多种硬件平台上实验,DynamicInfer 比 llama.cpp 提速 253%,比 PowerInfer 提速 59%,同时保持模型精度。
⭐ 主要贡献动态适配的神经元调度和硬件优化显著提升了 LLM 推理性能,为资源受限设备上的高性能部署提供了可行解决方案。
查看完整摘要 (Abstract)
Large Language Models (LLMs) have achieved remarkable success in various NLP tasks, but their enormous memory footprints pose significant challenges for deployment on consumer-grade GPUs.
Prior solutions, such as PowerInfer, combine offloading and sparse activation to reduce memory and computational overhead, but suffer from static neuron partitioning, leading to suboptimal GPU utilization and increased latency.
In this work, we present DynamicInfer, a runtime neuron offloading framework that dynamically adapts neuron scheduling based on input-dependent activation patterns. DynamicInfer introduces (1) a hierarchical neural caching strategies, (2) a load-aware neuron activation mechanism tailored to heterogeneous hardware, and (3) an activation-aware prefetching pipeline that overlaps data transfer with computation.
Extensive experiments on ReluLLaMA and Prosparse models across multiple hardware platforms demonstrate that DynamicInfer achieves up to 253\% speedup over llama.cpp and 59\% over PowerInfer, while retaining model accuracy. Our approach offers a practical and scalable solution for high-performance LLM inference on resource-constrained devices.
基础/前沿模型 (含LLM)
效率与压缩
#Diffusion Large Language Model #Inference Acceleration #KV Caching
🎯 研究动机扩散大语言模型因其双向上下文捕获能力和并行生成潜力而备受瞩目,但其推理开销巨大,限制了实际应用。
❓ 解决问题针对扩散大语言模型推理过程中的高计算成本问题,提出一种无需重新训练的加速框架。
🔍 现象分析分析发现扩散模型的中间表示(如键、值和隐藏状态)在连续迭代中变化较小,为优化计算提供了可能性。
🛠️ 主要方法提出ES-dLLM框架,基于中间张量变化和前一次迭代的置信分数计算词元重要性,在早层跳过不重要词元的计算以减少计算量。
📊 数据与实验在LLaDA-8B和Dream-7B模型上实验,利用NVIDIA H200 GPU实现最高308.51 TPS,较原始方法加速5.6~16.8倍,优于当前最优缓存方法。
⭐ 主要贡献提出首个针对扩散大语言模型的无训练推理加速方法ES-dLLM,并验证其高效性和对生成质量的保持能力。
查看完整摘要 (Abstract)
Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM inference remains computationally expensive as the full input context is processed at every iteration. In this work, we analyze the generation dynamics of dLLMs and find that intermediate representations, including key, value, and hidden states, change only subtly across successive iterations. Leveraging this insight, we propose ES-dLLM, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens in early layers based on the estimated importance. Token importance is computed with intermediate tensor variation and confidence scores of previous iterations. Experiments on LLaDA-8B and Dream-7B demonstrate that ES-dLLM achieves throughput of up to 226.57 and 308.51 tokens per second (TPS), respectively, on an NVIDIA H200 GPU, delivering 5.6$\times$ to 16.8$\times$ speedup over the vanilla implementation and up to 1.85$\times$ over the state-of-the-art caching method, while preserving generation quality.
基础/前沿模型 (含LLM)
效率与压缩
#Multimodal Retrieval #Vision–Language Models #Joint Encoding #Efficient Re-ranking #Token Compression
TL;DR:We propose EDJE, an efficient vision–language joint encoder with token-compression that enables fast multimodal re-ranking, achieving up to 53× higher throughput while matching the accuracy of prior joint encoders.
🎯 研究动机现有多模态检索系统主要依赖CLIP等嵌入模型进行向量搜索,但缺乏高效且性能相当的视觉-语言联合重排模型。作者发现传统联合编码器(如BLIP)中的昂贵视觉特征提取阶段是部署瓶颈,亟需一种更高效的解决方案。
❓ 解决问题论文提出了EDJE模型,旨在解决现有联合编码器在视觉特征提取时计算开销大、难以大规模部署的问题。该方法通过离线预计算和压缩视觉令牌来减少在线推理负担,实现高效的视觉-语言重排序。
🔍 现象分析当前文本检索中联合编码器重排已成熟,但视觉-语言领域同类方法仍空缺。研究发现,传统方法视觉处理部分耗时长、存储需求大,限制了其在实际大规模检索场景中的应用。
🛠️ 主要方法EDJE采用离线预计算视觉令牌并通过轻量级注意力适配器进行压缩,使在线推理仅需处理少量压缩视觉令牌与文本。该方法在保持强大检索性能的同时显著降低了存储需求和在线计算成本。
📊 数据与实验实验在Flickr(零样本检索)和COCO(微调检索)数据集上进行,EDJE达到了与先前方法相当的准确性。模型处理速度达每秒5万图像-文本对,每图像仅需49KB磁盘存储。
⭐ 主要贡献提出EDJE高效判别性联合编码器,通过令牌压缩实现快速多模态重排序,吞吐量最高提升53倍且精度匹配先前方法。该方法首次实现了实用化的高效视觉-语言联合重排,为大规模检索系统部署提供了可行方案。
查看完整摘要 (Abstract)
Multimodal retrieval still leans on embedding-based models like CLIP for fast
vector search over pre-computed image embeddings. Yet, unlike text retrieval
where joint-encoder rerankers are standard, comparable vision–language rerankers
are largely absent. We find that seminal joint encoders such as BLIP are severely
bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale.
Motivated by this bottleneck, we introduce EDJE , an
Efficient Discriminative Joint Encoder that precomputes vision tokens offline and
compresses them via a lightweight attention-based adapter, so online inference runs
only a compact joint encoder over a small set of visual tokens plus the text. EDJE
preserves strong retrieval performance while drastically reducing storage and online
compute, enabling high-throughput inference. Specifically, EDJE processes 50k
image–text pairs/second while requiring 49kB of disk storage per image, matching
prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval.
基础/前沿模型 (含LLM)
效率与压缩
#Mixture-of-Experts #Quantization #Theoretical Generalization Guarantees
TL;DR:We propose a theoretically provable method for efficient quantization of large Mixture-of-Experts models.
🎯 研究动机稀疏专家模型(MoE)尽管可以高效扩展语言和视觉模型,但推理过程中大量参数带来了显著的内存开销,亟需有效的量化方案。
❓ 解决问题针对现有均匀量化精度损失大和混合精度分配计算量高的问题,研究如何在专家敏感性差异的基础上设计高效且理论可证明的混合精度量化策略。
🔍 现象分析模型性能对专家量化的敏感性与专家在训练过程中的路由$L_2$范数变化和最大神经元方差相关,对于重要的专家需要更高精度以减少量化误差。
🛠️ 主要方法提出基于理论分析的专家级混合精度策略,结合路由$L_2$范数变化和最大神经元方差,动态调整每个专家的量化位宽分配,以优化精度与推理成本。
📊 数据与实验在Switch Transformer与Mixtral等大规模MoE模型上进行实验,结果表明新方法在保持更高精度的同时显著降低推理成本,且位宽分配的计算开销可以忽略不计。
⭐ 主要贡献提出了基于理论的专家级混合精度量化策略,实现了量化精度与推理效率的有效平衡,并证明了其对大规模稀疏专家模型的适用性。
查看完整摘要 (Abstract)
Sparse Mixture-of-Experts (MoE) allows scaling of language and vision models efficiently by activating only a small subset of experts per input. While this reduces computation, the large number of parameters still incurs substantial memory overhead during inference. Post-training quantization has been explored to address this issue. Because uniform quantization suffers from significant accuracy loss at low bit-widths, mixed-precision methods have been recently explored; however, they often require substantial computation for bit-width allocation and overlook the varying sensitivity of model performance to the quantization of different experts. We propose a theoretically grounded expert-wise mixed-precision strategy that assigns bit-width to each expert primarily based on their *change in router’s* $l_2$ *norm* during training. Experts with smaller changes are shown to capture less frequent but critical features, and model performance is more sensitive to the quantization of these experts, thus requiring higher precision. Furthermore, to avoid allocating experts to lower precision that inject high quantization noise, experts with large *maximum intra-neuron variance* are also allocated higher precision. Experiments on large-scale MoE models, including Switch Transformer and Mixtral, show that our method achieves higher accuracy than existing approaches, while also reducing inference cost and incurring only negligible overhead for bit-width assignment.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Model #Efficiency
🎯 研究动机随着大规模语言模型的扩展,计算和存储需求增加,对实际部署形成挑战,亟需提高模型效率的方法。
❓ 解决问题通过分析Transformer模型内部冗余,提出熵驱动的剪枝策略,以在不损害模型性能的情况下增强计算和存储效率。
🔍 现象分析研究发现隐藏表示的熵在模型早期层降低,大多数后续层逐渐升高,表明熵可有效衡量计算块的信息丰富度。
🛠️ 主要方法基于熵直接量化不确定性与信息内容,替代几何关系为主的余弦相似性,设计剪枝规则以优化模型结构。
📊 数据与实验进行了充分的实验,结果显示熵驱动剪枝策略在减少模型规模的同时保持了较高准确性,优于余弦相似性驱动方法。
⭐ 主要贡献提出了一种基于熵的剪枝新方法,显著提升模型部署效率,为大规模语言模型优化提供新的方向。
查看完整摘要 (Abstract)
As large language models continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an entropy-based pruning strategy to enhance efficiency while maintaining performance. Empirical analysis reveals that the entropy of hidden representations decreases in the early blocks but progressively increases across most subsequent blocks. This trend suggests that entropy serves as a more effective measure of information richness within computation blocks. Unlike cosine similarity, which primarily captures geometric relationships, entropy directly quantifies uncertainty and information content, making it a more reliable criterion for pruning. Extensive experiments demonstrate that our entropy-based pruning approach surpasses cosine similarity-based methods in reducing model size while preserving accuracy, offering a promising direction for efficient model deployment.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Models #Model Compression #On-Device Inference #Fixed-Point Network
🎯 研究动机大型语言模型在多样应用场景中表现出色,但其在边缘设备上的部署受限于严重的内存瓶颈,亟需新的解决方案。
❓ 解决问题如何在保持模型较高准确率的同时,显著减少内存消耗以支持边缘设备上的高效推理。
🔍 现象分析当前的大型语言模型因其庞大的参数量和计算需求,难以在资源受限的边缘设备上直接应用。
🛠️ 主要方法通过等价于求解平衡状态的轻量化定点网络替代部分Transformer层,并引入“分组剪枝策略优化”和“单步KV缓存”技术,提升内存利用效率与推理性能。
📊 数据与实验在常识推理、数学问题求解和代码生成等任务上进行实验,ELMs削减了28%的参数量,同时保留了99%的模型准确率。
⭐ 主要贡献提出了一种全新的内存高效压缩框架,为大型语言模型的边缘部署开辟了新方向。
查看完整摘要 (Abstract)
Large Language Models (LLMs) excel across diverse applications but remain impractical for edge deployment due to severe memory bottlenecks at the edge devices. We propose Equilibrium Language Models (ELMs), a novel compression framework that replaces groups of Transformer layers with a lightweight fixed-point network, reinterpreting deep computation as solving for an equilibrium state. To achieve ELMs, We introduce *Group Pruning Policy Optimization*, which automatically learns optimal pruning intervals. Moreover, we propose *One-Step KV-Cache*, which drastically reduces memory overhead by storing only the final iteration cache without compromising the accuracy, to enable effective deployment at the edge devices. Across different tasks such as common sense reasoning, mathematical problem solving, and code generation, ELMs prune 28\% of parameters while retaining 99\% of the accuracy of dense fine-tuned LLMs, establishing a new direction for memory-efficient edge deployment of large models.
基础/前沿模型 (含LLM)
效率与压缩
#knowledge distillation #large language model #LLM routing
🎯 研究动机知识蒸馏已成为从大型语言模型(LLMs)向小型高效模型传递知识的重要技术,但多教师模型下易出现知识冲突且资源消耗较高。
❓ 解决问题提出知识净化(Knowledge Purification)概念,通过整合多教师模型的推理逻辑,减少知识冲突并提升效率。
🔍 现象分析传统知识蒸馏在处理多教师模型时,面临知识冲突带来的性能下降以及资源需求过高的问题。
🛠️ 主要方法设计并测试五种从不同视角出发的知识净化方法,并利用路由器机制验证通用性和高效性。
📊 数据与实验在多个数据集上进行实验,结果表明提出的方法能够显著提升蒸馏模型的表现,同时有效缓解知识冲突。
⭐ 主要贡献引入知识净化概念,开发五种净化方法,验证路由器方法的泛化性能,为多教师知识蒸馏优化提供新思路。
查看完整摘要 (Abstract)
Knowledge distillation has emerged as a pivotal technique for transferring knowledge from stronger large language models (LLMs) to smaller, more efficient models. However, traditional distillation approaches face challenges related to knowledge conflicts and high resource demands, particularly when leveraging multiple teacher models. In this paper, we introduce the concept of **Knowledge Purification**, which consolidates the rationales from multiple teacher LLMs into a single rationale, thereby mitigating conflicts and enhancing efficiency. To investigate the effectiveness of knowledge purification, we further propose five purification methods from various perspectives. Our experiments demonstrate that these methods not only improve the performance of the distilled model but also effectively alleviate knowledge conflicts. Moreover, router-based methods exhibit robust generalization capabilities, underscoring the potential of innovative purification techniques in optimizing multi-teacher distillation and facilitating the practical deployment of powerful yet lightweight models.
基础/前沿模型 (含LLM)
效率与压缩
#Functional sparsity of FC; KV cache
TL;DR:We discover that RoPE is intrinsically sparse at the "frequency chunk" level and leverage this to build a zero-cost, query-aware KV cache pruner that rivals full-attention performance.
🎯 研究动机大型语言模型在处理长序列输入时,KV 缓存的内存开销是主要瓶颈。现有的基于注意力稀疏性的令牌筛选方法存在静态方法数据丢失风险和动态方法针对性不足的问题。
❓ 解决问题提出一个动态预测令牌重要性的框架,解决现有方法无法准确捕捉与查询相关的令牌重要性的问题。
🔍 现象分析发现 RoPE 在频率块(frequency chunk, FC)级别具有功能性稀疏性,少数关键的 dominant FCs 与全注意力头有高一致性,可作为高效的令牌筛选依据。
🛠️ 主要方法FASA 框架通过识别 dominant FCs 选择关键令牌,并仅在筛选后的子集上进行注意力计算,实现零额外计算成本的查询感知令牌剔除。
📊 数据与实验在包括长上下文任务和复杂推理任务的多个场景测试中,FASA 超越现有令牌剔除方法,且在 LongBench-V1 数据集上以仅保留 256 个令牌达到近100%的完整性能,并实现 2.56 倍加速。
⭐ 主要贡献提出了基于 RoPE 稀疏性的新型查询感知 KV 缓存剔除框架 FASA,展现出出色性能与效率,解决了长序列任务中的关键计算瓶颈问题。
查看完整摘要 (Abstract)
The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance.
We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance.
FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens.
Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset.
Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100\% of full-KV performance when only keeping 256 tokens, and achieves 2.56$\times$ speedup using just 18.9\% of the cache on AIME24.
基础/前沿模型 (含LLM)
效率与压缩
#Federated fine-tuning #low-rank Gram matrix #Procrustes alignment
TL;DR:We propose a federated fine-tuning framework with a single low-rank Gram matrix and adopts Procrustes alignment on the decomposed matrix to improve the fine-tuning performance.
🎯 研究动机大语言模型的高效微调需要在降低通信成本和减少分布式客户端间误差的同时,保证下游任务的适配性能。现有基于低秩矩阵的方法在联邦学习场景中存在不必要的误差和分解漂移问题。
❓ 解决问题为了解决联邦微调中双低秩矩阵引入的聚合误差与分解漂移问题,提出一种新的框架以优化联邦学习的效率与一致性。
🔍 现象分析现有方法在聚合和分解低秩矩阵时会产生误差,且分解可能不唯一,导致性能下降。通信成本高是联邦学习的另一个主要障碍。
🛠️ 主要方法提出了FLoRG框架,使用单一低秩矩阵并聚合其Gram矩阵,通过Procrustes校准减少分解漂移误差,确保每轮微调的一致更新,同时降低通信成本。
📊 数据与实验在多个大语言模型的微调基准数据集上进行实验,与五种最新方法比较,验证了FLoRG在下游任务准确性和通信开销上的优越性能(通信成本降低达2041倍)。
⭐ 主要贡献首次将单低秩Gram矩阵与Procrustes校准结合用于联邦微调;理论上证明了收敛性,并提升了收敛界;通过实验证明了在准确性和效率上的显著优势。
查看完整摘要 (Abstract)
Parameter-efficient fine-tuning techniques such as low-rank adaptation (LoRA) enable large language models (LLMs) to adapt to downstream tasks efficiently. Federated learning (FL) further facilitates this process by enabling collaborative fine-tuning across distributed clients without sharing private data. However, the use of two separate low-rank matrices in LoRA for federated fine-tuning introduces two types of challenges. First, aggregation error can arise from separately aggregating the two low-rank matrices.
Second, even if the server aggregates the product of two low-rank matrices, it needs to decompose the aggregated matrix back into low-rank matrices. Since the decomposition is not unique, it can lead to decomposition drift. To tackle the aforementioned challenges, we propose federated low-rank Gram-matrix aggregation (FLoRG), a federated fine-tuning framework which employs a single low-rank matrix for fine-tuning and aggregates its Gram matrix (i.e., the matrix of inner products of its column vectors). FLoRG can eliminate the aggregation error and reduce the communication overhead. It also minimizes the decomposition drift by introducing a Procrustes alignment approach which aligns the decomposed matrix between consecutive fine-tuning rounds for consistent updates. We theoretically analyze the convergence of FLoRG and prove that adopting the Procrustes alignment results in a tighter convergence bound. Experimental results across multiple LLM fine-tuning benchmarks demonstrate that FLoRG outperforms five state-of-the-art baseline schemes by providing higher downstream task accuracy and can reduce the communication overhead by up to 2041$\times$.
基础/前沿模型 (含LLM)
效率与压缩
#Efficient attention #GPUs #Long context LLMs #Sparse attention
🎯 研究动机现有稀疏注意力(NSA)内核在硬件对齐性和训练效率上表现优异,但其计算模式限制了其在广泛采用小查询头组的LLMs中的适用性。
❓ 解决问题提出一种改进内核实现方式,解决NSA在现代LLMs中面对小查询头组时效率下降的问题。
🔍 现象分析传统NSA内核在较大查询头组中表现高效,但与当前主流LLMs的小查询头组设计不匹配,导致适用范围有限。
🛠️ 主要方法设计FSA内核,通过调整实现方式支持不同查询头组规模,优化现代GPU上的稀疏注意力计算。
📊 数据与实验实验证明FSA内核在内核级延迟、端到端训练速度和生成推理预填充阶段分别达到最高3.5倍、1.25倍、1.36倍的加速效果。
⭐ 主要贡献提出FSA内核,大幅提升NSA在现代LLMs中的广泛适用性和性能,为大规模稀疏注意力计算提供高效实现。
查看完整摘要 (Abstract)
Recent advance in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), one state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance boost while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA forces a loop order that is only efficient with a relatively large number of query heads in each Grouped Query Attention (GQA) group, whereas existing LLMs widely adopt much smaller number of query heads in each GQA group --- such an inconsistency significantly limits the applicability of this sparse algorithmic advance. In this work, we propose **F**lash **S**parse **A**ttention (**FSA**), an alternative kernel implementation that enables efficient NSA computation across a wide range of popular LLMs with varied smaller number of query heads in each GQA group on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5x and on average 1.6x kernel-level latency reduction, (ii) up to 1.25x and 1.09x on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36x and 1.11x on average for prefill-phase speedup in LLM generative inference.
基础/前沿模型 (含LLM)
效率与压缩
#Zeroth‑order optimization #Large language models #Fine‑tuning #Adaptive step size #Batch gradient estimation #Memory efficiency
TL;DR:FZOO achieves fine‑tuning speed within the same order of magnitude as Adam for LLMs while using only inference‑level GPU memory.
🎯 研究动机大规模语言模型(LLM)微调受限于GPU内存瓶颈,传统一阶优化器如Adam在反向传播过程中消耗超过推理级别10倍以上的内存。零阶优化器可减少内存需求,但现有方法如MeZO在收敛速度上表现欠佳。
❓ 解决问题提出一种高效的零阶优化器FZOO,在显著降低内存使用的同时,实现与Adam接近的微调速度,改善现有零阶方法在效率与内存占用方面的权衡。
🔍 现象分析FZOO通过批量单边梯度估计降低收敛所需的前向传递次数,并利用标准差自适应调整步长。此外,利用Rademacher随机向量加速批量计算。
🛠️ 主要方法开发了一种基于标准化SGD的方法,通过自适应步长调整和测度优化梯度估计的过程,从而显著减少收敛所需的计算步骤及内存开销。
📊 数据与实验在11种下游任务和多种模型(如RoBERTa-large、OPT家族、Phi-2、Llama3)上进行实验,验证FZOO在精度和收敛效率方面的优越性,与对比方法MeZO相比,精度提升+3%,前向传递次数减少3倍。
⭐ 主要贡献提出了FZOO优化器,实现零阶优化方法与一阶优化方法间的性能接近,同时降低显存需求;提供了理论证明,验证方法的等价性与收敛性;支持PEFT技术,进一步节省内存,为单GPU快速全参数微调提供了可能性。
查看完整摘要 (Abstract)
Fine-tuning large language models (LLMs) often faces GPU memory bottlenecks: the backward pass of first-order optimizers like Adam increases memory usage to more than 10 times the inference level (e.g., 633~GB for OPT-30B). Zeroth-order (ZO) optimizers avoid this cost by estimating gradients only from forward passes, yet existing methods like MeZO usually need tens of times more steps to converge. Can this trade-off between speed and memory in ZO be fundamentally improved? Normalized-SGD, for instance, demonstrates strong empirical performance with greater memory efficiency than Adam. In light of this, we introduce FZOO, a Fast Zeroth-Order Optimizer towards Adam-Scale Speed. On the one hand, FZOO reduces the total forward passes needed for convergence by employing batched one-sided estimates that adapt step-sizes based on the standard deviation of batch losses. On the other hand, it accelerates per-batch computation through the use of Rademacher random vector (±1) perturbations, which also enables further speedups through batched evaluation. Extensive experiments on diverse models (including RoBERTa-large, the OPT family (350M-66B), Phi-2, and Llama3) across 11 varied downstream tasks validate FZOO's effectiveness. On average, FZOO outperforms MeZO by +3% in accuracy while requiring 3$\times$fewer forward passes. Notably, for the RoBERTa-large model, FZOO achieves average improvements of +5.6% in accuracy and 18$\times$reduction in forward passes compared to MeZO, achieving convergence speeds comparable to Adam. We also provide theoretical analysis proving FZOO’s formal equivalence to a normalized-SGD update rule and establishing its convergence guarantees. Beyond full-parameter tuning, FZOO plugs smoothly into PEFT techniques, unlocking even larger memory savings. Taken together, our results make single-GPU, high-speed, full-parameter fine-tuning realistic today and point toward future work on memory-efficient pre-training. Code: https://github.com/DKmiyan/FZOO
基础/前沿模型 (含LLM)
效率与压缩
#Diffusion LLM #Efficient AI
TL;DR:Fast-dLLM v2 transforms pretrained autoregressive LLMs into efficient block diffusion models, matching accuracy while delivering up to 2.5× faster decoding with minimal data and training cost.
🎯 研究动机大规模自回归语言模型(LLM)在自然语言任务中表现优秀,但其顺序解码导致推理效率受限。
❓ 解决问题提出一种高效方法,将预训练的自回归模型转换为块扩散模型,从而在不牺牲性能的情况下显著加速解码。
🔍 现象分析自回归解码存在固有的效率瓶颈,而全注意力扩散模型则需超大规模训练数据。
🛠️ 主要方法通过块扩散机制和新型注意力掩码结合,实现块级双向上下文建模;引入分层缓存机制,支持跨块历史上下文存储及块内高效并行生成。
📊 数据与实验在使用约10亿标记的微调训练条件下,于多种基准测试中验证了模型在生成质量和效率上的优越性。
⭐ 主要贡献显著提升大语言模型的解码效率(最高达2.5倍),并将块扩散模型的训练数据需求降低至传统方法的1/500。
查看完整摘要 (Abstract)
Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation—requiring only ∼1B tokens of fine-tuning. This represents a 500× reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model’s performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5× speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs—marking a significant step toward the practical deployment of fast and accurate LLMs.
基础/前沿模型 (含LLM)
效率与压缩
#Diffusion LLM #efficiency
TL;DR:Fast-dLLM boosts diffusion-based LLM inference speed by introducing block-wise KV caching and confidence-aware parallel decoding, achieving up to 27.6× throughput gains with minimal quality loss.
🎯 研究动机扩散式大语言模型(Diffusion LLMs)在非自回归文本生成中展现出潜力,但推理速度落后于自回归模型,尤其缺乏高效的 KV 缓存机制且并行解码时生成质量下降。
❓ 解决问题设计一种方法以提升扩散式 LLM 的推理速度,同时减少多 token 并行解码导致的质量损失,使其更接近自回归模型的性能。
🔍 现象分析传统扩散式 LLM 在缺少 KV 缓存机制的情况下需要重新计算特征,导致推理效率低下;同时,并行解码的质量下降源于假设条件独立性引发的 token 依赖破裂。
🛠️ 主要方法提出基于块的近似 KV 缓存机制,实现缓存重用并保持性能,以及一种基于置信度的并行解码策略,通过仅解码高置信度 token 减轻依赖破裂问题。
📊 数据与实验在 LLaDA 和 Dream 模型以及多个 LLM 基准任务上验证,实验表明推理吞吐量提高至 27.6 倍,同时准确度损失极小。
⭐ 主要贡献提出 Fast-dLLM,显著提升扩散式 LLM 推理速度,闭合与自回归模型的性能差距,为扩散式 LLM 的实际应用铺平道路。
查看完整摘要 (Abstract)
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce Fast-dLLM, a method that incorporates a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, Fast-dLLM also proposes a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to 27.6× throughput improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.
基础/前沿模型 (含LLM)
效率与压缩
#large language models #group relative policy optimization #speculative decoding #acceleration
🎯 研究动机Group relative policy optimization (GRPO) 虽能提升大语言模型的推理能力,但其训练过程因自回归生成多个响应的高计算开销而极为缓慢,成为实用化的障碍。
❓ 解决问题现有的 speculative decoding 在高并发训练条件下加速效果有限,因此需要一种能够适应并发环境的生成加速方法来提升训练效率。
🔍 现象分析GRPO 的生成阶段是性能瓶颈,而现有方法难以兼顾高并发场景下的动态需求,且目标模型更新导致草稿模型分布漂移会引发性能下降。
🛠️ 主要方法提出并发感知式 speculative decoding 框架,根据实时并发水平动态调整生成策略,结合在线草稿学习,通过目标模型反馈信号持续更新草稿模型以缓解分布漂移问题。
📊 数据与实验在多个数学推理数据集和模型上实验,方法实现了 2.35x 到 2.72x 的端到端加速效果,明显优于基线方法。
⭐ 主要贡献该研究提出并验证了一种结合并发感知加速与在线草稿学习的创新框架,显著提升了 GRPO 在高并发场景下的训练效率,并公开了代码供社区使用。
查看完整摘要 (Abstract)
Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models (LLMs) via reinforcement learning. However, its practical deployment is impeded by an excessively slow training process, primarily attributed to the computationally intensive autoregressive generation of multiple responses per query, which makes the generation phase the primary performance bottleneck. Although speculative decoding presents a promising direction for acceleration, its direct application in GRPO achieves limited speedup under high-concurrency training conditions. To overcome this limitation, we propose a concurrency-aware speculative decoding framework that dynamically adjusts the drafting and verification strategy according to real-time concurrency levels, thereby maximizing the acceleration of the generation process. Furthermore, to address performance degradation arising from distributional drift between the evolving target model and the fixed draft model during training, we introduce an online draft learning mechanism that enables the draft model to continuously adapt using feedback signals from the target model. Experimental results across multiple mathematical reasoning datasets and models demonstrate that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency. The code is available at https://github.com/yedaotian9/FastGRPO.
基础/前沿模型 (含LLM)
效率与压缩
#Memory-efficient Training #Zeroth-order Optimization #Quantization
TL;DR:Fine-tune a quantized large language model with zeroth-order optimization to save memory up to 18.4x
🎯 研究动机随着大语言模型规模的指数级增长,GPU 内存已成为模型适应下游任务的瓶颈。本研究旨在最小化模型权重、梯度和优化器状态的内存占用,突破内存高效训练的极限。
❓ 解决问题主要解决了量化权重与连续梯度之间的精度不匹配问题,使零阶优化能有效应用于量化模型训练。传统方法无法直接对离散量化权重进行梯度估计,需频繁反量化和重量化,导致额外开销。
🔍 现象分析量化虽能压缩权重内存,但离散值与连续梯度间的鸿沟阻碍了零阶优化的直接应用。梯度近似需在连续空间进行,而量化权重处于离散空间,这造成了训练不稳定和效率低下。
🛠️ 主要方法提出量化零阶优化,通过扰动连续量化尺度进行梯度估计,避免直接操作离散权重。引入方向导数裁剪方法稳定训练,该方法与标量和码本后量化方法正交兼容。
📊 数据与实验在 Llama-2-13B 等模型上验证,使用 4 位量化时相比 16 位全参数微调减少内存消耗超过 18 倍。实验表明单块 24GB GPU 即可完成 Llama-2-13B 微调。
⭐ 主要贡献统一框架下同时优化权重、梯度和优化器状态的内存占用,首次实现零阶优化与模型量化的有效结合。为内存受限环境下的下游任务适配提供了可行的解决方案。
查看完整摘要 (Abstract)
As the size of large language models grows exponentially, GPU memory has become a bottleneck for adapting these models to downstream tasks. In this paper, we aim to push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework. Our idea is to eliminate both gradients and optimizer states using zeroth-order optimization, which approximates gradients by perturbing weights during forward passes to identify gradient directions. To minimize memory usage on weights, we employ model quantization, e.g., converting from bfloat16 to int4. However, directly applying zeroth-order optimization to quantized weights is infeasible due to the precision gap between discrete weights and continuous gradients, which would otherwise require de-quantization and re-quantization. To overcome this challenge, we propose Quantized Zeroth-order Optimization (QZO), a simple yet effective approach that perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training. QZO is orthogonal to both scalar-based and codebook-based post-training quantization methods. Compared to full-parameter fine-tuning in 16 bits, QZO can reduce the total memory cost by more than 18$\times$ for 4-bit LLMs, and enables fine-tuning Llama-2-13B within a single 24GB GPU.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Models #Speculative Decoding #Efficient Training
🎯 研究动机提高大语言模型推理加速技术中的草稿模型训练效率,强调数据质量和选择的重要性。
❓ 解决问题现有草稿模型的训练需要大规模数据集,成本高昂,而不同样本对推理性能的贡献并不均等。
🔍 现象分析理论与实验证明,目标模型预测分布较平坦的样本比分布尖锐的样本对推理接受率更有价值。
🛠️ 主要方法提出基于平坦性的新度量指标,并设计数据集蒸馏方法 SFDD,通过过滤样本保留高价值数据以优化训练效率。
📊 数据与实验在 EAGLE 框架上实验表明,使用 SFDD 可通过仅使用 50% 的数据实现超过两倍的训练加速,且推理速度仅略低于全数据集基线。
⭐ 主要贡献首次从数据中心化角度优化推理加速技术,提出平坦性度量与 SFDD 方法,显著提升草稿模型训练效率,并提供公开代码。
查看完整摘要 (Abstract)
Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset. We approach this problem from a data-centric perspective, finding that not all training samples contribute equally to the SD acceptance rate. Specifically, our theoretical analysis and empirical validation reveals that tokens inducing flatter predictive distributions from the target model are more valuable than those yielding sharply peaked distributions. Based on this insight, we propose flatness, a new metric to quantify this property, and develop the Sample-level-flatness-based Dataset Distillation (SFDD) approach, which filters the training data to retain only the most valuable samples. Experiments on the EAGLE framework demonstrate that SFDD can achieve over 2$\times$ training speedup using only 50\% of the data, while keeping the final model's inference speedup within 4\% of the full-dataset baseline. This work introduces an effective, data-centric approach that substantially improves the training efficiency for Speculative Decoding. Our code is available at https://github.com/fjm9933/Flatness.
基础/前沿模型 (含LLM)
效率与压缩
#PEFT; Dynamic Rank; LoRA
🎯 研究动机大规模预训练模型在多个领域表现优异,但完整微调成本高昂,参数高效微调(PEFT)因此成为主流。然而,现有方法如 LoRA 固定低秩设计限制了灵活性。
❓ 解决问题现有动态秩分配方法无法有效区分矩阵级的重要性,且缺乏在需要额外适配的层中扩展容量的机制。
🔍 现象分析固定秩设计难以适应不同层的适配需求,基于启发式的分配方法缺乏稳定性和灵活性,导致性能受限。
🛠️ 主要方法提出 FlexLoRA,通过频谱能量熵评估矩阵重要性,支持全局预算下的秩裁剪与扩展,并通过零影响初始化新添加方向保证稳定性。
📊 数据与实验在多个基准上进行广泛实验,FlexLoRA 在灵活性和性能上均优于最新的基线方法。
⭐ 主要贡献解决了 PEFT 方法在粒度、灵活性和稳定性上的局限性,提出了一种基于熵引导的灵活低秩适配框架,显著提升了性能。
查看完整摘要 (Abstract)
Large pre-trained models achieve remarkable success across diverse domains, yet fully fine-tuning incurs prohibitive computational and memory costs.
Parameter-efficient fine-tuning (PEFT) has thus become a mainstream paradigm.
Among them, Low-Rank Adaptation (LoRA) introduces trainable low-rank matrices and shows strong performance, nevertheless, its fixed-rank design limits flexibility.
Dynamic rank allocation methods mitigate this issue by pruning redundant directions; however, they often rely on heuristic, element-level metrics that globally sort rank directions without matrix-wise distinction, and they lack mechanisms to expand capacity in layers requiring additional adaptation.
To overcome these limitations, we propose FlexLoRA, an entropy-guided flexible low-rank adaptation framework that (i) evaluates matrix importance via spectral energy entropy, (ii) supports rank pruning and expansion under a global budget, and (iii) employs zero-impact initialization for newly added singular directions to ensure stability.
By addressing granularity, flexibility, and stability limitations, FlexLoRA provides a more principled solution for PEFT.
Extensive experiments show that FlexLoRA consistently outperforms state-of-the-art baselines across benchmarks.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Models #KV Compression #Context Extension
TL;DR:This paper introduces FreqKV, an efficient context extension method that iteratively compresses key-value states in the frequency domain.
🎯 研究动机当前大语言模型的 KV 缓存压缩方法通常通过逐步淘汰 Token,这会导致长上下文任务中关键局部信息丢失,且跨越预训练上下文长度时性能明显下降。研究发现上下文信息在频域内集中在低频分量,启发了新的解决方向。
❓ 解决问题针对长上下文任务中现有 KV 缓存压缩方法性能不佳的问题,提出一种在频域内高效压缩 KV 状态的框架,以支持更稳定和扩展的上下文窗口。
🔍 现象分析频域分析表明上下文中的信息主要集中在低频分量,直接基于此特征压缩 KV 缓存可提高长上下文处理能力,同时避免关键信息丢失。
🛠️ 主要方法提出 FreqKV,一种参数无关且架构无关的方法,通过在频域内迭代压缩 KV 缓存,适配更长的上下文窗口,同时保持模型解码和预填充阶段的性能。
📊 数据与实验对 LLaMA-2-7B 模型进行了实验,模型在 8K token 的少量训练基础上,将上下文窗口扩展至 256K token,并在长上下文基准测试中表现稳定精确,验证了模型在预填充和解码阶段的优越性。
⭐ 主要贡献首次提出频域 KV 缓存压缩方法 FreqKV,显著扩展了上下文窗口至 256K token,同时在解码和生成长上下文任务中超越现有压缩方法,为长上下文处理提供了新颖思路。
查看完整摘要 (Abstract)
Existing key-value (KV) cache compression methods for large language models (LLMs) often rely on token eviction, which risks losing critical local information in both long prefilling and decoding scenarios. When extrapolating beyond the pretrained context length, their performance degrades sharply on long-context benchmarks. Motivated by the observation in the frequency domain that the context information is concentrated in the low-frequency components, we propose FreqKV, a parameter-free and architecture-agnostic approach. It iteratively compresses the increasing KV cache in the frequency domain, allowing models to process lengthy contexts efficiently. With minimal training at 8K length, FreqKV extends the context window of LLaMA-2-7B up to 256K tokens while maintaining stable perplexity. Extensive experiments on both prefilling and decoding stages demonstrate that FreqKV enables robust context window extension and consistently outperforms existing KV cache compression methods, highlighting its effectiveness for both understanding and generation in long contexts.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language models #model pruning
🎯 研究动机大语言模型在语言理解与生成上表现卓越,但其大规模参数量限制了部署与推理效率。
❓ 解决问题现有模型剪枝方法多集中于单模型剪枝,难以充分利用不同微调版本模型的特性,本文提出一种结合多模型剪枝的新策略。
🔍 现象分析通过融合多个微调模型的层结构,可以保留原模型的能力,同时显著压缩参数量,优化性能与规模的平衡。
🛠️ 主要方法将模型剪枝问题形式化为零阶优化问题,通过三种操作(层移除、从不同候选模型中选择层、层合并)在搜索空间中优化模型结构。
📊 数据与实验实验使用Llama2-13B系列模型,结果显示在减少约25%参数的情况下,压缩模型性能保持97.3%,显著优于现有剪枝方法。
⭐ 主要贡献提出了一种基于层裁剪与拼接的创新性剪枝方法,为大语言模型的参数优化提供了新的路径,同时实现了性能与规模的有效平衡。
查看完整摘要 (Abstract)
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment and inference. While structured pruning of model parameters offers a promising way to reduce computational costs at deployment time, current methods primarily focus on single model pruning. In this work, we develop a novel strategy to compress models by strategically combining or merging layers from finetuned model variants, which preserves the original model's abilities by aggregating capabilities accentuated in different finetunes. We pose the optimal tailoring of these LLMs as a zero-order optimization problem, adopting a search space that supports three different operations: (1) Layer removal, (2) Layer selection from different candidate models, and (3) Layer merging. Our experiments demonstrate that this approach leads to competitive model pruning, for example, for the Llama2-13B model families, our compressed models maintain approximately 97.3\% of the original performance while removing ~25\% of parameters, significantly outperforming previous state-of-the-art methods.
基础/前沿模型 (含LLM)
效率与压缩
#Test-Time Scaling #LLMs #Large Language Models #Speculative Decoding #Inference #Inference-Time Scaling #Best-of-n #Soft Best-of-n #PRM #Reward Models #Reward Guidance #KL Regularization #GSI
TL;DR:We describe a novel algorithm for test-time scaling that combines ideas from speculative decoding and best-of-n sampling and has provable guarantees.
🎯 研究动机提升大语言模型在测试阶段的解码效率,特别是在引入奖励模型指导的情况下,探索新的算法解决方案。
❓ 解决问题现有的软选优(soft best-of-n)方法在测试阶段具有较高计算成本,亟需一种兼具高准确性和低时延的解码方式。
🔍 现象分析实验表明,结合奖励模型和辅助小模型的推断方法能够有效提升模型性能,并显著减少推断时间。
🛠️ 主要方法提出了引导性推测推断算法(GSI),将软选优解码与奖励模型和小型辅助模型的推测样本结合,并提供了近似最优策略和期望奖励的理论保证。
📊 数据与实验在多个推理基准数据集(MATH500, OlympiadBench, Minerva Math, MMLU-STEM, GSM8K)及不同模型家族上进行测试,数据显示较传统方法提高了准确性,同时端到端时延减少了最多28%。
⭐ 主要贡献开发了一种高效的新解码算法(GSI),在保持计算效率的同时提升了解码精度,为测试阶段的模型缩放提供了一种实用解决方案。
查看完整摘要 (Abstract)
We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models.
GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $\pi_S(y\mid x)$. We provably approximate both the optimal tilted policy
$\pi_{\beta,B}(y\mid x) \propto \pi_B(y\mid x)\exp(\beta\,r(x,y))$ of soft best-of-$n$ under the base model $\pi_B$, as well as the expected reward under the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math, MMLU-STEM, GSM8K) and across different model families, our method achieves higher accuracy than standard soft best-of-$n$ with $\pi_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $\pi_B$, while reducing end-to-end latency by up to 28%.
基础/前沿模型 (含LLM)
效率与压缩
#MLLMs #Vision Token Pruning #Efficiency and Compression #Interpretability and Analysis
🎯 研究动机多模态大语言模型(MLLM)中视觉令牌处理的计算成本呈二次方增长,限制了其广泛应用。现有渐进式视觉令牌剪枝方法误判了浅层网络的功能并采用僵化的剪枝方案,未能充分挖掘模型效率潜力。
❓ 解决问题提出HiDrop框架,旨在将令牌剪枝与MLLM各层的真实层次功能对齐,以实现高效率的视觉令牌压缩。通过优化剪枝策略和消除动态令牌缩减的隐藏开销,解决计算效率瓶颈。
🔍 现象分析当前方法错误地将浅层视为被动处理层,而实际上视觉与语言模态的融合在更深层才真正开始。同时,固定剪枝率方案无法适应不同层次的特征重要性变化,导致性能损失或效率不足。
🛠️ 主要方法采用延迟注入(Late Injection)机制,仅在激活融合层引入视觉令牌;结合凹金字塔剪枝(Concave Pyramid Pruning)与早期退出机制,基于层间相似性度量和可微分top-k算子动态调整中深层剪枝率。同时整合持久位置编码、FlashAttention兼容令牌选择等技术消除隐藏开销。
📊 数据与实验在标准多模态基准数据集上进行广泛实验,验证方法在压缩约90%视觉令牌的同时保持原始性能,训练速度提升1.72倍。代码已开源供复现验证。
⭐ 主要贡献首次提出与MLLM层次功能对齐的令牌剪枝框架,实现效率与性能的最佳平衡;通过延迟注入和动态剪枝机制,为多模态融合的层次特性研究提供新见解;所提技术方案具备即用性,推动高效MLLM的实际部署。
查看完整摘要 (Abstract)
The quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs) hinders their widespread adoption. While progressive vision token pruning offers a promising solution, current methods misinterpret shallow layer functions and use rigid schedules, which fail to unlock the full efficiency potential. To address these issues, we propose HiDrop, a framework that aligns token pruning with the true hierarchical function of MLLM layers. HiDrop features two key innovations: (1) Late Injection, which bypasses passive shallow layers to introduce visual tokens exactly where active fusion begins; and (2) Concave Pyramid Pruning with an Early Exit mechanism to dynamically adjust pruning rates across middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-$k$ operator. To ensure practical efficiency, HiDrop further incorporates persistent positional encoding, FlashAttention-compatible token selection, and parallel decoupling of vision computation to eliminate hidden overhead associated with dynamic token reduction. Extensive experiments show that HiDrop compresses $\sim$90\% visual tokens while matching the original performance and accelerating training by 1.72$\times$. Our work not only sets a new state-of-the-art for efficient MLLM training and inference but also provides valuable insights into the hierarchical nature of multimodal fusion. The code is released at https://github.com/EIT-NLP/HiDrop.
基础/前沿模型 (含LLM)
效率与压缩
#Diffusion Large Language Models #Inference Acceleration
🎯 研究动机大语言模型的广泛应用伴随推理延迟问题,而离散扩散大语言模型虽有所缓解但计算成本仍然较高。
❓ 解决问题提出一种层次解码框架以大幅提升离散扩散语言模型的推理效率。
🔍 现象分析离散扩散模型的传统解码方式计算代价较高,影响实用性。
🛠️ 主要方法采用分而治之策略,递归地划分掩码区域并根据置信度解码,提高每次前向传播生成的令牌数量和信息利用率。
📊 数据与实验在多个基准数据集上实验表明,该方法的准确性媲美或超越现有基线,推理速度最高提升至17倍。
⭐ 主要贡献开发了一种高效的层次解码策略,为离散扩散语言模型推理加速提供了可行方案。
查看完整摘要 (Abstract)
The utilization of large language models (LLMs) has become increasingly widespread, and has attracted considerable attention. Although the emergence of discrete diffusion large language models (dLLMs) mitigates the inference latency inherent in autoregressive LLM decoding, its computational overhead remains substantial. To address this challenge, we propose Hierarchy-dLLM, a hierarchical decoding framework inspired by the divide-and-conquer principle. Our method recursively partitions masked spans into smaller sub-decoding areas and decodes tokens according to their confidence, which substantially increases the number of tokens generated per forward pass and improves information utilization. Extensive experiments conducted on multiple benchmarks demonstrate that Hierarchy-dLLM achieves accuracy comparable to or even surpassing existing baselines. Meanwhile, it is up to 17× faster than vanilla decoding and about 1.5× faster than the Fast-dLLM. These results establish hierarchical decoding as a practical solution for efficient dLLMs inference.
基础/前沿模型 (含LLM)
效率与压缩
#LLM #Boolean neural networks
TL;DR:A novel multi-Boolean framework for low-bit LLMs
🎯 研究动机LLM的权重二值化虽能降低模型复杂度,但现有方法存在明显局限:训练后二值化简单但性能损失严重,而训练感知方法又依赖全精度潜在权重,增加了复杂性和计算负担。
❓ 解决问题为解决上述问题,本研究提出一种多核布尔参数框架,首次实现直接在布尔域微调LLM,无需潜在权重,从而在提高表示能力的同时显著降低微调与推理的复杂度。
🔍 现象分析当前低比特量化与二值化技术往往在效率和性能之间难以平衡,现有方法要么牺牲模型效果,要么引入额外计算开销,限制了LLM在资源受限场景下的实际应用。
🛠️ 主要方法采用多核布尔参数表示LLM,通过新型框架支持布尔域内的直接微调,彻底消除对全精度潜在权重的依赖,提升了模型表示能力并简化了计算流程。
📊 数据与实验在多种LLM上进行了广泛实验,结果表明该方法在性能上优于近期的超低比特量化和二值化技术,验证了其有效性和泛化能力。
⭐ 主要贡献首次实现了布尔域直接微调LLM,提出无需潜在权重的多核布尔架构,显著提升了低比特LLM的效率和性能,为高效LLM部署提供了新思路。
查看完整摘要 (Abstract)
Weight binarization has emerged as a promising strategy to reduce the complexity of large language models (LLMs). Existing approaches fall into post-training binarization, which is simple but causes severe performance loss, and training-aware methods, which depend on full-precision latent weights, adding complexity and limiting efficiency. We propose a novel framework that represents LLMs with multi-kernel Boolean parameters and, for the first time, enables direct finetuning LMMs in the Boolean domain, eliminating the need for latent weights. This enhances representational capacity and dramatically reduces complexity during both finetuning and inference. Extensive experiments across diverse LLMs show our method outperforms recent ultra low-bit quantization and binarization techniques.
基础/前沿模型 (含LLM)
效率与压缩
#Low-Rank Adaptation #Integrated Gradients #Parameter-Efficient Fine-Tuning #Uncertainty-Aware Scoring
TL;DR:This paper proposes IGU-LoRA, an adaptive-rank LoRA method that leverages integrated gradients and uncertainty-aware scoring to improve parameter-efficient fine-tuning of large language models.
🎯 研究动机随着大型语言模型参数规模迅速扩大,完整的参数微调成本过高;当前的低秩适配方案存在层间秩分配均一化的问题,忽视了层的重要性差异。
❓ 解决问题针对现有方法对局部敏感性过度依赖且忽略路径效应的问题,提出一种更稳定、更准确的自适应秩分配方法。
🔍 现象分析传统方法基于即时梯度计算重要性分数,导致分数不稳定且偏差较大,无法有效捕捉参数空间的全局路径效应。
🛠️ 主要方法引入基于积分梯度的层内敏感性计算与不确定性感知评分机制,同时采用指数移动平均与偏差跟踪策略以抑制噪声并优化秩分配。
📊 数据与实验在多种任务和模型架构上进行实验,结果显示 IGU-LoRA 在相同参数预算下显著优于现有主流 PEFT 方法,同时提升了下游的准确性与稳健性。
⭐ 主要贡献提出了一种理论分析支撑的积分梯度适配方法;设计了不确定性感知机制以提升鲁棒性;验证了路径效应在低秩适配中的关键作用。
查看完整摘要 (Abstract)
As large language models (LLMs) scale to billions of parameters, full-parameter fine-tuning becomes compute- and memory-prohibitive. Parameter-efficient fine-tuning (PEFT) mitigates this issue by updating only a small set of task-specific parameters while keeping the base model frozen. Among PEFT approaches, low-rank adaptation (LoRA) is widely adopted; however, it enforces a uniform rank across layers despite substantial variation in layer importance, motivating layerwise rank allocation. Recent adaptive-rank variants (e.g., AdaLoRA) allocate ranks based on importance scores, yet typically rely on instantaneous gradients that capture only local sensitivity, overlooking non-local, pathwise effects within the same layer, which yields unstable and biased scores. To address this limitation, we introduce IGU-LoRA, an adaptive-rank LoRA that (i) computes within-layer Integrated Gradients (IG) sensitivities and aggregates them into a layer-level score for rank allocation, and (ii) applies an uncertainty-aware scheme using exponential moving averages with deviation tracking to suppress noisy updates and calibrate rank selection. Theoretically, we prove an upper bound on the composite trapezoidal rule approximation error for parameter-space IG under a pathwise Hessian-Lipschitz condition, which informs the quadrature budget. Across diverse tasks and architectures, IGU-LoRA consistently outperforms strong PEFT baselines at matched parameter budgets, improving downstream accuracy and robustness. Ablations confirm the contributions of pathwise within-layer sensitivity estimates and uncertainty-aware selection to effective rank allocation.
基础/前沿模型 (含LLM)
效率与压缩
#Large Vision-Language Models #Visual Token Pruning #Rotary Position Embeddings
🎯 研究动机大规模视觉-语言模型在处理高分辨率视觉输入时面临巨大的推理成本,而现有视觉token剪枝方法主要关注语义相关性,往往会丢弃对空间推理至关重要的token。
❓ 解决问题IVC-Prune 旨在实现一种无需训练、感知提示的视觉token剪枝策略,在显著减少token数量的同时,保持对空间推理至关重要的隐式视觉坐标token和语义相关的前景token。
🔍 现象分析揭示了LVLMs通过旋转位置嵌入,能够隐式地建立视觉坐标系,其中特定的token位置充当了对空间推理至关重要的隐式视觉坐标。
🛠️ 主要方法通过理论分析RoPE的数学特性来识别IVC token,同时采用语义种子发现和基于值向量相似度的上下文细化两阶段流程来鲁棒地识别前景token。
📊 数据与实验在4个代表性LVLM和20个多样化基准测试上进行了广泛评估,结果显示IVC-Prune能将视觉token减少约50%,同时保持原始性能的≥99%,甚至在一些基准上实现了提升。
⭐ 主要贡献提出了首个结合隐式视觉坐标和语义相关性的视觉token剪枝方法,揭示了LVLMs中RoPE隐式编码空间信息的关键特性,并设计出一种无需训练、高效且性能保留率高的剪枝策略。
查看完整摘要 (Abstract)
Large Vision-Language Models (LVLMs) achieve impressive performance across multiple tasks. A significant challenge, however, is their prohibitive inference cost when processing high-resolution visual inputs. While visual token pruning has emerged as a promising solution, existing methods that primarily focus on semantic relevance often discard tokens that are crucial for spatial reasoning. We address this gap through a novel insight into how LVLMs process spatial reasoning. Specifically, we reveal that LVLMs implicitly establish visual coordinate systems through Rotary Position Embeddings (RoPE), where specific token positions serve as implicit visual coordinates (IVC tokens) that are essential for spatial reasoning. Based on this insight, we propose IVC-Prune, a training-free, prompt-aware pruning strategy that retains both IVC tokens and semantically relevant foreground tokens. IVC tokens are identified by theoretically analyzing the mathematical properties of RoPE, targeting positions at which its rotation matrices approximate identity matrix or the $90^\circ$ rotation matrix. Foreground tokens are identified through a robust two-stage process: semantic seed discovery followed by contextual refinement via value-vector similarity. Extensive evaluations across four representative LVLMs and twenty diverse benchmarks show that IVC-Prune reduces visual tokens by approximately 50\% while maintaining $\geq$ 99\% of the original performance and even achieving improvements on several benchmarks.
基础/前沿模型 (含LLM)
效率与压缩
#quantization #large language models #LLMs
🎯 研究动机大语言模型在微调和推理过程中对内存需求极高,现有的块量化方法存在量化误差次优的问题。
❓ 解决问题优化块量化技术以减少量化误差,并通过改进的归一化方法和混合精度策略提升语言建模性能。
🔍 现象分析实验表明当前量化方法无法有效处理权重的零值与大幅值分布,同时分布失配问题显著影响建模准确性。
🛠️ 主要方法提出4位块优化浮点量化(BOF4)及其改进版BOF4-S,并设计保留异常值的混合精度量化策略(OPQ),进一步减小量化误差。
📊 数据与实验通过理论和数据驱动方法验证BOF4的最优性,通过零值和大幅值权重的表示误差分析开展变体实验,实验结果基于困惑度评价。
⭐ 主要贡献设计了一套4位最优量化技术,实现同类方法中性能最佳;提出保留异常值的混合精度策略(OPQ);探索与验证多种量化变体及其影响。
查看完整摘要 (Abstract)
Large language models (LLMs) demand extensive memory capacity during both fine-tuning and inference. To enable memory-efficient fine-tuning, existing methods apply block-wise quantization techniques, such as NF4 and AF4, to the network weights. We show that these quantization techniques incur suboptimal quantization errors. Therefore, as a first novelty, we propose an optimization approach for block-wise quantization. Using this method, we design a family of quantizers named 4-bit block-wise optimal float (BOF4), which consistently reduces the quantization error compared to both baseline methods. We provide both a theoretical and a data-driven solution for the optimization process and prove their practical equivalence. Secondly, we propose a modification to the employed normalization method based on the signed absolute block maximum (BOF4-S), enabling further reduction of the quantization error and empirically achieving less degradation in language modeling performance. Thirdly, we explore additional variations of block-wise quantization methods applied to LLMs through an experimental study on the importance of accurately representing zero and large-amplitude weights on the one hand, and optimization towards various error metrics on the other hand. Lastly, we introduce a mixed-precision quantization strategy dubbed outlier-preserving quantization (OPQ) to address the distributional mismatch induced by outlier weights in block-wise quantization. By storing outlier weights in 16-bit precision (OPQ) while applying BOF4-S, we achieve top performance among 4-bit block-wise quantization techniques w.r.t. perplexity.
基础/前沿模型 (含LLM)
效率与压缩
#Knowledge distillation #Directional coverage #Gradient variance #Cross Validation #Best Teacher prediction
TL;DR:GRACE is a gradient-based score that efficiently predicts the best teacher for knowledge distillation, without requiring teacher internals or test data
🎯 研究动机知识蒸馏需要选择最佳教师模型,但现有方法依赖繁琐的试错过程,成本较高。探索高效、轻量化的教师选择方法至关重要。
❓ 解决问题提出了基于梯度的评分方法 GRACE,用于量化教师模型有效性,避免依赖教师模型内部信息或测试数据。
🔍 现象分析GRACE 与蒸馏后学生模型性能之间的斯皮尔曼相关性高达 86%,显示该评分方法具备较强预测能力。
🛠️ 主要方法通过学生模型的梯度分布属性计算 GRACE,结合信息论和梯度算法稳定性分析,指导知识蒸馏过程中的关键设计决策。
📊 数据与实验在 GSM8K 和 MATH 数据集上验证了方法有效性,展示 GRACE 在多个教师模型选择场景中的应用潜力。
⭐ 主要贡献实现高效教师模型选择,提升学生模型性能最多 7.4%,并提供针对温度、模型规模及模型家族的细粒度蒸馏指导。
查看完整摘要 (Abstract)
Knowledge distillation is an efficient strategy to use data generated by large
“teacher” language models to train smaller capable “student” models, but selecting
the optimal teacher for a specific student-task combination requires expensive
trial-and-error. We propose a lightweight score called GRACE to quantify how
effective a teacher will be for post-training a student model. GRACE measures
distributional properties of the student’s gradients without access to a verifier,
teacher logits, teacher internals, or test data. From an information-theoretic
perspective, GRACE connects to leave-one-out stability of gradient-based
algorithms, which controls the generalization performance of the distilled students.
On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman
correlation) with the performance of the distilled LLaMA and OLMo students.
In particular, training a student using the GRACE-selected teacher can improve
the performance by up to 7.4% over naively using the best-performing teacher.
Further, GRACE can provide guidance on crucial design choices in distillation,
including (1) the best temperature to use when generating from the teacher, (2)
the best teacher to use given a size constraint, and (3) the best teacher to use within
a specific model family. Altogether, our findings demonstrate that GRACE can
efficiently and effectively identify a strongly compatible teacher for a given student
and provide fine-grained guidance on how to perform distillation.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Models #Speculative Decoding
TL;DR:We propose a new dynamic tree speculative decoding method that leverage the inference cost and achieves improvements against baselines.
🎯 研究动机大语言模型因其自回归设计和模型规模带来显著的推理延迟。推理成本问题亟需解决,尤其是提高推理效率成为关键挑战。
❓ 解决问题现有的推理方法存在忽视系统变量(如GPU配置和批量大小)影响的问题,难以优化推理效率。论文提出一种基于动态树结构的推理方法,结合推理成本进行优化。
🔍 现象分析当前方法如EAGLE-2和EAGLE-3虽提升了推理效率,但未考虑硬件配置和批量因素的动态影响,限制了实际应用中的性能表现。
🛠️ 主要方法提出了名为CAST的动态树推理方法,在推理过程中综合考虑GPU配置和批量大小等变量,动态调整树结构以优化解码效率。
📊 数据与实验在六个多样化任务及六个不同的大语言模型上进行了全面实验,结果显示方法在推理速度上最高提升5.2倍,且性能优于现有技术5%至20%。
⭐ 主要贡献结合推理成本提出了一种新的动态树推理方法CAST,显著提高了推理效率并改善了解码质量。提供公开代码促进技术发展与模型应用研究。
查看完整摘要 (Abstract)
Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes.
Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from $5\%$ to $20\%$. The code is available at \url{https://github.com/EAGLE-Research/sglang-eagle4}.
基础/前沿模型 (含LLM)
效率与压缩
#microscaling #fine-grained #FP4 #quantization #low-precision #llm
TL;DR:Naive microscaling formats hit their limits when block size is too small
🎯 研究动机微缩量化格式通过分块张量量化实现了高效的压缩能力,但其在极小分块时性能下降的问题亟需解决。
❓ 解决问题研究微缩量化过程中分块过小导致模型输出效果劣化的原因,并提出硬件友好的改进方案。
🔍 现象分析发现量化性能的异常下降与张量分布狭窄和量化尺度动态范围受限的相互作用有关。
🛠️ 主要方法从实验和理论角度分析量化误差来源,提出以UE5M3作为FP4缩放的硬件友好新格式,替代传统方案。
📊 数据与实验对多种大型语言模型的分布进行了实验分析,并使用预训练模型和理论框架验证了异常行为的机制。
⭐ 主要贡献揭示了微缩量化格式的局限性,提出了UE5M3格式,提升FP4量化的硬件兼容性与性能,同时省去全局缩放操作。
查看完整摘要 (Abstract)
Microscaling data formats leverage per-block tensor quantization to enable aggressive model compression with limited loss in accuracy. Unlocking their potential for efficient training and inference necessitates hardware-friendly implementations that handle matrix multiplications in a native format and adopt efficient error-mitigation strategies. Herein, we reported the emergence of a surprising behavior associated with microscaling quantization, whereas the output of a quantized model degrades as block size is decreased below a given threshold. This behavior clashes with the expectation that a smaller block size should allow for a better representation of the tensor elements. We investigate this phenomenon both experimentally and theoretically, decoupling the sources of quantization error behind it. Experimentally, we analyze the distributions of several Large Language Models and identify the conditions driving the anomalous behavior. Theoretically, we lay down a framework showing remarkable agreement with experimental data from pretrained model distributions and ideal ones. Overall, we show that the anomaly is driven by the interplay between narrow tensor distributions and the limited dynamic range of the quantized scales. Based on these insights, we propose the use of FP8 unsigned E5M3 as a novel hardware-friendly format for the scales in FP4 microscaling data types. We demonstrate that UE5M3 achieves comparable performance to the conventional FP8 unsigned E4M3 scales while obviating the need of global scaling operations on weights and activations.
基础/前沿模型 (含LLM)
效率与压缩
#vector quantization #llm #Moe
🎯 研究动机混合专家模型(MoE)在提升性能和计算效率方面表现优异,但其巨大参数量和内存需求限制了在资源有限环境中的部署。矢量量化(VQ)是压缩超低比特大型语言模型(LLM)的潜在解决方案。
❓ 解决问题传统矢量量化直接应用于MoE存在性能下降问题,主要由于专家间冗余表示导致的低效编码和专家聚合输出偏差引起的量化输出分布偏移。
🔍 现象分析专家间权重表示存在显著冗余,重复量化限制了代码簿容量利用效率;量化后的累积输出偏差放大,导致模型精度的显著下降。
🛠️ 主要方法提出KBVQ-MoE框架,通过卡尔曼-洛夫变换(KLT)引导的奇异值分解(SVD)消除冗余,并设计通道级仿射补偿的偏差校正机制以稳定量化输出。
📊 数据与实验在多个MoE LLM模型上进行实验,例如在Qwen1.5-MoE-A2.7B上实现3位量化,平均准确率为67.99,与FP16基线的68.07几乎相同,验证了方法的有效性。
⭐ 主要贡献通过提出轻量离线框架KBVQ-MoE,大幅提高MoE模型的极低比特量化精度,为资源受限设备上的高效部署提供了可行性。
查看完整摘要 (Abstract)
Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation. However, their enormous parameter sizes and memory demands pose significant challenges for deployment in resource-constrained environments.
Vector Quantization (VQ) offers a promising approach for ultra-low-bit compression in Large Language Models (LLMs) by constructing and leveraging a codebook—where weight vectors are mapped to the most similar discrete codewords within the codebook.
However, its direct application to MoEs suffers from significant performance degradation caused by two critical obstacles: (1) redundant representation among experts leads to VQ repeatedly quantizing similar representations for each expert, resulting in inefficient utilization of the limited codebook capacity; and (2) cumulative outputs bias, amplified by expert aggregation, leads to distributional shifts in the quantized outputs, resulting in degraded model accuracy.
To this end, we propose KBVQ-MoE, a novel VQ framework to enhance extremely low-bit quantization for MoE-based LLMs.
KBVQ-MoE introduces two lightweight and offline techniques that introduce negligible runtime computational and memory overhead:
(1) Input-driven redundancy elimination, where a Karhunen–Loève Transform (KLT) guided singular value decomposition (SVD) extracts and shares dominant weight components across experts.
(2) Bias-corrected output stabilization, where vector quantization is applied to expert-specific (i.e., non-redundant) representations and the quantized outputs are corrected with channel-wise affine compensation.
Experiments on various MoE LLMs demonstrate that KBVQ-MoE preserves accuracy substantially better than existing quantization methods.
For instance, 3-bit quantization of Qwen1.5-MoE-A2.7B achieves an average accuracy of 67.99, nearly identical to the FP16 baseline of 68.07, underscoring the potential of KBVQ-MoE for efficient deployment on edge devices and other resource-constrained platforms.
基础/前沿模型 (含LLM)
效率与压缩
#transformer #kv cache #compression
TL;DR:We present KVTC, a lightweight transform coder that allows for extended retention of transformer KV-cache via compression.
🎯 研究动机大型语言模型在大规模推理任务中面临高效 KV 缓存管理的挑战,现有方法因缓存滞留问题导致显存占用和计算资源浪费。
❓ 解决问题设计一种轻量级压缩算法,解决 KV 缓存在 GPU 和非 GPU 存储上的体积过大问题,从而确保模型推理的效率和长期上下文保留能力。
🔍 现象分析KV 缓存具有显著冗余性,传统方法难以在高压缩比和推理精度之间取得平衡,需要新的压缩技术来整合存储和推理性能。
🛠️ 主要方法提出 KVTC 转码器,结合主成分分析(PCA)特征去相关、自适应量化和熵编码,实现低开销高压缩比的 KV 缓存存储方案。
📊 数据与实验在 Llama 3、Mistral NeMo 和 R1-Qwen 2.5 等模型上,使用 AIME25、GSM8K 等多种基准测试,KVTC 在压缩比和推理性能上均超过现有方法。
⭐ 主要贡献开发一种支持 20x-40x 压缩比的新型方法,显著提升内存效率和长期缓存复用能力,推动大规模语言模型高效推理的发展。
查看完整摘要 (Abstract)
Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20x compression while maintaining reasoning and long-context accuracy, and 40x or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks
including AIME25, GSM8K, LiveCodeBench, LongBench, MATH-500, MMLU, Qasper and RULER.
It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Models #Multi-Agent Systems #Inter-LLM Communication #Multi-agent Debate
TL;DR:We propose KVComm, a communication framework that enables efficient inter-LLM collaboration by selectively sharing key-value pairs, achieving near upper-bound performance with significantly reduced communication cost.
🎯 研究动机大型语言模型在多智能体系统中应用广泛,但现有通信协议存在高推理成本和信息集中偏差的问题,亟需更高效的通信机制。
❓ 解决问题提出更高效的框架,以选择性共享键值对的方式解决自然语言通信的信息丢失和隐藏状态通信的低效问题。
🔍 现象分析自然语言通信导致信息传递效率低下,而隐藏状态过于集中且难以全面表达模型之间的协同信息。
🛠️ 主要方法设计KVComm框架,采用基于注意力重要性分数结合高斯先验的层级选择策略,选择最具信息量的键值对进行共享。
📊 数据与实验在多个任务和模型组合上进行广泛实验,KVComm在传输量仅约30%的情况下表现出接近最优的性能。
⭐ 主要贡献验证了键值对在跨模型通信中的高效性,为构建可扩展且高效的多智能体系统提供新方法。
查看完整摘要 (Abstract)
Large Language Models (LLMs) are increasingly deployed in multi-agent systems, where effective inter-model communication is crucial. Existing communication protocols either rely on natural language, incurring high inference costs and information loss, or on hidden states, which suffer from information concentration bias and inefficiency. To address these limitations, we propose KVComm, a novel communication framework that enables efficient communication between LLMs through selective sharing of KV pairs. KVComm leverages the rich information encoded in the KV pairs while avoiding the pitfalls of hidden states. We introduce a KV layer-wise selection strategy based on attention importance scores with a Gaussian prior to identify the most informative KV pairs for communication. Extensive experiments across diverse tasks and model pairs demonstrate that KVComm achieves comparable performance to the upper-bound method, which directly merges inputs to one model without any communication, while transmitting as few as 30\% of layers' KV pairs. Our study highlights the potential of KV pairs as an effective medium for inter-LLM communication, paving the way for scalable and efficient multi-agent systems.
基础/前沿模型 (含LLM)
效率与压缩
#llm #reasoning #latent reasoning #efficiency
TL;DR:We introduce a latent reasoning method guided by distillation from compressed kv-cache.
🎯 研究动机大型语言模型在显式链式思维推理中表现优异,但显式推理过程带来高计算与内存开销,同时冗余繁杂。潜在推理作为高效替代方案,因缺乏监督信号,在复杂自然语言推理中的表现受到限制。
❓ 解决问题针对潜在推理缺乏有效监督的问题,提出一种从压缩 KV-cache 中提取知识并用于潜在推理模型的框架,弥合了当前显式推理与潜在推理的效率与效果鸿沟。
🔍 现象分析压缩后的 KV-cache 在没有直接词元对应关系的情况下,存储了丰富的非结构化、抽象知识,这些知识可以为潜在推理提供强监督信号。
🛠️ 主要方法提出 KaVa 框架,通过自蒸馏方法,从教师模型的压缩 KV-cache 中提取信息,利用连续潜在向量对步骤间 KV 轨迹进行对齐,完成知识的转移与潜在推理的训练。
📊 数据与实验通过多个数据集的实验证明,该方法在潜在推理任务上优于现有基线模型,在从方程到自然语言推理迁移时表现稳定,并能够支持更大规模的模型而保持效率。
⭐ 主要贡献提出一种基于压缩 KV-cache 蒸馏的潜在推理监督方法,兼具显式推理的精度和潜在推理的效率,为大规模模型的推理任务提供新的解决方案。
查看完整摘要 (Abstract)
Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural-language reasoning traces.
In this work we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed KV-cache of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories. We show that the abstract, unstructured knowledge within compressed KV-cache, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student.
Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation-only to natural-language traces, and scales to larger backbones while preserving efficiency. These results establish compressed KV-cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT-trained teachers with the efficiency and deployability of latent inference.
基础/前沿模型 (含LLM)
效率与压缩
#Large Multimodal Models #Model Compression #Fourier Domain #Matrix Approximation
🎯 研究动机大规模多模态模型(LMMs)虽然在视觉-语言任务上表现出色,但其巨大的计算和内存开销限制了实际部署。现有压缩方法常将低秩分解和量化分离,导致误差叠加,尤其在跨模态冗余结构中问题更明显。
❓ 解决问题提出LLaVA-FA,一种新颖的高效多模态模型,在频域中联合执行低秩加量化近似。利用傅里叶变换的去相关性和共轭对称特性,实现更紧凑准确的权重表示,以克服传统分离压缩带来的重建误差。
🔍 现象分析现有压缩方法在处理多模态架构时,因解耦低秩分解与量化而产生累积重建误差,尤其受跨模态冗余影响,导致模型准确性和效率难以兼得。频域特性未被充分利用来优化压缩过程。
🛠️ 主要方法在傅里叶域进行联合低秩和量化近似;提出PolarQuant方法,专门针对复数矩阵进行极坐标量化;引入可选对角校准(ODC)方案,无需大规模校准数据即可提升压缩效果。
📊 数据与实验在多个基准测试上进行广泛实验,结果表明LLaVA-FA在保持最低激活参数和低计算成本的同时,性能优于现有高效多模态模型,验证了其压缩LMMs的有效性。
⭐ 主要贡献首次提出在频域联合执行低秩与量化压缩的方法,有效减少重建误差;开发了适用于复数矩阵的PolarQuant技术和轻量级ODC校准方案;为多模态模型压缩提供了高效、准确的解决方案。
查看完整摘要 (Abstract)
Large multimodal models (LMMs) have achieved impressive performance on various vision-language tasks, but their substantial computational and memory costs hinder their practical deployment. Existing compression methods often decouple low-rank decomposition and quantization, leading to compounded reconstruction errors, especially in multimodal architectures with cross-modal redundancy. To address this issue, we propose LLaVA-FA, a novel efficient LMM that performs joint low-rank plus quantization approximation in the frequency domain. By leveraging the de-correlation and conjugate symmetry properties of Fourier transform, LLaVA-FA achieves more compact and accurate weight representations. Furthermore, we introduce PolarQuant, a polar-coordinate quantization method tailored for complex matrices, and an optional diagonal calibration (ODC) scheme that eliminates the need for large-scale calibration data. Extensive experimental results demonstrate that our proposed LLaVA-FA outperforms existing efficient multimodal models across multiple benchmarks while maintaining minimal activated parameters and low computational costs, validating its effectiveness as a powerful solution for compressing LMMs.
基础/前沿模型 (含LLM)
效率与压缩
#llm #decoding intervention #language confusion
🎯 研究动机大语言模型在生成文本时经常出现语言混淆问题,这种现象可能干扰输出质量。现有方法要么需要重新训练模型,要么难以区分有害混淆与正常的语言切换。
❓ 解决问题提出了一种轻量级的插件式方案,称为语言混淆门(LCG),可以在不修改基础模型的情况下过滤解码过程中的不必要语言混淆。
🔍 现象分析研究发现语言混淆事件较少,高资源语言的正确语言预测在前列,且对应的嵌入向量范数较大,导致采样偏向高资源语言。
🛠️ 主要方法通过基于范数调整的自蒸馏技术训练 LCG,使其能够预测目标语言族,并仅在必要时对非目标语言标记进行屏蔽。
📊 数据与实验在多个模型(如 Qwen3、GPT-OSS、Gemma3、Llama3.1)上测试,实验表明 LCG 在显著减少语言混淆的同时,不影响任务性能。
⭐ 主要贡献提出了无需重训练的语言感知解码插件(LCG),显著降低语言混淆,并提供了基于嵌入向量范数的有效性理论支持。
查看完整摘要 (Abstract)
Large language models (LLMs) often experience language confusion, which is the unintended mixing of languages during text generation. Current solutions to this problem either necessitate model retraining or cannot differentiate between harmful confusion and acceptable code-switching. This paper introduces the \textbf{Language Confusion Gate} (LCG), a lightweight, plug-in solution that filters tokens during decoding without altering the base LLM. The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed. Our method is based on the findings that language confusion is infrequent, correct-language tokens are usually among the top predictions, and output token embedding norms are larger for high-resource languages, which biases sampling. When evaluated across various models, including Qwen3, GPT-OSS, Gemma3, Llama3.1, LCG decreases language confusion significantly—often by an order of magnitude—without negatively impacting task performance.
基础/前沿模型 (含LLM)
效率与压缩
#LLM Compression #Post-training Compression #Tucker Decomposition #Sparsity
🎯 研究动机大型语言模型(LLM)参数规模巨大,部署成本高昂,亟需数据无关的高效压缩技术来减轻结构冗余导致的存储与计算压力。
❓ 解决问题传统张量分解方法存在密集核心张量瓶颈,限制了压缩比的进一步提高。本研究旨在突破压缩上限,实现更高效的模型降维与存储优化。
🔍 现象分析现有方法虽能利用低秩基降低模型冗余,但其密集核心张量尺寸与基秩呈多项式增长,导致新的存储瓶颈,无法实现更高效的压缩。
🛠️ 主要方法提出一种名为LeSTD的两阶段框架,首先通过迭代算法获得高质量共享正交基,随后利用基于重要性的剪枝算法优化核心张量的稀疏性,从而突破压缩限制。
📊 数据与实验在多头注意力模块上验证了LeSTD压缩技术的有效性,通过对核心张量的稀疏化和模型性能优化,显著提升了压缩比并保持模型可用性。
⭐ 主要贡献消除传统张量方法的核心稠密瓶颈,提出一种具有理论依据和高鲁棒性的稀疏张量分解框架,为LLM的高效压缩提供新的解决方案。
查看完整摘要 (Abstract)
Large Language Models (LLMs) achieve remarkable success, but their massive parameter counts present significant deployment challenges. Post-training tensor decomposition offers a promising, data-free compression strategy by exploiting structural redundancies within the model weights. However, existing tensor methods face a critical limitation: the dense core tensor bottleneck. While these methods find a shared low-rank basis, the resulting dense core tensor grows polynomially with the chosen ranks, becoming a new storage bottleneck and capping the maximum achievable compression. To overcome this fundamental barrier, we introduce LeSTD (\textbf{Le}arning-based \textbf{S}parse \textbf{T}ensor \textbf{D}ecomposition), a novel two-stage framework for the high-ratio compression of Multi-Head Attention (MHA) blocks. LeSTD first employs an iterative algorithm to identify a high-quality, and shared orthogonal basis that jointly represents all attention heads. Subsequently, it introduces a principled, importance-based pruning algorithm that learns an ultra-sparse core tensor by systematically removing the least salient elements and refitting the remaining ones to preserve model fidelity. By decoupling basis optimization from core sparsification, LeSTD breaks the compression ceiling imposed by the dense core, enabling significantly higher compression ratios than prior methods.
基础/前沿模型 (含LLM)
效率与压缩
#speculative decoding #reinforcement learning
TL;DR:We use reinforcement learning to train two co-adaptive policies to dynamically coordinate the draft and verification phases, using throughput as the reward signal.
🎯 研究动机现有的推测解码方法在草稿生成和验证阶段间时间分配静态,忽视真实时间成本和两阶段动态协作潜力,限制了推理效率的进一步提升。
❓ 解决问题提出一种能够动态协调草稿和验证阶段的新方法,以直接优化每次推测解码循环的吞吐量,从根本上提升解码效率。
🔍 现象分析传统方法通常使用代理指标如接受长度,而非直接关注解码时间成本,导致草稿生成和验证阶段被孤立处理,无法充分发挥协同优化的效果。
🛠️ 主要方法基于强化学习,设计两种协作自适应策略,动态协调草稿生成和验证阶段,以最大化吞吐量为目标同时进行优化。
📊 数据与实验在五种语言模型和四项任务上进行了广泛评估,结果显示新方法 LTD 的加速比例达到了 2.24 倍至 4.32 倍,超越现有最优方法 Eagle3 最多 36.4%。
⭐ 主要贡献提出了基于强化学习的推测解码动态协调框架 LTD,显著提升了解码效率,为大语言模型推理优化提供了新的方向。
查看完整摘要 (Abstract)
Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency.
We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to
36.4\%.
基础/前沿模型 (含LLM)
效率与压缩
#Diffusion LLM
🎯 研究动机自回归解码在大型语言模型中受限于顺序生成的复杂度,导致推理吞吐量受限。扩散模型提供的并行生成虽然具备潜力,但现有方法缺乏对输入特性的动态适配,难以在速度与质量间实现最佳平衡。
❓ 解决问题改进现有扩散语言模型中依赖固定启发式规则的并行解码策略,通过动态且可学习的方式提升解码性能。
🔍 现象分析固定启发式规则在多样化的 NLP 任务中未能适应输入特性,解码速度和质量之间存在权衡不足的问题。
🛠️ 主要方法提出 Learn2PD 框架,训练轻量级的自适应过滤器模型,预测各位置当前生成结果是否已正确,并设计 EoTP 机制检测序列结束位置,从而避免冗余解码。
📊 数据与实验在 LLaDA 基准测试中验证方法,结果显示在不降低性能的情况下实现最高 22.58 倍加速,结合 KV-Cache 可达 57.51 倍加速。
⭐ 主要贡献提出 Learn2PD 动态并行解码框架和 EoTP 机制,显著提升扩散语言模型的推理速度,同时保持生成质量。
查看完整摘要 (Abstract)
Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through iterative denoising. However, current parallel decoding strategies rely on fixed, input-agnostic heuristics (e.g., confidence thresholds), which fail to adapt to input-specific characteristics, resulting in suboptimal speed-quality trade-offs across diverse NLP tasks. In this work, we explore a more flexible and dynamic approach to parallel decoding. We propose **Learning to Parallel Decode (Learn2PD)**, a framework that trains a lightweight and adaptive filter model to predict, for each token position, whether the current prediction matches the final output. This learned filter approximates an oracle parallel decoding strategy that unmasks tokens only when correctly predicted. Importantly, the filter model is learned in a post-training manner, requiring only a small amount of computation to optimize it (minute-level GPU time). Additionally, we introduce **End-of-Text Prediction (EoTP)** to detect decoding completion at the end of sequence, avoiding redundant decoding of padding tokens. Experiments on the LLaDA benchmark demonstrate that our method achieves up to **22.58×** speedup without any performance drop, and up to **57.51×** when combined with KV-Cache.
基础/前沿模型 (含LLM)
效率与压缩
#log #KV cache #generation
🎯 研究动机人类能够从过去经验中学习并适应新任务,但大语言模型(LLMs)在测试时难以保留并复用先前任务的推理能力。
❓ 解决问题提出一种框架,目标是在不损失效率和可扩展性的情况下,通过复用历史计算与推理结果,使模型能够在新任务中表现更优。
🔍 现象分析现有的基于反思的记忆机制需要额外的提取和提炼步骤,而现有的 KV 缓存技术主要关注效率,未能充分提高推理准确性。
🛠️ 主要方法开发了 Log-Augmented Generation(LAG),将任务日志表示为包含选择性 KV 数据的缓存,在新任务中从相关日志检索 KV 数据直接辅助生成。
📊 数据与实验在涉及知识和推理密集型的数据集上进行实验,结果显示,该方法显著优于不利用日志的标准系统以及基于反思和现有 KV 缓存技术的方法。
⭐ 主要贡献提出了首个直接复用推理历史的生成框架,超越了效率导向的 KV 缓存方法,同时显著提高了推理准确性。
查看完整摘要 (Abstract)
While humans naturally learn and adapt from past experiences, large language models (LLMs) and their agentic counterparts often fail to retain reasoning from previous tasks and apply it to future contexts.
We introduce **L**og-**A**ugmented **G**eneration (LAG), a novel framework that *directly reuses* prior computation and reasoning from past logs at test time, enabling models to learn from previous tasks to perform better on new, unseen challenges, without sacrificing efficiency or scalability.
Our approach represents task logs as key-value (KV) caches that encode the reasoning context of prior tasks, while storing KV values for only a selected subset of tokens. When a new task arises, LAG retrieves KV values from relevant logs to augment generation.
Unlike reflection-based memory mechanisms, which require additional extraction or distillation steps, LAG reuses prior reasoning verbatim.
Moreover, it extends beyond existing KV caching techniques, primarily designed for efficiency, by explicitly improving accuracy through log reuse.
Experiments on knowledge- and reasoning-intensive datasets demonstrate that our method significantly outperforms standard agentic systems without log utilization, as well as existing approaches based on reflection and KV cache techniques.
基础/前沿模型 (含LLM)
效率与压缩
#LLM Efficiency #Key-Value Cache Compression #Long-Context LLM #Inference Optimization
TL;DR:We propose a novel method that augments the LLM with parameter-efficient modules to perform fast and accurate KV cache eviction by predicting the attention pattern of the model's future response.
🎯 研究动机长文本任务中,LLM 的 KV 缓存随输入序列线性增长,成为效率瓶颈。现有方法通过删除低重要性 KV 缓解问题,但质量与性能尚待提升。
❓ 解决问题现有通过“预测未来响应”改善 eviction 质量的方法计算成本高,无法实用化;需要开发既高效又准确的 KV 缓存删除机制。
🔍 现象分析依赖草稿生成的未来响应预测方法提升了重要性评估精度,但引入显著的预填充开销,限制了应用场景。
🛠️ 主要方法提出 LookaheadKV,一个轻量化框架,通过参数高效模块直接预测重要性分数,避免明确生成草稿响应,确保运行时开销极低。
📊 数据与实验在多个长上下文任务数据集和不同模型上测试,实验显示 LookaheadKV 精度优于现有基线,可将删除成本降低至原来的 1/14.5,大幅提升推理速度。
⭐ 主要贡献设计并实现了无草稿响应预测的高效 KV 缓存删除策略,引入模块化改造,显著改进长上下文任务的效率与实用化性能。
查看完整摘要 (Abstract)
Transformer-based large language models (LLMs) rely on key–value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long‑context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by “glimpsing into the future”, in which a draft generator produces a surrogate future response approximating the target model's true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real-world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter‑efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long-context understanding tasks, but also reduces the eviction cost by up to $14.5$×, leading to significantly faster time-to-first-token. Our code is available at https://github.com/SamsungLabs/LookaheadKV.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Models #Long-Context Inference #Sparse Attention #Hybrid-Head Attention
🎯 研究动机长上下文大语言模型在推理阶段受到键值缓存快速膨胀的瓶颈限制,带来了存储和延迟成本的显著增加。
❓ 解决问题现有方法通过层间共享关键令牌集合减轻内存负担,但粗粒度的共享忽视了注意力头的功能多样性,从而降低模型性能。
🔍 现象分析粗粒度令牌共享削弱了注意力头的专门化能力,导致生成质量下降,现有方案在效率和模型表现间难以兼顾。
🛠️ 主要方法提出 LycheeDecode 解码方法,采用基于硬件高效的 HardKuma 算法,结合混合头注意机制,将注意力头分为动态检索关键令牌的小部分检索头和重用令牌的稀疏头。
📊 数据与实验在 Llama3 和 Qwen3 等主流模型上测试,涵盖 LongBench、RULER 等长上下文理解基准及 AIME24、OlympiadBench 等复杂推理任务,证明在 128K 上下文长度下实现最高 2.7 倍的推理加速,同时保持或超越全注意力基线的生成质量。
⭐ 主要贡献通过保留注意力头的功能多样性,LycheeDecode 兼具高效性和生成质量,提供了一种经验证的长上下文推理优化路径,并公开代码和模型以供扩展研究。
查看完整摘要 (Abstract)
The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-$k$ selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference. The implementation code, kernels, and models will be publicly available.
基础/前沿模型 (含LLM)
效率与压缩
#Distributed Training #Foundation Models #Large Language Models #Optimizers #Communication Efficiency #Federated Learning #Distributed Systems #Optimization Theory #Scaling
TL;DR:MT-DAO, a multi-timescale optimizer, closes the performance gap from infrequent communication in distributed training. It cuts wall-clock time by 6-27% and allows a 720M model to reach its target 35% faster with 5-24% fewer steps than standard DDP.
🎯 研究动机分布式数据并行训练大型模型时频繁的梯度通信会导致带宽瓶颈,现有减少通信频率的策略在自适应优化器中表现不佳,存在性能差距。
❓ 解决问题通过解决优化器动量的时间尺度不匹配问题,提出一种能够在不同时间尺度上适应的优化器,旨在平衡通信效率和训练性能。
🔍 现象分析传统自适应优化器在长时间间隔更新时,由于动量衰减过快,导致梯度无法被平滑处理,从而使优化过程充满噪声,影响训练效果。
🛠️ 主要方法设计了MT-DAO优化器家族,基于多组快慢动量和梯度追踪不同时间尺度的更新动态,并提供了第一个收敛性理论保证。
📊 数据与实验在语言模型预训练实验中,MT-DAO在以太网互联环境下将墙钟时间减少6-27%,在720M规模模型上比单动量DDP基线减少24%的训练步数和35%的时间。
⭐ 主要贡献提出一种多时间尺度分布式自适应优化器,显著缩短分布式训练时间,消除不频繁通信优化的性能差距,实现跨数据中心及地理区域训练的有效性。
查看完整摘要 (Abstract)
Training large models with distributed data parallelism (DDP) requires frequent communication of gradients across workers, which can saturate bandwidth. Infrequent communication strategies (e.g., Local SGD) reduce this overhead but, when applied to adaptive optimizers, often suffer a performance gap relative to fully synchronous DDP.
We trace this gap to a time-scale mismatch: the optimizer's fast-moving momentum, tuned for frequent updates, decays too quickly to smooth gradients over long intervals, leading to noise-dominated optimization. To address this, we propose MT-DAO, a family of optimizers that employs multiple slow- and fast-moving first momenta or the gradient to track update dynamics across different time scales, for which we provide the first convergence guarantees.
Empirically, for language-model pre-training, this eliminates the performance gap with DDP, outperforming infrequent-communication baselines in perplexity and reducing iso-token wall-clock time by 6-27% on Ethernet interconnects. At the 720M scale, MT-DAO reaches a target perplexity in 24% fewer steps and 35% less time than the single-momentum DDP baseline. MT-DAO enables effective cross-datacenter training and training over wide geographic areas.
基础/前沿模型 (含LLM)
效率与压缩
#FP4 #Full Quantization Training #LLM
🎯 研究动机低比特训练大语言模型时,由于参数、激活和梯度的奇异值谱具有各向异性,导致量化误差和谱失真,不可避免地降低训练性能。
❓ 解决问题提出一种改进的谱域量化框架,旨在解决低比特量化训练中奇异值谱各向异性带来的性能损失问题。
🔍 现象分析奇异值谱中少量大的奇异值占据主导地位,这会产生宽幅数值范围,导致量化误差和训练性能下降。
🛠️ 主要方法通过将各向异性谱划分为较窄的子分布进行独立量化,并结合稀疏随机采样和随机投影来保持主要谱子空间,从而降低分解成本。
📊 数据与实验在 LLaMA-3 8B 模型上进行了实验,使用100B数据训练,采用 FP4量化权重、激活和梯度,实现了与 BF16 几乎一致的训练性能,且性能超越 Nvidia FP4方案。
⭐ 主要贡献证明了提出的 Metis 框架可以显著改善低比特训练中的量化性能,以极低的计算开销同时提升模型精度和训练效率,并提供了开源代码实现。
查看完整摘要 (Abstract)
This work identifies anisotropy in the singular value spectra of parameters, activations, and gradients as the fundamental barrier to low-bit training of large language models (LLMs). These spectra are dominated by a small fraction of large singular values, inducing wide numerical ranges that cause quantization bias and severe spectral distortion, ultimately degrading training performance. This work presents \emph{Metis}, a spectral-domain quantization framework that partitions anisotropic spectra into narrower sub-distributions for independent quantization, thereby reducing errors and preserving spectral structure. To minimize overhead,
Metis leverages two key properties of the dominant spectral subspace: preservation via sparsely random sampling and preservation via random projection, reducing decomposition cost to a negligible level. On LLaMA-3 8B trained with 100B tokens, Metis enables robust W4A4G4 training with FP4 quantization of weights, activations, and gradients, yielding only a 0.4\% training loss gap and a 0.1\% degradation in downstream accuracy relative to BF16. Beyond matching BF16 fidelity, Metis also surpasses Nvidia’s FP4 recipe, consistently achieving lower loss and higher downstream accuracy while incurring significantly lower computational overhead.
The code implementation for Metis is available at: \url{https://github.com/sii-research/Metis}.
基础/前沿模型 (含LLM)
效率与压缩
#PEFT #LLM #LoRA
🎯 研究动机LoRA 在参数高效微调中广泛应用,但其收敛速度较慢限制了性能提升和资源效率,亟需改进方法。
❓ 解决问题现有方法难以同时优化性能、内存占用和计算效率,无法实现各维度的综合平衡。
🔍 现象分析论文重新审视了 LoRA 收敛速度慢的原因,并分析了不同 PEFT 方法在内存使用、初始化时间和计算效率方面的表现。
🛠️ 主要方法提出 Matrix Shard Sharing (MiSS),通过共享一个可训练的初始化为零的矩阵实现权重分片更新,并扩展为 MiSS$^e$ 以优化计算效率、内存占用与部署扩展性。
📊 数据与实验理论分析验证方法优化复杂度,实验证明其在性能、内存和效率维度上均实现了均衡优化,绘制 Pareto 前沿揭示多领域优越性。
⭐ 主要贡献介绍 MiSS 和 MiSS$^e$ 方法,解决了 LoRA 的收敛效率问题,同时提供对 PEFT 方法的综合分析,在各优化维度实现领先表现。
查看完整摘要 (Abstract)
Low-Rank Adaptation (LoRA) is a widely adopted technique for parameter-efficient fine-tuning, but its slow convergence has spurred the development of numerous variants. Nevertheless, current approaches struggle to achieve simultaneous improvements in performance, memory footprint, and computational efficiency. To address this challenge, we revisit the causes of LoRA’s slow convergence and, based on these insights, propose \textbf{M}atr\textbf{i}x \textbf{S}hard \textbf{S}haring (MiSS) that shards the original weight matrix and updates by sharing a single trainable matrix $\boldsymbol{D}$ initialized to zero. To simultaneously ensure computational efficiency, low memory footprint, and scalable serving, we introduce MiSS$^e$. Through theoretical analyses and empirical results, our method reduces optimization complexity while maintaining strong performance, striking a favorable balance between performance, memory, and efficiency. Furthermore, we provide a comprehensive analysis of different PEFT methods with respect to memory usage, initialization time, and computational efficiency. By mapping the Pareto frontier, we show that MiSS achieves a favorable balance across these dimensions, integrating the strengths of prior approaches.
基础/前沿模型 (含LLM)
效率与压缩
#large language models #model compression #structured pruning
🎯 研究动机大规模语言模型(LLMs)的扩展主要依赖于专家混合(MoE)架构,但其高内存需求显著增加了部署难度。
❓ 解决问题现有的MoE压缩方法在实现模型压缩时通常会导致显著的精度下降,本研究旨在提出一种可降低精度损失的新方法。
🔍 现象分析现有方法在中等压缩率下仍会导致相对高达7-14%的精度损失,说明需要更有效的权重结构优化方式。
🛠️ 主要方法提出了MoBE方法,通过将每个专家的上升/门控矩阵进行分解,结合独立矩阵A与共享基矩阵组合表示,实现精度损失最小化。
📊 数据与实验在多个模型(如Qwen3-235BA22B-2507和Kimi-K2-Instruct)上验证,MoBE将参数减少24%-30%,相对精度仅下降约1%-2%。
⭐ 主要贡献提出了新型混合基专家(MoBE)方法,在实现显著参数压缩的同时,精度损失显著低于现有方法。
查看完整摘要 (Abstract)
The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further reparameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235BA22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).
基础/前沿模型 (含LLM)
效率与压缩
#Question Answering #(Large) Language Models
🎯 研究动机当前问答系统在复杂问题中需要权衡推理质量与效率,但缺乏灵活调整响应长度的方法。
❓ 解决问题提出一种能基于问题难度自适应调整响应长度的问答方法,以提升响应效率并降低推理成本。
🔍 现象分析实验中观察到一种称为“智能简洁”的现象,即模型会对简单问题给出较短回答,对复杂问题提供较长解答。
🛠️ 主要方法基于信息论的难度评估机制和双目标奖励机制,开发了一种名为 MoL 的自适应多长度推理方法。
📊 数据与实验在多个问答基准数据集上进行实验,MoL 展现了与现有方法相当的准确性,同时显著减少生成的 Token 数量。
⭐ 主要贡献验证了基于难度感知的响应长度调节能有效提升问答效率,提出了一种适用于人机交互的高效推理框架。
查看完整摘要 (Abstract)
We present Mixture-of-Length (MoL), an approach for Question Answering (QA) with context that aims to improve the balance between reasoning quality and response efficiency. Our method introduces a principled difficulty assessment based on information-theoretic principles and a dual-objective reward mechanism that adaptively modulates response length. In our experiments, MoL exhibits an emergent behavior termed "intelligent brevity": the model tends to produce shorter responses for simpler queries and longer ones for more complex inputs. This property is desirable for human-computer interaction and can reduce inference costs. A post-hoc analysis of internal activations suggests a correlation between this output adaptivity and the effective number of layers that contribute during inference. On multiple QA benchmarks, MoL demonstrates competitive accuracy while substantially reducing tokens compared to baselines, indicating that difficulty-aware length modulation is a promising direction for efficient QA with context.
基础/前沿模型 (含LLM)
效率与压缩
#Model Compression #Mixture-of-Experts #Structured Pruning #Expert Pruning
🎯 研究动机Mixture-of-Experts (MoE)模型因其仅激活部分专家的机制能够高效扩展,但存在显著的内存开销问题,亟需有效的结构化剪枝方法降低内存成本。
❓ 解决问题现有的结构化剪枝方法在模型架构、校准数据来源和样本规模三个方面表现不稳定,导致模型性能欠佳且退化不均。
🔍 现象分析验证表明,专家的重复度可以通过访问频率和输出方差进行量化,低使用率且输出稳定的专家对模型整体性能贡献较小。
🛠️ 主要方法提出MoNE方法,通过访问频率和输出方差评估专家冗余,将冗余专家替换为轻量化的新手估计器,尽量减少模型性能的下降。
📊 数据与实验在九个下游任务上进行实验,25%剪枝率条件下平均零样本精度领先基线方法2.72,Qwen2-57B-A14B模型性能仅下降0.14。
⭐ 主要贡献开发了一种效果更优且稳健的专家剪枝方法,显著提升了剪枝后的零样本任务性能,同时减少了大规模模型的内存开销。
查看完整摘要 (Abstract)
Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token.
However, deploying MoE-based models incurs significant memory overhead due to the need to retain all experts in memory.
While structured pruning is promising to reduce memory costs, existing methods often show suboptimal performance and unstable degradation in three dimensions: model architectures, calibration data sources, and calibration sample sizes.
This paper proposes \textbf{M}ixture-\textbf{o}f-\textbf{N}ovices-and-\textbf{E}xperts (\textbf{MoNE}), a novel expert pruning method that replaces redundant experts with lightweight novices to achieve effective and robust model compression.
MoNE evaluates expert redundancy based on two metrics: access frequency and output variance.
Experts exhibiting low usage and stable outputs are pruned and replaced with lightweight novices—unbiased estimations of their original outputs—minimizing performance degradation.
Extensive experiments demonstrate that MoNE consistently outperforms baseline methods with minimal accuracy degradation across the three dimensions, confirming its effectiveness and robustness.
Notably, it outperforms baselines by up to 2.72 for the average zero shot accuracy across nine downstream tasks under 25\% pruning ratio, with only 0.14 performance drop for Qwen2-57B-A14B.
The code is available at \url{https://github.com/zxgx/mode-pd}.
基础/前沿模型 (含LLM)
效率与压缩
#Parameter-efficient fine-tuning #large language model #low-rank adaptation
🎯 研究动机传统的低秩适配方法在参数预算限制下表现受限,亟需一种更高效的参数共享机制以优化大语言模型的微调过程。
❓ 解决问题提出一种无需架构变化且适配精度更优的微调方法,以在参数预算相同情况下突破现有方法的效率–表达性瓶颈。
🔍 现象分析实验显示非局部参数共享能够有效正则化,且权重分组设计与预算分配显著影响模型的表现与效率平衡。
🛠️ 主要方法通过固定嵌套划分,将一组学习到的小规模标量广播到权重矩阵,从而实现随机化的细粒度权重更新共享。
📊 数据与实验在多种语言理解与生成任务中进行评测,MoSA在严格匹配预算条件下持续优于主流的参数高效微调基线。
⭐ 主要贡献引入了MoSA这一简单可扩展的替代方法,在提升模型性能的同时保持推理零额外开销,为参数高效微调领域提供新的思路。
查看完整摘要 (Abstract)
We introduce MoSA, a new parameter-efficient fine-tuning (PEFT) method that replaces low-rank factorization with randomized, fine-grained sharing of weight updates. Each adapted weight matrix is constructed by broadcasting a small set of learned scalars over a fixed tessellation, a pre-defined group assignment of weight entries of the weight matrix, producing expressive changes under the same parameter budget as low-rank adaptation (LoRA). MoSA requires no architectural changes and can be merged into the base model for zero-overhead inference. Across diverse language understanding and generation tasks, MoSA matches or surpasses strong PEFT baselines under strictly matched budgets. Analyses and ablations indicate that non-local parameter sharing acts as an effective regularizer, and that grouping design and budget allocation govern the expressivity–efficiency trade-off. These results position MoSA as a simple, scalable alternative to LoRA. Our code is available at https://github.com/XiequnWang/MoSA-ICLR26.
基础/前沿模型 (含LLM)
效率与压缩
#On-device LLM
🎯 研究动机针对大模型推理能力的两个长期假设,即需大规模模型和海量数据集进行训练,作者质疑数据规模的必要性,重点探讨小规模数据训练的潜能。
❓ 解决问题重新审视推理能力的涌现是否必须依赖超过10T tokens的极大语料库,并探索在显著减少数据规模的情况下实现高效推理的小模型方案。
🔍 现象分析发现通过精心策划及重采样高质量的开源数据集,仅需约2T tokens即可涌现强推理能力,并可在此基础上进一步提升数据训练效率和模型表现。
🛠️ 主要方法设计数据质量度量指标,筛选和重采样高质量数据集,结合低标量数据预训练和后续训练步骤,开发子十亿参数推理模型序列MobileLLM-R1。
📊 数据与实验使用经过优化后的开源数据集,以4.2T tokens进行预训练,比较性能优于参数更高的大规模开源模型,并在多项推理基准上达到与Qwen3-0.6B相当或更好的结果。
⭐ 主要贡献提出无需海量数据即可实现强推理的小模型训练范式,开发MobileLLM-R1模型系列,显著超越同类开源模型表现,并公开完整训练代码与数据配比方案以促进后续研究。
查看完整摘要 (Abstract)
The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we revisit the necessity of scaling to extremely large corpora (>10T tokens) for reasoning emergence. By carefully curating and resampling open-source datasets that we identify as beneficial under our designed metrics, we demonstrate that strong reasoning abilities can emerge with far less data. Specifically, we show that only ~2T tokens of high-quality data are sufficient, and pre-training with 4.2T tokens on the dataset resampled from these ~2T tokens, followed by a established post-training procedure, enables the development of MobileLLM-R1, a series of sub-billion-parameter reasoning models that substantially outperform prior models trained on fully open-sourced data. For example, MobileLLM-R1-950M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3’s proprietary 36T-token corpus for pretraining, MobileLLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks. To facilitate further research in this direction, we have made the models (https://huggingface.co/collections/facebook/mobilellm-r1) and code (https://github.com/facebookresearch/MobileLLM-R1) publicly available, along with the complete training recipe, data sources, and data mixing ratios.
基础/前沿模型 (含LLM)
效率与压缩
#ML System #Efficient Decoding
🎯 研究动机长上下文推理的瓶颈在于KV缓存加载的内存访问开销,特别是生成过程中反复从高带宽内存加载到片上静态随机存取存储器导致效率低下。
❓ 解决问题解决了现有MLA方法因单一潜在头无法分片而导致分布式解码性能瓶颈的问题。
🔍 现象分析MLA在多卡分布式解码中需要每个设备重复加载完整的KV缓存,占用内存带宽并削弱TP的优点。
🛠️ 主要方法提出了多头低秩注意力(MLRA)方法,使潜在状态可分片,在支持四维TP解码的同时提升效率。
📊 数据与实验通过大量实验验证,MLRA在困惑度和下游任务性能上表现达到最优,同时实现了2.8倍解码速度提升。
⭐ 主要贡献提出了一个高效解码的新方法MLRA,解决了分布式解码的关键瓶颈,并通过公开代码和权重促进研究复现。
查看完整摘要 (Abstract)
Long-context inference in large language models is bottlenecked by Key--Value (KV) cache loading during the decoding stage, where the sequential nature of generation requires repeatedly transferring the KV cache from off-chip High-Bandwidth Memory (HBM) to on-chip Static Random-Access Memory (SRAM) at each step. While Multi-Head Latent Attention (MLA) significantly reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP). Since its single latent head cannot be partitioned, each device is forced to redundantly load the complete KV cache for every token, consuming excessive memory traffic and diminishing TP benefits like weight sharding. In this work, we propose Multi-Head Low-Rank Attention (MLRA), which enables partitionable latent states for efficient 4-way TP decoding. Extensive experiments show that MLRA achieves state-of-the-art perplexity and downstream task performance, while also delivering a 2.8$\times$ decoding speedup over MLA. Code is available at https://github.com/SongtaoLiu0823/MLRA. Pretrained weights, along with the training and evaluation data, are available at https://huggingface.co/Soughing/MLRA.
基础/前沿模型 (含LLM)
效率与压缩
#Discrete Diffusion Sampling; Neural Indicator
TL;DR:We propose a general framework for sampling order optimization of discrete diffusion models by using a neural indicator.
🎯 研究动机离散扩散语言模型较传统自回归方法具有并行解码潜力,但现有采样策略效率低下,亟需优化采样顺序以提升模型性能。
❓ 解决问题优化离散扩散模型中令牌的采样顺序,以显著减少采样迭代次数,同时保持生成性能。
🔍 现象分析通过充分利用每步中正确预测的令牌,发现采样迭代次数可减少一个数量级,且不会牺牲准确度。
🛠️ 主要方法提出了神经指标采样框架(NI Sampling),基于神经指标决定每步需采样的令牌,并设计轨迹保留目标函数来训练该指标。
📊 数据与实验基于LLaDA和Dream模型,在多个基准测试上进行实验,方法实现最高14.3倍加速,同时性能损失可忽略不计。
⭐ 主要贡献提供了一种通用的采样顺序优化框架,显著提升离散扩散模型的采样效率和生成性能,优于传统置信度阈值采样策略。
查看完整摘要 (Abstract)
Discrete diffusion language models (dLLMs) have recently emerged as a promising alternative to traditional autoregressive approaches, offering the flexibility to generate tokens in arbitrary orders and the potential of parallel decoding. However, existing heuristic sampling strategies remain inefficient: they choose only a small part of tokens to sample at each step, leaving substantial room for improvement. In this work, we study the problem of token sampling order optimization and demonstrate its significant potential for acceleration. Specifically, we find that fully leveraging correct predictions at each step can reduce the number of sampling iterations by an order of magnitude without compromising accuracy. Based on this, we propose Neural Indicator Sampling (NI Sampling), a general sampling order optimization framework that utilize a neural indicator to decide which tokens should be sampled at each step. We further propose a novel trajectory-preserving objective to train the indicator. Experiments on LLaDA and Dream models across multiple benchmarks show that our method achieves up to 14.3$\times$ acceleration over full-step sampling with negligible performance drop, and consistently outperforms confidence threshold sampling in the accuracy–step trade-off.
基础/前沿模型 (含LLM)
效率与压缩
#outliers #Quantization
TL;DR:Flexible arbitrary bit-width non-uniform quantization with multi-level outlier compensation for efficient LLM compression.
🎯 研究动机随着大语言模型规模化发展,高效压缩同时保持模型性能成为关键挑战。现有非均匀量化方法依赖固定码本且优化成本高,适应性与效率不足。
❓ 解决问题针对传统方法无法有效处理异常分布问题,引入灵活的层级量化策略与多级异常补偿机制,以提供更高效的模型压缩方案。
🔍 现象分析传统异常处理方法无法适应权重扰动、激活分布和扰动传播的复杂特性,需重新定义异常评估指标以优化补偿策略。
🛠️ 主要方法提出一种支持任意比特宽的非均匀量化框架NuBitQ,并设计异常补偿插件OCP,通过多层细粒度补偿缓解性能下降,无需复杂Hessian计算与微调。
📊 数据与实验在多个任务和多种模型系列上进行实验验证,展示了方法的有效性和适用性,实验结果证明模型性能与压缩率的显著提升。
⭐ 主要贡献构建灵活的层级非均匀量化方案;设计综合性异常评估指标与插件;降低传统方法计算复杂度,提高适应性和扩展性。
查看完整摘要 (Abstract)
With the rapid scaling of large language models, achieving efficient compression while maintaining model performance has become a critical challenge. To address the limitations of existing non-uniform quantization methods, which typically rely on fixed codebooks and require costly optimization, we propose a novel arbitrary bit-width non-uniform Quantization (NuBitQ). The framework enables flexible, layer-specific quantization strategies, significantly enhancing adaptability and efficiency. Notably, traditional outlier compensation methods used in uniform quantization are ill-suited for the anomalous distribution characteristics encountered in our context. To address this, we design a novel outlier evaluation metric that integrates weight perturbation, activation distribution, and perturbation propagation. Based on this metric, we further develop an Outlier Compensation Plugin (OCP) that implements multi-level, fine-grained outlier compensation strategies, effectively mitigating performance degradation caused by outliers. Our approach avoids direct complex Hessian computation and fine-tuning, offering strong applicability and scalability. Extensive experiments on multiple tasks and across various model series demonstrate the effectiveness of the proposed approach.
基础/前沿模型 (含LLM)
效率与压缩
#Nonparametric Teaching #Functional Gradient Descent #Attention Learners #Data Efficiency
🎯 研究动机注意力学习器擅长捕捉序列与其属性间的隐式关系,但其学习过程成本较高,亟需提高学习效率。
❓ 解决问题提出非参数教学范式(AtteNT),以通过非参数的示例选择加速注意力学习器的训练。
🔍 现象分析通过理论分析表明,注意力学习器的参数梯度下降过程与非参数教学中的功能梯度下降一致,揭示了注意力机制对训练效率的影响。
🛠️ 主要方法设计AtteNT框架,通过选取密集序列-属性对中的子集来优化教学示例,加速训练过程。
📊 数据与实验实验覆盖大语言模型(LLMs)与视觉Transformer(ViTs),在微调与从头训练设置中分别实现13.01%与20.58%的训练时间缩减,同时保持或提升任务性能。
⭐ 主要贡献提出新型非参数教学方法(AtteNT),显著提高注意力学习器的数据效率和训练速度,且不以性能为代价。
查看完整摘要 (Abstract)
Attention learners, neural networks built on the attention mechanism, e.g., transformers, excel at learning the implicit relationships that relate sequences to their corresponding properties, e.g., mapping a given sequence of tokens to the probability of the next token. However, the learning process tends to be costly. To address this, we present a novel paradigm named **Atte**ntion **N**eural **T**eaching (AtteNT) that reinterprets the learning process through a nonparametric teaching perspective. Specifically, the latter provides a theoretical framework for teaching mappings that are implicitly defined (i.e., nonparametric) via example selection. Such an implicit mapping is embodied through a dense set of sequence-property pairs, with the AtteNT teacher selecting a subset to accelerate convergence in attention learner training. By analytically investigating the role of attention on parameter-based gradient descent during training, and recasting the evolution of attention learners, shaped by parameter updates, through functional gradient descent in nonparametric teaching, we show *for the first time* that teaching attention learners is consistent with teaching importance-adaptive nonparametric learners. These new findings readily commit AtteNT to enhancing learning efficiency of attention learners. Specifically, we observe training time reductions of 13.01% for LLMs and 20.58% for ViTs, spanning both fine-tuning and training-from-scratch regimes. Crucially, these gains are achieved without compromising accuracy; in fact, performance is consistently preserved and often enhanced across a diverse set of downstream tasks.
基础/前沿模型 (含LLM)
效率与压缩
#large language models #reasoning #efficiency #model compression
🎯 研究动机现有的模型量化方法,如4-bit量化,虽然在非推理模型和零样本任务中表现出色,但在推理模型中,由于KV缓存占据大量内存,这种方法存在局限性。亟需针对推理模型的规模变化设计更优的内存优化策略。
❓ 解决问题探讨推理模型的规模与内存优化之间的关系,研究不同规模下权重分配与生成长度的最佳权衡,以制定更有效的内存优化策略。
🔍 现象分析实验发现,小规模推理模型通过增大权重分配提升准确性,而大规模模型则优先优化生成能力。此外,模型规模还影响内存效率,包括并行加速时的效率及KV缓存的处理方式。
🛠️ 主要方法系统性地比较不同规模推理模型在数学计算、代码生成和知识密集型任务中的表现,分析KV缓存量化与驱逐策略的适用性,并提出基于模型规模的优化准则。
📊 数据与实验采用多领域推理任务数据集进行实验,包括数学推理、代码生成以及知识推理,围绕模型不同量化方式和规模进行系统评价。
⭐ 主要贡献首次揭示推理模型的内存优化应根据规模调整策略,为小型模型优先模型容量、大型模型优先生成能力提供指导性建议;推动LLM部署优化从非推理模型策略向规模特定策略转变。
查看完整摘要 (Abstract)
While 4-bit quantization has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales, we show that this universal prescription fails for reasoning models, where KV cache rather than model size can dominate memory.
Through systematic experiments on mathematical, code generation, and knowledge-intensive reasoning tasks, we find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to larger weights, rather than longer generation, while larger models benefit from the opposite strategy.
This scale threshold also determines when parallel scaling becomes memory-efficient and whether KV cache eviction outperforms KV quantization.
Our findings show that memory optimization for LLMs cannot be scale-agnostic, while providing principled guidelines: for small reasoning models, prioritize model capacity over test-time compute, while for large ones, maximize test-time compute.
Our results suggest that optimizing reasoning models for deployment requires fundamentally different strategies than those established for non-reasoning ones.
基础/前沿模型 (含LLM)
效率与压缩
#Configuration-aware optimization #Pareto-base configuration search #Quantization #Fine-tuning
🎯 研究动机大型预训练模型需要高效压缩以部署在边缘设备上,同时避免因量化导致的精度损失。
❓ 解决问题针对边缘设备异构能力,提出无需针对每种量化配置单独微调的方法,以减少计算成本。
🔍 现象分析直接在任意量化配置下调整 LoRA 适配器是困难的,训练配置集的选择质量对精度影响显著。
🛠️ 主要方法提出 CoA-LoRA,通过配置感知模型动态调整 LoRA 适配器,并设计基于 Pareto 的配置搜索优化训练集质量。
📊 数据与实验在多种量化配置的实验中,CoA-LoRA实现了与现有方法相当甚至更优的性能,无需额外微调时间成本。
⭐ 主要贡献提供了一种更高效的量化配置适配解决方案,可同时减少边缘设备部署时的计算负担和性能损失。
查看完整摘要 (Abstract)
As increasingly large pre-trained models are released, deploying them on edge devices for privacy-preserving applications requires effective compression. Recent works combine quantization with the fine-tuning of high-precision LoRA adapters, which can substantially reduce model size while mitigating the accuracy loss from quantization. However, edge devices have inherently heterogeneous capabilities, while performing configuration-wise fine-tuning for every quantization setting is computationally prohibitive. In this paper, we propose CoA-LoRA, a method that dynamically adjusts the LoRA adapter to arbitrary quantization configurations (i.e., the per-layer bit-width choices of a pre-trained model) without requiring repeated fine-tuning. This is accomplished via a configuration-aware model that maps each configuration to its low-rank adjustments. The effectiveness of this model critically depends on the training configuration set, a collection of configurations chosen to cover different total bit-width budgets. However, constructing a high-quality configuration set is non-trivial. We therefore design a Pareto-based configuration search that iteratively optimizes the training configuration set, yielding more precise low-rank adjustments. Our experiments demonstrate that, unlike the state-of-the-art methods that require fine-tuning a separate LoRA adapter for each configuration, CoA-LoRA incurs no additional time cost while achieving comparable or even superior performance to those methods.
基础/前沿模型 (含LLM)
效率与压缩
#Quantization #Pruning #LLMs
TL;DR:This paper introduces a compensation-based framework for joint quantization and sparsity, and is the first to enable W4A4KV4 quantized + 50% sparse LLMs.
🎯 研究动机随着大规模语言模型压缩技术逐渐达到瓶颈,单一方法难以进一步提高压缩效果,结合量化与稀疏化成为一种新方向。
❓ 解决问题量化与稀疏化同时应用时,权重分布要求冲突,量化需紧凑范围,稀疏化需高方差,优化该矛盾以减少性能损失。
🔍 现象分析通过二阶海森目标函数分析权重分布误差,发现调整量化和稀疏化间误差能有效减少模型退化。
🛠️ 主要方法提出无需训练的‘最佳脑恢复’框架,通过替代逼近与群组误差补偿实现闭合解,有效协调量化与稀疏化需求。
📊 数据与实验实验使用Llama2-7B模型,在W4A4KV4量化和50%稀疏条件下仅导致1.4困惑度下降,同时实现高达4.72倍速度提升和6.4倍内存缩减。
⭐ 主要贡献首次实现联合量化与稀疏化的大规模语言模型,同时提供训练无关的通用解决方案,大幅提升模型压缩效率与推理性能。
查看完整摘要 (Abstract)
Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR incurs only a 1.4 perplexity degradation on Llama2-7B to enable aggressive W4A4KV4 quantization with 50% sparsity, delivering up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.
基础/前沿模型 (含LLM)
效率与压缩
#Speculative Decoding #Joint Intractability #Lossless Verification
🎯 研究动机探索提高推断速度同时保持分布一致性的解码方法,验证过程是当前瓶颈。现有方法在序列级验证上表现较好,但受限于部分信息或近似策略。
❓ 解决问题解决联合不可解性问题,同时设计一种无损的验证方法以显著提升接受序列数量。
🔍 现象分析序列级验证优于逐字符验证,但现有方法难以平衡支路间的概率质量,影响解码效率。
🛠️ 主要方法提出层次化推测解码(HSD),通过平衡概率质量来消除联合不可解性,并提供可证明的无损验证机制。
📊 数据与实验在多个模型家族与基准上进行大规模实验,验证HSD在广泛任务中一致提高接受率。与EAGLE-3结合时性能提升超过12%。
⭐ 主要贡献提出一种高效的、可解释性强的解码策略,提升了解码效率同时保持分布一致性,并为推测解码框架提供即插即用的解决方案。
查看完整摘要 (Abstract)
Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to token-wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose \emph{Hierarchical Speculative Decoding (HSD)}, a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient probability mass across accessible branches. Our extensive large-scale experiments demonstrate that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Moreover, its strong explainability and generality make it readily integrable into a wide range of speculative decoding frameworks. Notably, integrating HSD into EAGLE-3 yields over a 12\% performance gain, establishing state-of-the-art decoding efficiency without compromising distribution fidelity. Code is available at https://github.com/ZhouYuxuanYX/Hierarchical-Speculative-Decoding.
基础/前沿模型 (含LLM)
效率与压缩
#Model Pruning #Large Language Model #Data Selection #Efficient Recovery
TL;DR:To achieve efficient capability recovery for pruned LLMs, we propose the PASER method to conduct the post-training data seletion.
🎯 研究动机模型剪枝虽能压缩大语言模型,但常导致性能显著下降。现有后训练恢复方法忽视了模型能力受损不均及高计算成本问题。
❓ 解决问题本文提出PASER方法,旨在通过后训练数据选择实现剪枝后大语言模型的高效能力恢复。重点关注如何以有限数据预算恢复受损最严重的能力,并避免无关数据干扰。
🔍 现象分析模型剪枝后不同能力退化程度不均,传统指令调优方法未考虑此差异且计算成本高。部分无关指令还会对恢复过程产生负面影响。
🛠️ 主要方法使用流形学习和谱聚类将恢复指令在语义空间分组,形成能力特定的指令集。根据各能力退化程度自适应分配数据预算,并优先选择导致模型性能下降最多的样本。同时过滤冲突或无关数据以降低负面调优效应。
📊 数据与实验实验表明PASER显著优于传统基线,仅使用4%-20%的后训练数据即可有效恢复剪枝后大模型的通用能力。作者提供了匿名代码仓库链接。
⭐ 主要贡献提出了首个针对剪枝大语言模型能力恢复的数据选择框架PASER。通过能力导向的分组和预算分配实现了高效恢复,大幅降低了数据需求并提升了恢复效果。
查看完整摘要 (Abstract)
Model pruning is an effective approach for compressing large language models (LLMs). However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs. Moreover, some irrelevant instructions may also introduce negative effects to model capacity recovery. To address these challenges, we propose the **P**ost-training d**A**ta **S**election method for **E**fficient pruned large language model **R**ecovery (**PASER**). PASER aims to identify instructions to recover the most compromised model capacities with a certain data budget. Our approach first applies manifold learning and spectral clustering to group recovery instructions in the semantic space, revealing capability-specific instruction sets. Then, the data budget is adaptively allocated across clusters by the degree of corresponding model capability degradation. In each cluster, we prioritize data samples that lead to the most decline of model performance. To mitigate potential negative tuning effects, we also detect and filter out conflicting or irrelevant recovery data. Extensive experiments demonstrate that PASER significantly outperforms conventional baselines, effectively recovering the general capabilities of pruned LLMs while utilizing merely 4\%-20\% of the original post-training data. We provide the anonymous code repository in [Link](https://anonymous.4open.science/r/PASER-E606).
基础/前沿模型 (含LLM)
效率与压缩
#Efficient Finetuning of Large Language Models;LoRA;
TL;DR:A novel lightweight method for module selection for LoRA finetuning
🎯 研究动机LoRA是一种常用的大模型微调方法,但现有研究对适配器的位置策略多未形成明确结论,优化潜力尚存。
❓ 解决问题提出一种轻量化方法,自动识别适合放置LoRA适配器的模块类型,从而提高微调效率。
🔍 现象分析部分研究建议将适配器置于注意力模块,而其他研究则建议选择MLP模块。两者表现差异在不同场景中未有统一结论。
🛠️ 主要方法通过理论分析提出PLoP算法,依据预训练模型与微调任务,精准选定适配器的放置位置。
📊 数据与实验在监督微调任务与推理强化学习任务上进行实验,验证PLoP的一致性优于或至少不逊于现有放置策略。
⭐ 主要贡献引入一种高效模块选择方法PLoP,显著提升LoRA微调的性能与适应性,降低人工作业成本。
查看完整摘要 (Abstract)
Low-Rank Adaptation is a widely used finetuning method for large models. Its small memory footprint allows practitioners to adapt large models to specific tasks at a fraction of the cost of full finetuning. Different modifications have been proposed to enhance its efficiency by, for example, setting the learning rate, the rank, and the initialization. Another improvement axis is adapter placement strategy: when using LoRA, practitioners usually pick \emph{module types} to adapt with LoRA, such as Query and Key modules. Few works have studied the problem of adapter placement, with nonconclusive results: original LoRA paper suggested placing adapters in attention modules, while other works suggested placing them in the MLP modules. Through an intuitive theoretical analysis, we introduce PLoP (Precise LoRA Placement), a lightweight method that allows automatic identification of module types where LoRA adapters should be placed, given a pretrained model and a finetuning task. We demonstrate that PLoP consistently outperforms, and in the worst case competes, with commonly used placement strategies through comprehensive experiments on supervised finetuning and reinforcement learning for reasoning.
基础/前沿模型 (含LLM)
效率与压缩
#transformer #autoregressive model #multi-token prediction #generative model #large language models
TL;DR:An LLM framework to predict multiple tokens with arbitrary dependencies in a single model call.
🎯 研究动机传统自回归语言模型因单次只生成一个标记而速度较慢,需要探索快速生成多标记的方法。
❓ 解决问题提出一种能够在一次模型调用中预测多个标记的框架,减少解码时间并提升生成效率。
🔍 现象分析通过将随机性从后处理采样转移到输入变量,使得未来标记成为输入变量的确定性函数,可实现联合预测。
🛠️ 主要方法提出了并行标记预测(PTP)框架,利用现有模型蒸馏或无教师逆自回归训练方法训练模型,在单次前向传播中实现多标记预测。
📊 数据与实验在一个多任务推测性解码基准上实验,PTP实现了2.4倍的速度提升,并开源了代码与检查点供验证和复现。
⭐ 主要贡献验证了PTP框架能以单次调用表示任意标记间依赖关系,为大型语言模型快速生成开辟了新方向。
查看完整摘要 (Abstract)
Autoregressive decoding in language models is inherently slow, generating only one token per forward pass. We propose Parallel Token Prediction (PTP), a general-purpose framework for predicting multiple tokens in a single model call. PTP moves the source of randomness from post-hoc sampling to random input variables, making future tokens deterministic functions of those inputs and thus jointly predictable in a single forward pass. We prove that a single PTP call can represent arbitrary dependencies between tokens. PTP is trained by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, PTP achieves a 2.4$\times$ speedup on a diverse-task speculative decoding benchmark. We provide code and checkpoints at https://github.com/mandt-lab/ptp.
基础/前沿模型 (含LLM)
效率与压缩
#diffusion LLMs #parallel decoding #benchmark
🎯 研究动机当前大多数自回归LLMs受限于逐步解码,而扩散式LLMs(dLLMs)通过并行解码具备显著加速潜力,但忽视生成质量下降问题的系统研究。
❓ 解决问题解决扩散式LLMs在并行解码中忽略令牌依赖性问题,并通过信息论分析与案例研究揭示其根本性限制。
🔍 现象分析发现并行解码在真实场景中会导致明显的质量下降,现有策略在任务难度上无法动态调整并行度,难以在速度和质量之间找到平衡。
🛠️ 主要方法提出ParallelBench,一个专为扩散式LLMs设计的基准,包含对人类和自回归LLMs简单但对dLLMs具有挑战性的真实任务。
📊 数据与实验构建ParallelBench基准并系统评估dLLMs和自回归LLMs,量化了并行解码中速度与质量之间的权衡,验证了现有策略的不足。
⭐ 主要贡献系统揭示了扩散式LLMs在并行解码中的固有挑战,提出首个聚焦于扩散式LLMs的专用基准,为未来解码方式的创新提供方向并公开相关数据集。
查看完整摘要 (Abstract)
While most autoregressive LLMs are constrained to one-by-one decoding, diffusion LLMs (dLLMs) have attracted growing interest for their potential to dramatically accelerate inference through parallel decoding. Despite this promise, the conditional independence assumption in dLLMs causes parallel decoding to ignore token dependencies, inevitably degrading generation quality when these dependencies are strong. However, existing works largely overlook these inherent challenges, and evaluations on standard benchmarks (e.g., math and coding) are not sufficient to capture the quality degradation caused by parallel decoding. To address this gap, we first provide an information-theoretic analysis of parallel decoding. We then conduct case studies on analytically tractable synthetic list operations from both data distribution and decoding strategy perspectives, offering quantitative insights that highlight the fundamental limitations of parallel decoding. Building on these insights, we propose ParallelBench, the first benchmark specifically designed for dLLMs, featuring realistic tasks that are trivial for humans and autoregressive LLMs yet exceptionally challenging for dLLMs under parallel decoding. Using ParallelBench, we systematically analyze both dLLMs and autoregressive LLMs, revealing that: (i) dLLMs under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speedup without compromising quality. Our findings underscore the pressing need for innovative decoding methods that can overcome the current speed-quality trade-off. We release our benchmark to help accelerate the development of truly efficient dLLMs.
基础/前沿模型 (含LLM)
效率与压缩
#quantization #large language models #model compression
🎯 研究动机大语言模型(LLM)的量化通过降低精度以压缩模型和加速推理,但现有方法在处理权重和激活中的异常值时存在准确性下降的问题,特别是在推理任务的长链思维中错误积累严重。
❓ 解决问题现有量化方法在异常值抑制上不足或引入推理时的额外开销,导致无法兼顾效率与准确性。本研究旨在设计一种在高效推理下有效解决异常值问题的量化方法。
🔍 现象分析权重和激活中的异常值导致量化误差增大,并引发推理任务中显著的准确性下降;现有工作未有效平衡动态范围或高效利用硬件资源。
🛠️ 主要方法提出了基于配对旋转的量化方法(ParoQuant),通过结合独立Givens旋转和通道级缩放,降低通道间量级差异,同时设计高效推理内核,保证硬件友好和轻量化的开销。
📊 数据与实验在权重量化任务中,ParoQuant表现优于现有的AWQ方法,推理任务准确率提高2.4%,并且引入推理开销少于10%;另外,方法在权重激活量化上达到了当前最优方法的同等准确率。
⭐ 主要贡献提出了一种高效的PTQ方法ParoQuant,有效解决了异常值问题;通过推理内核设计实现了GPU并行化和低开销;验证了其在推理任务上的精度提升和高效性,为更高效部署推理型LLM铺平了道路。
查看完整摘要 (Abstract)
Post-training quantization (PTQ) compresses the weights and activations of large language models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitudes across channels and narrow the dynamic range within each quantization group, effectively addressing the outlier issue. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. Under weight-only quantization, ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks, with less than 10% overhead. ParoQuant also matches the accuracy of state-of-the-art weight-activation quantization methods. This paves the way for more efficient and accurate deployment of reasoning LLMs.
基础/前沿模型 (含LLM)
效率与压缩
#Data Synthesis #Large Language Model #Knowledge Distillation
TL;DR:This paper introduces a pedagogically-inspired data synthesis framework that distills knowledge from teacher to student language models through deficiency diagnosis, curriculum structuring, and stage-wise adaptation.
🎯 研究动机大型语言模型知识蒸馏可提高小型模型的效率,但现有方法缺乏系统的教学理论指导,未充分考虑知识转移的动态过程。
❓ 解决问题提出一个基于教学理论的数据合成框架,通过诊断学生模型缺陷、组织递进式课程、分阶段适应实现高效知识蒸馏。
🔍 现象分析当前数据合成方法视蒸馏为单次训练任务,忽略知识传递对学生模型认知容量的渐进式匹配需求。
🛠️ 主要方法设计由知识识别、课程组织和适应调整组成的三阶段管道,结合布鲁姆掌握学习原则与维果茨基最近发展区理论,动态控制知识难度递增。
📊 数据与实验基于 LLaMA-3.1/3.2 和 Qwen2.5,使用 DollyEval、MATH 和 HumanEval 数据集开展实验,验证框架在复杂推理任务上的超越性。
⭐ 主要贡献提出基于教育学理论的知识蒸馏框架,大幅提升学生模型性能;在多个基准任务上显著优于现有方法,同时减少模型参数使用量。
查看完整摘要 (Abstract)
Knowledge distillation from Large Language Models (LLMs) to smaller models has emerged as a critical technique for deploying efficient AI systems. However, current methods for distillation via synthetic data lack pedagogical awareness, treating knowledge transfer as a one-off data synthesis and training task rather than a systematic learning process. In this paper, we propose a novel pedagogically-inspired framework for LLM knowledge distillation that draws from fundamental educational principles. Our approach introduces a three-stage pipeline—**Knowledge Identifier**, **Organizer**, and **Adapter** (**IOA**)—that systematically identifies knowledge deficiencies in student models, organizes knowledge delivery through progressive curricula, and adapts representations to match the cognitive capacity of student models. We integrate Bloom's Mastery Learning Principles and Vygotsky's Zone of Proximal Development to create a dynamic distillation process where student models approach teacher model's performance on prerequisite knowledge before advancing, and new knowledge is introduced with controlled, gradual difficulty increments. Extensive experiments using LLaMA-3.1/3.2 and Qwen2.5 as student models demonstrate that IOA achieves significant improvements over baseline distillation methods, with student models retaining 94.7\% of teacher performance on DollyEval while using less than 1/10th of the parameters. Our framework particularly excels in complex reasoning tasks, showing 19.2\% improvement on MATH and 22.3\% on HumanEval compared with state-of-the-art baselines.
基础/前沿模型 (含LLM)
效率与压缩
#diffusion #LLM #parallel generation #fast inference #autoregressive #planning #hybrid model
TL;DR:Planned diffusion speeds up LLM inference by denoising parallelized spans from a previously generated plan.
🎯 研究动机现有的大型语言模型采用自回归方式生成文本,无法实现并行生成,导致推理速度受到限制。基于离散扩散的语言模型提供了并行生成的可能性,但退噪顺序的设计存在质量与延迟之间的权衡问题。
❓ 解决问题提出了计划扩散(planned diffusion)方法,使模型能够自主确定退噪顺序,优化质量与延迟之间的平衡,提升生成效率。
🔍 现象分析目前的扩散语言模型依赖启发式方法设置退噪顺序,导致生成质量与推理延迟的显著权衡;自回归生成方式效率低且无法并行化。
🛠️ 主要方法计划扩散方法通过两阶段模型工作:第一阶段以自回归方式生成语义独立的响应块计划;第二阶段使用扩散方法并行化退噪生成文本,融合了自回归与扩散的优势。
📊 数据与实验在AlpacaEval数据集的805个指令任务上进行评估,计划扩散方法实现了质量与延迟的Pareto优化,比自回归生成速度提升1.27倍至1.81倍,质量下降仅0.87%至5.4%。
⭐ 主要贡献提出了一种融合自回归与扩散的混合生成模型,显著提高推理效率;展示了该方法在下游任务中的优越性能与灵活的质量-延迟控制能力。
查看完整摘要 (Abstract)
Most existing large language models are autoregressive: they generate text one token at a time, and cannot decode any new tokens until they have decoded every token before it.
Discrete diffusion language models offer a promising alternative by generating multiple tokens in parallel, but sampling from them requires a _denoising order_, the strategy for deciding which tokens to decode at each step.
Determining the right denoising order is difficult, and existing approaches use heuristics that create a steep trade-off between quality and latency.
We propose _planned diffusion_, a system that trains the model to determine its own denoising order.
Planned diffusion uses a single model that transitions between autoregressive and diffusion-based generation: first, the model autoregressively generates a plan that partitions the response into semantically independent chunks, defining a denoising order that parallelizes sampling across chunks; second, the model executes this plan via diffusion denoising.
On AlpacaEval, a suite of 805 instruction-following prompts, planned diffusion achieves Pareto-optimal trade-off between quality and latency, achieving 1.27x to 1.81x speedup over autoregressive generation with only 0.87\% to 5.4\% drop in win rate.
Our empirical results show that planned diffusion exhibits superior performance scaling on downstream tasks compared to autoregressive baselines while offering the runtime flexibility to precisely navigate the quality-latency trade-off.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Model #Knowledge Distillation
🎯 研究动机大语言模型(LLMs)的推理能力需要通过知识蒸馏传递到小型高效的学生模型中,但现有方法常常表现为模式记忆过度和泛化能力不足的问题。
❓ 解决问题提出一种创新的蒸馏框架,不仅限于简单的模仿,还能够深入传递概念性理解,从而克服模式记忆和泛化不足的局限。
🔍 现象分析传统蒸馏方法容易导致学生模型仅记住表面模式,缺乏深层推理能力,且在面对未见数据时表现不佳。
🛠️ 主要方法框架包含两个核心创新:利用解释反转(Explanatory Inversion, EI)引导学生模型生成逻辑解释,避免简单记忆答案;通过强化学习算法和对话结构奖励机制(EXGRPO)提升推理过程连贯性,从而提高泛化能力。
📊 数据与实验在12个数据集上进行广泛评估,使用Gemma-7b为学生模型,方法在零样本性能上平均提高20.39%,在当前最佳蒸馏基线的基础上提升6.02%,并表现出较高的训练效率和跨任务的泛化能力。
⭐ 主要贡献提出了解释驱动的蒸馏框架,有效提升学生模型的推理深度、泛化能力和训练效率,在蒸馏研究中树立新基准。
查看完整摘要 (Abstract)
Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf{20.39\%} increase over zero-shot performance and a \textbf{6.02\%} improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf{10-25\%} training data) and strong generalization to out-of-distribution tasks.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Model #KV Cache
TL;DR:We discovered a new paradigm for key distribution in LLMs and used it to guide the KV cache compression strategy.
🎯 研究动机现代大语言模型因注意力机制的平方复杂度在处理长文本时面临挑战。现有的 KV 缓存压缩方法难以有效保持语义完整性。
❓ 解决问题通过优化 KV 缓存策略,提出在不增加计算开销的情况下,提升长序列处理的语义完整性与计算效率。
🔍 现象分析发现键嵌入空间中大多数词表现出与上下文相似的模式,但有少量语义锚点词形成语义偏离并聚集成簇。
🛠️ 主要方法提出 ProtoKV 方法,分别处理位置决定型词与语义锚点词,基于其特性构建语义原型,形成语义相似词簇作为压缩单元。
📊 数据与实验在 LongBench 数据集上实验,ProtoKV 在相同内存约束下将准确率提升了 2.11%,优于现有最优方法。
⭐ 主要贡献揭示了键嵌入中的语义锚点现象,设计出高效的 KV 缓存压缩框架 ProtoKV 并验证了其在长文本处理上的有效性。
查看完整摘要 (Abstract)
Modern Large Language Models (LLMs) face fundamental challenges in processing long text sequences due to the quadratic complexity of attention mechanisms. Key-Value (KV) cache retention strategies mitigate this issue by selectively preserving salient KV pairs for autoregressive generation. However, existing methods fail to adequately and efficiently preserve the semantic integrity of the compressed representations. In this paper, we discover a prevalent phenomenon in LLM: within the key embedding space, while most tokens exhibit similarity with their contextual neighbors (we term position-determined tokens), a small subset of special tokens serving as semantic anchors consistently show local deviation property and form one or several clusters (we term semantic-anchored tokens). Motivated by this observation, we propose ProtoKV that separately processes these two token categories for KV cache compression. Within this framework, we first construct semantic prototypes based on the inherent characteristics of the two token categories, and subsequently form clusters of semantically similar tokens as basic compression units. This approach preserves semantic integrity with high computational efficiency. Experiments on LongBench demonstrate that ProtoKV achieves 2.11% higher accuracy than state-of-the-art methods under matched memory constraints.
基础/前沿模型 (含LLM)
效率与压缩
#Memory Efficient Training #Pre-training #Finetuning #Approximate Matrix Multiplication #Compressed Activations
TL;DR:Significantly reduces QKV projection memory by leveraging Point-Approximate Matrix Multiplication (PAMM).
🎯 研究动机多头注意力机制是大型语言模型的核心组件,其训练过程的计算和内存效率备受关注。然而,QKV线性投影的内存消耗问题常被忽视,需要新的技术来优化其内存使用。
❓ 解决问题提出一种新型张量压缩技术PAMM,旨在显著降低注意力层中QKV投影的内存占用,同时维持或提升最终模型的表现质量。
🔍 现象分析传统方法在优化注意力计算中侧重缩减点积的计算复杂度,但QKV投影的激活内存使用仍是显著瓶颈,影响了整体训练效率。
🛠️ 主要方法使用PAMM对Q、K、V张量的激活进行高效压缩,最高可达512倍,将内存足迹大幅缩减,并确保与其他高效注意力技术如FlashAttention的兼容性。
📊 数据与实验通过多个预训练和微调实验验证该方法的有效性,结果表明PAMM能够在降低内存消耗的同时实现相似或更优的模型困惑度表现。
⭐ 主要贡献提出并证明了PAMM的有效性,将QKV投影内存消耗降至几乎零,为训练内存高效的大型语言模型提供了新的解决方案。
查看完整摘要 (Abstract)
The Multi-Head Attention mechanism is central to LLM operation, and multiple works target its compute and memory efficiency during training.
While most works focus on approximating the scaled dot product, the memory consumption of the linear projections that compute the $Q$, $K$, and $V$ tensors from the input $x$ is often overlooked.
To address this, we propose Point-Approximate Matrix Multiplication (PAMM), a novel tensor compression technique that compresses the activations of the $Q,K,V$ projections in attention layers by a factor of up to $\times 512$, effectively erasing their memory footprint, while achieving similar or better final perplexity. PAMM is fully composable with efficient attention techniques such as FlashAttention, making it a practical and complementary method for memory-efficient LLM training.
基础/前沿模型 (含LLM)
效率与压缩
#Reinforcement Learning #Quantization
TL;DR:We develop a lossless quantized reinforcement learning framework for LLM reasoning
🎯 研究动机基于可验证奖励的强化学习训练推理大模型成为主流,但其自回归解码特性导致训练中rollout阶段耗时占总时长70%,成为效率瓶颈。
❓ 解决问题提出量化强化学习方法,通过量化策略网络加速rollout过程,解决传统方法训练效率低的问题。
🔍 现象分析量化强化学习面临两大挑战:长期训练崩溃风险,以及权重更新幅度过小导致量化操作难以有效捕捉变化。
🛠️ 主要方法采用自适应截断范围动态调整量化截断比率,结合不变缩放技术降低量化噪声并增强权重更新可识别性。
📊 数据与实验在DeepScaleR和DAPO数据集上开展INT8和FP8量化实验,训练过程中实现rollout速度提升20%至80%。
⭐ 主要贡献建立无损量化强化学习框架,通过两项技术创新在保持模型性能的前提下显著提升推理大模型的训练效率。
查看完整摘要 (Abstract)
Reinforcement learning with verifiable rewards (RLVR) has become a trending paradigm for training reasoning large language models (LLMs).
However, due to the autoregressive decoding nature of LLMs, the rollout process becomes the efficiency bottleneck of RL training, consisting of up to 70\% of the total training time.
In this work, we propose Quantized Reinforcement Learning (QuRL) that uses a quantized actor for accelerating the rollout.
We address two challenges in QuRL. First, we propose Adaptive Clipping Range (ACR) that dynamically adjusts the clipping ratio based on the policy ratio between the full-precision actor and the quantized actor, which is essential for mitigating long-term training collapse.
Second, we identify the weight update problem, where weight changes between RL steps are extremely small, making it difficult for the quantization operation to capture them effectively.
We mitigate this problem through the invariant scaling technique that reduces quantization noise and increases weight update.
We evaluate our method with INT8 and FP8 quantization experiments on DeepScaleR and DAPO, and achieve 20% to 80% faster rollout during training.
基础/前沿模型 (含LLM)
效率与压缩
#Efficient LLM Inference #LLM Prefill Acceleration #Sparse Attention #KV Cache Subselection #Training-Free
🎯 研究动机在预填充阶段,Transformer 中的注意力机制计算开销巨大,尤其当查询仅需要对少量键进行交互时。这导致语言模型推理效率受限。
❓ 解决问题研发一种硬件无关、无需重新训练的稀疏注意力算法,加速在分块预填充场景下的Transformer推理过程。
🔍 现象分析低余弦相似度的查询与更多键交互,对最终注意力得分贡献最大。通过优先处理这些查询,可近似完整注意力的行为。
🛠️ 主要方法提出 QuoKA,首先保留具有代表性的少量查询,再次从中选择与这些查询最相关的键值对,从而优化注意力计算。
📊 数据与实验基于 Needle-In-A-Haystack、LongBench、RULER 和 Math500 数据集实验,QuoKA在计算效率上提升显著,同时在Nvidia GPU上提升5倍速度,在Intel Xeon CPU上提升接近7倍。
⭐ 主要贡献QuoKA实现了注意力计算的高效稀疏化,在减小88%键值对的情况下,依然保持接近基线的精度,显著优化了大语言模型的推理性能。
查看完整摘要 (Abstract)
We present QuoKA: Query-oriented KV selection for efficient attention, a training-free and hardware agnostic sparse attention algorithm for accelerating transformer inference under chunked prefill. While many queries focus on a smaller group of keys in the attention operator, we observe that queries with low cosine similarity with respect to the mean query interact more strongly with more keys and have the greatest contribution to final attention logits. By prioritizing these low cosine similarity queries, the behavior of full attention during the prefill stage can be closely approximated. QuoKA leverages this observation, accelerating attention by (1) first retaining a small set of representative queries and (2) then subselecting the keys most aligned with those queries. Through experiments on Needle-In-A-Haystack, LongBench, RULER, and Math500, we show that, while realizing a 3× reduction in time-to-first-token, 5× speedup in attention on Nvidia GPUs and up to nearly a 7× speedup on Intel Xeon CPUs, QuoKA achieves near-baseline accuracy, utilizing 88% fewer key-value pairs per attention evaluation.
基础/前沿模型 (含LLM)
效率与压缩
#LLM #Compression #Pruning
🎯 研究动机大语言模型在大规模数据上训练,积累了丰富的语义知识,但结构化剪枝中使用的校准数据有限,导致输出误差问题亟需解决。传统直接最小二乘拟合易过拟合校准集,破坏预训练权重。因此需要一种误差补偿机制,平衡剪枝效果与模型性能。
❓ 解决问题提出一种旋转约束补偿方法,以减少结构化剪枝引入的误差,同时保留输出表示的几何特性。该方法通过重新校准剪枝子空间与原始输出,解决校准数据不足导致的模型退化问题。
🔍 现象分析剪枝后难以恢复误差的关键在于移除强影响输出主方向的成分。因此,研究发现输入维度的大方差对输出主方向影响显著,需优先保留对模型重要性高的维度。
🛠️ 主要方法利用旋转约束更新剪枝参数,保持输出表示的几何特性,同时引入基于方差的重要性评分方法,确保对主方向贡献大的维度优先保留,从而结合旋转约束高效补偿剪枝误差。
📊 数据与实验在Llama-7B和Llama-2-13B上测试,使用WikiText2以及多个语言理解基准数据集。实验结果显示与现有基准方法相比,该方法在困惑度和任务准确率上均有显著提升。
⭐ 主要贡献提出了融合旋转约束和方差感知剪枝的重要性评分的新方法,提高了剪枝后的模型性能结构稳定性,验证了该方法在大语言模型压缩中的有效性,为后续模型优化提供了新思路。
查看完整摘要 (Abstract)
In this paper, we propose a rotation-constrained compensation method to address the errors introduced by structured pruning of large language models (LLMs).
LLMs are trained on massive datasets and accumulate rich semantic knowledge in their representation space.
In contrast, pruning is typically carried out with only a small amount of calibration data, which makes output mismatches unavoidable.
Although direct least-squares fitting can reduce such errors, it tends to overfit to the limited calibration set, destructively modifying pretrained weights.
To overcome this difficulty, we update the pruned parameters under a rotation constraint.
This constrained update preserves the geometry of output representations (i.e., norms and inner products) and simultaneously re-aligns the pruned subspace with the original outputs.
Furthermore, in rotation-constrained compensation, removing components that strongly contribute to the principal directions of the output makes error recovery difficult.
Since input dimensions with large variance strongly affect these principal directions, we design a variance-aware importance score that ensures such dimensions are preferentially kept in the pruned model.
By combining this scoring rule with rotation-constrained updates, the proposed method effectively compensates errors while retaining the components likely to be more important in a geometry-preserving manner.
In the experiments, we apply the proposed method to Llama-7B and Llama-2-13B, and evaluate it on WikiText2 and multiple language understanding benchmarks.
The results demonstrate consistently better perplexity and task accuracy compared with existing baselines.
基础/前沿模型 (含LLM)
效率与压缩
#mixture-of-experts #moe #compresson #expert pruning #expert merging #merging #pruning #LLM #evaluation
TL;DR:We argue that pruning experts is superior to merging them for one-shot compression of MoE LLMs and introduces a new method, REAP, that achieves nearly lossless performance on generative tasks by minimizing the upper bound of the reconstruction error.
🎯 研究动机稀疏激活专家模型(SMoE)的参数量庞大导致内存开销过高,需要有效的专家压缩方法以降低资源需求。
❓ 解决问题现有研究偏向于使用专家合并技术,但在生成任务中这种方法会引入不可避免的误差,缺乏精细的路由控制。
🔍 现象分析与辨别任务不同,生成任务中专家合并会造成路由权重和激活细节的丢失,导致误差不可消除,而专家剪枝可以更好地保持模型性能。
🛠️ 主要方法提出了一种新的剪枝准则——路由加权的专家激活剪枝(REAP),综合考虑路由权值和专家激活范数,以最小化重构误差的上界。
📊 数据与实验对20B到1T参数范围内的多种SMoE模型进行实验,在生成任务中测试,包括代码生成模型Qwen3-Coder-480B和Kimi-K2,在50%压缩率下几乎无损性能。
⭐ 主要贡献提出了针对生成任务的专家剪枝方法REAP,验证其在保持性能的前提下优于合并及其他剪枝方法,尤其是在高压缩率下表现卓越。
查看完整摘要 (Abstract)
Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert *merging* on discriminative benchmarks, we find that expert *pruning* is a superior strategy for generative tasks. We demonstrate that existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms to minimize the reconstruction error bound. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.
基础/前沿模型 (含LLM)
效率与压缩
#Sparse Attention #Attention Redundancy #Low-rank Approximation
TL;DR:We propose RESA to compensate the results of sparse attention for a more accurate output.
🎯 研究动机随着大语言模型对长上下文的需求增长,稀疏注意因KV缓存的局限性导致模型质量下降问题亟待解决。
❓ 解决问题提出一种残差估计(Residual Estimation)框架,以补偿稀疏注意机制对剩余KV贡献的忽略,从而提升模型输出质量。
🔍 现象分析注意力得分的低秩特性导致其存在显著冗余,并且随着序列长度增加,主奇异值的谱主导效应和线性标度性导致冗余增长明显。
🛠️ 主要方法设计了一个无需额外训练的框架RESA,包括推理阶段生成秩-1近似先验的先验估计器和解码阶段通过轻量化计算融合先验的在线聚合器。
📊 数据与实验使用了多个不同任务下的数据集,并在无额外开销的情况下,RESA提升了最高达26%的模型质量,同时减少了最多33.2%的KV预算并提高了1.23倍的注意力吞吐量。
⭐ 主要贡献提出了一个训练无关的框架RESA,以有效补偿稀疏注意的忽略情况,并显著提升了模型的效率与性能。
查看完整摘要 (Abstract)
Large Language Models (LLM) have gained significant attention.
KV cache, stored to avoid quadratic complexity of attention, becomes a bottleneck due to the demands for long-context.
Sparse attention (SA) has been proposed to address this by only selecting critical KVs for attention, which may degrade model quality in less sparse scenarios.
To improve quality, rather than selecting more KVs, this paper reveals another perspective by estimating the contributions of remaining KVs, called Residual Estimation.
We find that attention logits (before softmax) exhibit substantial redundancy due to its inherent low-rank nature.
We perform Singular Value Decomposition (SVD) on logits matrix in prefilling and find the spectral dominance of principal singular value and its linearly scaling property with sequence length.
These imply that increasing sequence length leads to replication-like logits growth with significant redundancy.
However, it is impossible to perform SVD at each decoding step in practice due to its heavy costs.
To this end, we propose RESA, a training-free framework compensating SA's output with an estimated low-rank prior of logits.
RESA introduces (i) a Prior Estimator that derives a prior distribution from a typical query as a rank-1 approximation at the end of prefilling, and (ii) an Online Aggregator that fuses the prior with SA at each decoding step via lightweight scaling and merging.
Besides, we further show that RESA's effect comes from priors being used as attention bias for knowledge injection.
Extensive experiments show that without extra overheads, RESA improves model quality by up to 26\% across various tasks with the same KV budget compared to state-of-the-art.
Moreover, RESA maintains the same quality with up to 33.2\% KV budget reduction and 1.23$\times$ attention throughput improvement.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Model #Layer Pruning #Model Compression
TL;DR:This paper presents a theoretical and empirical analysis of layer pruning in Large Language Models, aiming to improve and refine pruning strategies.
🎯 研究动机大语言模型因规模庞大而计算资源需求高,在资源受限环境中部署面临挑战,因此需要高效的模型压缩方法。
❓ 解决问题探索层裁剪策略在大语言模型中的最佳实践,并评估现有精细化调整方法(如 LoRA)的实际效果。
🔍 现象分析通过理论和实验证明,仅使用简单的层裁剪方式即可实现强性能,并发现现有复杂层选择指标并非总是有效。
🛠️ 主要方法裁剪模型的后几层,并仅对 lm_head 和剩余的后三层进行微调,同时结合基于梯度流的理论分析支持这些策略。
📊 数据与实验在 Llama-3.1-8B-It、Llama-3-8B 和 Llama-3-70B 模型上,进行了大量基准测试,总计消耗数千 GPU 小时,结果表明性能提升显著。
⭐ 主要贡献提出了一种简单高效的层裁剪方法,性能超过现有最优裁剪方法,提升幅度达 5.62%-19.45%,并开源代码提供支持。
查看完整摘要 (Abstract)
Although large language models (LLMs) have achieved remarkable success across various domains, their considerable scale necessitates substantial computational resources, posing significant challenges for deployment in resource-constrained environments. Layer pruning, as a simple yet effective compression method, removes layers of a model directly, reducing computational overhead. However, what are the best practices for layer pruning in LLMs? Are sophisticated layer selection metrics truly effective? Does the LoRA (Low-Rank Approximation) family, widely regarded as a leading method for pruned model fine-tuning, truly meet expectations when applied to post-pruning fine-tuning? To answer these questions, we dedicate thousands of GPU hours to benchmarking layer pruning in LLMs and gaining insights across multiple dimensions. Our results demonstrate that a simple approach, i.e., pruning the final layers followed by fine-tuning the lm\_head and the remaining last three layers, yields remarkably strong performance. These pruning strategies are further supported by theoretical analyses based on the gradient flow. Following this guide, our method surpasses existing state-of-the-art pruning methods by $5.62\%$–$17.27\%$ on Llama-3.1-8B-It, by $2.36\%$–$19.45\%$ on Llama-3-8B and by $4.34\%$–$9.59\%$ on Llama-3-70B. The code is available at at https://github.com/yaolu-zjut/Navigation_LLM_layer_pruning.
基础/前沿模型 (含LLM)
效率与压缩
#Reparameterization #Speculative Decoding
TL;DR:We introduce RepSpec, a training method for speculative decoding that uses structural re-parameterization to temporarily expand the draft model’s capacity during training—without adding inference cost.
🎯 研究动机随着大语言模型参数规模增长,自回归推理的延迟显著增加,而推测解码的性能受制于草稿模型的容量限制。
❓ 解决问题提出一种优化草稿模型容量的方法,以突破推测解码因参数差距导致的性能瓶颈,实现更高效的推理。
🔍 现象分析推测解码中的草稿模型由于参数不足,生成的并行候选序列长度和质量受限,影响整体推理效率。
🛠️ 主要方法提出RepSpec方法,通过结构化重新参数化,在训练阶段临时扩展草稿模型容量,后在推理阶段将冗余结构合并,避免额外推理成本。
📊 数据与实验将RepSpec应用于现有方法EAGLE的改进,在接受序列长度方面取得显著提升,同时探索结合线性与非线性结构的混合策略以进一步增强性能。
⭐ 主要贡献提出一种结合结构化重新参数化与混合训练策略的新方法,大幅提高推测解码的接受序列长度,优化了草稿模型的训练效果。
查看完整摘要 (Abstract)
As the parameter size of large language models (LLMs) continues to grow, the latency of autoregressive inference increases due to memory-bound computational inefficiency. To address this, speculative decoding has been proposed, where a large target model verifies multiple tokens generated in parallel by a smaller draft model. However, the performance of speculative decoding is fundamentally limited by the draft model’s capacity, which stems from the parameter gap between the two models. To overcome this limitation, we propose RepSpec, which combines structural re-parameterization with draft model training. During training, redundant linear structures are introduced and later merged into the backbone network during inference, thus enhancing the draft model’s training effectiveness without increasing inference cost. By applying our method to improve the current state-of-the-art approach, EAGLE, we achieve a significant improvement in accepted sequence length. Furthermore, considering the specific characteristics of the speculative decoding scenario, we explore a hybrid training strategy that combines linear and nonlinear structures, which yields a further improvement in acceptance length.
基础/前沿模型 (含LLM)
效率与压缩
#Data Selection #Data Pruning #Large Language Model #Benchmark Compression
TL;DR:We propose a benchmark compression method that efficiently accelerates the evaluation of large language models (LLMs).
🎯 研究动机大规模语言模型评估消耗资源巨大,而基准套件的扩展使得评估渐成计算和标注的瓶颈,亟需新的方式高效压缩基准测试。
❓ 解决问题提出一种基准压缩方法,目的是在显著减少数据量的情况下,仍能维持模型评分的准确性和排名稳定性。
🔍 现象分析通过分析发现,基准测试数据中的文本文本及模型排名模式存在冗余,可利用这些冗余性减少评估实例数量而不损害结果有效性。
🛠️ 主要方法设计了名为 EssenceBench 的三阶段框架,依次进行基于冗余的实例过滤、基于遗传算法和代理预测器的子集搜索,及基于归因分析的表现优化。
📊 数据与实验在多个排名数据集上验证方法,包括 HellaSwag 数据集;用仅 50 个实例实现了 95% 模型排名稳定性,仅引入 5% 偏移,实现 200 倍压缩效果。
⭐ 主要贡献提出了针对大规模语言模型评估场景的高效基准压缩框架,显著降低数据需求并提升评估效率,源代码将在论文接收后公开。
查看完整摘要 (Abstract)
Benchmark suites for large language models are growing faster than our ability to pay for them. Even when training is already expensive, many use cases require repeated evaluation across many checkpoints, variants, and competing systems, and the steady expansion of benchmark suites increasingly turns evaluation into a bottleneck in tokens and compute. This scale changes what ``useful data'' means. Instead of asking whether an instance is good for training one model, we ask **which instances are necessary to keep the collective ordering of many models stable.** We analyze redundancy at the instance level and find repetition in both the text and the ranking patterns induced across models. Based on this observation, we formulate benchmark compression as a subset optimization problem that targets accurate score reconstruction and ranking preservation at the same time. We propose EssenceBench, a coarse-to-fine framework with three stages: redundancy-aware filtering with text and ranking signals, fitness-driven subset search with an iterative genetic algorithm and a fixed surrogate predictor, and attribution-guided refinement for better coverage under tight budgets. Across multiple leaderboards, EssenceBench achieves lower reconstruction error and stronger ranking preservation than prior approaches while reducing selection time. On HellaSwag with 10K instances, EssenceBench preserves 95\% of model rankings within a 5\% shift using only 50 instances, a 200$\times$ compression. The source code will be made available upon acceptance of the paper.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Models #Quantization
TL;DR:We reveal that standard compensation-based methods overlook intra-layer dependencies and provide a rectification.
🎯 研究动机现有基于权重补偿的量化方法在处理大语言模型时表现优越,但缺乏对层内依赖性的关注,导致校准目标次优。
❓ 解决问题重新定义残差误差校准目标,使量化模型的输出在每一步更精准地对齐原始全精度模型的输出。
🔍 现象分析发现残差误差不仅源于前一层的输出差异,还包括补偿权重与原始权重间的差异,命名为'补偿感知误差'。
🛠️ 主要方法利用从GPTAQ继承的神经元分解技术,将补偿感知误差高效融入权重更新过程,优化校准目标的定义。
📊 数据与实验在多种大语言模型和量化设置上进行广泛实验,验证新方法在与GPTQ和GPTAQ结合时的性能提升效果。
⭐ 主要贡献深入分析现有量化方法的次优现象,提出补偿感知误差概念,改进量化校准目标,并公开相关代码,为大语言模型量化领域提供新方向。
查看完整摘要 (Abstract)
Methods based on weight compensation, which iteratively apply quantization and weight compensation to minimize the output error, have recently demonstrated remarkable success in quantizing Large Language Models (LLMs).
The representative work, GPTQ, introduces several key techniques that make such iterative methods practical for LLMs with billions of parameters.
GPTAQ extends this approach by introducing an asymmetric calibration process that aligns the output of each quantized layer with its full-precision counterpart, incorporating a residual error into the weight compensation framework.
In this work, we revisit the formulation of the residual error.
We identify a sub-optimal calibration objective in existing methods: during the intra-layer calibration process, they align the quantized output with the output from compensated weights, rather than the true output from the original full-precision model. Therefore, we redefine the objective to precisely align the quantized model's output with the original output of the full-precision model at each step. We then reveal that the residual error originates not only from the output difference of the preceding layer but also from the discrepancy between the compensated and original weights within each layer, which we name the 'compensation-aware error'.
By inheriting the neuron decomposition technique from GPTAQ, we can efficiently incorporate this compensation-aware error into the weight update process. Extensive experiments on various LLMs and quantization settings demonstrate that our proposed enhancements integrate seamlessly with both GPTQ and GPTAQ, significantly improving their quantization performance. Our code is publicly available at https://github.com/list0830/ResComp.
基础/前沿模型 (含LLM)
效率与压缩
#Long Generation #KV Cache #Compression
🎯 研究动机大型语言模型广泛应用于长文本任务,但是推理效率受到 KV 缓存内存占用线性增长的限制,影响解码步骤的延迟和性能。
❓ 解决问题现有 KV 缓存压缩方法未能有效解决长解码过程中累积的注意力误差问题,影响模型在长语境生成中的准确性。
🔍 现象分析长解码任务中,固定的注意力输出框架无法动态修正过去注意力的近似错误,导致生成质量下降。
🛠️ 主要方法提出一种名为 RetroAttention 的 KV 缓存更新技术,通过维护轻量化输出缓存并引入后续解码步中的新 KV 条目,回溯地改进过去的注意力输出。
📊 数据与实验在多个长文本生成基准上进行实验,RetroAttention 在有效 KV 曝光率和准确性方面显著优于最先进的 KV 压缩方法。
⭐ 主要贡献突破固定注意力输出框架,实现动态修正,提升长语境生成任务的效率及准确性,并提供匿名代码为进一步研究奠定基础。
查看完整摘要 (Abstract)
Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important few tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to be efficiently supplemented with more contexts, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6$\times$ and accuracy by up to 21.9\%. We provide anonymized code in the supplementary material.
基础/前沿模型 (含LLM)
效率与压缩
#SVD Compression #Large Language Models
🎯 研究动机大型语言模型的参数规模快速增长,亟需有效的压缩技术以减少计算和存储成本。
❓ 解决问题现有SVD低秩压缩方法忽视层间误差累积问题,导致模型整体性能下降,需解决误差传播和优化全局偏差的问题。
🔍 现象分析传统方法仅通过独立最小化单层重构误差进行压缩,无法有效抑制误差在网络中的累积与放大,影响模型的精度保持能力。
🛠️ 主要方法提出SAES-SVD框架,包括两部分:CEALC通过局部重构和累计误差补偿的联合优化,提供封闭式低秩解;ACES动态调整权重系数,以最大化固定秩下压缩层输出与目标偏差的比率,提高秩预算的利用效率。
📊 数据与实验在多个LLM架构与任务中进行实验,在LLaMA-7B等模型的0.2压缩比下,SAES-SVD将精度下降限制到0.02,而现有方法平均下降超过0.05,表现更为优越。
⭐ 主要贡献提出一个能有效抑制累计误差的SVD压缩框架,显著缩小压缩模型与全精度模型的性能差距,通过提升跨层误差补偿能力实现更可靠的LLM压缩方案。
查看完整摘要 (Abstract)
The rapid growth in the parameter scale of large language models (LLMs) has created a high demand for efficient compression techniques.
As a hardware-agnostic and highly compatible technique, low-rank compression has been widely adopted. However, existing methods typically compress each layer independently by minimizing per-layer reconstruction error, overlooking a critical limitation: the reconstruction error propagates and accumulates through the network, which leads to amplified global deviations from the full-precision baseline.
To address this, we propose **Self-Adaptive Error Suppression SVD (SAES-SVD)**, a LLMs compression framework that jointly optimizes intra-layer reconstruction and inter-layer error compensation.
SAES-SVD is composed of two novel components:
**Cumulative Error-Aware Layer Compression (CEALC),** which formulates the compression objective as a combination of local reconstruction and weighted cumulative error compensation. Based on it, we derive a closed-form low-rank solution relied on second-order activation statistics, which explicitly aligns each layer's output with its full-precision counterpart to compensate for accumulated errors.
\ding{183} Adaptive Collaborative Error Suppression (ACES), which automatically adjusts the weighting coefficient to enhance the low-rank structure of the compression objective in CELAC. Specifically, the coefficient is optimized to maximize the ratio between the Frobenius norm of the compressed layer's output and that of the compression objective under a fixed rank, thus ensuring that the rank budget is utilized effectively.
Extensive experiments across multiple LLM architectures and tasks show that, without fine-tuning or additional tricks, SAES-SVD consistently improves post-compression performance. For example, at a 0.2 compression ratio on LLaMA-7B, existing methods exhibit an average accuracy drop exceeding 0.05, whereas SAES-SVD restricts the drop to only 0.02. These improvements underscore the potential of SAES-SVD to effectively narrow the gap between compressed models and their full-precision counterparts, paving the way for more reliable compression of LLMs.
基础/前沿模型 (含LLM)
效率与压缩
#Foundation models #LoRA #Homomorphic Encryption
TL;DR:SHE-LoRA: Privacy FL for edge devices via selective homomorphic encryption & adaptive LoRA. Matches non-private performance while cutting comms 99.7% and encryption compute 99.8%, comparing to full encryption.
🎯 研究动机联邦微调对于提升大语言模型的领域任务性能至关重要,但面临数据隐私泄漏风险,如梯度反演攻击的挑战。
❓ 解决问题现有的隐私保护技术会导致性能下降和高成本,难以适配用户数据异构性和设备能力差异的问题。
🔍 现象分析梯度反演攻击会通过训练过程提取私有数据,现有解决方案在隐私保护和性能优化之间难以平衡。
🛠️ 主要方法提出 SHE-LoRA,将选择性同态加密与低秩适配相结合,实现模型参数的敏感度评估,以适配异构客户端,通过列感知安全聚合和自定义重参技术优化模型汇聚。
📊 数据与实验基于多个实验,验证了 SHE-LoRA 的隐私保护性能与非隐私基线相当,同时相较于全同态加密方案显著减少通信开销99.71%和加密计算时间99.87%。
⭐ 主要贡献提出一种高效且隐私友好的联邦微调框架,有效抵御攻击,同时显著降低成本,更适合设备异构环境应用。
查看完整摘要 (Abstract)
Federated fine-tuning is critical for improving the performance of large language models (LLMs) in handling domain-specific tasks while keeping training data decentralized and private.
However, prior work has shown that clients' private data can actually be recovered via gradient inversion attacks.
Existing privacy preservation techniques against such attacks typically entail performance degradation and high costs, making them ill-suited for clients with heterogeneous data distributions and device capabilities.
In this paper, we propose SHE-LoRA, which integrates selective homomorphic encryption (SHE) and low-rank adaptation (LoRA) to enable efficient and privacy-preserving federated tuning of LLMs in cross-device environments.
Based on model parameter sensitivity assessment, heterogeneous clients adaptively negotiate and select a subset of model parameters for homomorphic encryption.
To ensure accurate model aggregation, we design a column-aware secure aggregation method and customized reparameterization techniques to align the aggregation results with the heterogeneous device capabilities of clients.
Extensive experiments demonstrate that SHE-LoRA maintains performance comparable to non-private baselines, achieves strong resistance to state-of-the-art attacks, and significantly reduces communication overhead by 99.71\% and encryption time by 99.87\%, compared to HE baselines.
基础/前沿模型 (含LLM)
效率与压缩
#Language Models #Knowledge Distillation #Reinforcement Learning
🎯 研究动机大型语言模型在函数调用任务中表现优秀,但规模过大限制了普及,需要将其能力迁移到更小的模型以提升可用性。
❓ 解决问题现有迁移方法面临过拟合、训练不稳定和多解决方案任务奖励机制单一的问题,同时难以整合多种技术。
🔍 现象分析超小模型在复杂函数调用任务中的性能远低于大型模型,传统知识蒸馏无法充分传递模型能力。
🛠️ 主要方法提出STAR框架,结合受限知识蒸馏和基于相似度的强化学习,其中CKD抑制错误预测并保持探索性,Sim-RL提供连续奖励信号优化策略。
📊 数据与实验在著名基准测试上进行广泛实验,0.6B模型在小于1B的所有公开模型中表现最佳,并超过部分较大模型。
⭐ 主要贡献构建一个完整的训练框架,将大型语言模型能力有效迁移至超小模型,推进高效、可访问的AI代理开发。
查看完整摘要 (Abstract)
The proliferation of Large Language Models (LLMs) in function calling is pivotal for creating advanced AI agents, yet their large scale hinders widespread adoption, necessitating transferring their capabilities into smaller ones. However, existing paradigms are often plagued by overfitting, training instability, ineffective binary rewards for multi-solution tasks, and the difficulty of synergizing techniques. We introduce STAR: Similarity-guided Teacher-Assisted Refinement, a novel holistic framework that effectively transfers LLMs' capabilities to super-tiny models. STAR consists of two core technical innovations: (1) Constrained Knowledge Distillation (CKD), a training objective that augments top-k forward KL divergence to suppress confidently incorrect predictions, ensuring training stability while preserving exploration capacity for downstream RL. STAR holistically synergizes these strategies within a cohesive training curriculum, enabling super-tiny models to achieve exceptional performance on complex function calling tasks; (2) Similarity-guided RL (Sim-RL), a RL mechanism that introduces a fine-grained, similarity-based reward. This provides a robust, continuous, and rich signal for better policy optimization by evaluating the similarity between generated outputs and the ground truth. Extensive experiments on challenging and renowned benchmarks demonstrate the effectiveness of our method. Our STAR models establish SOTA in their size classes, significantly outperforming baselines. Remarkably, our 0.6B STAR model achieves the best performance among all open models under 1B, surpassing even several well-known open models at a larger scale. STAR demonstrates a training framework that distills capabilities of LLMs into super-tiny models, paving the way for powerful, accessible, and efficient AI agents.
基础/前沿模型 (含LLM)
效率与压缩
#Efficient Video Understanding #Vision-Language Models #Token Pruning #Redundancy Reduction #Predictive Coding
🎯 研究动机视频数据蕴含丰富信息但时空冗余度高,连续帧间背景相似且运动可预测。现有视频语言模型无法利用这种冗余,对大量信息量低的图像块进行冗余计算,导致效率低下。
❓ 解决问题提出一种无需训练、与主干模型无关的token削减方法SURGE,动态测量token的
🔍 现象分析模型在处理视频时,大多数patch token携带信息量低且可预测,导致计算资源浪费。缺乏一个基于时序可预测性的即时信号来决定哪些token值得计算。
🛠️ 主要方法通过预测误差量化每个token相对其历史状态的惊奇度,保留高惊奇度token,剪枝可预测token。结合CLIP查询相关性构建紧凑的时空掩码,聚焦关键事件。
📊 数据与实验在多个视频理解基准测试中验证,SURGE可实现最高7倍的token削减,预填充成本降低86-98%,精度与全token基线相差不超过±1个百分点。
⭐ 主要贡献提出基于惊奇度引导的token削减框架,将计算资源与信息新颖性对齐。首次实现无需重训练的长视频高效处理,为视频理解模型提供通用效率优化方案。
查看完整摘要 (Abstract)
Videos contain rich information but also high redundancy, as consecutive frames often share similar backgrounds and predictable motions. Current video-language models (VLMs) are unable to exploit this redundancy and therefore perform a significant amount of superfluous computation, processing thousands of patch tokens even when little new information is present. What is missing is an on-the-fly, model-agnostic signal of temporal predictability to decide whether tokens carry unpredictable information that merits computation. We propose SURGE, a training-free and backbone-agnostic method that measures surprise in token space. Surprise scores are defined by the prediction error of each token from its recent history; high-surprise tokens are retained, while predictable ones are pruned. Aggregating scores over time produces a surprise curve that highlights key events, which can be further refined with CLIP-based query relevance to form a compact spatio-temporal mask. Experiments on multiple video understanding benchmarks show that SURGE reduces tokens by up to 7$\times$ and prefill cost by 86–98\%, while maintaining accuracy within $\pm$1 point of full-token baselines. By aligning computation with novelty, SURGE enables video VLMs to handle long contexts efficiently and without retraining.
基础/前沿模型 (含LLM)
效率与压缩
#Self-Attention #Sparse Representation
TL;DR:We propose Sparse Feature Attention (SFA), which converts dense Q/K into k-sparse codes and computes attention via FlashSFA kernel, preserving near-dense quality while significantly reducing compute, latency, and KV-cache.
🎯 研究动机自注意力机制的计算复杂度在处理超长序列时受限于 $O(n^2 d)$ 的高成本,现有方法降低序列维度的计算代价但牺牲了准确性。
❓ 解决问题探索特征稀疏性这一新维度,以减少自注意力的计算成本,同时保持高维表达能力和模型性能。
🔍 现象分析现有方法通过窗口、核近似或令牌稀疏化等方式降低成本,但这些方法会导致显著的准确率下降,短特征嵌入还会损失多样性。
🛠️ 主要方法提出稀疏特征注意力(SFA),将 Queries 和 Keys 转化为 k-稀疏编码,并通过 FlashSFA 内核高效计算稀疏重叠,从而优化至 $ heta(n^2 k^2/d)$ 成本。
📊 数据与实验在 GPT-2 和 Qwen3 预训练中,SFA与密集基线性能持平,同时速度提升最多 2.5 倍,降低近 50% 的 FLOPs 和 KV 缓存;在综合与下游测试中,SFA在长上下文的检索准确性和鲁棒性上优于短嵌入基线。
⭐ 主要贡献首次将特征稀疏性作为优化自注意力的一种方法,实现了在极长上下文中的高效扩展,同时保证模型质量接近密集基线不下降。
查看完整摘要 (Abstract)
Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: \emph{feature sparsity}. We propose \textbf{Sparse Feature Attention (SFA)}, where queries and keys are represented as $k$-sparse codes that preserve high-dimensional expressivity while reducing the cost of attention from $\Theta(n^2 d)$ to $\Theta(n^2 k^2/d)$. To make this efficient at scale, we introduce \textbf{FlashSFA}, an IO-aware kernel that extends FlashAttention to operate directly on sparse overlaps without materializing dense score matrices. Across GPT-2 and Qwen3 pretraining, SFA matches dense baselines while improving speed by up to $2.5\times$ and reducing FLOPs and KV-cache by nearly 50\%. On synthetic and downstream benchmarks, SFA preserves retrieval accuracy and robustness at long contexts, outperforming short-embedding baselines that collapse feature diversity. These results establish feature-level sparsity as a complementary and underexplored axis for efficient attention, enabling Transformers to scale to orders-of-magnitude longer contexts with minimal quality loss.
基础/前沿模型 (含LLM)
效率与压缩
#Large language models #post-training quantization #low-bit neural networks #model compression
TL;DR:This paper presents SliderQuant, a new post-training quantization framework for LLMs, which is superior to existing methods.
🎯 研究动机当前对大语言模型(LLMs)的后训练量化(PTQ)方法通常对所有层一视同仁,但这种方法在低比特宽场景下效果可能不佳。因此,改进层间量化设计以提高模型性能变得迫切。
❓ 解决问题探索 LLMs 各层对量化的敏感性差异,通过更精细的量化设计解决现有方法在低比特宽下性能不足的问题。
🔍 现象分析研究发现浅层/深层对量化更为敏感,其中首层/末层量化误差显著高于其他层。这表明需要针对每层设计独特的量化方案,而非简单共享相同策略。
🛠️ 主要方法提出 SliderQuant 框架,结合跨层滑动量化(inter-layer sliding quantization)和层内滑动量化(intra-layer sliding quantization)。通过滑动窗口设计和增量量化策略,有效降低层间量化误差。
📊 数据与实验在语言生成、常识推理、数学与代码任务上,涵盖 Llama 系列及多种深度模型,实验结果表明 SliderQuant 在权重量化和权重-激活联合量化中均优于最新方法。
⭐ 主要贡献提出一种新的 PTQ 框架 SliderQuant,能够细致适配 LLMs 各层的量化需求;在广泛任务和模型上显著提升低比特宽下的量化性能。
查看完整摘要 (Abstract)
In this paper, we address post-training quantization (PTQ) for large language models (LLMs) from an overlooked perspective: given a pre-trained high-precision LLM, the predominant sequential quantization framework treats different layers equally, but this may be not optimal in challenging bit-width settings. We empirically study the quantization impact of different layers on model accuracy, and observe that: (1) shallow/deep layers are usually more sensitive to quantization than intermediate layers; (2) among shallow/deep layers, the most sensitive one is the first/last layer, which exhibits significantly larger quantization error than others. These empirical observations imply that the quantization design for different layers of LLMs is required on multiple levels instead of a single level shared to all layers. Motivated by this, we propose a new PTQ framework termed **Sliding**-lay**er** **Quant**ization (SliderQuant) that relies on a simple adaptive sliding quantization concept facilitated by few learnable parameters. The base component of SliderQuant is called inter-layer sliding quantization, which incorporates three types of novel sliding window designs tailored for addressing the varying quantization sensitivity of shallow, intermediate and deep layers. The other component is called intra-layer sliding quantization that leverages an incremental strategy to quantize each window. As a result, SliderQuant has a strong ability to reduce quantization errors across layers. Extensive experiments on basic language generation, zero-shot commonsense reasoning and challenging math and code tasks with various LLMs, including Llama/Llama2/Llama3/Qwen2.5 model families, DeepSeek-R1 distilled models and large MoE models, show that our method outperforms existing PTQ methods (including the latest PTQ methods using rotation transformations) for both weight-only quantization and weight-activation quantization under diverse bit width settings. Code is available at https://github.com/deep-optimization/SliderQuant.
基础/前沿模型 (含LLM)
效率与压缩
#Efficient Evaluation #LLM Evaluation
🎯 研究动机随着大语言模型规模扩大,其在多种任务中的表现显著提升,但评估成本也随之增加,尤其在大规模基准测试样本上的推理计算代价较高。
❓ 解决问题提出一种高效的评估方法,旨在通过稀疏优化减少基准测试的计算开销,同时保留模型评估的准确性。
🔍 现象分析模型-样本性能矩阵展现稀疏性,可通过选择具代表性的样本作为锚点,转化为稀疏优化问题进行处理。
🛠️ 主要方法提出SparseEval方法,通过梯度下降优化锚点权重,迭代筛选锚点,利用MLP处理稀疏优化,并设计锚点重要性分值和候选重要性分值进行任务感知的精炼。
📊 数据与实验在多个基准测试中进行广泛实验,验证方法具有较低估计误差和高度Kendall’s τ,表现出强大的鲁棒性和实际应用价值。
⭐ 主要贡献首次将稀疏优化引入LLM评估,提出高效评估方法SparseEval并实现公开可用代码,验证了方法的理论与实用性。
查看完整摘要 (Abstract)
As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs.
In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem.
Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall’s $\tau$ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios. Code is available at https://github.com/taolinzhang/SparseEval.
基础/前沿模型 (含LLM)
效率与压缩
#reinforced sparse attention #token sparsity
TL;DR:Sparsity Forcing is an inference-aligned post-training method that optimizes a joint efficiency–performance reward with multi-budget Top-p exploration via GRPO.
🎯 研究动机稀疏注意力机制旨在通过选择性地处理关键标记来减少计算开销,但现有方法大多仅利用模型的固有稀疏性,在中等预算下(约50%标记约简)性能趋于饱和,难以在不损害准确性的前提下进一步降低预算。
❓ 解决问题为了解决现有方法中稀疏模式固化、无法灵活适应输入和层动态,且缺乏对标记预算的直接控制等问题,本文提出了一个强化稀疏性的后训练框架。
🔍 现象分析当前方法要么仅利用固有稀疏性导致预算降低空间有限,要么通过可训练的稀疏注意力或诱导尖锐性的正则化器强制稀疏性,但这些方法往往忽视输入和层动态或优化代理目标,无法直接控制标记预算。
🛠️ 主要方法本文提出了Sparsity Forcing,这是一种基于强化学习的后训练方法,通过多预算Top-p探索来优化效率和性能的联合奖励。该方法通过对比不同标记预算下的输出,奖励更高效且正确的答案,惩罚低效或不正确的答案,从而实现端到端的推理一致优化。
📊 数据与实验在十三个图像和视频基准上对Qwen2-VL/Qwen2.5-VL进行了实验,结果表明该方法将标记约简率从20%提升至75%,同时准确率下降最小,并在长上下文推理中内存减少高达3倍,解码速度提升高达3.3倍。
⭐ 主要贡献提出了一种创新的后训练框架,明确强化了多模态大语言模型的标记稀疏性,实现了效率与性能的更好权衡,显著提升了推理速度和内存效率,同时保持了高准确性。
查看完整摘要 (Abstract)
Sparse attention mechanisms aim to reduce computational overhead with minimal accuracy loss by selectively processing salient tokens. Despite their effectiveness, most methods merely exploit a model’s inherent sparsity and thus plateau at moderate budgets (about 50\% token reduction), with little headroom to push budget lower without hurting accuracy.
Other approaches attempt to enforce sparsity through trainable sparse attention or sharpness-inducing regularizers, but these either fix rigid patterns that ignore input and layer dynamics, or optimize proxy objectives without direct control over token budgets.
In this paper, we explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named $\textit{Sparsity Forcing}$.
Our method explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, where both efficiency (token reduction ratio) and performance (answer correctness) are formulated as joint rewards.
By contrasting rollouts within each group, the more efficient and correct answer is rewarded while less efficient or incorrect ones are penalized, thereby turning token saving into an end-to-end, inference-consistent optimization objective.
Across thirteen image and video benchmarks, Sparsity Forcing raises token reduction ratio on Qwen2-VL/Qwen2.5-VL from 20\% to 75\% with minimal accuracy decline,
significantly reducing long-context inference memory by up to 3$\times$ while speeding up decoding by up to 3.3$\times$.
基础/前沿模型 (含LLM)
效率与压缩
#inference #large language models #speculative decoding
TL;DR:We introduce an asynchronous speculative decoding algorithm wherein the draft model continuously speculates on top of anticipated verification outcomes, thus hiding drafting latency entirely.
🎯 研究动机自回归解码因其顺序性限制了推理效率,现有的推测解码方式虽能部分加速,但仍局限于推测与验证间的顺序依赖性。
❓ 解决问题提出一种新的异步推测解码算法,通过平行化推测与验证过程,消除推测阶段的延迟,从根本上改善推理速度。
🔍 现象分析推测解码的性能瓶颈在于其操作间的依赖性,验证完成后推测才能开始,无法充分利用计算资源并最大化潜在效率。
🛠️ 主要方法提出“推测推测解码”(SSD)算法,通过让草稿模型在验证期间预测可能的验证结果,并预先准备后续推测,加速解码流程。
📊 数据与实验基于开源推理引擎验证算法性能,相较优化的推测解码基线提高最高2倍速度,相较自回归解码提高最高5倍速度。
⭐ 主要贡献通过提出 SSD 算法,革新了解码步骤的并行化处理机理,显著提升推理效率,并优化现有推测解码技术的瓶颈。
查看完整摘要 (Abstract)
Autoregressive decoding is bottlenecked by its *sequential* nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then verifying them *in parallel* with a single target model forward pass. However, speculative decoding itself relies on a *sequential* dependence between speculation and verification. We introduce *speculative speculative decoding* (SSD) to parallelize these operations. While a verification is ongoing, the draft model *predicts* likely verification outcomes and prepares speculations pre-emptively for them. If the actual verification outcome is then in the predicted set, a speculation can be returned immediately, eliminating drafting overhead entirely. We identify three key challenges presented by speculative speculative decoding, and suggest principled methods to solve each. The result is Saguaro, an optimized SSD algorithm. Our implementation is up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Models #parameter-efficient fine tuning #low-rank adaptation
TL;DR:This paper proposes Stable-LoRA, a weight-shrinkage optimization strategy that enhances stability of LoRA feature learning.
🎯 研究动机低秩适配(LoRA)作为一种参数高效的微调方法,尽管表现良好,但理论上对特征学习稳定性的理解仍然匮乏。
❓ 解决问题LoRA的必要非零初始化会破坏其自稳能力,从而导致特征学习稳定性不足和性能不佳。
🔍 现象分析研究表明,LoRA在适当的超参数和初始化条件下,可以自然实现特征学习的稳定性,但实际的非零初始化限制了其自稳性。
🛠️ 主要方法提出Stable-LoRA方法,通过动态收缩A矩阵的权重,在早期训练阶段增强特征学习的稳定性,同时保留非零初始化的优势。
📊 数据与实验实验覆盖多种模型和任务,结果表明Stable-LoRA在不增加内存使用、几乎无计算开销的情况下,一致优于其他基线方法。
⭐ 主要贡献从理论和实验证明了Stable-LoRA有效消除LoRA特征学习的不稳定性,提出了一种无额外成本的高效优化策略。
查看完整摘要 (Abstract)
Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient method for fine-tuning Large Langauge Models. It updates the weight matrix as $W=W_0+sBA$, where $W_0$ is the original frozen weight, $s$ is a scaling factor and $A$,$B$ are trainable low-rank matrices. Despite its robust empirical effectiveness, the theoretical foundations of LoRA remain insufficiently understood, particularly with respect to feature learning stability. In this paper, we first establish that, LoRA can, in principle, naturally achieve and sustain stable feature learning (i.e., be self-stabilized) under appropriate hyper-parameters and initializations of $A$ and $B$. However, we also uncover a fundamental limitation that the necessary non-zero initialization of $A$ compromises self-stability, leading to suboptimal performances. To address this challenge, we propose Stable-LoRA, a weight-shrinkage optimization strategy that dynamically enhances stability of LoRA feature learning. By progressively shrinking $A$ during the earliest training steps, Stable-LoRA is both theoretically and empirically validated to effectively eliminate instability of LoRA feature learning while preserving the benefits of the non-zero start. Experiments show that Stable-LoRA consistently outperforms other baselines across diverse models and tasks, with no additional memory usage and only negligible computation overheads. The code is available at https://github.com/Yize-Wu/Stable-LoRA.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Models; Efficient Training; Low-Rank; LoRA
🎯 研究动机传统优化器如 Adam 和 Muon 在训练大语言模型时依赖一阶和二阶动量,但带来显著的内存开销,降低了模型的可扩展性与计算效率。
❓ 解决问题使用低秩分解,降低优化器的动量矩阵内存占用,同时保证优化性能,以实现高效的预训练和微调。
🔍 现象分析通过将指数移动平均 (EMA) 重新表述为在线梯度流训练的线性回归问题,揭示了优化器状态的潜在低秩结构。
🛠️ 主要方法提出低秩优化器 LoRA-Pre,通过将完整动量矩阵分解为低秩子空间来减少内存开销,同时保持优化性能的表现。
📊 数据与实验使用 Llama 架构族从 60M 到 1B 参数规模的模型进行实验,LoRA-Pre在预训练与微调场景中均超越现有基线,表现出显著的内存效率与性能优势。
⭐ 主要贡献1. 提出低秩优化器 LoRA-Pre,有效降低内存需求;2. 首次将低秩优化器应用于大规模模型的预训练与微调;3. 在多个模型规模和任务中取得显著性能提升。
查看完整摘要 (Abstract)
Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency.
In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow.
Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training.
Specifically, LoRA-Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency.
We empirically validate LoRA-Pre's efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters.
LoRA-Pre achieves the highest performance across all model sizes.
Notably, LoRA-Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods.
Beyond pre-training, we evaluate LoRA-Pre's effectiveness in fine-tuning scenarios.
With the same rank, LoRA-Pre consistently outperforms all efficient fine-tuning baselines.
Specifically, compared to standard LoRA, LoRA-Pre achieves substantial improvements of 3.14 points on Llama-3.1-8B and 6.17 points on Llama-2-7B, validating our approach's effectiveness across both pre-training and fine-tuning paradigms.
Our code is publicly available at https://github.com/mrflogs/LoRA-Pre.
基础/前沿模型 (含LLM)
效率与压缩
#Ternary Quantization #Large Language Models #Edge Computing
🎯 研究动机大语言模型在边缘设备上的部署需要量化技术,但现有方法难以实用,主要因高效硬件支持有限。
❓ 解决问题提出解决死区困陷问题的优化方法,以克服现有三值量化导致的模型性能严重下降。
🔍 现象分析权重的死区困陷现象源于接收噪声性梯度,无法稳定逃离死区边界,影响模型容量与优化。
🛠️ 主要方法提出 Tequila,通过将死区困陷权重重新利用为动态偏置以引入连续信号,从而实现梯度信号的直接优化。
📊 数据与实验在五个基准测试上广泛评估,Tequila在 ARC 基准测试中精度提升超过 4%,接近全精度性能,推理速度提高 3 倍。
⭐ 主要贡献Tequila提供了一种实用且高效的大语言模型三值量化实现,有效提升边缘设备部署性能。
查看完整摘要 (Abstract)
Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to {-1, 0, 1}, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as _**deadzone trapping**: a large number of weights are trapped at the deadzone boundary._ This occurs because these weights receive only noisy, less informative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose **Tequila**, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly _zero_ inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves $>4$% accuracy gain over the SOTA baseline, nearly matching full-precision performance (within $<1$% gap) with an $3.0\times$ inference speedup. Consequently, Tequila offers a highly practical and efficient implementation for the deployment of advanced LLMs in resource-constrained environments. The code is available at https://github.com/Tencent/AngelSlim/tree/tequila/TernaryQuant .
基础/前沿模型 (含LLM)
效率与压缩
#LLM #Quantization #Lattice Algorithm #Closest Vector Problem
TL;DR:The GPTQ algorithm is exactly Babai's nearest plane algorithm for the closest vector problem, giving a geometric view of LLM quantization.
🎯 研究动机LLM量化通常采用GPTQ算法,但其内部机制被描述为一系列代数更新,缺乏几何解释和最坏情况保证。本文旨在揭示GPTQ背后的数学本质,将其与经典的最近向量问题关联。
❓ 解决问题阐明了GPTQ算法在执行时的几何含义,将其数学等价于Babai的最近平面算法。通过避免权重剪裁,改进了原有GPTQ方法的上界保证。
🔍 现象分析GPTQ算法在LLM线性层中逆向执行时,其误差传播步骤具有清晰的几何解释。这表明量化问题可以转化为基于Hessian矩阵定义的晶格上的最近向量问题。
🛠️ 主要方法将GPTQ与Babai最近平面算法建立数学等价性,推导出误差上界。基于此设计避免剪裁的后训练量化方法,并提供高效GPU推理内核实现。
📊 数据与实验开源代码已发布,支持大规模参数模型量化。实验证明新方法在避免权重剪裁时优于原始GPTQ,但论文未详述具体评估数据集和基准测试结果。
⭐ 主要贡献为GPTQ奠定了坚实的理论基础,建立了与晶格算法的连接。这项工作为利用数十年晶格算法进展来设计未来十亿参数模型的量化算法开辟了新途径。
查看完整摘要 (Abstract)
Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators.
While GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale, its inner workings are described as a sequence of algebraic updates that obscure geometric meaning or worst-case guarantees.
In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs.
This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences:
first, the GPTQ error propagation step gains an intuitive geometric interpretation;
second, GPTQ inherits the error upper bound of Babai's algorithm under the assumption that no weights are clipped.
Leveraging this bound, we design post-training quantization methods that avoid clipping, and outperform the original GPTQ.
In addition, we provide efficient GPU inference kernels for the resulting representation.
Taken together, these results place GPTQ on a firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models.
Source code is available at https://github.com/IST-DASLab/GPTQ-Babai.
基础/前沿模型 (含LLM)
效率与压缩
#Compression #LLM
TL;DR:By understanding that the exponents of generative AI model weights have a low-entropy structure, we developed ECF8, a lossless 8-bit floating-point format that significantly compresses these models without losing accuracy.
🎯 研究动机生成式 AI 模型参数规模不断扩大,低精度计算成为高效部署的关键,传统浮点格式优化不足且存在去量化开销。
❓ 解决问题开发一种基于低精度浮点格式的无损压缩方法,既能提升存储效率,又无需牺牲计算准确性。
🔍 现象分析发现生成式 AI模型权重的指数具有低熵特性,其来源于梯度下降中自然形成的α稳定分布,并提出理论性压缩极限。
🛠️ 主要方法设计无损的8位浮点格式ECF8,包括熵感知编码和GPU优化解码,通过指数集中性实现权重压缩。
📊 数据与实验在参数规模达671B的LLM和DiT模型上实验,结果显示内存节省高达26.9%,吞吐性能提升至177.1%,且计算完全无损。
⭐ 主要贡献提出了指数集中性作为训练模型的统计规律,推动FP8时代无损低精度浮点格式设计的理论与实践进展。
查看完整摘要 (Abstract)
The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision \emph{floating-point} formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an \emph{exponent concentration} phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from $\alpha$-stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9\% memory savings and 177.1\% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era. Code is available at https://github.com/zeyuyang8/ecf8.
基础/前沿模型 (含LLM)
效率与压缩
#Memory Efficient Fine Tuning
🎯 研究动机微调大语言模型通常因高内存消耗效率低下,亟需更高效的优化策略。
❓ 解决问题现有优化方案受限于数据无关性,导致微调过程不稳定且效果欠佳。
🔍 现象分析激活相关的内存占比极高,是影响微调效率的主要瓶颈。
🛠️ 主要方法提出一种通用插件TokenSeek,通过实例感知的token筛选与舍弃,实现内存显著节省且性能更优。
📊 数据与实验实验表明TokenSeek在Llama3.2 1B上仅需14.8%的内存,且性能持平甚至更优;辅助分析展示其可解释性。
⭐ 主要贡献通过高效、稳定的微调方法推动大语言模型内存优化与token效率研究。
查看完整摘要 (Abstract)
Fine tuning has been regarded as a de facto approach for adapting large language models (LLMs) to downstream tasks, but the high training memory consumption inherited from LLMs makes this process inefficient. Among existing memory efficient approaches, activation-related optimization has proven particularly effective, as activations consistently dominate overall memory consumption. Although prior arts offer various activation optimization strategies, their data-agnostic nature ultimately results in ineffective and unstable fine tuning. In this paper, we propose TokenSeek, a universal plugin solution for various transformer-based models through instance-aware token seeking and ditching, achieving significant fine-tuning memory savings (e.g., requiring only 14.8% of the memory on Llama3.2 1B) with on-par or even better performance. Furthermore, our interpretable token seeking process reveals the underlying reasons for its effectiveness, offering valuable insights for future research on token efficiency. Homepage: runjia.tech/iclr_tokenseek.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Model #Quantization
TL;DR:Two-stage QAT for reasoning LLMs: mixed-domain calibration preserves abilities and teacher-guided loss restores reasoning, boosting performance of ultra-low-bit reasoning LLMs
🎯 研究动机大语言模型(LLMs)在推理任务中表现卓越,但其高昂的计算和内存需求限制了部署。
❓ 解决问题现有量化感知训练(QAT)方法在超低比特压缩中损害了推理能力,尤其因为后训练过程引入的复杂知识结构。
🔍 现象分析研究发现量化对预训练能力和推理能力的影响不同,尤其在数据域间表现出显著差异。
🛠️ 主要方法提出一个双阶段QAT流水线:第一阶段利用混合域校准数据量化模型以保留能力,第二阶段通过教师引导的奖励修正损失恢复推理能力。
📊 数据与实验实验使用六个任务验证混合域校准优于单域校准,提升达2.74%;且在五个推理基准上验证Qwen3-8B的2比特模型比PTQ基线平均提升50.45%。
⭐ 主要贡献提出了针对推理优化的双阶段QAT方法,显著提升了超低比特模型性能,同时在数学推理任务中以更少数据超越竞品模型。
查看完整摘要 (Abstract)
Large language models (LLMs) have achieved remarkable performance across diverse reasoning tasks, yet their deployment is hindered by prohibitive computational and memory costs. Quantization-aware training (QAT) enables ultra-low-bit compression (<4 bits per weight), but existing QAT methods often degrade reasoning capability, partly because complex knowledge structures are introduced during the post-training process in LLMs. In this paper, through a systematic investigation of how quantization affects different data domains, we find that its impact on pre-training and reasoning capabilities differs. Building on this insight, we propose a novel two-stage QAT pipeline specifically designed for reasoning LLMs. In the first stage, we quantize the model using mixed-domain calibration data to preserve essential capabilities across domains; in the second stage, we fine-tune the quantized model with a teacher-guided reward-rectification loss to restore reasoning capability. We first demonstrate that mixed-domain calibration outperforms single-domain calibration by up to 2.74% improvement on average over six tasks, including reasoning and pre-trained tasks. Following experiments on five reasoning benchmarks show that our 2-bit-quantized Qwen3-8B outperforms post-training quantization (PTQ) baselines by 50.45% on average. Moreover, compared to ultra-low-bit-specialized models such as BitNet-2B4T, our pipeline achieves approximately 2\% higher mathematical-reasoning accuracy with fewer than 1B tokens. Code is available: https://github.com/yasu0001/ReasoningQAT
基础/前沿模型 (含LLM)
效率与压缩
#efficient reasoning LLM #KV cache #test-time learning
🎯 研究动机大型推理模型在复杂问题上表现优异,但因强化学习训练需长序列回传,导致效率低下,受限于时间和内存开销。
❓ 解决问题现有滑动窗口式缓存策略在减少内存使用的同时限制了长上下文推理,导致性能下降。
🔍 现象分析长序列回传对训练阶段的内存需求巨大,具备固定大小缓存的模型在推理阶段难以兼顾准确性与效率。
🛠️ 主要方法提出渐进式思维编码方法,通过参数高效的微调方式,将中间推理进程紧凑编码,降低训练时的内存需求,同时保证推理阶段内存恒定。
📊 数据与实验在三个模型(如 Qwen2.5-3B-Instruct)和六个数学基准上进行实验,显示该方法相较 LoRA 平均提升 19.3%,相较基线提升 29.9%,在 AIME2024/2025 数据集上缓存紧张条件下取得最高 23.4% 的绝对增益。
⭐ 主要贡献提出一种新方法显著提升了大型推理模型强化学习训练的效率与可扩展性,在严格内存限制下仍可实现高准确性的推理。
查看完整摘要 (Abstract)
Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into compact representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing training-time memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, across six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3\% improvement over LoRA and +29.9\% over the baseline on average, with up to +23.4 absolute gains on AIME2024/2025 under tight cache budgets. These results demonstrate that Progressive Thought Encoding not only improves reasoning accuracy but also makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.
基础/前沿模型 (含LLM)
效率与压缩
#Speculative Decoding #Large Language Models
🎯 研究动机大规模语言模型在多任务中的性能强大,但因自回归生成导致推理延迟较高。现有推测解码严格的精确匹配验证丢弃了许多语义正确的候选。
❓ 解决问题提出一种新方法,以放宽验证标准,通过目标模型的语义纠正能力接纳语义正确但非精确匹配的候选。
🔍 现象分析通过设计的双层机制分别识别多种可信候选和语义正确的变体,以避免过严格验证带来的潜在性能浪费。
🛠️ 主要方法引入熵门与延迟窗口机制,结合多层加速策略,在免训练的框架下实现广泛模型间的兼容性与推理加速。
📊 数据与实验在Llama-3.1-70B-Instruct上实现平均2.81倍加速,在405B模型上达到5.07倍加速,同时保持超过99%的目标模型准确性,并在异域数据集上优于基于训练的方法1.62倍。
⭐ 主要贡献提出了一种免训练的宽松推测解码方法,大幅提高推理速度且兼容多领域模型,展现出语义验证的强适应性与加速效果。
查看完整摘要 (Abstract)
Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive generation.
Speculative Decoding (SPD) mitigates this issue by verifying candidate tokens from a smaller draft model in parallel, yet its strict exact-match verification discards many semantically valid continuations.
We propose Training-Free Loosely Speculative Decoding (FLy), a novel method that loosens the rigid verification criterion by leveraging the target model’s own corrective behavior to judge whether a draft–target mismatch remains semantically valid.
FLy introduces a two-tier mechanism: an entropy-level gate that identifies whether the current token allows multiple plausible alternatives or is nearly deterministic, and a token-level deferred window that distinguishes genuine errors from differently worded yet semantically correct variants.
To further reduce latency, we design a multi-level acceleration strategy that accelerates not only the target model but also the drafter itself.
Owing to its training-free design, FLy composes seamlessly with arbitrary draft–target pairs and generalizes across models and domains without hyperparameter re-tuning.
Experiments show that FLy preserves $\geq$99\% of the target model’s accuracy while achieving an average 2.81$\times$ speedup on Llama-3.1-70B-Instruct and 5.07$\times$ speedup on the 405B variant. Notably, on out-of-domain datasets, our method remains highly effective and outperforms the training-based method EAGLE-3 by 1.62$\times$.
Our code is available at
https://github.com/AMD-AGI/FLy.
基础/前沿模型 (含LLM)
效率与压缩
#LLMs #activation sparsity #efficiency #representations
TL;DR:We provide an overview of activation sparsity in modern LLM architectures and highlight a set of interesting properties of the activations across different model families.
🎯 研究动机激活稀疏性在深度神经网络中的效率、鲁棒性和可解释性方面具有显著优势,但其在现代大型语言模型中的普适性尚未明确。针对这一领域缺乏统一理解的问题进行探讨具有重要意义。
❓ 解决问题现有方法依赖于完全零激活,对现代大型语言模型不适用。研究旨在填补不同模型之间策略碎片化及普遍性理解的空白。
🔍 现象分析发现激活稀疏性在多种模型家族和规模中具有普遍性,且模型规模越大,稀疏性效果越显著。此外,还首次分析了扩散基础的语言模型中的稀疏性特性。
🛠️ 主要方法提出了一个通用框架,用于评价现代大型语言模型中稀疏性的鲁棒性,主要聚焦于前馈网络层的稀疏性表现并进行系统性调查。
📊 数据与实验使用多种模型家族和不同规模的模型进行实验,并首次包含扩散模型,深入探索激活稀疏性的性能和潜力。
⭐ 主要贡献提供了激活稀疏性的全面视角,揭示其随模型规模扩大的重要性。提出实用指导方针,可用于优化大型语言模型设计与计算加速。
查看完整摘要 (Abstract)
Activation sparsity is an intriguing property of deep neural networks that has been extensively studied in ReLU-based models, due to its advantages for efficiency, robustness, and interpretability.
However, methods relying on exact zero activations do not directly apply to modern Large Language Models (LLMs), leading to fragmented, model-specific strategies for LLM activation sparsity and a gap in its general understanding.
In this work, we introduce a general framework for evaluating sparsity robustness in contemporary LLMs and conduct a systematic investigation of this phenomenon in their feedforward~(FFN) layers.
Our results uncover universal properties of activation sparsity across diverse model families and scales.
Importantly, we observe that the potential for effective activation sparsity grows with model size, highlighting its increasing relevance as models scale.
Furthermore, we present the first study of activation sparsity in diffusion-based LLMs.
Overall, our work provides a comprehensive perspective and practical guidance for harnessing activation sparsity in LLM design and acceleration.
基础/前沿模型 (含LLM)
效率与压缩
#Sparse Activation #Efficient Inference
🎯 研究动机大语言模型推理计算需求激增,提升推理效率成为关键挑战。现有方法多依赖重新训练或架构调整,适用性有限。无需训练的稀疏激活方法虽有潜力,但现阶段误差较大且性能下降明显。
❓ 解决问题现有稀疏激活方法主要基于隐藏状态的幅值,导致近似误差大。论文试图通过结合权重矩阵的结构信息来减少误差并提升性能。
🔍 现象分析隐藏状态幅值单一策略的稀疏化方法效果欠佳,特别在极端稀疏度下性能下降显著。需要引入更系统化的稀疏化策略以改进表现。
🛠️ 主要方法提出 WINA 框架,通过结合隐藏状态幅值和模型权重矩阵的 ℓ2 范数,实现无需训练的稀疏激活策略。该方法提供理论上最优近似误差界并兼顾实际应用。
📊 数据与实验实验覆盖多种 LLM 架构和数据集,结果表明 WINA 在相同稀疏度下的准确率优于现有方法,且在更加极端稀疏条件下表现保持稳定。
⭐ 主要贡献提出 WINA 框架,实现训练无关的稀疏激活方法,理论上具有优化近似误差界并在多种模型和数据集上验证其性能优越性。
查看完整摘要 (Abstract)
The ever-increasing computational demands of large language models (LLMs) make efficient inference a central challenge. While recent advances leverage specialized architectures or selective activation, they typically require (re)training or architectural modifications, limiting their broad applicability. Training-free sparse activation, in contrast, offers a plug-and-play pathway to efficiency; however, existing methods often rely solely on hidden state magnitudes, leading to significant approximation error and performance degradation. To address this, we introduce WINA (Weight-Informed Neuron Activation): a simple framework for training-free sparse activation that incorporates both hidden state magnitudes and weight matrix structure. By also leveraging the ℓ2-norm of the model’s weight matrices, WINA yields a principled sparsification strategy with provably optimal approximation error bounds, offering better and tighter theoretical guarantees than prior state-of-the-art approaches. Overall, WINA also empirically outperforms many previous training-free methods across diverse LLM architectures and datasets: not only matching or exceeding their accuracy at comparable sparsity levels, but also sustaining performance better at more extreme sparsity levels. Together, these results position WINA as a practical, theoretically grounded, and broadly deployable solution for efficient inference. Our source code is available at https://github.com/microsoft/wina.
基础/前沿模型 (含LLM)
效率与压缩
#Vision language model #singular value decomposition #quantization
🎯 研究动机奇异值分解(SVD)虽被广泛用于减少视觉语言模型的计算开销,但实际应用中仍难实现显著延迟降低。现有方法在模型执行时的延迟改善有限,需要更精细的优化策略。
❓ 解决问题针对传统SVD在延迟降低上的不足,提出一种更细粒度的计算模式。通过考虑权重元素重要性差异,在SVD中自适应分配重要性,并结合量化技术以提升效率。
🔍 现象分析尽管已有高效SVD变体支持低秩操作,但在实际模型执行中延迟降低效果不明显。权重元素对模型精度的影响不均等,导致标准SVD可能损失关键信息。
🛠️ 主要方法引入加权SVD(WSVD),在更细粒度上应用SVD以优化延迟。自适应分配权重重要性,保留精度,并扩展至权重和激活的量化,实现高效视觉语言模型。
📊 数据与实验通过实验验证WSVD的有效性,具体数据集未在摘要中说明,但开源代码可供复现。解码速度提升超过1.8倍,同时保持模型精度。
⭐ 主要贡献提出WSVD方法,实现显著解码加速(超1.8倍)且精度无损。开源代码促进社区应用,为低精度视觉语言模型的高效执行提供新解决方案。
查看完整摘要 (Abstract)
Singular Value Decomposition (SVD) has become an important technique for reducing the computational burden of Vision Language Models (VLMs), which play a central role in tasks such as image captioning and visual question answering. Although multiple prior works have proposed efficient SVD variants to enable low-rank operations, we find that in practice it remains difficult to achieve substantial latency reduction during model execution. To address this limitation, we introduce a new computational pattern and apply SVD at a finer granularity, enabling real and measurable improvements in execution latency. Furthermore, recognizing that weight elements differ in their relative importance, we adaptively allocate relative importance to each element during SVD process to better preserve accuracy, then extend this framework with quantization applied to both weights and activations, resulting in a highly efficient VLM. Collectively, we introduce Weighted SVD (WSVD), which outperforms other approaches by achieving over $1.8\times$ decoding speedup while preserving accuracy. We open source our code at: https://github.com/SAI-Lab-NYU/WSVD.
基础/前沿模型 (含LLM)
效率与压缩
#LLMs Compression #LRMs Compression #Quantization #Pruning #Distillation
TL;DR:We run performance benchmarking and mechanistic interpretation to understand the effects of compression on reasoning models.
🎯 研究动机大规模推理模型的压缩方法(如量化、剪枝、蒸馏)旨在提升计算效率,但现有研究缺乏对三种压缩方法的全面比较及深入解释分析。
❓ 解决问题研究压缩对推理能力的影响,并探讨哪些权重在压缩过程中对推理性能至关重要。
🔍 现象分析权重数量对知识记忆的影响大于推理能力;模型中蒸馏的最后层MLP投射至关重要;当前量化方法过度压缩了关键模块,保护少量权重即可显著提升性能。
🛠️ 主要方法通过性能基准测试和机制解释,利用均值差异和归因修补技术精确分析压缩对模型权重的影响及权重与推理能力的因果关系。
📊 数据与实验在四个推理数据集(AIME 2024、FOLIO、时间序列、MuSiQue)上对DeepSeek-R1模型进行量化、剪枝、蒸馏实验,并验证方法有效性。
⭐ 主要贡献提供LRMs压缩对推理性能影响的细粒度解释;揭示权重的重要性问题;提出保护关键权重可提升准确率的新方法,超越现有技术水平。
查看完整摘要 (Abstract)
Compression methods, including quantization, distillation, and pruning, improve the computational efficiency of large reasoning models (LRMs). However, existing studies either fail to sufficiently compare all three compression methods on LRMs or lack in-depth interpretation analysis. In this paper, we investigate how the reasoning capabilities of LRMs are compromised during compression, through performance benchmarking and mechanistic interpretation. To uncover the effects of compression on reasoning performance, we benchmark quantized, distilled, and pruned DeepSeek-R1 models on four reasoning datasets (AIME 2024, FOLIO, Temporal Sequences, and MuSiQue). To precisely locate compression effects on model weights, we adapt difference of means and attribution patching techniques, focusing on the activation of every linear component in compressed LRMs, to interpret fine-grained causal relationships between weights and various reasoning capabilities. This fine-grained interpretation addresses a fundamental question of compression: which weights are the most important for reasoning? Overall, we find dynamically quantized 2.51-bit R1 reaches close-to-R1 performance. With empirical verification, we present three main findings that generalize across both R1 and non-R1 LRMs: (1) Weight count has a greater impact on LRMs' knowledge memorization than reasoning, highlighting the risks of pruning and distillation; (2) The MLP up projection in the final layer of distilled LRMs is one of the most important components, offering a new perspective on locating critical weights - a fundamental problem in model compression; and (3) Current quantization methods overly compress the final-layer modules and MLP gate projections, so protecting just 2% of all weights that are excessively compressed can raise average accuracy by 6.57%, greatly surpassing the state-of-the-art.
基础/前沿模型 (含LLM)
效率与压缩
#Large Language Model #KV Cache Compression #Attention Pattern
🎯 研究动机大语言模型中的注意力模式对训练和推理至关重要,但现有研究仅关注个别模式,缺乏统一的解释。
❓ 解决问题提出统一框架 TAPPA,从时间连续性的角度分析注意力模式的数学本质,填补碎片化观察的空白。
🔍 现象分析发现注意力模式可分为具有规则性的可预测模式和类似随机的不可预测模式,这可通过时间维度上的查询自相似性程度解释。
🛠️ 主要方法通过结合查询、键值和旋转位置嵌入(RoPE)的联合效应,数学建模和分析典型的注意力模式,并验证其可用于加速推理改进。
📊 数据与实验在 KV 缓存压缩和 LLM 剪枝任务中测试 TAPPA 的理论,并通过简单指标实现基准性能提升。
⭐ 主要贡献提出了 TAPPA 框架,统一解释了多种注意力模式;提供了对预测性模式的数学分析;成功应用于推理效率提升任务,显著提高了基础方法性能。
查看完整摘要 (Abstract)
Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce **Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations** from a temporally continuous perspective.
TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension.
Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at [https://github.com/MIRALab-USTC/LLM-TAPPA](https://github.com/MIRALab-USTC/LLM-TAPPA).
基础/前沿模型 (含LLM)
效率与压缩
#Diffusion LLM #Efficient Inference
🎯 研究动机扩散型大语言模型(DLLMs)虽然支持并行生成,速度快,但质量与速度之间存在显著权衡,影响实际应用效果。
❓ 解决问题现有解码方式不可逆,容易积累早期错误,导致性能下降。论文提出一种训练无关的解码算法解决该问题。
🔍 现象分析标准解码方式会因早期错误上下文积累进入错误解码方向,造成质量显著下降,尤其在高速生成场景中表现尤为明显。
🛠️ 主要方法提出Wide-In, Narrow-Out(WINO)算法,通过并行的草稿与验证机制,在生成过程中重新屏蔽并修正可疑的无效生成内容。
📊 数据与实验验证了WINO在开源DLLMs(如LLaDA和MMaDA)上的有效性;在GSM8K数学基准测试中提高推断速度6倍,准确率提升2.58%;在Flickr30K任务中速度提高10倍且性能提升。
⭐ 主要贡献提出了一种高效且训练无关的可撤销解码算法,为DLLMs的质量与速度权衡提供了一种有效解决方案,并通过广泛实验证明其卓越性。
查看完整摘要 (Abstract)
Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model's bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6 while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10 speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.
基础/前沿模型 (含LLM)
效率与压缩
#Diffusion-based large language models #Key-value caching #Inference acceleration
🎯 研究动机扩散式大语言模型(dLLMs)虽然在性能上有前景,但推理效率较低,这主要源于其依赖双向注意力,无法直接利用自回归模型的标准键值缓存机制。
❓ 解决问题通过提出一种训练无关的大键值缓存框架,旨在提升扩散式大语言模型的推理速度,同时改善生成质量。
🔍 现象分析dLLMs的推理效率低是由于解码过程中无法有效缓存和重用键值状态,导致计算冗余,同时容易因解码顺序带来序列末尾的不可靠生成。
🛠️ 主要方法提出Dual aDaptive Cache (d$^2$Cache),采用两阶段细粒度选择策略,对关键令牌的键值状态进行自适应更新,其他令牌的状态被缓存以供重复使用,同时支持近似的左到右生成模型推理。
📊 数据与实验在两个代表性扩散式大语言模型(LLaDA和Dream)上进行了广泛实验,显示在推理加速的同时保持甚至提高了生成效果。
⭐ 主要贡献首次提出面向扩散式大语言模型的近似键值缓存框架d$^2$Cache,实现了推理速度显著提升及输出质量优化,为类似模型的高效推理提供了可靠方案。
查看完整摘要 (Abstract)
Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce \textit{Dual aDaptive Cache} (d$^2$Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d$^2$Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d$^2$Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d$^2$Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The anonymous evaluation codes are available at \url{https://anonymous.4open.science/r/d2Cache-5538}.
基础/前沿模型 (含LLM)
效率与压缩
#Vision-Language Models #Model efficiency #Token merging
TL;DR:Efficient Large Vision-Language Models with two-stage visual token merging strategies in the image encoder and LLM.
🎯 研究动机现有大型视觉语言模型加速方法主要关注减少LLM阶段的图像令牌,忽略了图像编码器本身的计算瓶颈,未能实现真正的端到端加速。图像编码器是LLM输入令牌的主要来源,在编码器阶段减少视觉冗余可同时加速编码器和减轻LLM负担。
❓ 解决问题现有方法因未联合优化图像编码器和LLM,导致加速效果受限且可能损失性能。关键在于如何在减少令牌的同时,通过回收被丢弃令牌的有用信息来保持模型精度,实现全面加速。
🔍 现象分析图像编码器是计算瓶颈的主要贡献者,其生成的冗余视觉令牌增加了后续LLM的处理负担。单纯在LLM阶段进行令牌修剪或合并,无法从根本上解决端到端效率问题。
🛠️ 主要方法提出iLLaVA框架,采用两阶段视觉令牌合并策略,在图像编码器和LLM中联合优化以全面加速。创新性地通过回收被丢弃令牌的有用信息来设计令牌合并策略,以降低性能损失风险。
📊 数据与实验在图像和视频理解任务上进行广泛评估,与最先进的令牌修剪和合并技术进行对比。实验表明iLLaVA实现了高达2倍的吞吐量提升和4倍的预填充时间减少,且大模型在精度和效率上均超越小模型。
⭐ 主要贡献首次联合优化图像编码器与LLM以实现端到端加速,提出信息回收的令牌合并策略以平衡效率与精度。为LVLM高效计算提供深入可视化分析,展示了不同组件对计算效率的贡献机制。
查看完整摘要 (Abstract)
Recent methods have made notable progress in accelerating Large Vision-Language Models (LVLMs) by exploiting the inherent redundancy in visual inputs. Most existing approaches, however, focus narrowly on reducing image tokens before or within the Large Language Model (LLM) stage to lower computational cost. This overlooks other major bottlenecks, particularly the image encoder, which itself requires substantial computation. As a result, these methods fall short of achieving true end-to-end acceleration. Importantly, the image encoder is the primary contributor of input tokens to the LLM. Thus, reducing visual redundancy at the encoder stage not only speeds up the encoder itself but also significantly lightens the workload for the subsequent LLM. Motivated by this, we investigate how to jointly optimize the image encoder and the LLM along with other LVLM components for comprehensive acceleration. To mitigate the risk of performance degradation from token reduction, we propose a novel token merging strategy that recycles useful information from otherwise discarded tokens. Our approach, iLLaVA, delivers consistent improvements across both image and video understanding tasks, achieving up to a 2$\times$ throughput boost and a 4$\times$ reduction in prefilling time. Notably, iLLaVA enables a larger model (e.g., InternVL-2.5 26B) to surpass a smaller counterpart (e.g., InternVL-2.5 8B) in both accuracy and efficiency. Extensive comparisons with state-of-the-art token pruning and merging techniques demonstrate the clear superiority of our method. Finally, we provide detailed visualizations for the merging steps of iLLaVA , offering deeper insights into how different LVLM components contribute to efficient computation.
Agent 与工具使用100 篇
基础/前沿模型 (含LLM)
Agent 与工具使用
#Adaptive LLMs #Deep Research #Agent Reasoning
TL;DR:We propose A²FM, a unified 32B model combining agentic, reasoning, and instant modes via adaptive routing and APO, achieving state-of-the-art accuracy with substantially improved cost efficiency.
🎯 研究动机现有的大模型分为重内在推理但无法调用工具的推理型模型,以及擅长环境交互但在深度推理上较弱的代理型模型,两者在简单任务上常出现过度推理或工具使用的低效问题。
❓ 解决问题针对两类模型在任务匹配和效率上的鸿沟,该研究提出一种统一的框架,以改进简单查询的处理效率并同时提升推理及代理能力。
🔍 现象分析推理型和代理型模型由于训练目标不同,在实际应用中各有优劣,而现存模型往往忽略了对简单任务的处理优化且资源成本较高。
🛠️ 主要方法提出 A²FM 框架,通过任务感知路由和模式对齐,实现代理型、推理型和即时模式整合;并引入自适应策略优化(APO)以在模式间进行成本正则化的奖励采样。
📊 数据与实验使用 BrowseComp、AIME25 和 HLE 数据集进行评估,A²FM 在多个基准中均达到最先进性能,同时实现大幅度的成本优化。
⭐ 主要贡献提出一种统一的模式路由和对齐框架,并通过自适应策略优化显著提升了推理、代理和简单查询任务的成本效率和准确性,推动了大型语言模型的高效应用。
查看完整摘要 (Abstract)
Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A$^2$FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third instant mode that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A$^2$FM achieves 13.4\% on BrowseComp, 70.4\% on AIME25, and 16.7\% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only \$0.00487 per correct answer—cutting cost by 45.2\% relative to reasoning and 33.5\% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Multi-hop QA;Reinforcement Learning; GRPO; Large Language Model; LLM agent
🎯 研究动机当前的大型语言模型与强化学习方法在开放域问答领域表现强劲,但难以处理存在多种有效答案的模糊性问题。现有数据集通常假设唯一标准答案,导致训练信号不准确。
❓ 解决问题提出无需人工标注的训练框架 A$^2$Search,通过自动化流程识别模糊性问题并生成多样答案,解决多跳问答中标注成本高的挑战。
🔍 现象分析现有方法在处理模糊性和生成多个答案方面表现受限,尤其在扩展至规模较大的多跳问答数据集时。
🛠️ 主要方法利用轨迹采样与证据验证构建自动化流程检测模糊性问题并生成多答案,通过强化学习结合自定义的 $mathrm{AnsF1}$ 奖励优化模型。
📊 数据与实验在八个开放域问答数据集上的实验显示,A$^2$Search 在多项指标中刷新性能记录,其中 A$^2$Search-7B 在四个多跳数据集上的 $mathrm{AnsF1}@1$ 平均得分达 48.4%,超越多个强基线模型。
⭐ 主要贡献提出首个无需标注的端到端模糊问题处理框架,证明拥抱模糊性对构建可靠问答系统的重要性,并公开代码与模型权重以促进社区发展。
查看完整摘要 (Abstract)
Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue.
In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers.
Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4$% across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2$%). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Multi-agent orchestration #Real-world travel planning #Constraints-aware planning
TL;DR:General multi-agent framework for real-world constrained travel planning with information search
🎯 研究动机大语言模型在处理复杂约束的推理和工具使用方面表现有限,无法生成最优且符合实际环境的解决方案。现实旅行规划任务对代理处理多层级动态约束带来挑战。
❓ 解决问题提出一个能够在实际旅行规划任务中应对显性、隐性及动态约束的通用多代理框架,实现约束感知的规划能力。
🔍 现象分析传统方法在复杂环境下难以动态处理多种约束条件,且基于静态规划的解决方案性能不足,存在显著的适应性缺陷。
🛠️ 主要方法ATLAS通过动态约束管理、迭代计划批评和自适应交替搜索机制提升规划能力,从原理上解决约束感知的核心挑战。
📊 数据与实验在TravelPlanner基准测试中,ATLAS将最优通过率从23.3%提升至44.4%;在真实环境下的动态任务中,ATLAS以84%最终通过率远超其他方法。
⭐ 主要贡献首次以量化方式验证多代理系统在实际旅行规划任务中的有效性,并通过动态信息搜索和多轮反馈展示了卓越性能。
查看完整摘要 (Abstract)
While Large Language Models (LLMs) have shown remarkable advancements in reasoning and tool use, they often fail to generate optimal, grounded solutions under complex constraints. Real-world travel planning exemplifies these challenges, evaluating agents’ abilities to handle constraints that are explicit, implicit, and even evolving based on interactions with dynamic environments and user needs. In this paper, we present ATLAS, a general multi-agent framework designed to effectively handle such complex nature of constraints awareness in real-world travel planning tasks. ATLAS introduces a principled approach to address the fundamental challenges of constraint-aware planning through dedicated mechanisms for dynamic constraint management, iterative plan critique, and adaptive interleaved search. ATLAS demonstrates state-of-the-art performance on the TravelPlanner benchmark, improving the final pass rate from 23.3% to 44.4% over its best alternative. More importantly, our work is the first to demonstrate quantitative effectiveness on real-world travel planning tasks with live information search and multi-turn feedback. In this realistic setting, ATLAS showcases its superior overall planning performance, achieving an 84% final pass rate which significantly outperforms baselines including ReAct (59%) and a monolithic agent (27%).
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Model #Reinforcement Learning #Geometry Agent Prover
TL;DR:We propose Complexity Curriculum Reinforcement Learning to train LLMs to solve IMO-level geometry with minimal data, surpassing gold medalists and showing emergent creativity.
🎯 研究动机现有AI几何解题模型依赖大量数据合成与搜索,且在辅助构造的启发式方法上表现较弱,限制了问题解决效果及创造性。
❓ 解决问题提出一种新方法,使大型语言模型能够解决国际数学奥林匹克(IMO)几何问题,超越金牌得主水平,并展现创造性辅助构造能力。
🔍 现象分析传统模型在几何问题中依赖规模化数据,但表现受限于静态启发式设计;动态交互与复杂度递增能有效提升解决问题的深度与质量。
🛠️ 主要方法通过复杂度递增强化学习(CBRL),逐步增加问题复杂性;结合动态记忆机制与符号引擎交互,迭代生成命题与辅助构造并反思反馈。
📊 数据与实验基于仅13K个训练样例,在2000-2024年IMO几何问题中实现44/50解答,所用数据远低于AlphaGeometry 2,验证了模型的高效性。
⭐ 主要贡献提出InternGeometry框架,显著提升LLM解决高阶几何问题能力,展现超越人类解决方案的潜力,同时开启生成性辅助构造新领域。
查看完整摘要 (Abstract)
Large language model (LLM) agents exhibit strong mathematical problem-solving abilities and can even solve International Mathematical Olympiad (IMO) level problems with the assistance of formal proof systems. However, due to weak heuristics for auxiliary constructions, AI for geometry problem solving remains dominated by expert models such as AlphaGeometry 2, which rely heavily on large-scale data synthesis and search for both training and evaluation.
In this work, we make the first attempt to build a medalist-level LLM agent for geometry and present InternGeometry. InternGeometry overcomes the heuristic limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine, and reflecting on the engine's feedback to guide subsequent proposals. A dynamic memory mechanism enables InternGeometry to conduct more than two hundred interactions with the symbolic engine per problem. To further accelerate learning, we introduce Complexity-Boosting Reinforcement Learning (CBRL), which gradually increases the complexity of synthesized problems across training stages.
Built on InternThinker-32B, InternGeometry solves 44 of 50 IMO geometry problems (2000-2024), exceeding the average gold medalist score (40.9), using only 13K training examples, just 0.004% of the data used by AlphaGeometry 2, demonstrating the potential of LLM agents on expert-level geometry tasks. InternGeometry can also propose novel auxiliary constructions for IMO problems that do not appear in human solutions.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Multi-Agent #Adaptive Collaboration #Policy Optimization #Large Language Models
🎯 研究动机单一大型语言模型的扩展已取得显著进展,但多代理系统的协作扩展是下一步前沿。现有的自主MAS受限于预训练模型的静态知识范围,对新挑战易失败。
❓ 解决问题提出HILA框架,实现人类与代理间协作,通过元认知策略优化解决自主决策与人类干预的动态平衡问题。
🔍 现象分析纯自主MAS无法应对训练数据之外的任务,而HILA通过引入人类专家可以填补知识空缺,避免系统在复杂任务中集体失效。
🛠️ 主要方法设计双循环策略优化机制,内循环通过相对策略优化优化代理的自主与干预决策,外循环进行持续学习,利用专家反馈强化推理能力。
📊 数据与实验在数学及问题解决基准测试中进行实验,验证HILA框架配合双循环策略优化的性能优越性。
⭐ 主要贡献提出一种融合人类专家的多代理协作新范式HILA,并通过持续学习打造具备长期能力增长的智能系统,建立协作型代理不断提升的理论基础。
查看完整摘要 (Abstract)
While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ``closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human--agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost-aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high-quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLM Agent #Agentic System #Failure Attribution
🎯 研究动机基于大型语言模型(LLM)的代理系统因其复杂性和多模型协作性能提升了整体能力,但这种复杂性也导致系统脆弱性增加,需要精准归因错误来源以改进性能。
❓ 解决问题现有的推理型 LLM 在代理系统故障归因任务中表现不足,准确率低于10%,亟需更高效的错误诊断框架。
🔍 现象分析多代理系统中的错误多样且涉及复杂工具调用和协调协议,现有系统缺乏从长的执行轨迹中精准定位问题的能力。
🛠️ 主要方法提出 AgenTracer 框架,通过反事实回放和程序化故障注入生成失败轨迹数据集 TracerTraj,并利用多粒度强化学习训练新的轻量化诊断模型 AgenTracer-8B。
📊 数据与实验利用新数据集 TracerTraj 和 Who&When 基准测试,通过多项实验验证 AgenTracer-8B 能以高精度诊断错误,并实现多代理系统性能提升4.8%至14.2%。
⭐ 主要贡献开发首个实现高效故障归因的自动化框架和模型 AgenTracer-8B,与现有巨型专属 LLM 相比,准确率提升18.18%,推动代理系统的自纠正与自进化能力。
查看完整摘要 (Abstract)
Large Language Model (LLM)-based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of \textbf{agentic system failure attribution}. Current state-of-the-art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below $10\\%$. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a lightweight failure tracer trained with multi-granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi-agent interactions. On {Who\&When} benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up $18.18\\%$, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with $4.8\sim14.2\\%$ performance gains, empowering self-correcting and self-evolving agentic AI.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Mathematical reasoning #Large Language Models #Reinforcement Learning #Agent
🎯 研究动机现有的大型推理模型在处理复杂数学问题时准确性和计算效率不足,需要结合计算工具提升其能力。
❓ 解决问题提出一种将语言模型的推理能力与代码解释器的计算精度相结合的框架,解决高复杂度数学操作中的效率与准确性问题。
🔍 现象分析传统模型在长链式推理中表现突出,但在数学问题的解答中因数据稀缺和计算流程复杂性面临准确率低及资源消耗大的挑战。
🛠️ 主要方法开发了一种自动生成工具增强轨迹数据的方法、引入代理式强化学习框架结合实时代码执行、以及构建高效训练系统以实现多轮互动反馈和算力优化。
📊 数据与实验使用如 AIME 和 HMMT 等数学竞赛基准测试评估,AgentMath 在多个任务中显著超越同规模开源模型,展示了极高的准确性与效率提升。
⭐ 主要贡献提出了一种创新的工具增强代理框架,显著提升语言模型的数学推理能力,并为构建高效可扩展的数学推理代理奠定了基础。
查看完整摘要 (Abstract)
Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought. However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations. In this work, we present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision to efficiently tackle complex mathematical problems. Our approach introduces three key innovations: (1) An automated method that converts natural language chain-of-thought into structured tool-augmented trajectories, generating high-quality supervised fine-tuning (SFT) data to alleviate data scarcity; (2) A novel agentic reinforcement learning (RL) paradigm that dynamically interleaves natural language generation with real-time code execution. This enables models to autonomously learn optimal tool-use strategies through multi-round interactive feedback, while fostering emergent capabilities in code refinement and error correction; (3) An efficient training system incorporating innovative techniques, including request-level asynchronous rollout scheduling, agentic partial rollout, and prefix-aware weighted load balancing, achieving 4-5× speedup and making efficient RL training feasible on ultra-long sequences with scenarios with massive tool calls. Extensive evaluations show that AgentMath achieves state-of-the-art performance on challenging mathematical competition benchmarks including AIME24, AIME25, and HMMT25, substantially outperforming frontier open‑source models of comparable size. Specifically, AgentMath-30B-A3B attains 90.6\%, 86.4\%, and 73.8\% accuracy respectively, surpassing OpenAI-o3-mini and Claude-Opus-4.0-Thinking while remaining competitive with OpenAI-o3, Gemini-2.5-Pro, and DeepSeek-R1-671B-0528. These results validate the effectiveness of our approach and pave the way for building scalable mathematical reasoning agents.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Synthetic data #Computer-use agents #Scalable
TL;DR:We present AgentSynth, a scalable pipeline that automatically generates diverse and realistic computer-use tasks and trajectories.
🎯 研究动机现有通用性电脑任务代理的训练依赖人工标注数据,成本高且扩展性受限,需要一种性价比高的任务生成方式。
❓ 解决问题开发一种自动化、高效、可扩展的管道,用于生成多样化且真实的电脑任务及轨迹数据,以支持通用代理训练。
🔍 现象分析通过难度递增的任务评测,现有 LLM 代理成功率随任务复杂度显著下降,验证了生成任务的挑战性和分辨能力。
🛠️ 主要方法利用信息不对称策略,通过简单子任务组合生成复杂长期任务,并以任务复杂度动态调节子任务数量。
📊 数据与实验生成包含超过 6,000 个任务的数据集,每个轨迹平均成本仅为 0.60 美元,进行基准测试以验证数据质量和任务难度。
⭐ 主要贡献提出 AgentSynth 管道以降低人工标注成本,实现多样化任务生成,公开代码及数据推动通用电脑任务代理研究发展。
查看完整摘要 (Abstract)
We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18\% success at difficulty level 1 to just 4\% at level 6, highlighting the benchmark's difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of \$0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are available at https://github.com/sunblaze-ucb/AgentSynth
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLM Agents #Context Engineering #Continual Learning #Agent Memory #Test-Time Scaling #Self-Improving LLMs
🎯 研究动机大语言模型在代理任务和领域推理中依赖上下文适配,但现有方法在简洁性和信息持久性上存在不足,影响表现和效率。
❓ 解决问题解决上下文简化导致的领域洞察缺失问题,以及迭代重写导致的细节流失问题,增强上下文适应能力。
🔍 现象分析发现简洁偏差和上下文崩解现象会削弱模型在复杂任务中的表现,需通过模块化更新框架改善此问题。
🛠️ 主要方法提出ACE框架,通过生成、反思和筛选模块的组合,以增量更新方式维护和优化上下文结构,从而实现高效、自我提升的LLM系统。
📊 数据与实验在代理任务和金融领域基准上进行评估,ACE在有效性、适应能力及效率上超越基线模型,实现显著性能提升,同时减少延迟和成本。
⭐ 主要贡献ACE框架实现了无监督自然反馈适应,高效地推动模型上线及挑战性的任务表现,展示可扩展性及低开销的自我改进潜能。
查看完整摘要 (Abstract)
Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on *context adaptation*: modifying inputs with instructions, strategies, or evidence, rather than weight updates.
Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time.
We introduce ACE (**A**gentic **C**ontext **E**ngineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation.
ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models.
Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6\% on agents and +8.6\% on finance, while significantly reducing adaptation latency and rollout cost.
Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback.
On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model.
These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.
基础/前沿模型 (含LLM)
Agent 与工具使用
#reinforcement learning #large language model #agent #process reward
TL;DR:We propose a general credit-assignment strategy for LLM agent reinforcement learning in interactive environments with implicit step rewards.
🎯 研究动机当前增强语言模型作为自主智能体训练面临稀疏且难验证的奖励问题,影响其互动环境中的学习效率。
❓ 解决问题提出一种通用性奖励分配策略,通过隐式步级奖励改善稀疏奖励环境中的训练表现,避免现有方法的偏差和高方差问题。
🔍 现象分析理论分析表明隐式奖励模型能够有效捕获基于轨迹偏好的步级奖励函数,同时提高训练稳定性和采样效率。
🛠️ 主要方法通过交替优化隐式过程奖励模型和策略模型,采用多轮DPO目标生成隐式步级奖励,并与轨迹级优势结合更新策略。
📊 数据与实验在WebShop、VisualSokoban和SOTOPIA等三个复杂智能体基准上进行验证,涵盖结构化任务和开放式社会交互环境。
⭐ 主要贡献iStar方法展示了跨领域的最优性能,显著提高了训练样本效率、稳定性和任务执行成功率,同时支持高效探索。
查看完整摘要 (Abstract)
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL) that reason and act in interactive environments.
However, sparse and sometimes unverifiable rewards make it extremely challenging to assign credit when training LLM agents that serve as a policy.
Recent work attempts to integrate process supervision into RL but suffers from biased annotation, reward hacking, high-variance from overly fine-grained rewards or failures when state overlap is rare.
We therefore introduce implicit step rewards for agentic RL (**iStar**), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms without relying on additional rollouts or explicit step labels.
Particularly, we alternatively optimize an implicit process reward model (PRM) with the policy model to generate step rewards for each action via a multi-turn DPO objective. Theoretical analysis shows that this learning objective produces a step-wise reward function learned from trajectory preferences.
Then the implicit step rewards are used to compute step-level advantages, which are combined with trajectory (or episode)-level advantages for policy updates, creating a self-reinforcing training loop.
We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
Crucially, our method shows superior performance over frontier LLMs and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and training stability.
Further analysis also demonstrates efficient exploration by **iStar** with increased rewards in both step- and episode-level while maintaining fewer steps to achieve task success.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Alpha Mining #Agentic AI #Quantitative Investment #Self-evolving Agent
TL;DR:AlphaAgentEvo introduces a new evolution-oriented paradigm for alpha mining via self-evolving agentic reinforcement learning, outperforming traditional and LLM baselines—even surpassing state-of-the-art LLMs with only 1.7B–4B parameters.
🎯 研究动机Alpha挖掘旨在从复杂噪声空间中寻找预测性因子,但传统进化方法难以系统性演化,效率低且不易解释。
❓ 解决问题现有方法对语言指令理解不足,无法从失败案例中提取有价值信息,且多代理方法易陷入重复性演化,缺乏长期规划与反思机制。
🔍 现象分析传统方法如遗传编程缺乏语言解释能力,多代理方法在演化中效率低下,不能主动适应市场状态变化。
🛠️ 主要方法提出AlphaAgentEvo框架,通过自我进化的智能体强化学习,利用分层奖励逐步学习规则,培养长期规划与反思能力,实现持续演化。
📊 数据与实验实验表明,该框架在生成多样且可迁移alpha因子方面效率更高,凭借仅4B参数超越依赖闭源模型的先进大语言模型方法。
⭐ 主要贡献实现系统性自我演化智能体,提升alpha挖掘效率与质量,为下一代定量投资研究提供可靠范式。
查看完整摘要 (Abstract)
Alpha mining seeks to identify predictive alpha factors that generate excess returns relative to the market from a vast and noisy search space; however, existing evolution-based approaches struggle to facilitate the systematic evolution of alphas. Traditional methods, such as Genetic Programming (GP), cannot interpret natural language instructions and often fail to extract valuable insights from unsuccessful attempts, leading to low interpretability and inefficient exploration. Analogously, without mechanisms for systematic evolution, e.g., long-term planning and reflection, existing multi-agent approaches may easily fall into repetitive evolutionary routines, resulting in inefficient evolution. To overcome these limitations, we introduce AlphaAgentEvo, a self-evolving Agentic Reinforcement Learning (ARL) framework for alpha mining, which moves alpha mining beyond the brittle search-backtest-restart cycle toward a continuous trajectory of evolution. Guided by a hierarchical reward function, our agent engages in self-exploration of the search space, progressively learning basic requirements (e.g., valid tool calls) and then harder objectives (e.g., continuous performance improvements). Through this process, the agent acquires advanced behaviors such as long-horizon planning and reflective reasoning, which enable it to actively react to the underlying state (e.g., market regime shifts) and realize a self-evolving agent, marking a step toward more principled and scalable alpha mining. Extensive experiments demonstrate that AlphaAgentEvo achieves more efficient alpha evolution and generates diverse and transferable alphas, consistently surpassing a wide range of baselines. Notably, with only 4B parameters, it outperforms LLM-driven evolution methods configured with state-of-the-art closed-source reasoning models, highlighting the promise of ARL for next-generation alpha mining.
基础/前沿模型 (含LLM)
Agent 与工具使用
#information bottleneck #rate-distortion theory #agentic collaboration #large language models #scaling laws
TL;DR:We frame agentic language model systems as a information bottleneck problem, deriving scaling laws and practical design principles for efficient collaboration between LMs.
🎯 研究动机针对多语言模型系统设计缺乏信息理论指导的问题,研究如何通过压缩模型与预测模型协作提升效率与性能。
❓ 解决问题探讨压缩模型与预测模型的设计选择对下游性能的影响,并提出基于信息论的任务无关性能度量方法。
🔍 现象分析较大的压缩模型在准确性、Token效率和信息传递能力方面表现显著优于小型压缩模型;压缩模型的扩展比预测模型的扩展更能有效提升系统性能。
🛠️ 主要方法将压缩模型视为噪声信道,引入一种估算上下文与压缩结果间互信息的新方法,并以此衡量压缩质量。
📊 数据与实验使用五个数据集和三种模型家族进行实证分析,展示互信息预测性能的任务无关性,以及模型规模对性能的影响。
⭐ 主要贡献提出基于信息瓶颈的多语言模型系统设计框架;阐明压缩模型扩展优于预测模型扩展的规律;通过信息理论指标有效降低设计成本。
查看完整摘要 (Abstract)
Agentic language model (LM) systems power modern applications like "Deep Research" and "Claude Code," and leverage multi-LM architectures to overcome context limitations.
Beneath their apparent diversity lies a recurring pattern: smaller "compressor" LMs (that can even run locally) distill raw context into compact text that is then consumed by larger "predictor" LMs.
Despite their popularity, the design of compressor-predictor systems remains largely ad hoc, with little guidance on how compressor and predictor choices shape downstream performance.
In practice, attributing gains to compression versus prediction requires costly, task-specific pairwise sweeps.
We argue that these agentic system design questions are, at root, information-theoretic. Viewing the compressor LM as a noisy channel, we introduce a simple estimator of mutual information between the context and its compression to quantify compression quality in a task-independent way.
We show that mutual information strongly predicts downstream performance, independent of any specific task. Through an information-theoretic framework, we perform a comprehensive empirical analysis across five datasets and three model families. Results reveal that larger compressors not only are more accurate, but also more token-efficient, conveying more bits of information per token.
A 7B Qwen-2.5 compressor, for instance, is $1.6\times$ more accurate, $4.6\times$ more concise, and conveys $5.4\times$ more bits of mutual information per token than its 1.5B sibling. Across datasets, scaling compressors is substantially more effective than scaling predictors, enabling larger on-device compressors to pair with smaller cloud predictors.
Applied to a Deep Research system, these principles enable local compressors as small as 3B parameters to recover 99% of frontier-LM accuracy at 26% of API costs.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Lean 4 #Autoformalization #LLM #Graph-of-Thought #Retrieval Augmented Generation
🎯 研究动机数学定理的自动形式化是研究级数学自动发现与验证的重要技术,但当前大模型在生成过程中存在虚构内容、语义不匹配及缺乏新定义综合能力等问题。
❓ 解决问题为解决自动形式化中的语义准确性及逻辑一致性问题,提出一种模拟人类专家推理过程的新方法。
🔍 现象分析当前大模型在处理复杂数学论断时无法实现细粒度的语义校验,依赖简单机制易导致形式化失败。
🛠️ 主要方法设计了一个两阶段的 Graph-of-Thought 流程,先递归分解依赖图,再对基于术语的概念进行形式化;引入 AriaScorer 工具,通过从 Mathlib 检索定义确保语义校验的严谨性与鲁棒性。
📊 数据与实验在 ProofNet、FATE-X 和一个同调代数问题数据集上进行实验,Aria 在多个基准上均显著超越现有方法,尤其在同调猜想数据集上达到了 42.9% 的准确率,而其他模型为 0%。
⭐ 主要贡献提出了一个高效的自动形式化工具 Aria 和语义校验机制 AriaScorer,提升了复杂数学定理在形式化过程中的准确性与可靠性,并显著优化了多个数学基准的表现。
查看完整摘要 (Abstract)
Accurate auto-formalization of theorem statements is essential for advancing automated discovery and verification of research-level mathematics, yet remains a major bottleneck for LLMs due to hallucinations, semantic mismatches, and their inability to synthesize new definitions.
To tackle these issues, we present Aria (**A**gent for **R**etrieval and **I**terative **A**utoformalization), a system for conjecture-level formalization in Lean that emulates human expert reasoning via a two-phase Graph-of-Thought process: recursively decomposing statements into a dependency graph and then constructing formalizations from grounded concepts. To ensure semantic correctness, we introduce **AriaScorer**, a checker that retrieves definitions from Mathlib for term-level grounding, enabling rigorous and reliable verification.
We evaluate Aria on diverse benchmarks. On ProofNet, it achieves 91.6\% compilation success rate and 68.5\% final accuracy, surpassing previous methods. On FATE-X, a suite of challenging algebra problems from research literature, it outperforms the best baseline with 44.0\% vs. 24.0\% final accuracy. On a dataset of homological conjectures, Aria reaches 42.9\% final accuracy while all other models score 0\%.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Agent #Evaluation #LLM
TL;DR:We propose a new method for inducing metrics for evaluating and improving agents from open-ended human feedback.
🎯 研究动机当前智能体评估主要依赖任务成功指标,设计粗糙且难以评估细粒度行为,难以提升智能体执行力。
❓ 解决问题提出一种从开放式人类反馈中诱导评估指标的方法,用于捕获智能体行为的细微变化并优化性能。
🔍 现象分析任务成功指标缺乏对中间涌现行为的奖励,无法充分解释智能体行为的复杂性及适应性。
🛠️ 主要方法利用框架 AutoLibra,将人类反馈与智能体行为关联,通过聚类导出具体指标,并利用 LLM-as-a-Judge 提供评估支持。
📊 数据与实验通过多种实验验证 AutoLibra 的能力,包括优化覆盖率与冗余率的元指标,并与现有评估基准对比发现更具体指标。
⭐ 主要贡献提出一种任务无关的评估框架,不仅帮助人类优化智能体流程,还支持智能体自我优化,显著提升语言智能体性能与行为质量。
查看完整摘要 (Abstract)
Agents are predominantly evaluated and optimized via task success metrics, which are coarse,
rely on manual design from experts, and fail to reward intermediate emergent behaviors.
We propose AutoLibra, a framework for agent evaluation, that transforms open-ended
human feedback e.g. “If you find that the button is disabled, don’t click it again”, or “This
agent has too much autonomy to decide what to do on its own” into metrics for evaluating
fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding
feedback to an agent’s behavior, clustering similar positive and negative behaviors, and
creating concrete metrics with clear definitions and concrete examples, which can be used for
prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate
the alignment of a set of (induced) metrics with open feedback: “coverage” and “redundancy”.
Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra’s ability
to induce more concrete agent evaluation metrics than the ones proposed in previous
agent evaluation benchmarks and discover new metrics to analyze agents. We also present
two applications of AutoLibra in agent improvement: First, we show that AutoLibra
serve human prompt engineers for diagonalize agent failures and improve prompts iterative.
Moreover, we find that AutoLibra can induce metrics for automatic optimization for agents,
which makes agents improve through self-regulation. Our results suggest that AutoLibra is a
powerful task-agnostic tool for evaluating and improving language agents.
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLMs #Autonomous Agents #Agent Specialization
TL;DR:We introduce a framework that creates persistent, specialist agent teams through an offline lifecycle of discovery and cultivation, and deploys them with an online policy that efficiently adapts the team's structure for novel tasks.
🎯 研究动机现有自动化代理设计要么缺乏适应性,要么难以积累深层任务专业知识,需要一种新方法同时实现代理的持久性、适应性与高效性。
❓ 解决问题提出一种框架,自动生成能够积累知识并自主适应新任务的状态保持型专家代理团队,解决现有框架中适应性与专业性之间的矛盾。
🔍 现象分析通过比较静态工作流与逐任务优化器的不足,验证了传统方法缺乏深度学习能力或全局适应性的局限性。
🛠️ 主要方法引入 ASpec 框架,结合演化搜索发现代理原型,并通过实践优化其专业能力,同时设计轻量级分层控制策略实现代理结构的动态调整。
📊 数据与实验在专业级科学基准 GPQA 等实验中表现显著优于传统方法,并在广域任务中达到当前最优水平,体现了方法的高效性与适应性。
⭐ 主要贡献首次提出全生命周期管理的状态保持型专家代理团队框架,结合发现与培养过程,显著提升了代理在专业任务与泛化任务上的性能,并开源代码提供进一步研究支持。
查看完整摘要 (Abstract)
Current automated agent design frameworks produce either static workflows that lack adaptability or per-query optimizers that prevent the accumulation of deep, agent-level task expertise. We propose a new direction that reconciles these paradigms: creating stateful teams of specialist agents that accumulate knowledge over time and can be reconfigured for novel tasks entirely without human intervention. To this end, we introduce \textsc{ASpec}, a framework that manages this full agent lifecycle by first autonomously \textbf{discovering} specialist archetypes via evolutionary search and then \textbf{cultivating} their expertise through experience, mirroring how human experts learn through practice and reflection. We further introduce a lightweight hierarchical control policy, "retain-then-escalate," which governs when to leverage the established agent system versus when to adapt its structure. Through comprehensive experiments, we demonstrate that this approach leads to significant performance gains on expert-level scientific benchmarks like GPQA while matching the state-of-the-art on broader domain tasks, demonstrating a promising path toward agent systems that are simultaneously expert, adaptive, and efficient. We will release the code at https://github.com/myanvoos/ASpec.
基础/前沿模型 (含LLM)
Agent 与工具使用
#experimental design #Bayesian experimental design #BED #expected information gain #EIG #information gain #Bayesian #uncertainty #LLM #conversational agent #clarification #question asking
🎯 研究动机当前大语言模型(LLM)在信息收集和交互中缺乏适应性与智能性,亟需一种能提升其能力的方法。
❓ 解决问题提出一种基于贝叶斯实验设计(BED)的框架,使LLM能够智能地选择问题或查询以最大化信息增益,增强其在多轮对话和外部环境交互中的表现。
🔍 现象分析传统方法依赖固定的提示或简单的适应性策略,而没有系统性地利用模型推断结果来动态优化问题设计,因此表现有限。
🛠️ 主要方法通过迭代选择问题或查询,使用LLM的预测分布构建概率模型并估算期望信息增益(EIG),实现智能的信息收集过程。
📊 数据与实验利用“20问游戏”和用户偏好主动推断测试,验证BED-LLM在多种场景中对比纯提示生成和其他设计策略的显著性能提升。
⭐ 主要贡献实现了LLM与贝叶斯实验设计的结合,提出EIG优化机制,大幅提升多轮交互和信息获取效率,推动了LLM的智能性与实用性发展。
查看完整摘要 (Abstract)
We propose a general-purpose approach for improving the ability of large language models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments. Our approach, which we call BED-LLM (Bayesian experimental design with large language models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) with respect to a variable of interest given the responses gathered previously. We show how this EIG can be formulated (and then estimated) in a principled way using a probabilistic model derived from the LLM's predictive distributions and provide detailed insights into key decisions in its construction and updating procedure. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20 Questions game and using the LLM to actively infer user preferences, compared to purely prompting-based design generation and other adaptive design strategies.
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLM #multi-agent system #visualization
🎯 研究动机尽管深度研究推动了数据分析的发展,但数据科学家仍需大量时间手动创建可视化,当前方法在应对复杂数据集和迭代优化方面表现不足。
❓ 解决问题现有系统对复杂数据集的处理能力有限,难以实现从初始查询到高质量可视化的全面自动化。
🔍 现象分析单一或简单多智能体系统通常偏重初始查询解析,忽视了数据复杂性、代码错误与最终可视化质量间的平衡。
🛠️ 主要方法提出CoDA,一个多智能体系统,利用专用的LLM智能体执行元数据分析、任务规划、代码生成及自反思,并通过元数据驱动的分析避开模型输入限制,确保质量优先的迭代优化。
📊 数据与实验通过全面评估验证,CoDA在整体得分上比竞争基线方法最高提高41.5%。
⭐ 主要贡献定义了协同多智能体流程以自动化可视化任务,展示了协作智能体在超越孤立代码生成解决方案上的显著潜力。
查看完整摘要 (Abstract)
Deep research has revolutionized data analysis, yet data scientists still devote substantial time to manually crafting visualizations, highlighting the need for robust automation from natural language queries. However, current systems struggle with complex datasets containing multiple files and iterative refinement. Existing approaches, including simple single- or multi-agent systems, often oversimplify
the task, focusing on initial query parsing while failing to robustly manage data complexity, code errors, or final visualization quality. In this paper, we reframe this challenge as a collaborative multi-agent problem. We introduce CoDA, a multi-agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and self-reflection. We formalize this pipeline, demonstrating how metadata-focused analysis bypasses token limits and quality-driven refinement ensures robustness. Extensive evaluations show CoDA achieves substantial gains in the overall score, outperforming competitive baselines by up to 41.5%. This work demonstrates that the future of visualization automation lies not in isolated code generation but in integrated, collaborative agentic workflows.
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLM-based Agent #Multi-agent System
🎯 研究动机研究旨在探索如何使基于大语言模型(LLM)的智能体在预训练后通过自我进化获得持续能力提升,模仿人类通过讨论和协作学习的机制。
❓ 解决问题当前的强化学习(RL)方法依赖外部密集奖励或从LLM内部提取奖励,与人类智能体的自我进化方式不同,缺乏通过交互学习提升能力的机制。
🔍 现象分析人类智能体的自我提升通常来源于协作和交流,单纯依靠外部监督或内在奖励信号难以达到同等效果。
🛠️ 主要方法提出CoMAS框架,通过丰富的交互动态生成内在奖励,利用LLM作为评判机制并结合强化学习优化智能体策略,支持基于多智能体无监督的分布式自我进化。
📊 数据与实验实验展现CoMAS在多个评估环境中均超过未训练智能体并达到最先进性能,消融实验验证交互奖励信号的必要性,并显示出随智能体数量和多样性增加的良好扩展性。
⭐ 主要贡献建立了一个创新且有效的基于LLM多智能体自我进化框架,为无监督智能体能力提升提供了新的范式。
查看完整摘要 (Abstract)
Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.
基础/前沿模型 (含LLM)
Agent 与工具使用
#large language models #code world models #code generation #information set MCTS #planning #partial observability #two-player games #imperfect information games
TL;DR:Instead of using LLMs-as-a-policy to play games, we use LLMs to implement an explicit code world model and combine it with a planner to play games, including imperfect information ones.
🎯 研究动机现有使用大语言模型生成游戏动作的方式存在局限,如易产生非法动作和策略深度不足。需探索新方法以提升模型的可验证性、战略能力及适应性。
❓ 解决问题提出一种通过生成可执行的代码世界模型(CWM),结合规划算法来更有效地解决策略决策和信息不完全问题。
🔍 现象分析直接使用LLM生成游戏动作易受其隐式模式匹配的局限性影响,导致逻辑错误和浅层策略行为。
🛠️ 主要方法利用LLM生成Python代码形式的游戏模型,包括状态转换、合法动作枚举和终止检测,同时生成启发式价值函数和推断函数以增强规划算法效率。
📊 数据与实验在10种游戏中评估方法性能,其中4种为论文新增游戏,5种为完全信息,5种为不完全信息游戏,实验结果显示方法在9个游戏中优于或追平Gemini 2.5 Pro。
⭐ 主要贡献通过将游戏规则和轨迹转化为可验证的代码模型,提供高性能规划能力;结合语义理解与深度搜索提高战略力;实现更广泛游戏类型的适应能力。
查看完整摘要 (Abstract)
Large Language Models (LLMs) reasoning abilities are increasingly being applied to classical board and card games, but the dominant approach---involving prompting for direct move generation---has significant drawbacks. It relies on the model's implicit fragile pattern-matching capabilities, leading to frequent illegal moves and strategically shallow play. Here we introduce an alternative approach: We use the LLM to translate natural language rules and game trajectories into a formal, executable world model represented as Python code. This generated model---comprising functions for state transition, legal move enumeration, and termination checks---serves as a verifiable simulation engine for high-performance planning algorithms like Monte Carlo tree search (MCTS). In addition, we prompt the LLM to generate heuristic value functions (to make MCTS more efficient), and inference functions (to estimate hidden states in imperfect information games). Our method offers three distinct advantages compared to directly using the LLM as a policy: (1) Verifiability: The generated CWM serves as a formal specification of the game's rules, allowing planners to algorithmically enumerate valid actions and avoid illegal moves, contingent on the correctness of the synthesized model; (2) Strategic Depth: We combine LLM semantic understanding with the deep search power of classical planners; and (3) Generalization: We direct the LLM to focus on the meta-task of data-to-code translation, enabling it to adapt to new games more easily. We evaluate our agent on 10 different games, of which 4 are novel and created for this paper. 5 of the games are fully observed (perfect information), and 5 are partially observed (imperfect information). We find that our method outperforms or matches Gemini 2.5 Pro in 9 out of the 10 considered games.
基础/前沿模型 (含LLM)
Agent 与工具使用
#query routing #model selection #distributed system #self-awareness of LLM
TL;DR:We introduce DiSRouter, a distributed system where LLMs leverage "self-awareness" to route queries among themselves, outperforming conventional centralized routers.
🎯 研究动机大型语言模型(LLM)生态系统的性能和成本差异显著,现有路由系统难以灵活高效地平衡查询性能与开销。
❓ 解决问题现有基于中心化外部路由器的查询分配方式难以理解不同模型的知识边界,导致性能不佳且扩展性不足。
🔍 现象分析采用中心化路由器无法适应动态多样化的模型生态,而分布式自路由设计可利用模型自身的能力判断处理查询。
🛠️ 主要方法提出 DiSRouter,设计分布式自路由系统,通过两阶段自感知训练增强每个模型对自身能力的判断,用于决定查询处理或转发。
📊 数据与实验实验涵盖多种场景,验证方法在效用和泛化性上显著优于已有路由方式,并能有效区分易查询与难查询。
⭐ 主要贡献首创利用 LLM 自我感知能力进行查询路由,提升了模块化与效率,推动了更灵活的多代理系统设计。
查看完整摘要 (Abstract)
The proliferation of Large Language Models (LLMs) has created a diverse ecosystem of models with highly varying performance and costs, necessitating effective query routing to balance performance and expense. Current routing systems often rely on a centralized external router trained on a fixed set of LLMs, making them inflexible and prone to poor performance since the small router can not fully understand the knowledge boundaries of different LLMs. We introduce DiSRouter (Distributed Self-Router), a novel paradigm that shifts from centralized control to distributed routing. In DiSRouter, a query traverses a network of LLM agents, each independently deciding whether to answer or route to other agents based on its own self-awareness—its ability to judge its competence. This distributed design offers superior flexibility, scalability, and generalizability. To enable this, we propose a two-stage Self-Awareness Training pipeline that enhances each LLM's self-awareness. Extensive experiments demonstrate that DiSRouter significantly outperforms existing routing methods in utility across various scenarios, effectively distinguishes between easy and hard queries, and shows strong generalization to out-of-domain tasks. Our work validates that leveraging an LLM's intrinsic self-awareness is more effective than external assessment, paving the way for more modular and efficient multi-agent systems.
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLM-based agent systems #failure analysis #intervention
TL;DR:IIntervention-driven debugging advances beyond log-based attribution by validating and repairing failures in LLM-based multi-agent systems.
🎯 研究动机LLM多智能体系统由于交互链条复杂,故障难以定位和调试,传统基于日志的失败归因方法存在局限性。
❓ 解决问题解决基于日志的调试中验证不足和单步归因不可靠的问题,通过介入驱动的方法提高调试准确性和效果。
🔍 现象分析现有方法生成的故障归因往往未经验证,且多步交互中存在多种可能独立修复故障的干预方式。
🛠️ 主要方法提出DoVer框架,将假设生成与通过目标性干预的验证相结合,侧重任务成功率而非单纯归因准确性。
📊 数据与实验基于GAIA和AssistantBench的数据集进行评估,成功将18-28%的失败案例转为成功并产生显著的里程碑进展,同时验证或否定30-60%的失败假设。
⭐ 主要贡献提出干预驱动调试方法,提升了LLM多智能体系统的调试效率和可靠性,展现出更具可扩展性的调试路径。
查看完整摘要 (Abstract)
Large language model (LLM)–based multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the system resolves the failure or makes quantifiable progress toward task success, reflecting a more outcome-oriented view of debugging. On the datasets derived from GAIA and AssistantBench, DoVer flips 18–28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses. Our findings highlight intervention as a practical mechanism for improving reliability in agentic systems and open opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Doctor Agent #Clinical Inquiry #Agentic Reinforcement Learning
🎯 研究动机人类医生在门诊服务中的核心能力包括精准医疗决策和战略性、同理心的患者咨询技能。现有大语言模型虽然在医疗决策上表现优异,但缺乏有效的咨询能力,难以应对真实临床场景的需求。
❓ 解决问题开发一种能够掌握精准决策和战略性咨询能力的AI医生代理,以弥补现有模型在患者交流和多轮问诊中的不足。
🔍 现象分析现有模型在医疗咨询中无法提出高效问题,并缺乏指导性问诊策略,导致临床能力和患者体验表现欠佳。
🛠️ 主要方法提出Doctor-R1框架,包括多代理交互环境、双层奖励架构优化决策与咨询技能,以及经验库用于高质量轨迹的学习,在政策训练中提升能力。
📊 数据与实验模型在HealthBench和MAQuE数据集上进行评估,采用沟通质量、用户体验、任务准确性等多维度评价,并通过与开源及专有模型对比验证性能优势。
⭐ 主要贡献提出了兼具医疗决策和患者咨询的AI医生代理,超越现有模型的参数效率,同时在人类专家评估下展现了卓越的临床能力和以患者为中心的表现。
查看完整摘要 (Abstract)
The professionalism of a human doctor in outpatient service depends on two core abilities: the ability to make accurate medical decisions and the medical consultation skill to conduct strategic, empathetic patient inquiry. Existing Large Language Models (LLMs) have achieved remarkable accuracy on medical decision-making benchmarks. However, they often lack the ability to conduct the strategic and empathetic consultation, which is essential for real-world clinical scenarios. To address this gap, we propose Doctor-R1, an AI doctor agent trained to master both of the capabilities by ask high-yield questions and conduct strategic multi-turn inquiry to guide decision-making. Our framework introduces three key components: a multi-agent interactive environment, a two-tiered reward architecture that separately optimizes clinical decision-making and communicative inquiry skills, and an experience repository to ground policy learning in high-quality prior trajectories. We evaluate Doctor-R1 on OpenAI's HealthBench and MAQuE, assessed across multi-facet metrics, such as communication quality, user experience, and task accuracy. Remarkably, Doctor-R1 surpasses state-of-the-art open-source specialized LLMs by a substantial margin with higher parameter efficiency and outperforms powerful proprietary models. Furthermore, the human expert evaluations show that Doctor-R1 achieves superior clinical capability and patient-centric performance, demonstrating the effectiveness of the framework.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Models #LLM Agents #Tool Learning #Multi-turn tool use #Reinforcement Learning
🎯 研究动机现有的LLM Agent面临高质量训练数据匮乏的问题,导致多回合复杂工具使用任务的性能受限。
❓ 解决问题提出环境调优(Environment Tuning)训练范式,以解决模型在监督微调过拟合和强化学习冷启动问题上的关键挑战。
🔍 现象分析通过结构化课程、环境增强和细粒度奖励设计的学习策略,显著改善了训练稳定性和探索效率,同时避免了基于静态轨迹的性能崩溃。
🛠️ 主要方法环境调优通过动态的环境驱动学习,包括问题实例直接学习、纠偏反馈机制和精细化进度奖励,强化代理在复杂任务中的行为能力。
📊 数据与实验使用400个Berkeley Function-Calling Leaderboard基准问题实例进行实验,方法在分布内性能与强基线持平,且在分布外泛化性能上表现优异。
⭐ 主要贡献提出了一种从静态轨迹训练转向动态环境探索的新范式,为构建更健壮且数据高效的LLM Agent提供了新的路径。
查看完整摘要 (Abstract)
Large Language Model (LLM) agents show great promise for complex multi-turn tool-use tasks, but their development is often hampered by the extreme scarcity of high-quality training data. Supervised fine-tuning (SFT) on synthetic data leads to overfitting, whereas standard reinforcement learning (RL) struggles with a critical cold-start problem and training instability. To address these challenges, we introduce $\textbf{Environment Tuning}$, a novel training paradigm that enables agents to learn complex behaviors directly from problem instances without relying on pre-collected expert trajectories. $\textbf{Environment Tuning}$ orchestrates this learning process through a structured curriculum, actionable environment augmentation that provides corrective feedback, and fine-grained progress rewards to ensure stable and efficient exploration. Using only 400 problem instances from Berkeley Function-Calling Leaderboard (BFCL) benchmark, our method not only achieves competitive in-distribution performance against strong baselines but also demonstrates superior out-of-distribution generalization, overcoming the performance collapse common to SFT-based approaches. Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration, paving the way for training more robust and data-efficient agents.
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLM #Autonomous Agents
🎯 研究动机自动化智能体在复杂环境下执行多步骤任务有助于推动机器人、科学发现和网络自动化领域的发展,但现有方法在决策闭环和成本效率上存在局限性。
❓ 解决问题大语言模型在闭环决策中受限于静态预训练和时间维度不足,现有方法要么依赖高成本的实时交互,要么受制于脆弱的模仿策略,无法兼顾安全性和效率。
🔍 现象分析多数基于搜索或强化学习的方法在复杂任务中需要大量的环境交互,导致延迟增加且伴随执行不可逆行为的风险升高。
🛠️ 主要方法提出DreamPhase框架,通过离线想象和不确定性引导的规划改进智能体性能,采用潜在世界模型模拟未来分支,并基于价值和安全过滤选择最佳策略分支,同时通过自然语言反思简化查询过程。
📊 数据与实验在WebShop和ALFWorld等环境测试中,相较现有基线方法,DreamPhase显著降低了API调用频率和不可逆行为发生次数,并展现出较高的采样效率和安全性。
⭐ 主要贡献提供了一种高效、安全且可扩展的想象驱动规划框架,为复杂任务的自动化智能体设计指明了新的发展方向,代码已开源。
查看完整摘要 (Abstract)
Autonomous agents capable of perceiving complex environments, understanding instructions, and performing multi-step tasks hold transformative potential across domains such as robotics, scientific discovery, and web automation. While large language models (LLMs) provide a powerful foundation, they struggle with closed-loop decision-making due to static pretraining and limited temporal grounding. Prior approaches either rely on expensive, real-time environment interactions or brittle imitation policies, both with safety and efficiency trade-offs. We introduce DreamPhase, a modular framework that plans through offline imagination. A learned latent world model simulates multi-step futures in latent space; imagined branches are scored with an uncertainty-aware value and filtered by a safety gate. The best branch is distilled into a short natural-language reflection that conditions the next policy query, improving behavior without modifying the LLM. Crucially, DreamPhase attains its performance with substantially fewer real interactions: on WebShop, average API calls per episode drop from $\sim$40 with ARMAP-M (token-level search) to $<10$ with DreamPhase, a $4\times$ reduction that lowers latency and reduces executed irreversible actions by $\sim 5\times$ on WebShop (4.9$\times$ on ALFWorld) per incident logs. Across web, science, and embodied tasks, DreamPhase improves sample efficiency, safety, and cost over search-based and reward-based baselines. This offers a scalable path toward safe, high-performance autonomous agents via imagination-driven planning. Code: \url{https://anonymous.4open.science/r/DreamPhase-A8AD/README.md}.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Agents; Computer Use; Large Language Models; Vision Language Models
TL;DR:PC Agent-E demonstrates efficient agent training with a small set of human trajectories augmented with Claude 3.7 Sonnet, achieving 141% improvement and surpassing Claude 3.7 Sonnet by 10%.
🎯 研究动机大规模高质量轨迹数据的获取是开发拟人化计算机使用智能体的关键瓶颈。本研究旨在通过降低对海量人工演示数据的依赖,实现更高效的智能体训练。
❓ 解决问题传统方法依赖大量人工标注轨迹,成本高昂且难以扩展。本研究核心是解决高质量轨迹数据稀缺的问题,并提出一种高效的替代方案。
🔍 现象分析现有AI模型(如Claude 3.7 Sonnet)本身具备生成多样化决策的潜力,但直接蒸馏效果有限。关键在于如何将少量高质量人类数据与AI的自动合成能力有效结合,以打破数据瓶颈。
🛠️ 主要方法提出PC Agent-E框架。首先,仅从少量(312条)人工标注轨迹出发。然后,使用Claude 3.7 Sonnet为这些轨迹合成多样化的备选动作决策,从而大幅扩充和丰富训练数据。
📊 数据与实验构建并发布了改进的基准测试WindowsAgentArena-V2。实验表明,在合成数据上训练的PC Agent-E模型相比仅使用人类轨迹取得了141%的相对性能提升,甚至超越Claude 3.7 Sonnet模型10%(相对指标)。
⭐ 主要贡献1. 提出一种高效的智能体训练框架,显著减少对大规模人工演示的依赖。2. 展示了结合少量人类数据与AI数据合成的有效性。3. 发布了改进的基准测试集,为相关研究提供了评估标准。
查看完整摘要 (Abstract)
Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further augment them by synthesizing diverse alternative action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, and even surpassed the Claude 3.7 Sonnet by 10% in relative terms on WindowsAgentArena-V2, an improved benchmark we also released. By integrating robust human computer use skills with automated AI data synthesis capabilities, our method not only brought substantial improvements over training on human trajectories alone, but also significantly surpassed direct distillation from Claude 3.7 Sonnet.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large language model #Tool invocation #Tool-call reward model
TL;DR:We propose a Tool-call Reward Model that provides fine-grained signals for tool invocation and adapts classical RL algorithms, significantly enhancing LLMs' tool usage compared to outcome-only reward methods.
🎯 研究动机大语言模型(LLMs)在使用外部工具上受到局限性,仅依赖结果奖励信号的强化学习方法存在粒度粗糙和梯度冲突问题。
❓ 解决问题提出一种新的工具调用奖励模型(TRM),以细粒度奖励信号克服现有方法在工具调用上的不足,尤其优化调用过程中而非仅看最终结果。
🔍 现象分析传统奖励模型在复杂任务中表现受限,尤其是在处理工具调用精细化评估时容易出现奖励劫持和梯度冲突。
🛠️ 主要方法构建系统化的TRM设计流程,结合细化的奖励分配机制与回合级优势估计,确保与PPO和GRPO等强化学习算法的平稳集成。
📊 数据与实验在10K样本训练的3B参数TRM模型上进行实验,验证其在搜索问答和代码数学任务中的性能相较传统结果奖励方法显著提升。
⭐ 主要贡献实现了TRM在工具调用场景中的首创性应用,提出兼容经典强化学习算法的集成方法并验证其跨模型规模的有效性。
查看完整摘要 (Abstract)
Large Language Models (LLMs) have recently alleviated limitations in outdated internal knowledge and computational inaccuracies by invoking external tools such as search engines and code generation. While reinforcement learning (RL) has substantially enhanced tool usage in LLMs, most existing agentic RL approaches rely solely on outcome-only reward signals, which assign credit at a coarse granularity and often induce gradient conflict (e.g., correct tool calls may be penalized due to incorrect final answers). To address this, we propose the *Tool-call Reward Model* (TRM), a specialized process reward model meticulously designed to evaluate and reward each tool invocation. Since previous PRM research has predominantly focused on traditional reasoning tasks such as step-wise mathematical reasoning, the introduction of TRM brings two unique challenges: (1) limited understanding of how to construct effective TRMs, including data requirements and model size; and (2) difficulties integrating TRM with classical RL algorithms such as PPO and GRPO, where naive adaptation may lead to reward hacking (minimizing tool calls to avoid penalties). To tackle these challenges, we establish a systematic TRM construction workflow and propose refined credit assignment and turn-level advantage estimation for effective integration with PPO and GRPO. Experiments show that a 3B TRM trained on 10K samples achieves robust performance. On search-based QA and Python code-based math tasks, integrating TRM consistently outperforms outcome-only reward RL methods across models of different sizes.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Models #Agents #Test Time Learning
TL;DR:We introduce the J-TTL benchmark and EvoTest, a system where an LLM agent learns at test time as a second agent evolves its entire configuration from gameplay experience, no fine-tuning needed.
🎯 研究动机当前 AI 在测试时无法动态学习复杂技能,导致其在新环境中表现受限,制约了实际应用能力。
❓ 解决问题提出一个评估和提升 AI 测试时间学习能力的框架,以克服现有适应方法(如反思和记忆)的不足。
🔍 现象分析实验发现,现有方法在 Jericho Test-Time Learning 基准上表现不佳,无法在连续游戏中显著提升适应能力。
🛠️ 主要方法设计了 EvoTest框架,包括演员代理与进化代理,后者通过分析游戏记录迭代优化前者的配置,实现无微调的动态学习。
📊 数据与实验基于 J-TTL 基准,验证 EvoTest 在两款游戏中的胜率超过所有基线方法,展现出更强的适应性与性能提升能力。
⭐ 主要贡献提出了J-TTL 基准与 EvoTest 框架,填补了测试时间学习领域的空白,首次证明了通过演化方法能在无微调情况下实现动态性能优化。
查看完整摘要 (Abstract)
A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like “clever but clueless interns” in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients—by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state–action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Multi-Hop RAG #Efficiency #Reasoning #SLMs
🎯 研究动机小型语言模型在推理密集任务中借助强化学习取得进展,但在多跳问答检索生成任务中表现有限,亟需提升效率与准确性间的平衡。
❓ 解决问题如何通过强化学习框架减少多跳问答中的检索步骤,同时保证高效性与准确性,以提升任务可扩展性。
🔍 现象分析现有方法在多跳问答任务中常需大规模数据支持,且策略更注重推理深度,导致效率低下。
🛠️ 主要方法提出FrugalRAG框架,分两阶段:先用监督学习进行广泛子查询的探索性训练,再通过强化学习根据问题难度自适应缩减检索深度,以准确性与节约性为优化目标。
📊 数据与实验在HotPotQA等基准测试中验证,通过仅1000个示例实现效率/准确性的新高,并在BrowseCompPlus基准中零样本超越多种基线。
⭐ 主要贡献提出了利用强化学习减少检索步骤的新思路,显著优化了多跳问答的效率-准确性平衡,并在多任务测试中表现优异。
查看完整摘要 (Abstract)
Reinforcement learning (RL) based on the final answer's reward has driven recent progress in small language models (SLMs) on reasoning-heavy tasks such as math and code. However, applying the same techniques to retrieval-augmented generation (RAG) benchmarks like multi-hop QA has yielded limited gains—often trailing supervised or prompting-only baselines. Instead, we argue that a viable path for RL in multi-hop QA is to use test-time scaling judiciously, for optimizing both the final answer accuracy and the efficiency in reaching that answer.
We propose FrugalRAG, a two-stage finetuning framework that adaptively _reduces_ the number of retrieval steps based on a question's difficulty. First, we train an SLM with supervised finetuning on a full-exploration policy that generates broad sub-queries. Then, we apply RL to adaptively prune search depth based on question difficulty, directly rewarding policies that balance correctness with frugality. Unlike prior approaches requiring 10× more data, our method achieves competitive performance with only ~1,000 examples. On HotPotQA and other multi-hop QA benchmarks, FrugalRAG attains state-of-the-art efficiency–accuracy tradeoffs, cutting retrieval cost nearly in half. Moreover, on the challenging BrowseCompPlus benchmark, it generalizes zero-shot and surpasses SLM-based and other baselines. These results demonstrate the use of RL—not to increase reasoning steps but to reduce them—as an effective solution for scalable, efficient RAG.
基础/前沿模型 (含LLM)
Agent 与工具使用
#prompt optimization #natural language #reflection #large language models #agent design #agent discovery #code optimization #compound AI systems #genetic #language based learning #evolutionary algorithms
TL;DR:GEPA uses natural language reflection to optimize prompts, outperforming GRPO and MIPROv2 while needing far fewer rollouts.
🎯 研究动机当前大型语言模型(LLMs)在下游任务适配中常依赖强化学习方法,如GRPO,这些方法需大量尝试且效率较低;相比之下,语言的可解释性为模型学习带来了更丰富的潜力。
❓ 解决问题减少模型在任务优化中的尝试次数,同时通过自然语言反思实现更加高效准确的提示优化。
🔍 现象分析采用强化学习的提示优化方法受限于稀疏的标量奖励信号,优化速度慢且成本高;而利用自然语言描述问题和总结规则,可显著提升学习效率。
🛠️ 主要方法提出GEPA算法,通过自然语言反思结合遗传-帕累托优化,分析任务反馈并逐步改进提示,还能合并多次尝试的成果提升整体优化效果。
📊 数据与实验在六个任务上验证,GEPA比GRPO平均提升6个百分点、最多19个百分点,且尝试次数减少至1/35。此外,相较于MIPROv2,GEPA在多个场景如代码优化中表现更优(如AIME-2025任务成绩提升12pp)。
⭐ 主要贡献提出一种基于自然语言反思的优化框架,通过显著减少尝试次数,成功超越当前主流优化方法,提供更高效的提示调优方案,并公开代码供后续研究使用。
查看完整摘要 (Abstract)
Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language often provides a much richer learning medium for LLMs, compared to policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error.
Given any AI system containing one or more LLM prompts, GEPA samples trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain.
Across six tasks, GEPA outperforms GRPO by 6 percentage points on average and by up to 19pp, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10 percentage points (e.g., +12pp on AIME-2025), and demonstrates promising results as an inference-time search strategy for code optimization. We release our code at https://github.com/gepa-ai/gepa.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Tool Learning #Large Language Model #Graph Data Mining
TL;DR:We propose GTool, which is the first work aiming to enhance the tool planning ability of LLMs under incomplete tool dependencies.
🎯 研究动机工具规划是连接自然语言理解与任务执行的重要环节,但现有方法将工具视为独立组件,未能利用工具间的内在依赖关系,导致规划结果无效。
❓ 解决问题解决工具依赖关系不完整情况下,大语言模型在工具规划中难以准确选择适合工具的问题,特别是面临大规模工具集合时的挑战。
🔍 现象分析现有方法未能充分利用工具间的依赖信息,工具规划能力因缺乏依赖识别和推断机制而受限。
🛠️ 主要方法提出 GTool,通过构建请求特定工具图来高效选择工具,并生成供大语言模型理解的依赖信息图表示,同时设计缺失依赖预测任务以提升规划可靠性。
📊 数据与实验使用轻量级(7B)语言模型作为后端,进行广泛实验表明,相较于现有最优基线,GTool 在性能上提升超过 29.6%。
⭐ 主要贡献首次引入工具图增强的工具规划方案,在不需大量重训练的情况下无缝集成多种大语言模型,显著提升依赖不完整场景下的工具规划能力。
查看完整摘要 (Abstract)
Tool planning with large language models (LLMs), referring to selecting, organizing, and preparing the tools necessary to complete a user request, bridges the gap between natural language understanding and task execution. However, current works treat different tools as isolated components and fail to leverage the inherent dependencies of tools, leading to invalid planning results. Since tool dependencies are often incomplete, it becomes challenging for LLMs to accurately identify the appropriate tools required by a user request, especially when confronted with a large toolset. To solve this challenge, we propose GTool, which is the first work aiming to enhance the tool planning ability of LLMs under incomplete dependencies. GTool constructs a request-specific tool graph to select tools efficiently and generate the \<graph token\> which provides sufficient dependency information understandable by LLMs. Moreover, a missing dependency prediction task is designed to improve the reliability of GTool with incomplete dependencies. Without trimming LLMs, GTool can be seamlessly integrated with various LLM backbones without extensive retraining. Extensive experiments show that GTool achieves more than 29.6% performance improvements compared with the state-of-the-art (SOTA) baselines with a light-weight (7B) LLM backbone.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Model #Safety
🎯 研究动机DNA语言模型在合成基因组设计中展现出显著生成能力,但同时也带来了基因生成可能被滥用于设计人类病毒的生物安全风险。
❓ 解决问题探索DNA语言模型的生物安全漏洞并开发系统性的测试框架,以揭示其生成病原体序列的潜在能力并引导更可靠的安全防护技术发展。
🔍 现象分析在高优先级人类病毒场景下,该研究发现随着模型规模的扩大,DNA语言模型的潜在双重用途风险显著增加,这对生物安全构成了威胁。
🛠️ 主要方法提出名为GeneBreaker的端到端攻击框架,包括定制化的生物信息工具生成高相似性非致病性提示、通过PathoLM和概率启发引导生成,以及基于BLAST和功能注释评估成功率。
📊 数据与实验设计了针对高优先级人类病毒的JailbreakDNABench基准,并通过实验实现了对多种病毒类别模型(例如Evo2-40B)的成功攻击,攻击成功率高达60%。
⭐ 主要贡献系统评估了DNA语言模型中生物安全漏洞,开发了评估框架GeneBreaker及新基准,揭示了DNA语言模型在扩展规模时的双重用途风险,并强调改进安全对齐与追踪机制的重要性。
查看完整摘要 (Abstract)
DNA, encoding genetic instructions for almost all living organisms, fuels groundbreaking advances in genomics and synthetic biology. Recently, DNA Language Models have achieved success in designing synthetic functional DNA sequences, even whole genomes of novel bacteriophage, verified with wet lab experiments. Such remarkable generative power also brings severe biosafety concerns about whether DNA language models can design human viruses. With the goal of exposing vulnerabilities and informing the development of robust safeguarding techniques, we perform a systematic biosafety evaluation of DNA language models through the lens of jailbreak attacks. Specifically, we introduce JailbreakDNABench, a benchmark centered on high-priority human viruses, together with an end-to-end jailbreak framework, GeneBreaker. GeneBreaker integrates three key components: (1) an LLM agent equipped with customized bioinformatics tools to design high-homology yet non-pathogenic jailbreak prompts, (2) beam search guided by PathoLM and log-probability heuristics to steer sequence generation toward pathogen-like outputs, and (3) a BLAST- and function-annotation–based evaluation pipeline to identify successful jailbreaks. On JailbreakDNABench, GeneBreaker successfully jailbreaks the latest Evo series models across 6 viral categories consistently (up to 60\% Attack Success Rate for Evo2-40B). Further case studies on SARS-CoV-2 spike protein and HIV-1 envelope protein demonstrate the sequence and structural fidelity of jailbreak output, while evolutionary modeling of SARS-CoV-2 underscores biosecurity risks. Our findings also reveal that scaling DNA language models amplifies dual-use risks, motivating enhanced safety alignment and tracing mechanisms.
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLM Agent #Reinforcement Learning #Data Synthesis #Generalizability
TL;DR:We transform static coding problems into interactive multi-turn tool-use environments, enabling LLM agents to learn through reinforcement learning and improve their generalization ability for OOD tasks.
🎯 研究动机当前工具增强型大语言模型(LLM)在新工具和未见工作流任务中的泛化能力有限,亟需更高效的训练框架来提升其面对现实任务的表现。
❓ 解决问题通过将静态编程问题转化为交互式的多轮工具使用环境,解决现有强化学习框架对开发环境以外任务泛化效果较差的问题。
🔍 现象分析代码执行能反映现实任务的结构模式,但传统训练方式对结构化工具使用环境的支持不足,导致模型易受新任务与工具变化影响。
🛠️ 主要方法提出 CodeGym 框架,合成多样化、可验证、可控的工具使用环境,将编程问题中的原子函数或逻辑提取为可调用工具,构建多轮任务配置供模型探索学习。
📊 数据与实验在 CodeGym 环境中训练了不同规模和推理结构的模型,其中 Qwen2.5-32B-Instruct 在 OOD 基准 $ au$-Bench 上绝对准确率提高了 8.7 分,证明了该框架的强泛化能力。
⭐ 主要贡献提供了一个可扩展的通用 RL 环境 CodeGym,用于训练 LLM 进行工具使用任务;显著提升模型在未见任务上的表现,并公开代码供社区使用。
查看完整摘要 (Abstract)
Tool-augmented large language models (LLMs), hereafter LLM agents, leverage external tools to solve diverse tasks and interface with the real world. However, current training practices largely rely on supervised fine-tuning (SFT) over static trajectories or reinforcement learning (RL) on narrow tasks, which generalize poorly beyond development settings and lead to brittleness with new tools and unseen workflows. Because code execution reflects many structural patterns of real-world workflows, we use coding problems as a structured substrate to build tool-use agent training environments with diverse task configurations. To this end, we introduce **CodeGym**, a scalable framework that synthesizes diverse, verifiable, and controllable multi-turn tool-use environments for agent RL, enabling LLM agents to explore and master various workflows actively. CodeGym converts static coding problems into interactive environments by extracting atomic functions or logic into callable tools, yielding verifiable tasks that span various tool-execution workflows. Models of varying sizes and chain-of-thought configurations trained in CodeGym exhibit consistent out-of-distribution generalizability; for example, Qwen2.5-32B-Instruct achieves an absolute accuracy gain of 8.7 points on the OOD benchmark $\tau$-Bench. These results highlight CodeGym as a step toward scalable general-purpose RL environments for training tool-use behaviors that align with real-world agent workflows. Our code is publicly available at https://github.com/StigLidu/CodeGym.
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLM Collaboration #Multi-Agent LLM
🎯 研究动机随着大型语言模型(LLM)数量和基准测试需求的快速增长,需要有效协作多种模型以提升任务性能的机制,现有方法在模型选择、通信和响应整合方面存在不足。
❓ 解决问题提出一种新框架解决多智能体协作中的核心问题,包括相关模型选择、模型间通信优化及响应整合效率提升。
🔍 现象分析实验表明,现有方法在处理多智能体模型时表现不佳,过多模型同时参与降低效率,而缺乏有效的通信机制影响整体性能。
🛠️ 主要方法设计了基于图的智能体协作框架,通过节点采样选择相关模型,构造基于响应相关性的边,利用定向消息传递优化模型间的通信和响应整合,结合图池化生成最终统一答案。
📊 数据与实验在多域基准(MMLU, MMLU-Pro, GPQA)和特定领域基准(MATH, HumanEval, MedMCQA)上验证,使用6个跨域模型池,仅选择3个模型即可超越同时利用所有6个模型的基线表现。
⭐ 主要贡献提出了Graph-of-Agents框架,通过结构化的消息传递实现了多智能体模型高效协作,兼具可扩展性和效能,显著提高了多领域任务的性能。
查看完整摘要 (Abstract)
With an ever-growing zoo of LLMs and benchmarks, the need to orchestrate multiple models for improved task performance has never been more pressing. While frameworks like Mixture-of-Agents (MoA) attempt to coordinate LLMs, they often fall short in terms of (1) selecting relevant agents, (2) facilitating effective intra-agent communication, and (3) integrating responses efficiently. In this work, we propose Graph-of-Agents (GoA), a new graph-based framework for modeling multi-agent LLM communication. Our approach begins with node sampling, selecting only the most relevant agents by leveraging model cards that summarize each model’s domain, task specialization, and other characteristics. Next, we construct edges between the selected agents by evaluating their responses against one another to determine relevance ordering. Directed message passing is then performed from highly relevant agents to less relevant ones to enhance their responses, followed by reverse message passing to refine the original responses of the more relevant agents. Finally, the updated responses are aggregated via graph-based pooling (e.g., max or mean pooling) to produce a single, unified answer. We evaluate GoA on diverse multi-domain benchmarks (MMLU, MMLU-Pro, GPQA) and domain-specific benchmarks (MATH, HumanEval, MedMCQA), with an agent pool of 6 LLMs spanning multiple domains. Surprisingly, GoA achieves superior performance18 using only 3 selected agents, outperforming recent multi-agent LLM baselines that utilize all 6 agents simultaneously. By adopting a graph structure, GoA offers both scalability and effectiveness through structured message passing—positioning it as a strong candidate for navigating the challenges of the ever-growing LLM zoo.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large language model #Tool Learning #Reinforcement Learning
🎯 研究动机近年来,可验证奖励的强化学习显著提升了大语言模型的交互编码能力,但现有方法忽视了过程可验证的环境反馈,导致学习效果受限。
❓ 解决问题现有算法在推理过程中优势估计不准确,无法高效利用代码执行中的中间反馈,影响模型优化和学习质量。
🔍 现象分析过程可验证的中间反馈如语法错误和运行异常能够提供更细粒度的指导,但未被现有策略优化框架充分利用来纠正模型行为。
🛠️ 主要方法提出GVPO算法,通过整合结果可验证奖励与过程可验证反馈,引入优势塑造框架,实现短期与长期目标的平衡优化,从而改善信用分配和模型收敛性。
📊 数据与实验使用32B参数模型在AppWorld环境中测试,GVPO算法在复杂交互环境中表现优异,相较OpenAI o1提升12.7%,相较最强RL基线提升3.7%。
⭐ 主要贡献提出了一种结合双重可验证信号的强化学习算法,显著增强了大语言模型在交互编码任务中的泛化能力及优化稳定性。
查看完整摘要 (Abstract)
Recent advancements in reinforcement learning from verifiable rewards (RLVR), particularly through Group Relative Policy Optimization (GRPO), have significantly improved the capabilities of large language models (LLMs) for interactive coding agents.
However, these methods overlook process-verifiable environment feedback (e.g., code execution failures), leading to inaccurate advantage estimation at each reasoning step and insufficient learning.
To address this issue, we propose Group Verification-based Policy Optimization (GVPO), a novel RL algorithm that introduces an advantage shaping framework integrating both outcome-verifiable and process-verifiable signals.
While outcome-verifiable rewards ensure alignment with long-term task objectives, process-verifiable feedback derived from intermediate execution traces (e.g., syntax errors, runtime exceptions) serves as corrective shaping terms at the step level.
By jointly leveraging these two forms of verifiability, GVPO achieves more accurate credit assignment, balancing short-term process guidance with long-term outcome alignment.
This unified formulation yields more stable optimization, faster convergence, and stronger generalization in complex interactive environments.
A 32B-parameter agent trained with GVPO in the AppWorld environment outperforms OpenAI’s o1 agent by 12.7\% on the more challenging Test-C split and surpasses the strongest 32B RL-trained state-of-the-art baseline by 3.7\%.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Models #Reinforcement Learning #Evolution Strategies #Scientific Discovery
TL;DR:We introduce HELIX, a Hierarchical Evolutionary Reinforcement Learning framework with In-context eXperiences, achieving superior performance over GPT-4o pipelines on open-ended scientific tasks
🎯 研究动机大语言模型在解决复杂科学问题上表现出潜力,但现有方法在探索效率和泛化能力方面存在不足,难以应对领域特定且开放的任务挑战。
❓ 解决问题提高复杂科学问题的探索效率和解空间质量,克服当前学习方法的局限性,实现更高级别的科学发现。
🔍 现象分析现有方法依赖精心设计的流程或纯学习方式,探索范围有限且性能提升受限,难以充分利用开放式任务中的灵活解空间。
🛠️ 主要方法提出HELIX框架,将分层进化强化学习与场景化经验相结合,通过候选解的多样性与质量、迭代式策略优化,逐步提高解的质量。
📊 数据与实验在Circle Packing任务上实现最优结果,使用小规模模型达到GPT-4o无法匹敌的性能;在机器学习基准数据集上,平均F1分数相比高效流水线提升5.95点。
⭐ 主要贡献开发了融合进化与强化学习的新框架,显著提升了开放式科学问题探索效率和解决质量,推动科学发现领域应用的边界。
查看完整摘要 (Abstract)
Large language models (LLMs) with reasoning abilities have demonstrated
growing promise for tackling complex scientific problems. Yet such tasks are inherently domain-specific, unbounded and open-ended, demanding exploration across vast and flexible solution spaces. Existing approaches, whether purely learning-based or reliant on carefully designed workflows, often suffer from limited exploration efficiency and poor generalization.
To overcome these challenges, we present **HELIX**---a
**H**ierarchical **E**volutionary reinforcement **L**earning framework with **I**n-context e**X**periences. HELIX introduces two key novelties: (i) a diverse yet high-quality pool of candidate solutions that broadens exploration through in-context learning, and (ii) reinforcement learning for iterative policy refinement that progressively elevates solution quality. This synergy enables the discovery of more advanced solutions. On the circle packing task, HELIX achieves state-of-the-art result with a sum of radii of 2.63598308 using only a 14B model. Across standard machine learning benchmarks, HELIX further surpasses GPT-4o with a carefully engineered pipeline, delivering an average F1 improvement of 5.95 points on the Adult and Bank Marketing datasets.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Multi-agent System #Federated Learning #LLM-based Agent
🎯 研究动机联邦学习在分布式数据上的建模能力强,但设计和部署复杂性使其难以实现鲁棒性,需要解决数据异质性和系统限制下的多方面策略选择与优化问题。
❓ 解决问题解决现有联邦学习系统设计复杂且易碎的瓶颈,通过自动化工具提升联邦学习系统的生成能力和实验准确性。
🔍 现象分析传统解决方案过度依赖手动调整和定制化设计,难以适应多样化任务需求且容错性较差。
🛠️ 主要方法引入一种基于LLM的多智能体框架Helmsman,包括三阶段流程:人机交互规划、模块化代码生成以及闭环自动评估和优化,支持用户从高层描述到完整系统生成。
📊 数据与实验设计了名为AgentFL-Bench的新基准数据集,包含16个多样任务,通过广泛实验验证框架生成方案的竞争力与优越性。
⭐ 主要贡献提出了自动化合成联邦学习系统的框架Helmsman及配套基准,显著减少人工参与,提高复杂分布式AI系统工程的效率和质量,同时开放代码资源促进领域发展。
查看完整摘要 (Abstract)
Federated Learning (FL) offers a powerful paradigm for training models on decentralized data, but its promise is often undermined by the immense complexity of designing and deploying robust systems. The need to select, combine, and tune strategies for multifaceted challenges like data heterogeneity and system constraints has become a critical bottleneck, resulting in brittle, bespoke solutions. To address this, we introduce Helmsman, a novel LLM-based multi-agent framework that automates the end-to-end synthesis of federated learning systems from high-level user specifications. It emulates a principled research and development workflow through three collaborative phases: (1) interactive human-in-the-loop planning to formulate a sound research plan, (2) modular code generation by supervised generative agent teams, and (3) a closed-loop of autonomous evaluation and refinement in a sandboxed simulation environment. To facilitate rigorous evaluation, we also introduce AgentFL-Bench, a new benchmark comprising 16 diverse tasks designed to assess the system-level generation capabilities of LLM-driven agentic systems in FL. Extensive experiments demonstrate that our approach generates solutions competitive with, and often superior to, established hand-crafted baselines. Our work represents a significant step towards the automated engineering of complex decentralized AI systems. Code is available at: https://github.com/haoyuan-l/Helmsman.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Retrieval Augmented Generation #LLM Agent #Agentic RAG #Question Answering #Reinforcement Learning
TL;DR:This work introduces HiPRAG, a RL training method that uses hierarchical process rewards to teach agentic RAG systems to search more efficiently by reducing sub-optimal behaviors like over-search and under-search.
🎯 研究动机当前大模型常用的检索增强生成(RAG)方法存在检索行为次优问题,如过度检索和检索不足,导致效率低下与结果不可靠。
❓ 解决问题现有基于强化学习的训练方式大多依赖结果奖励,缺乏精细化控制。论文旨在通过过程奖励优化 RAG 系统的搜索和推理行为。
🔍 现象分析过度检索会导致模型重复获取已知信息,而检索不足则会忽视必要的信息检索,均增加不必要开销且降低精度。
🛠️ 主要方法提出 HiPRAG 框架,利用层级化过程奖励,将检索决策分解为可解析的推理步骤,并通过奖励函数引导最优检索与非检索操作。
📊 数据与实验在 Qwen2.5 和 Llama-3.2 模型上,针对七个 QA 基准进行实验。HiPRAG 平均准确率为 65.4%(3B)和 67.2%(7B),大幅降低过检率(从 27% 降至 2.3%),并提高搜索效率。
⭐ 主要贡献通过引入过程奖励框架,显著优化了基于检索的生成模型的推理效率和质量,验证了在不同模型与算法上的普适性和推广能力。
查看完整摘要 (Abstract)
Agentic Retrieval-Augmented Generation (RAG) is a powerful technique for incorporating external information that Large Language Models (LLMs) lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a Reinforcement Learning (RL) framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce $\textbf{Hi}$erarchical $\textbf{P}$rocess Rewards for Efficient agentic $\textbf{RAG}$ (HiPRAG), a novel training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4\% (3B) and 67.2\% (7B), outperforming strong agentic RAG baselines. This is accomplished while dramatically improving search efficiency, reducing the over-search rate from over 27\% in baselines from previous work to just 2.3\% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Models #Reinforcement Learning #Imperfect Information Game #Strategic Reasoning
TL;DR:This work systematically studies LLMs in poker, uncovering heuristic, factual, and knowing–doing flaws, and introduces ToolPoker, a tool-integrated framework using external solvers to reach state-of-the-art gameplay and professional-level reasoning.
🎯 研究动机随着大语言模型(LLMs)在高风险领域的应用增多,其在不确定性下的战略推理能力变得至关重要。扑克作为一项需要严格博弈论推理的游戏,为评估其能力提供了理想的测试平台。
❓ 解决问题研究当前 LLMs 在扑克任务中表现不足的问题,尤其是其难以胜过传统算法,以及存在启发式依赖、事实误解和知行分离的三大缺陷。
🔍 现象分析LLMs 在扑克中的表现暴露出逻辑推理与实际行动脱节的显著问题,且行为模仿与强化学习虽改善推理风格,但未能达到博弈论的一致性。
🛠️ 主要方法提出 ToolPoker 框架,通过结合外部求解器,实现博弈论一致的行动选择,同时提供更加精准的专业化推理解释。
📊 数据与实验在多项现实扑克任务中实验证明,ToolPoker 达到当前最佳游戏表现,同时生成的推理轨迹更符合博弈论原则。
⭐ 主要贡献系统揭示了 LLMs 在扑克中的不足,定义了核心问题,提出了工具融合的框架 ToolPoker,并实现了专业水准的战略推理与游戏表现。
查看完整摘要 (Abstract)
As Large Language Models (LLMs) are increasingly applied in high-stakes domains, their ability to reason strategically under uncertainty becomes critical. Poker provides a rigorous testbed, requiring not only strong actions but also principled, game-theoretic reasoning. In this paper, we conduct a systematic study of LLMs in multiple realistic poker tasks, evaluating both gameplay outcomes and reasoning traces. Our analysis reveals LLMs fail to compete against traditional algorithms and identifies three recurring flaws: reliance on heuristics, factual misunderstandings, and a “knowing–doing” gap where actions diverge from reasoning. An initial attempt with behavior cloning and step-level reinforcement learning improves reasoning style but remains insufficient for accurate game-theoretic play. Motivated by these limitations, we propose ToolPoker, a tool-integrated reasoning framework that combines external solvers for GTO-consistent actions with more precise professional-style explanations. Experiments demonstrate that ToolPoker achieves state-of-the-art gameplay while producing reasoning traces that closely reflect game-theoretic principles.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Multi-hop QA #RAG #Reasoning
TL;DR:We propose HybridDeepSearcher, a scalable search agent that dynamically integrates parallel and sequential strategies,trained on HDS-QA,a novel hybrid-hop dataset with supervised trajectories.
🎯 研究动机当前的大规模推理模型结合检索增强生成(RAG)已能够进行多步骤推理,但在复杂任务中文档检索的广度和深度依然存在限制。
❓ 解决问题针对单一查询扩展和多查询独立生成的局限性,提出一种动态整合并行与顺序搜索策略的方法,提升复杂任务的搜索扩展能力。
🔍 现象分析HybridDeepSearcher 在需要更多证据的问题上表现出更高的鲁棒性和证据覆盖率,并随着测试阶段搜索资源的增加表现出较好的扩展性。
🛠️ 主要方法提出 HybridDeepSearcher,结合并行和顺序搜索策略,通过推理-查询-检索循环机制动态调整,以支持复杂的多跳推理任务。
📊 数据与实验构建了 HDS-QA 数据集,融合并行与顺序搜索逻辑,提供推理路径监督信息,并在五个基准任务上实现显著性能提升,其中 FanOutQA 和 BrowseComp 子集分别提升 F1 分数 +15.9 和 +11.5。
⭐ 主要贡献提出一种动态搜索策略的新模型,提升复杂任务的推理能力;构建 HDS-QA 数据集;提供公开可用的代码和数据,促进社区研究。
查看完整摘要 (Abstract)
Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval. However, previous methods that extend reasoning with single-query search steps struggle to scale to complex tasks demanding broad document exploration. Meanwhile, approaches that generate multiple independent queries simultaneously may limit deeper, sequential reasoning. To address these limitations, we propose HybridDeepSearcher that dynamically integrates parallel and sequential search strategies to enable effective search scaling. To support training, we introduce HDS-QA, a novel dataset that seamlessly integrates broad parallel search with sequential search reasoning, providing answer trajectories in the form of reasoning-query-retrieval loops with parallel sub-queries. Across all five benchmarks, our approach significantly outperforms the state-of-the-art, improving F1 scores by +15.9 on FanOutQA and +11.5 on a subset of BrowseComp. Further analysis reveals that HybridDeepSearcher effectively scales performance with additional test-time search resources and demonstrates robustness on questions requiring more evidence, achieving higher evidence coverage. We include the code in the supplementary materials and will release the dataset and code publicly.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Code Localization #Large Language Models #Agent Memory
TL;DR:We improve code localization by augmenting language agents with repository memory built from commit history, and show that such memory could significantly boost performance on SWE-bench benchmarks.
🎯 研究动机代码定位是软件工程中的核心挑战,但现有方法忽略语言代理长期记忆的重要性,未充分利用代码库演化过程中的历史信息。
❓ 解决问题现有方法对代码库内容的处理缺乏记忆性,导致无法有效利用模块功能和历史问题解决关联。
🔍 现象分析人类开发者依赖长期记忆进行代码定位,而语言代理无法像人类一样积累和使用历史经验。
🛠️ 主要方法通过从代码库提交历史中构建非参数化记忆,包括关联问题和功能摘要,增强语言代理的代码定位能力。
📊 数据与实验在经过验证的SWE-bench数据集和最新的SWE-bench-live基准上测试,显著提升了LocAgent的性能。
⭐ 主要贡献提出了利用提交历史构建代理记忆的方法,显著提升代码定位性能,并向更接近人类开发方式的智能代理设计迈进。
查看完整摘要 (Abstract)
Code localization is a fundamental challenge in repository-level software engineering tasks such as bug fixing. While existing methods equip language agents with comprehensive tools/interfaces to fetch information from the repository, they overlook the critical aspect of *memory*, where each instance is typically handled from scratch assuming no prior repository knowledge. In contrast, human developers naturally build long-term repository memory, such as the functionality of key modules and associations between various bug types and their likely fix locations. In this work, we augment language agents with such memory by leveraging a repository's *commit history* - a rich yet underutilized resource that chronicles the codebase's evolution. We introduce tools that allow the agent to retrieve from a non-parametric memory encompassing recent historical commits and linked issues, as well as functionality summaries of actively evolving parts of the codebase identified via commit patterns. We demonstrate that augmenting such a memory can significantly improve LocAgent, a state-of-the-art localization framework, on both SWE-bench-verified and the more recent SWE-bench-live benchmarks. Our research contributes towards developing agents that can accumulate and leverage past experience for long-horizon tasks, more closely emulating the expertise of human developers.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Turn-Level Reward #Search Agent #Agentic RL
TL;DR:We propose IGPO, a simple and effective reinforcement learning framework with turn-level reward for training multi-turn search agents.
🎯 研究动机随着大型语言模型的广泛应用,强化学习被用来提高其在多轮搜索情境中的推理和知识获取能力,但现有方法奖励稀疏,无法有效支持多轮交互训练。
❓ 解决问题现有基于结果的奖励机制导致奖励稀疏性,加剧了优势坍缩、细粒度信用分配困难和样本效率低的问题,尤其在长轨迹任务中表现尤为明显。
🔍 现象分析多轮情境中各回合奖励相同,无法提供实际学习信号;中间回合的正确性被忽略;单个输出信号限制了数据利用率,影响训练效果。
🛠️ 主要方法提出了基于信息增益的策略优化框架(IGPO),通过模型自身的置信度更新定义回合级奖励,将其与结果监督结合形成密集奖励信号,避免依赖外部奖励模型或昂贵的蒙特卡洛估计。
📊 数据与实验在多个领域内外的基准数据集进行实验,展示了IGPO在多轮任务中始终优于强基线模型,提升了准确性和数据效率。
⭐ 主要贡献提出了基于信息增益的回合级奖励策略优化框架,改进了多轮交互任务的奖励机制,显著提高了数据利用效率与最终性能,并公开了相关代码。
查看完整摘要 (Abstract)
Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided exclusively upon generating the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate three critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals; (ii) lack of fine-grained credit assignment, where the correctness of intermediate turns is obscured, especially in long-horizon tasks; and (iii) poor sample efficiency, where each rollout yields only a single outcome signal, leading to low data utilization. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward signals. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved data efficiency. Our code is available at https://github.com/GuoqingWang1/IGPO.
基础/前沿模型 (含LLM)
Agent 与工具使用
#multi-turn #RL #GPU kernel #code generation
🎯 研究动机GPU内核编写对AI系统效率举足轻重,但过程复杂且高度迭代,具有可验证的奖励(正确性和速度提升),非常适合应用强化学习(RL)。
❓ 解决问题解决生成和优化CUDA内核代码的迭代性问题,同时应对长序列学习与跨回合奖励归因等现实挑战。
🔍 现象分析实验表明模型通过多回合迭代,更能精准优化代码生成的正确性和速度,比基线与前沿模型提升明显,且序列式优化优于并行采样。
🛠️ 主要方法提出基于RL的多回合优化框架Kevin,通过解析长序列和奖励分配规则,来适应逐步优化内核生成环境。
📊 数据与实验基于来自CUDA与PyTorch的代码基线进行验证,实验表明内核生成正确率从56%提升至82%,平均速度提升从0.53倍增长至1.10倍。
⭐ 主要贡献Kevin是首个使用多回合RL训练的CUDA内核生成模型,显著提升代码正确性及运行效率,并揭示序列优化对长期性能改进的优势。
查看完整摘要 (Abstract)
Writing GPU kernels is a challenging task and critical for AI systems' efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings, such as learning from long trajectories and effective reward attribution across turns. We present Kevin - K(ernel D)evin, the first model trained with multi-turn RL for CUDA kernel generation and optimization. In our evaluation setup, Kevin shows significant gains over its base model (QwQ-32B), improving correctness of generated kernels (in pure CUDA) from 56% to 82% and mean speedup from 0.53x to 1.10x of baseline (PyTorch Eager), and surpassing frontier models like o4-mini (0.78x). Finally, we study its behavior across test-time scaling axes: we found scaling serial refinement more beneficial than parallel sampling. In particular, when given more refinement turns, Kevin shows a higher rate of improvement.
基础/前沿模型 (含LLM)
Agent 与工具使用
#coder LLM #Agentless #SWE-Agent #Reinforcement Learning
TL;DR:We present Kimi-Dev that obtains 60.4% pass rate on SWE-bench Verified with Agentless training, and demonstrate it provides strong skill priors that enable efficient and effective SWE-Agent adaptation.
🎯 研究动机大型语言模型(LLMs)逐渐应用于软件工程领域,但当前的多步交互式框架(SWE-Agent)和单步验证式Agentless方法存在割裂,研究如何融合两者以提高适应能力成为关键。
❓ 解决问题提出一种结合Agentless训练与SWE-Agent适应的新方法,旨在通过技能先验(如定位、代码编辑、自我反思)提升编码代理的效率与性能。
🔍 现象分析通过Agentless训练发现,其推理密集型步骤可以生成结构化技能先验,为SWE-Agent适应提供有力支持,同时提高模型通用性。
🛠️ 主要方法设计Agentless训练配方并开发开源模型Kimi-Dev,利用理由密集的单步训练方法,并通过5千条公开轨迹进行额外微调适应SWE-Agent环境。
📊 数据与实验模型在SWE-bench Verified中以60.4%的通过率表现最佳,并在SWE-Agent环境中达到48.6%的pass@1成绩,与Claude 3.5 Sonnet接近。
⭐ 主要贡献首次将Agentless训练与编码代理结合,提出一个强技能先验框架(Kimi-Dev),填补工作流与代理框架的连接空白,为软件工程代理的高效适应提供了新方向。
查看完整摘要 (Abstract)
Large Language Models (LLMs) are increasingly applied to software engineering (SWE), with SWE-bench as a key benchmark. Solutions are split into SWE-Agent frameworks with multi-turn interactions and workflow-based Agentless methods with single-turn verifiable steps. We argue these paradigms are not mutually exclusive: reasoning-intensive Agentless training induces skill priors, including localization, code edit, and self-reflection that enable efficient and effective SWE-Agent adaptation. In this work, we first curate the Agentless training recipe and present Kimi-Dev, an open-source SWE LLM achieving 60.4\% on SWE-bench Verified, the best among workflow approaches. With additional SFT adaptation on 5k publicly-available trajectories, Kimi-Dev powers SWE-Agents to 48.6\% pass@1, on par with that of Claude 3.5 Sonnet (241022 version). These results show that structured skill priors from Agentless training can bridge workflow and agentic frameworks for transferable coding agents.
基础/前沿模型 (含LLM)
Agent 与工具使用
#multi-agent system #clinical reasoning #medical question answering
🎯 研究动机临床实践中,医生在信息不足时会选择暂缓决策以防止误诊。然而,现有大语言模型在医疗场景中容易生成过度自信的回答,缺乏有效的拒答机制。
❓ 解决问题针对传统拒答方法依赖自评、缺乏基于外部医学证据的知识边界识别的不足,研发一种更为系统化的拒答策略。
🔍 现象分析现有模型在信息不足情境下难以有效识别知识缺口,主要问题源于缺乏对医学知识的结构化探索与评估机制。
🛠️ 主要方法提出KnowGuard,通过'探索前拒答'范式,分为证据发现和证据评估两个阶段,利用知识图谱扩展与检索进行系统化知识探索,并基于患者上下文对证据进行排序与分析。
📊 数据与实验基于开放式多轮对话的临床基准测试,评估KnowGuard在诊断准确性与交互效率上的权衡表现,实验显示平均对话轮次为5.74,诊断准确率提升3.93%。
⭐ 主要贡献提出了一种结合知识图谱探索的新型拒答机制,在多轮临床推理中显著提升诊断性能,拓展了拒答研究的新方向。
查看完整摘要 (Abstract)
In clinical practice, physicians refrain from making decisions when patient information is insufficient. This behavior, known as abstention, is a critical safety mechanism preventing potentially harmful misdiagnoses. Recent investigations have reported the application of large language models (LLMs) in medical scenarios. However, existing LLMs struggle with the abstentions, frequently providing overconfident responses despite incomplete information. This limitation stems from conventional abstention methods relying solely on model self-assessments, which lack systematic strategies to identify knowledge boundaries with external medical evidences. To address this, we propose \textbf{KnowGuard}, a novel \textit{investigate-before-abstain} paradigm that integrates systematic knowledge graph exploration for clinical decision-making. Our approach consists of two key stages operating on a shared contextualized evidence pool: 1) an evidence discovery stage that systematically explores the medical knowledge space through graph expansion and direct retrieval, and 2) an evidence evaluation stage that ranks evidence using multiple factors to adapt exploration based on patient context and conversation history. This two-stage approach enables systematic knowledge graph exploration, allowing models to trace structured reasoning paths and recognize insufficient medical evidence. We evaluate our abstention approach using open-ended multi-round clinical benchmarks that mimic realistic diagnostic scenarios, assessing abstention quality through accuracy-efficiency trade-offs beyond existing closed-form evaluations. Experimental evidence clearly demonstrates that KnowGuard outperforms state-of-the-art abstention approaches, improving diagnostic accuracy by 3.93\% through effective diagnostic interactions averaging 5.74 conversation turns.
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLM deception #Long-horizon interaction
🎯 研究动机欺骗是人类交流中的常见现象,也是大语言模型(LLM)日益引发关注的问题。现有研究主要局限于单轮对话,未能捕捉伴随任务动态展开的长时交互中的欺骗行为。
❓ 解决问题提出了一种新的模拟框架 LH-Deception,用于系统量化 LLM 在长时交互中的欺骗行为,以应对现有方法对动态情境和信任变化的捕捉不足。
🔍 现象分析实验表明,欺骗程度因模型而异,且在事件压力下增加,同时对监督方的信任造成持续恶化。定性分析揭示了长时交互中的新现象,如“欺骗链”,难以通过单轮评估发现。
🛠️ 主要方法设计了一个多代理系统,包括任务执行代理、监督代理和独立审核代理,用于动态评估欺骗行为发生的时机及方式,结合多轮任务和反馈进行模拟。
📊 数据与实验对 11 种前沿模型(包括闭源与开源系统)进行了广泛实验,通过多轮任务序列和动态情境压力验证欺骗行为及其影响。
⭐ 主要贡献提供了系统评估 LLM 长时交互中欺骗行为的研究基础,为未来开发可信赖的语言模型应用场景提供了关键参考。
查看完整摘要 (Abstract)
Deception is a pervasive feature of human communication and an emerging concern in large language models (LLMs). While recent studies document instances of LLM deception, most evaluations remain confined to single-turn prompts and fail to capture the long-horizon interactions in which deceptive strategies typically unfold. We introduce a new simulation framework, LH-Deception, for a systematic, empirical quantification of deception in LLMs under extended sequences of interdependent tasks and dynamic contextual pressures. LH-Deception is designed as a multi-agent system: a performer agent tasked with completing tasks and a supervisor agent that evaluates progress, provides feedback, and maintains evolving states of trust. An independent deception auditor then reviews full trajectories to identify when and how deception occurs. We conduct extensive experiments across 11 frontier models, spanning both closed-source and open-source systems, and find that deception is model-dependent, increases with event pressure, and consistently erodes supervisor trust. Qualitative analyses further reveal emergent, long-horizon phenomena, such as ``chains of deception", which are invisible to static, single-turn evaluations. Our findings provide a foundation for evaluating future LLMs in real-world, trust-sensitive contexts.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Models #Evolutionary Optimization #AI for Science #Materials Discovery
TL;DR:LLM-guided Evolution for Materials design (LLEMA) is a unified evolutionary framework that couples the scientific knowledge embedded in large language models with chemistry-informed evolutionary rules and memory-based refinement.
🎯 研究动机材料发现涉及庞大的化学与结构空间,需要同时满足多个通常互相冲突的目标,亟需优化方法支持。
❓ 解决问题如何将大型语言模型中的科学知识与化学启发的进化规则相结合,以实现针对多目标材料发现的优化过程。
🔍 现象分析LLEMA通过结合规则引导生成、记忆优化和代理预测,显著提升了新材料发现的化学可行性、热力学稳定性和目标属性匹配度。
🛠️ 主要方法提出LLEMA框架,利用LLM生成候选材料,通过代理模型评估属性,用多目标打分机制结合记忆更新优化后续生成。
📊 数据与实验实验覆盖14项电子、能源、涂层、光学和航空领域的现实任务,结果显示LLEMA相较生成模型和纯LLM方法在命中率和帕累托前沿质量上均取得显著优势。
⭐ 主要贡献提供了一个统一框架,将大语言模型与进化规则和多目标优化相结合,在材料发现任务中显著提高了实际材料设计的效率和质量。
查看完整摘要 (Abstract)
Materials discovery requires navigating vast chemical and structural spaces while satisfying multiple, often conflicting, objectives. We present LLM-guided Evolution for MAterials discovery (**LLEMA**), a unified framework that couples the scientific knowledge embedded in large language models with chemistry-informed evolutionary rules and memory-based refinement. At each iteration, an LLM proposes crystallographically specified candidates under explicit property constraints; a surrogate-augmented oracle estimates physicochemical properties; and a multi-objective scorer updates success/failure memories to guide subsequent generations. Evaluated on **14 realistic tasks** that span electronics, energy, coatings, optics, and aerospace, LLEMA discovers candidates that are chemically plausible, thermodynamically stable, and property-aligned, achieving higher hit rates and improved Pareto front quality relative to generative and LLM-only baselines. Ablation studies confirm the importance of rule-guided generation, memory-based refinement, and surrogate prediction. By enforcing synthesizability and multi-objective trade-offs, LLEMA provides a principled approach to accelerating practical materials discovery. Project website: https://scientific-discovery.github.io/llema-project/
基础/前沿模型 (含LLM)
Agent 与工具使用
#reinforcement learning #self-imitation learning #large language model #agentic learning #llm agents
🎯 研究动机强化学习在大语言模型执行长时间跨度和稀疏奖励的任务中非常重要,但面临探索与利用平衡的根本问题。现有方法通过最大化政策熵促进探索,但可能导致分布不稳定性。
❓ 解决问题提出逐步探索与利用平衡的新方法,旨在避免因熵崩塌或发散而导致RL不稳定,同时利用代理的自身经验进行优化。
🔍 现象分析通过经验回放缓解多回合分布偏移问题,使用课程调度逐步调整策略熵,以实现从初识环境到充分利用成功策略的平滑转换。
🛠️ 主要方法提出SPEAR框架,结合自模仿学习和内在奖励调节,同时利用工业RL优化技术构建强基线(Dr.BoT),实现渐进式探索与利用平衡。
📊 数据与实验在ALFWorld和WebShop任务中,与GRPO/GiGPO/Dr.BoT相比成功率分别提升16.1%/5.1%/8.6%和20.7%/11.8%/13.9%。在AIME24和AIME25任务中,成功率提升达3.8%和6.1%,理论复杂度增加10%-25%,实际运行开销可忽略。
⭐ 主要贡献通过SPEAR框架实现RL稳定性与效率提升,显著提高不同任务中的成功率,具备可插拔和扩展性,针对探索与利用的根本难题提出新路径。
查看完整摘要 (Abstract)
Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent's own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL, where a replay buffer stores good experience for off-policy update, by gradually steering the policy entropy across stages. Specifically, the proposed curriculum scheduling harmonizes intrinsic reward shaping and self-imitation to 1) expedite exploration via frequent tool interactions at the beginning, and 2) strengthen exploitation of successful tactics upon convergence towards familiarity with the environment. We also combine bag-of-tricks of industrial RL optimizations
for a strong baseline Dr.BoT to demonstrate our effectiveness. In ALFWorld and WebShop, SPEAR increases the success rates of GRPO/GiGPO/Dr.BoT by up to 16.1\%/5.1\%/8.6\% and 20.7\%/11.8\%/13.9\%, respectively. In AIME24 and AIME25, SPEAR boosts Dr.BoT by up to 3.8\% and 6.1\%, respectively. Such gains incur only 10\%–25\% extra theoretical complexity and negligible runtime overhead in practice, demonstrating the plug-and-play scalability of SPEAR.
基础/前沿模型 (含LLM)
Agent 与工具使用
#RL #reasoning #LLM #tool use #prompting
TL;DR:We introduce the Conductor, a new kind of language model trained with reinforcement learning to automatically discover powerful coordination strategies among LLMs
🎯 研究动机大型语言模型(LLM)在不同领域表现优异,但有效协调其能力尚缺乏系统性解决方法。
❓ 解决问题提出一种名为 Conductor 的模型,通过强化学习自动发现 LLM 间协作的最优策略,实现更强的推理能力。
🔍 现象分析Conductor 能设计针对性通信拓扑,优化 LLM 之间任务分配,并通过高效的提示工程解锁模型潜力。
🛠️ 主要方法使用强化学习训练 Conductor,通过随机化代理池进行适配与动态拓扑选择,并引入递归结构提升测试时性能。
📊 数据与实验在 LiveCodeBench 和 GPQA 等复杂推理基准上测试,实现单一模型无法达到的性能表现,并验证其对多类型 LLM 的鲁棒适配能力。
⭐ 主要贡献展示通过强化学习可实现 LLM 的协作优化,提出一种递归拓扑动态扩展技术,显著提升推理能力并推动语言模型协同领域发展。
查看完整摘要 (Abstract)
Powerful large language models (LLMs) from different providers have been expensively trained and finetuned to specialize across varying domains. In this work, we introduce a new kind of Conductor model trained with reinforcement learning to automatically discover powerful coordination strategies among LLMs. Our Conductor learns not only to design targeted communication topologies for effective agent-to-agent collaboration, but also to prompt engineer focused instructions to the LLMs to maximally leverage their individual capabilities. We show that, by learning optimal coordination strategies over pools of powerful worker LLMs, a 7B Conductor achieves significant performance gains beyond any individual worker, attaining state-of-the-art results in challenging reasoning benchmarks, such as LiveCodeBench and GPQA. By training with randomized agent pools, our conductor effectively adapts to arbitrary sets of open- and closed-source agents, meeting any user requirements. Furthermore, allowing the Conductor to select itself as a worker gives rise to recursive topologies, elevating performance with a new form of dynamic test-time scaling through online iterative adaptation.
More broadly, ours is among the early work demonstrating language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally in LLMs through pure end-to-end reward maximization.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Verifiers #Verification #Digital Agents #Web Agents #GUI Agents #Robotics #Large Language Models #Test Time Scaling #WebArena #OSWorld #Reward Models #open-endedness #LLMs-as-judges #Vision Language Models
🎯 研究动机验证器在数学、代码和游戏等领域推动了AI进展,但在开放领域(如计算机使用)中,将人类直觉转化为可扩展的规则仍具挑战。多模态大语言模型(MLLMs)凭借其世界知识、人类偏好对齐和推理能力,有望成为解决方案。
❓ 解决问题本文旨在解决MLLMs作为验证器时存在的‘同意偏差’问题,即模型过度验证智能体行为。该偏差普遍存在,对依赖MLLM评估的方法(如过滤行为克隆和自改进)造成损害。
🔍 现象分析研究发现MLLMs在评估网络导航、计算机使用和机器人任务时,存在系统性的同意偏差。这种偏差在不同模型家族和评估模板中普遍存在,且对测试时缩放具有韧性。
🛠️ 主要方法提出自接地验证(SGV),一种轻量级方法。SGV分为两步:首先引导MLLM生成关于期望行为的广泛先验;然后基于自生成先验,对候选轨迹进行推理和评估,从而更好地利用模型的知识、对齐和推理能力。
📊 数据与实验实验涵盖13+模型家族、28+评估模板,使用来自不同智能体且长度各异的轨迹。评估领域包括网络导航、计算机使用和机器人技术。发布了更新版的VisualWebArena及其精简子集。
⭐ 主要贡献揭示了MLLMs作为验证器的同意偏差问题;提出了SGV方法,在多个模型和环境上显著提升了失败检测和准确性;通过自改进和在线监督,在多个任务上超越了之前的最佳性能。发布了包含强基线、对齐评估器和高效并行的更新基准。
查看完整摘要 (Abstract)
Verifiers—functions assigning rewards to agent behavior—have been key to AI progress in domains such as math, code, and games. However, extending these gains to domains without clear-cut success criteria (e.g., computer use) remains a challenge: while humans can recognize desired outcomes, translating this intuition into scalable rules is nontrivial. Multimodal LLMs (MLLMs) emerge as a promising solution, given vast world knowledge, human-preference alignment, and reasoning capabilities. We evaluate MLLMs as verifiers across web navigation, computer use, and robotics, spanning 13+ model families, 28+ evaluation templates, curated trajectories from diverse agents and of varying lengths, and distinct verifier applications. We identify a critical limitation: a strong tendency for MLLMs to over-validate agent behavior—a phenomenon we term agreement bias. This bias is pervasive across models, resilient to test-time scaling, and can harm methods relying on MLLM evaluations, such as filtered behavior cloning and self-improvement. We provide guidance on the design and evaluation of MLLM verifiers, and introduce Self-Grounded Verification (SGV), a lightweight method that harnesses MLLMs' own sampling mechanisms by modulating (un)conditional generation to better leverage their knowledge, alignment, and reasoning. SGV operates in two steps: first, the MLLM is elicited to generate broad priors about desired behavior, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. Our methods yield gains across models and environments, improving failure detection by up to 25pp and accuracy by 14pp, with benefits extending to downstream applications. In self-improvement and online supervision, SGV boosts task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena—setting a new state of the art, surpassing the previous best by 20pp. Finally, we release an updated version of VisualWebArena featuring strong agent baselines, more human-aligned evaluators, high-fidelity environment parallelism, runtime speedups exceeding 10x, and VisualWebArena-Lite, a 1/3-scale subset with comparable evaluation fidelity. Our code, models, and data are publicly available at [our project page](https://mshalimay.github.io/agreement-bias-sgv/).
基础/前沿模型 (含LLM)
Agent 与工具使用
#large language model #LLM memory
🎯 研究动机大型语言模型(LLM)在动态复杂环境中难以有效利用历史交互信息,而记忆系统能够通过存储、检索和利用持久信息扩展LLM的无状态交互能力。
❓ 解决问题现有的记忆系统虽然增强了信息处理能力,但通常伴随高昂的时间和计算成本,亟需一种兼顾性能与效率的新型记忆方案。
🔍 现象分析当面对复杂的历史信息时,传统记忆系统的高资源耗费问题限制了其广泛应用,尤其是在推理效率方面存在明显瓶颈。
🛠️ 主要方法提出LightMem,借鉴Atkinson–Shiffrin人类记忆模型,将记忆分为感知记忆、短期记忆和长期记忆三个阶段,实现信息的高效压缩、按主题组织以及离线更新。
📊 数据与实验在LongMemEval基准上进行实验证明,基于GPT和Qwen的LightMem在准确性上比强基线提升最高达10.9%,同时将token使用、API调用和运行时间分别减少达117倍、159倍和12倍以上。
⭐ 主要贡献提出了一种轻量高效的记忆增强生成模型,显著提升了LLM在动态环境中的信息利用能力,提供了新的记忆模型分层设计方式,并计划开源代码推动社区发展。
查看完整摘要 (Abstract)
Despite their remarkable capabilities, Large Language Model (LLM) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often incur substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson–Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognitive-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference.
Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117×, API calls by up to 159×, and runtime by over 12×. Code will be released on GitHub.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Model #Self-play #Multi-Agent System #Strategic Games
TL;DR:We propose a self-play framework for enhancing general multi-agent capabilities of LLMs via reinforcement learning on strategic games.
🎯 研究动机开发具备多代理系统中合作与竞争能力的LLM是迈向高级智能的关键。现有强化学习方法在单代理任务中表现优异,但在多代理场景下的长期信号分配和代理差异化估计仍面临挑战。
❓ 解决问题解决多轮、多代理场景中的长期信号分配问题,以及代理特定的优势估计以提升模型的合作与竞争能力。
🔍 现象分析通过自我博弈训练,LLM在战略游戏中的表现提升高达28.7%,并在推理基准任务中展现出任务之外的广泛泛化能力。
🛠️ 主要方法提出MARSHAL框架,通过联合自我博弈及回合优势估计方法优化多代理系统,在战略游戏中实现合作与竞争任务的强化学习。
📊 数据与实验使用Qwen3-4B模型进行自我博弈训练,并在AIME、GPQA-Diamond等多个推理基准上进行验证,零样本性能提升分别达到10.0%、7.6%及平均3.5%。
⭐ 主要贡献证明战略游戏中的自我博弈方法能显著提升LLM在多代理场景中的推理能力,并为多代理任务的泛化性研究提供了新的方向。
查看完整摘要 (Abstract)
Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce **MARSHAL**, an end-to-end RL framework that incentivizes **M**ulti-**A**gent **R**easoning through **S**elf-play wit**H** str**A**tegic **L**LMs in both cooperative and competitive games. MARSHAL features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARSHAL agents trained from Qwen3-4B develop strong strategic abilities, with up to $28.7$\% performance improvements in held-out games. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARSHAL agent achieves significant zero-shot performance gains of up to $10.0$\% on AIME, $7.6$\% on GPQA-Diamond, and $3.5$\% on average across all benchmarks. These results establish self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Model #Multi-Agent #Reinforcement Learning
🎯 研究动机探讨如何在多智能体大语言模型系统中实现高效的训练与推理,解决单智能体系统性能和扩展性的局限性。
❓ 解决问题设计一个框架,支持多智能体系统的高效交互和策略训练,同时优化推理效率和任务协作能力。
🔍 现象分析通过实验观察到多智能体系统在相同推理预算下,经过收敛相比单智能体系统性能显著提升。
🛠️ 主要方法提出了MARTI框架,实现集中式多智能体交互、分布式策略训练以及支持异步多轮回合的推理流程;融合基于规则的验证性奖励与基于生成式语言模型的奖励机制。
📊 数据与实验在多种数学任务上进行了广泛实验,验证了多智能体系统在复杂推理任务中的性能优势。
⭐ 主要贡献提出了一个扩展性框架,为多智能体大语言模型系统的协作与复杂推理能力打开了新的研究方向。
查看完整摘要 (Abstract)
We present MARTI (Multi-Agent Reinforced Training and Inference), an open-source framework designed to facilitate scalable and efficient learning of multi-agent LLM systems. MARTI supports centralized multi-agent interactions and distributed policy training, with the added capability of multi-turn asynchronous rollouts to enhance training efficiency. The framework includes dynamic workflows for multi-agent interactions, which integrate both rule-based verifiable rewards and LLM-based generative rewards. We validate the effectiveness of MARTI through comprehensive experiments on diverse mathematical tasks, demonstrating that multi-agent LLM-based systems outperform single-agent systems within the same inference budget after convergence. Our contributions lay the foundation for exploring scalable collaborations within LLM-based multi-agent systems and advancing the capabilities of large reasoning models.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Multi-Agent System #LLM Agent
TL;DR:we introduce MAS$^2$, a paradigm predicated on the principle of recursive self-generation: a multi-agent system that autonomously architects bespoke multi-agent systems for diverse problems.
🎯 研究动机随着大型语言模型驱动的多智能体系统快速发展,其自进化能力备受关注,但现有方法难以应对动态环境的不确定性。
❓ 解决问题现有的自动化多智能体系统多采用“一次生成后部署”模式,缺乏灵活性和鲁棒性,无法满足复杂问题需求。
🔍 现象分析实验显示,传统系统在深度研究与代码生成等复杂场景中表现有限,且跨模型泛化性能不足。
🛠️ 主要方法提出MAS$^2$框架,包括“生成器-执行器-校正器”三智能体体系,通过协作树优化进行动态组装与自适应校正。
📊 数据与实验基于七个基准任务测试MAS$^2$,在复杂场景中性能最高提升19.6%,跨模型泛化提高15.1%,同时保持成本性能的优越性。
⭐ 主要贡献构建了一种可递归自生成、自配置、自校正的多智能体系统,为解决动态复杂任务提供新范式。
查看完整摘要 (Abstract)
The past two years have witnessed the meteoric rise of Large Language Model (LLM)-powered multi-agent systems (MAS), which harness collective intelligence and exhibit a remarkable trajectory toward self-evolution. This paradigm has rapidly progressed from manually engineered systems that require bespoke configuration of prompts, tools, roles, and communication protocols toward frameworks capable of automated orchestration. Yet, dominant automatic multi-agent systems, whether generated by external modules or a single LLM agent, largely adhere to a rigid \textit{generate-once-and-deploy} paradigm, rendering the resulting systems brittle and ill-prepared for the dynamism and uncertainty of real-world environments.
To transcend this limitation, we introduce MAS$^2$, a paradigm predicated on the principle of recursive self-generation: a multi-agent system that autonomously architects bespoke multi-agent systems for diverse problems. Technically, we devise a ``\textit{generator-implementer-rectifier}'' tri-agent team capable of dynamically composing and adaptively rectifying a target agent system in response to real-time task demands. Collaborative Tree Optimization is proposed to train and specialize these meta-agents. Extensive evaluation across seven benchmarks reveals that MAS$^2$ achieves performance gains of up to $19.6\\%$ over state-of-the-art MAS in complex scenarios such as deep research and code generation. Moreover, MAS$^2$ exhibits superior cross-backbone generalization, effectively leveraging previously unseen LLMs to yield improvements of up to $15.1\\%$. Crucially, these gains are attained without incurring excessive token costs, as MAS$^2$ consistently resides on the Pareto frontier of cost-performance trade-offs.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Multimodal #RAG #Vision-Language #Agent #Benchmark
🎯 研究动机随着对多步、跨模态和知识驱动推理需求的增长,多模态大模型正从传统固定检索生成范式向更复杂的智能体式检索增强生成演进,但现有基准主要关注简单问答和短检索链,缺乏对自适应规划与多模态推理的深入评估。
❓ 解决问题提出首个面向智能体式多模态检索增强生成的长链结构化推理基准MC-Search,系统设计包含五种推理结构的标注链路,并通过证据回溯验证确保数据保真度,填补多步自适应规划评测的空白。
🔍 现象分析对六种主流多模态模型的基准测试揭示系统性缺陷,包括检索过载或不足以及模态规划错位,表明现有模型在多步跨模态推理中规划精度与检索保真度存在瓶颈。
🛠️ 主要方法构建包含子问题、检索模态、支持证据和中间答案的链式标注数据集;创新提出过程监督微调框架Search-Align,利用验证后的推理链提升开源模型的规划与检索可靠性。
📊 数据与实验MC-Search包含3,333个平均3.7跳的高质量样本,引入过程级评测指标;通过统一智能体流程验证模型表现,并证明微调框架能显著提升推理链的准确性。
⭐ 主要贡献首次建立带长链标注的多模态智能体检索增强生成基准,提出证据回溯验证机制与过程级评估体系;开发过程监督微调方法,为多模态推理的可靠评估与模型优化提供新范式。
查看完整摘要 (Abstract)
With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy. By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning. Finally, we introduce Search-Align, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that our data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.
基础/前沿模型 (含LLM)
Agent 与工具使用
#deep research #reasoning #context compression
TL;DR:We propose a RL-based learning framework that enables interactive agents to solve long-horizon tasks with constant context
🎯 研究动机现代语言代理在解决多回合长时任务时面临上下文无限增长的问题,导致计算成本增加及推理性能下降,尤其在应对分布外输入长度时表现不佳。
❓ 解决问题如何在长时任务中通过内存压缩与高效推理,实现代理在接近常数上下文大小下的高效交互与任务解决。
🔍 现象分析现有方法依赖全上下文提示,但忽略了上下文中无关或冗余的信息,容易导致内存无限增长且性能下降。
🛠️ 主要方法提出基于强化学习的框架 MEM1,通过内存合并与推理优化,动态更新精简的内部状态,并利用轨迹截断策略加强内存与新观察的融合。
📊 数据与实验在内部检索 QA、开放领域网络 QA 和多回合网络购物任务上进行测试,MEM1 在增强的多跳 QA 数据集上超越 Qwen2.5-14B-Instruct 提升性能 3.5 倍,内存使用降低 3.7 倍。
⭐ 主要贡献证明了基于推理的内存合并方法在长时任务中的高效性,为训练多交互任务解决的智能代理提供了可扩展解决方案。
查看完整摘要 (Abstract)
Modern language agents often need to solve long-horizon tasks requiring multiple turns of interactions with the environment, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to un-bounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths due to LLM forgetting the context. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with near constant context size when solving long-horizon tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. Leveraging reinforcement learning (RL) and rollout trajectory truncation, we train a MEM1 agent to develop internal states that integrate prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5$\times$ while reducing memory usage by 3.7$\times$ compared to Qwen2.5-14B-Instruct on an augmented multi-hop QA dataset with 16 objectives in each task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon task-solving agents that involve multiple interactions, where both efficiency and performance are optimized. Code can be found at https://github.com/MIT-MI/MEM1.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Think with images #Medical visual reasoning #Medical VQA #Agentic reinforcement learning
🎯 研究动机当前的医学视觉语言模型多依赖纯文本推理,难以有效利用视觉证据。
❓ 解决问题提出了MedVR框架,通过强化学习实现无需标注的医学视觉推理,提升模型的可靠性和透明度。
🔍 现象分析现有方法的纯文本范式会导致细粒度视觉分析不足以及视觉幻觉风险。
🛠️ 主要方法引入熵引导视觉重新定位机制和基于共识的信用分配机制,协同促进视觉推理学习。
📊 数据与实验在多个公开医学VQA基准上评估,MedVR无需中间步骤标注即取得最优性能。
⭐ 主要贡献MedVR推动了医学AI临床部署所需的鲁棒性和可解释性发展。
查看完整摘要 (Abstract)
Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Agent Memory #Latent Reasoning #LLM Agent
🎯 研究动机现有LLM代理的记忆模型局限于参数调整和外部数据库存储,无法模拟人与记忆和推理紧密交织的动态过程。研究旨在构建更接近人类认知的智能记忆系统,提升代理的自演化能力。
❓ 解决问题克服传统记忆模型割裂推理与记忆的不足,开发一种能自动调用和生成潜在记忆的框架,支持代理在推理过程中实时增强记忆能力。
🔍 现象分析实验发现MemGen能够自发演化出规划记忆、程序性记忆和工作记忆等人类化记忆功能,体现机器认知向自然主义的进化趋势。
🛠️ 主要方法提出MemGen框架,包括监控推理状态的记忆触发机制和通过当前状态生成潜在记忆的记忆编织器,实现记忆与认知的循环增强。
📊 数据与实验基于八个基准测试开展实验,MemGen在性能测试中相较于ExpeL、AWM等外部记忆系统提升最高达38.22%,较GRPO提升达13.44%,并展现出显著跨领域泛化能力。
⭐ 主要贡献开发动态生成记忆框架MemGen,提升代理记忆及认知能力;展示机器智能自发演化人类化记忆功能的可能性;显著优化多样场景的智能代理表现。
查看完整摘要 (Abstract)
Agent memory shapes how Large Language Model (LLM)-powered agents, akin to the human brain, progressively refine themselves through environment interactions. Existing paradigms remain constrained: parametric memory forcibly adjusts model parameters, and retrieval-based memory externalizes experience into structured databases, yet neither captures the fluid interweaving of reasoning and memory that underlies human cognition. To address this gap, we propose MemGen, a dynamic generative memory framework that equips agents with a human-esque cognitive faculty. It consists of a \textit{memory trigger}, which monitors the agent’s reasoning state to decide explicit memory invocation, and a \textit{memory weaver}, which takes the agent's current state as stimulus to construct a latent token sequence as machine-native memory to enrich its reasoning. In this way, MemGen enables agents to recall and augment latent memory throughout reasoning, producing a tightly interwoven cycle of memory and cognition. Extensive experiments across eight benchmarks show that MemGen surpasses leading external memory systems such as ExpeL and AWM by up to $38.22\\%$, exceeds GRPO by up to $13.44\\%$, and exhibits strong cross-domain generalization ability. More importantly, we find that without explicit supervision, MemGen spontaneously evolves distinct human-like memory faculties, including planning memory, procedural memory, and working memory, suggesting an emergent trajectory toward more naturalistic forms of machine cognition.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Model #Agent #Reinforcement Learning #Meta Learning
🎯 研究动机传统基于强化学习的语言模型在需要主动探索和试错学习的任务上表现不足,限制了其适应复杂环境的能力。
❓ 解决问题提出一种通用的Meta-RL框架,使语言代理能够在测试时主动探索并从环境反馈中学习,提高任务适应性。
🔍 现象分析RL训练的代理在多回合任务中难以高效探索,并且在对新任务的泛化能力上表现较弱。
🛠️ 主要方法设计了LaMer框架,包括跨回合训练机制以优化长期奖励,以及基于上下文反思的策略自适应方法,无需梯度更新即可调整策略。
📊 数据与实验在Sokoban、MineSweeper和Webshop环境中进行测试,LaMer相比RL基线分别提升11%、14%和19%,并展示了优越的泛化能力。
⭐ 主要贡献证明了Meta-RL可有效诱导语言代理进行探索,显著提升其对新环境的适应能力和任务泛化能力。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has enabled the training of Large Language Model (LLM) agents to interact with the environment and to solve multi-turn longhorizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11\%, 14\%, and 19\% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that meta-reinforcement learning provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.
基础/前沿模型 (含LLM)
Agent 与工具使用
#language agent #multi-agent system
🎯 研究动机多语言模型通过多代理协作解决复杂任务表现卓越,但设计多代理系统的提示和拓扑结构存在高复杂性。研究旨在深入分析设计空间,以优化多代理系统设计。
❓ 解决问题自动化设计有效的多代理系统,解决提示与拓扑结构优化难题,改进多代理交互和协作效率。
🔍 现象分析提示和拓扑结构是影响多代理系统高效性的关键因素。现有方法在设计多代理系统时未充分结合两者的协同优化。
🛠️ 主要方法提出基于多代理系统搜索框架(MASS)的优化方法,通过三个阶段交替优化提示和拓扑结构:局部提示优化、工作流拓扑优化、全局提示优化。
📊 数据与实验通过广泛实验验证,MASS优化的多代理系统显著优于现有替代方案,展示了设计原则的有效性。
⭐ 主要贡献揭示提示与拓扑在多代理系统设计中的核心作用;提出MASS框架并定义优化阶段;总结设计有效多代理系统的原则。
查看完整摘要 (Abstract)
Large language models, employed as multiple agents that interact and collaborate with each other, have excelled at solving complex tasks. The agents are programmed with prompts that declare their functionality, along with the topologies that orchestrate interactions across agents. Designing prompts and topologies for multi-agent systems (MAS) is inherently complex. To automate the entire design process, we first conduct an in-depth analysis of the design space aiming to understand the factors behind building effective MAS. We reveal that prompts together with topologies play critical roles in enabling more effective MAS design. Based on the insights, we propose Multi-Agent System Search (MASS), a MAS optimization framework that efficiently exploits the complex MAS design space by interleaving its optimization stages, from local to global, from prompts to topologies, over three stages: 1) block-level (local) prompt optimization; 2) workflow topology optimization; 3) workflow-level (global) prompt optimization, where each stage is conditioned on the iteratively optimized prompts/topologies from former stages. We show that MASS-optimized multi-agent systems outperform a spectrum of existing alternatives by a substantial margin. Based on the MASS-found systems, we finally propose design principles behind building effective multi-agent systems.
基础/前沿模型 (含LLM)
Agent 与工具使用
#generalist agent; GUI agent; embodied agent; MoE
🎯 研究动机现有多模态智能体主要针对GUI或具身化单一场景,而现实任务常需交叉环境交互。为构建可跨2D/3D世界执行任务的通用智能体,本研究旨在探索GUI与具身化数据的协同训练机制。
❓ 解决问题针对混合训练GUI与具身化数据时出现的性能退化问题,发现两类数据在浅层具协同性、深层存冲突性,需解决参数层面的数据冲突问题。
🔍 现象分析通过数据混合实验发现:GUI与具身化数据在模型浅层表征呈现协同效应,但在深层参数学习时产生干扰,类比人脑皮层-小脑的分工机制。
🛠️ 主要方法提出分层异构混合专家模型(Layer-heterogeneous MoE),在深层分离参数消除冲突,浅层共享参数利用协同;同时统一GUI与具身化任务的动作空间,构建大规模多源训练数据集。
📊 数据与实验整合多源数据构建统一训练集,实验表明OmniActor在纯GUI或具身化任务上均超越单场景专有模型,尤其在GUI任务中表现显著提升。
⭐ 主要贡献首次系统揭示GUI与具身化数据的层间协同-冲突规律,提出分层异构MoE架构;实现首个在2D/3D场景均达高性能的通用智能体,为跨环境任务执行提供新范式。
查看完整摘要 (Abstract)
Multimodal large language models are progressively advancing toward multimodal agents that can proactively execute tasks. Existing research on multimodal agents primarily targets either GUI or embodied scenarios, corresponding to interactions within 2D virtual world and 3D physical world, respectively. However, many real-world tasks inherently require agents to interleave interactions across both types of environments. We initially mix GUI and embodied data to train models, but find performance degradation caused by data conflicts. Further analysis reveals that GUI and embodied data exhibit synergy at shallow layers but conflict at deep layers, resembling the cerebrum-cerebellum mechanism in the human brain. To this end, we introduce a high-performance generalist agent, OmniActor, designed from both structural and data perspectives. First, we propose Layer-heterogeneous MoE that separates parameters at deep layers to eliminate conflict, while sharing parameters at shallow layers to leverage synergy. This design enables OmniActor to outperform agents trained solely on GUI or embodied data in their respective tasks. Furthermore, we unify the action spaces of GUI and embodied tasks and collect large-scale datasets from diverse sources for training. This substantially enhances the performance of OmniActor across various scenarios, especially in GUI tasks.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Proactive Agents #Uncertainty Estimation #Cost-Sensitive Learning #Adaptive Computation #Calibration #Knowledge Distillation
TL;DR:PRISM learns a cost-sensitive, decision-theoretic gate that proactively intervenes only when help is needed and welcome, triggering selective slow reasoning to cut false alarms, latency, and compute while boosting accuracy.
🎯 研究动机主动型智能体需要平衡介入时机与用户接受意愿,以优化帮助与负担之间的权衡,而现有系统依赖脆弱的启发式方法或一刀切的长推理,缺乏精细控制。
❓ 解决问题提出一种成本敏感的选择性介入框架,使智能体仅在必要且用户接受时介入,从而减少误报、计算开销和延迟,同时提升精度。
🔍 现象分析通过门控机制和异构推理架构,发现资源密集型的长推理应集中在决策边界附近的高风险或模糊场景,从而实现高效的推理分配。
🛠️ 主要方法设计了一种基于决策理论的门控机制,结合门控对齐的知识蒸馏手段,构建双过程推理框架,确保介入门控和响应策略可调且审计明确。
📊 数据与实验在开源的 ProactiveBench 基准上验证方法,PRISM 将误报率降低了22.78%,F1提升了20.14%,显著优于现有强基线。
⭐ 主要贡献提出PRISM框架,实现了准确、高效且可控的主动型智能体,通过决策理论门控、选择性推理和对齐蒸馏创新地解决了主动介入的难题,并且开放了代码和实验资源。
查看完整摘要 (Abstract)
Proactive agents must decide not only what to say but also whether and when to intervene. Many current systems rely on brittle heuristics or indiscriminate long reasoning, which offers little control over the benefit-burden tradeoff. We formulate the problem as cost-sensitive selective intervention and present PRISM, a novel framework that couples a decision-theoretic gate with a dual-process reasoning architecture. At inference time, the agent intervenes only when a calibrated probability of user acceptance exceeds a threshold derived from asymmetric costs of missed help and false alarms. Inspired by festina lente (Latin: "make haste slowly"), we gate by an acceptance-calibrated, cost-derived threshold and invoke a resource-intensive Slow mode with counterfactual checks only near the decision boundary, concentrating computation on ambiguous and high-stakes cases. Training uses gate-aligned, schema-locked distillation: a teacher running the full PRISM pipeline provides dense, executable supervision on unlabeled interaction traces, while the student learns a response policy that is explicitly decoupled from the intervention gate to enable tunable and auditable control. On ProactiveBench, PRISM reduces false alarms by 22.78% and improves F1 by 20.14% over strong baselines. These results show that principled decision-theoretic gating, paired with selective slow reasoning and aligned distillation, yields proactive agents that are precise, computationally efficient, and controllable. To facilitate reproducibility, we release our code, models, and resources at https://prism-festinalente.github.io/; all experiments use the open-source ProactiveBench benchmark.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Skill Induction #Agent #Polymorphism #Continual Learning #Large Language Models
TL;DR:We proposed PolySkill, a framework that guides Web Agents to induce skills that is generalized and transfer better across different websites, unlike existing methods that produce over-specialized, non-transferable skills.
🎯 研究动机随着大语言模型逐渐用于动态环境,现有代理学习方法在技能泛化与迁移方面表现有限,尤其在不同网站间技能复用能力弱。
❓ 解决问题提出PolySkill框架,通过多态化抽象,解决现有方法中过度专用化技能难以泛化和迁移的问题。
🔍 现象分析现有方法通常聚焦于单一网站的技能优化,导致技能在新环境中的适用性较差,无法满足持续学习需求。
🛠️ 主要方法通过借鉴软件工程中的多态性,解耦技能的抽象目标(所需实现的任务)和具体执行方式,设计出具备更强泛化能力的组合式技能学习框架。
📊 数据与实验在Mind2Web和未见网站上实验显示,PolySkill方法技能复用性能提升1.7倍、成功率分别提高9.4%和13.9%,且减少操作步骤20%以上,并在无任务指定的自探索条件下提升任务质量和技能通用性。
⭐ 主要贡献创新性引入技能目标与执行分离的理念,设计出可持续学习与广泛适用的技能获取框架,显著提升代理在开放网络中的学习与迁移能力。
查看完整摘要 (Abstract)
Large language models (LLMs) are moving beyond static uses and are now powering agents that learn during their interaction with external environments. For example, agents can learn reusable skills while navigating web pages or toggling new tools. However, existing methods for skill learning often create skills that are over-specialized to a single website and fail to generalize.
We introduce PolySkill, a new framework that enables agents to learn generalizable and compositional skills. The core idea, inspired by polymorphism in software engineering, is to decouple a skill's abstract goal (*what* it accomplishes) and its concrete implementation (*how* it is executed). Experiments show that
our method (1) improves skill reuse by 1.7x on seen websites and (2) boosts success rates by up to 9.4\% on Mind2Web and 13.9\% on unseen websites, while reducing steps by over 20\%. (3) In self-exploration settings without specified tasks, our framework improves the quality of proposed tasks and enables agents to learn generalizable skills that work across different sites.
By enabling the agent to identify and refine its own goals, the PolySkill enhance the agent a better curriculum, leading to the acquisition of more generalizable skills compared to baseline methods. Our findings show that separating a skill's goal from its execution is a crucial step toward developing autonomous agents that can learn and generalize across the open web continuously. Our code can be found in \href{https://github.com/simonucl/PolySkill}{\texttt{https://github.com/simonucl/PolySkill}}.
基础/前沿模型 (含LLM)
Agent 与工具使用
#computer-use agents #evaluation #benchmark #code-generation #multimodal
TL;DR:We introduce ProgrammingwithPixels (PwP), an environment for evaluating computer use agents (CUAs) for software engineering. Our evaluations on the 15-task PwP-Bench reveals current CUA limitations and potential directions for their improvement.
🎯 研究动机现有计算机使用智能体(CUAs)的评估主要限于简单场景,尚不清楚这类通用智能体能否自动化完成软件工程等复杂、专业化工作。论文旨在探究通用计算机使用智能体在软件工程任务上的能力水平。
❓ 解决问题为解决缺乏综合性评估环境的问题,论文提出了首个面向软件工程的计算机使用环境Programming with Pixels(PwP)。同时,为进行整体评估,构建了涵盖多模态、多编程语言和技能集的基准测试PwP-Bench。
🔍 现象分析评估发现,顶尖的开放权重和封闭权重CUAs在纯视觉交互下表现显著差于专用代码生成智能体。但当这些CUAs仅获得文件编辑和bash操作两个API的直接访问权限后,性能大幅跃升,常能达到专用智能体水平。若进一步提供IDE工具API,所有模型性能均有提升。
🛠️ 主要方法构建了PwP环境,让智能体通过视觉控制IDE来执行多样软件工程任务。创建了包含15个现有及新任务的PwP-Bench基准,用于全面评估智能体的软件工程能力。对当前先进的CUAs进行了广泛评估,并分析了其性能瓶颈。
📊 数据与实验在包含15个任务的PwP-Bench上进行了全面评估,任务涵盖多种模态、编程语言和技能集。实验对比了开源和闭源的顶尖CUAs,并测试了在不同API访问权限(纯视觉、基础API、扩展API)下的性能表现。
⭐ 主要贡献首次为软件工程建立了全面的计算机使用评估环境PwP和基准PwP-Bench。实验揭示了当前CUAs的局限性主要源于视觉理解能力不足及未能充分利用环境,并指出了清晰的改进方向。PwP确立软件工程为衡量通用计算机使用智能体能否在复杂任务上达到专家水平的自然领域。
查看完整摘要 (Abstract)
Computer-use agents (CUAs) hold the promise of performing a wide variety of general tasks, but current evaluations have primarily focused on simple scenarios.
It therefore remains unclear whether such generalist agents can automate more sophisticated and specialized work such as software engineering (SWE).
To investigate this, we introduce Programming with Pixels (PwP), the first comprehensive computer-use environment for software engineering, where agents visually control an IDE to perform diverse software engineering tasks.
To enable holistic evaluation, we also introduce PwP-Bench, a benchmark of 15 existing and new software-engineering tasks spanning multiple modalities, programming languages, and skillsets.
We perform an extensive evaluation of state-of-the-art open-weight and closed-weight CUAs and find that when interacting purely visually, they perform significantly worse than specialized coding agents.
However, when the same CUAs are given direct access to just two APIs—file editing and bash operations—performance jumps, often reaching the levels of specialized agents despite having a task-agnostic design.
Furthermore, when given access to additional IDE tools via text APIs, all models show further gains.
Our analysis shows that current CUAs fall short mainly due to limited visual grounding and the inability to take full advantage of the rich environment, leaving clear room for future improvements.
PwP establishes software engineering as a natural domain for benchmarking whether generalist computer-use agents can reach specialist-level performance on sophisticated tasks.
基础/前沿模型 (含LLM)
Agent 与工具使用
#spatial reasoning #agent #VLM
🎯 研究动机视觉语言模型(VLM)在空间推理任务上存在两大瓶颈:一是其基于2D数据预训练导致的三维理解能力不足;二是冗余的三维信息常常干扰其推理过程。
❓ 解决问题该研究提出MSSR框架,通过构建最小充分信息集来同时解决三维感知能力不足与信息冗余导致的推理失败问题。
🔍 现象分析现有VLM的空间推理失败主要源于二维中心训练带来的三维理解局限,以及复杂场景中冗余信息对关键推理路径的干扰。
🛠️ 主要方法构建双智能体框架:感知智能体使用包括SOG模块在内的工具箱提取充分三维信息;推理智能体通过闭环迭代修剪冗余并补充缺失,最终生成最小充分集用于答案生成。
📊 数据与实验在两个具有挑战性的基准测试上进行了广泛实验,结果表明该方法在保持高可解释性的同时显著提升了准确率,达到了最先进的性能水平。
⭐ 主要贡献首次提出最小充分信息集原则并实现为MSSR框架;创新的SOG模块实现了语言到方向的稳健对齐;为未来模型提供了可解释推理路径与高质量训练数据生成方案。
查看完整摘要 (Abstract)
Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: \textit{inadequate} 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by \textit{redundant} 3D information.
To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a \textit{compact} selection of 3D perception results from \textit{expert models}. We introduce \textbf{MSSR} (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A \textit{Perception Agent} programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel \textbf{SOG} (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A \textit{Reasoning Agent} then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated.
Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code will be made publicly available.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Multi-modal Embodied Agent #Unified Generative Model #Auto-Regressive World Model
🎯 研究动机现有智能体在复杂开放世界中,仅具备推理或想象单一能力,或多模块集成效率低下,限制了策略的学习效率与泛化能力。
❓ 解决问题提出首个端到端通用策略 RIG,协同整合推理与想象能力,以提升策略的样本效率、泛化性与鲁棒性。
🔍 现象分析当前方法未能显式建模推理、动作与环境动态的内在关联,导致策略学习效率低且泛化受限。
🛠️ 主要方法构建渐进式数据管道,融合现有智能体轨迹中的推理与想象内容,通过联合学习推理与下一帧图像生成,实现端到端训练。
📊 数据与实验利用现有智能体收集轨迹构建数据集,实验显示样本效率提升超过17倍,并验证了泛化性、鲁棒性与测试时扩展能力。
⭐ 主要贡献首次实现端到端策略中推理与想象的协同;显式建模动作-环境动态关联;提出推理-想象-自校正的推断机制,提升策略性能。
查看完整摘要 (Abstract)
Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy.
Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG.
To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments. It thus exhibits more than $17\times$ sample efficiency improvements and generalization in comparison with previous works.
During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions.
Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLM #RL #Agent
🎯 研究动机现有强化学习方法通常仅优化任务成功率,忽视过程中的推理质量,导致复杂长时序任务中的泛化能力不足。
❓ 解决问题通过引入过程级监督技术,改善强化学习中推理路径的效率与可靠性,解决因不充分探索导致的脆弱性问题。
🔍 现象分析传统方法强化了低效或错误的推理路径,导致冗余动作频繁且故障恢复能力弱,限制了代理的鲁棒性与可解释性。
🛠️ 主要方法提出RLVMR框架,将可验证的元推理行为奖励与最终任务绩效结合,并使用无评论梯度策略优化实现过程信号与结果信号的统一。
📊 数据与实验在ALFWorld与ScienceWorld基准上进行测试,7B模型在最难任务分割上达到83.6%成功率,显著优于现有方法。
⭐ 主要贡献整合过程-结果双重奖励机制,提升推理质量与任务效率,实现了更鲁棒、可解释的长时序任务代理,推动领域技术发展。
查看完整摘要 (Abstract)
The development of autonomous agents for complex, long-horizon tasks is a central goal in AI. However, dominant training paradigms face a critical limitation: reinforcement learning (RL) methods that optimize solely for final task success often reinforce flawed or inefficient reasoning paths, a problem we term inefficient exploration. This leads to agents that are brittle and fail to generalize, as they learn to find solutions without learning how to reason coherently. To address this, we introduce RLVMR, a novel frame-work that integrates dense, process-level supervision into end-to-end RL by rewarding verifiable, meta-reasoning behaviors. RLVMR equips an agent to explicitly tag its cognitive steps—such as planning, exploration, and reflection—and provides program-matic, rule-based rewards for actions that contribute to effective problem-solving. These process-centric rewards are combined with the final outcome signal and optimized using a critic-free policy gradient method. On the challenging ALFWorld and ScienceWorld benchmarks, RLVMR achieves new state-of-the-art results, with our 7B model reaching an 83.6% success rate on the most difficult unseen task split. Our analysis confirms these gains stem from improved reasoning quality, including significant reductions in redundant actions and enhanced error recovery, leading to more robust, efficient, and interpretable agents.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Models #Agent
🎯 研究动机传统通过强化学习训练的推理模型在结构化问题解决方面表现不足,特别是几何推理、精炼计算和复杂方程求解,需结合工具优势改进能力。
❓ 解决问题提出ReTool框架,通过工具整合学习改善长文本推理,并设计方法使模型在动态交互中高效使用工具以提升推理能力。
🔍 现象分析实验表明,ReTool在挑战性数学基准测试中表现显著优于传统基线模型,展示了自主工具使用的泛化能力及涌现行为,如代码自我纠错。
🛠️ 主要方法包括实时代码执行与自然语言推理动态交织,以及基于结果反馈的强化学习策略,逐步优化工具调用模式,无需人为先验指导。
📊 数据与实验使用AIME数学竞赛数据集,32B模型仅需400步训练即实现67%准确率,远超基线模型;在扩展设定中达72.5%准确率,显著超过OpenAI对比模型。
⭐ 主要贡献提出一种以任务结果为导向的工具整合强化学习框架,为复杂数学推理提供新方法,并展现神经符号系统的潜力和启发性行为。
查看完整摘要 (Abstract)
While reasoning models trained with reinforcement learning (RL) excel in reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving—areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic code-augmented long-form reasoning data for cold-start training. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in performance and efficiency. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals generalization to broader tool-use scenarios and emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Model #Reinforcement Learning #Code LLM #multi-turn RL
🎯 研究动机强化学习结合可验证奖励在提升大语言模型推理能力上展现出潜力,但现有方法未显式优化验证过程,也未能充分利用真实环境中的可靠信号,导致自我验证能力不足且测试时扩展性有限。
❓ 解决问题通过扩大生成与验证间的非对称性并显式优化自我验证,提升模型在测试阶段的扩展能力和可靠性。
🔍 现象分析现有方法过于依赖结果奖励,缺乏深度验证优化,导致弱验证机制无法支持长期推理的演化,限制了测试时多轮迭代扩展的性能。
🛠️ 主要方法提出 ReVeal 框架,通过多轮强化学习建立生成–验证的迭代机制,引入细粒度的交互式奖励分配(TAPO),实现代码生成与测试共进化,并利用工具反馈强化模型自我验证能力。
📊 数据与实验在 LiveCodeBench 上实验验证,训练仅涉及三轮迭代,但推理阶段可扩展至超过 20 轮,显著提升 Pass@k 指标,表明模型推理边界的显著扩展。
⭐ 主要贡献提出可扩展的多轮强化学习框架 ReVeal,显式优化自我验证机制,证明其在增强代码推理能力及自主智能体构建中的潜力,并公开相关代码资源。
查看完整摘要 (Abstract)
Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models. Howerer, existing methods rely solely on outcome rewards, without explicitly optimizing verification or leveraging reliable signals from realistic environments, leading to unreliable self-verification and limited test-time scaling. To address this, we widen the verification–generation asymmetry by explicitly optimizing self-verification, making it a reliable driver of deeper test-time scaling. We introduce ReVeal, a multi-turn Reinforcement learning framework that evolves code generation through self-Verification and tool-based evaluation. ReVeal structures long-horizon reasoning as iterative generation–verification turns and incorporates TAPO for turn-level credit assignment, fostering the co-evolution of code and test generation. At inference, this strengthened self-verification enables the model to use self-constructed tests and tool feedback to continuously evolve code for 20+ turns on LiveCodeBench despite training on only three. It also significantly improves Pass@k, indicating stronger exploration that expands the reasoning boundaries of the base model. These findings highlight the promise of ReVeal as a scalable paradigm for RL training and test-time scaling, paving the way for more robust and autonomous AI agents. Code is available at [https://ReVeal.github.io/](https://shimly-2.github.io/ReVeal.github.io/).
基础/前沿模型 (含LLM)
Agent 与工具使用
#Real-time reasoning #Language model agents #Parallel reasoning architecture
🎯 研究动机现实中的智能体需在动态环境中进行及时且逻辑合理的决策,但现有语言模型无法有效应对环境的持续变化。
❓ 解决问题提出实时推理问题的新定义,解决智能体在快速变化环境中同时进行复杂推理和快速反应的能力不足问题。
🔍 现象分析实验表明,即使是最先进的语言模型在两种推理范式下(快速反应和复杂规划)也难以同时实现逻辑性和及时性。
🛠️ 主要方法提出混合推理框架 AgileThinker,同时整合快速反应与深度推理两种能力,以平衡推理深度和响应时延。
📊 数据与实验构建名为 Real-time Reasoning Gym 的评测平台,通过实验验证 AgileThinker 在任务难度和时间压力增加时表现优于单一推理范式。
⭐ 主要贡献首次提出实时推理问题及其实验平台,构建混合推理方法 AgileThinker,为受时间约束的 AI 系统研究奠定基础。
查看完整摘要 (Abstract)
Agents in the real world must make not only logical but also *timely* judgments. This requires continuous awareness of the dynamic environment: hazards emerge, opportunities arise, and other agents act, while the agent's reasoning is still unfolding. Despite advances in language model reasoning, existing approaches fail to account for this dynamic nature. We introduce *real-time reasoning* as a new problem formulation for agents in evolving environments and build **Real-time Reasoning Gym** to demonstrate it. We study two paradigms for deploying language models in agents: (1) reactive agents, which employ language models with *bounded reasoning computation for rapid responses*, and (2) planning agents, which allow *extended reasoning computation for complex problems*. Our experiments show that even state-of-the-art models struggle with making logical and timely judgments in either paradigm. To address this limitation, we propose **AgileThinker**, which simultaneously engages *both reasoning paradigms*. AgileThinker consistently outperforms agents engaging only one reasoning paradigm as the task difficulty and time pressure rise, effectively balancing reasoning depth and response latency. Our work establishes real-time reasoning as a critical testbed for developing practical agents and provides a foundation for research in temporally constrained AI systems, highlighting a path toward real-time capable agents.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large language models #LLM reasoning #Agentic multi-turn reasoning
🎯 研究动机主动推理要求LLM代理能够多轮交互并策略性地获取信息,而这一过程中需要精确的信念追踪。然而,现有模型往往因推理能力受限而产生信念偏差,导致状态认知丧失及低效行为。
❓ 解决问题针对信念偏差问题,该研究提出了一种方法抑制偏差对强化学习中轨迹质量和优化效果的负面影响,从而提升LLM代理的推理能力。
🔍 现象分析信念偏差会积累并导致轨迹中的误差传播,使得强化学习中的奖励归因不准确,限制探索效率并增加重复和低效行为。
🛠️ 主要方法提出了 ${T^3}$ 方法,通过监测并检测过度信念偏差,截断不具信息价值的轨迹部分,从而优化训练轨迹以保留有效信息,并缓解尾部负效应。
📊 数据与实验在五个具有挑战性的任务上进行实验,结果表明 ${T^3}$ 方法可显著提升训练稳定性,性能最高提升30分,同时减少高达34%的token成本。
⭐ 主要贡献该研究首次将信念控制作为主动推理中重要的原则,提出方法有效解决信念偏差问题,证明了其对构建稳健LLM代理的价值。
查看完整摘要 (Abstract)
Active reasoning requires large language model (LLM) agents to interact with external sources and strategically gather information to solve problems in multiple turns. Central to this process is belief tracking: maintaining an accurate representation of the underlying state and uncertainty in understanding and solving the problem. However, due to limited reasoning capabilities, LLM-based agents often suffer belief deviation: their internal beliefs drift from the true problem state, leading to loss of state awareness and uninformative or repetitive actions. Once this happens, errors compound in the trajectories used for reinforcement learning (RL), leading to misattributed credits and limited exploration. To address this issue, we propose to track belief deviation and develop $\mathbf{T^3}$, a simple yet principled method that detects excessive deviation and truncates training trajectories to suppress uninformative tail effects. Hence, $\mathbf{T^3}$ preserves credits for informative prefixes and systematically improves policy optimization. Across 5 challenging tasks, $\mathbf{T^3}$ consistently
enhances training stability and yields performance gains of up to 30 points while cutting token cost by up to 34%. These results highlight belief control as a key principle for building robust LLM agents capable of active reasoning.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Tool Creation #Tool-Augmented Reasoning
🎯 研究动机大型语言模型通过外部工具增强推理能力,但许多任务缺乏预定义工具,现有方法依赖内部知识,在模型知识范围外的任务效果有限。
❓ 解决问题提出一种参考驱动的框架 RefTool,从外部资料生成工具,解决模型内知识局限性问题,以应对知识密集型任务。
🔍 现象分析传统基于模型内部知识的工具生成方法难以准确处理知识范围外任务;以参考内容为基础的工具生成能够提高工具的精确性和可信性。
🛠️ 主要方法RefTool 包括两个模块:工具生成模块从参考资料创建可执行工具,并通过示例验证与分层组织;工具使用模块基于工具箱选择合适工具解决问题。
📊 数据与实验在因果推理、物理与化学数据集上的实验表明,RefTool 平均准确率比现有方法提升 12.3%,还能高效泛化至例如低资源语言翻译等非科学任务。
⭐ 主要贡献通过外部参考资料指导工具创建,突破模型内部知识限制;提出层级化结构提升工具选择效能,为知识密集领域中的通用推理提供新方案。
查看完整摘要 (Abstract)
Large Language Models (LLMs) can enhance their reasoning capabilities by using external tools. However, many tasks lack predefined tools. Prior works have explored instructing LLMs to generate tools on their own, but such approaches depend heavily on internal knowledge and struggle when tasks fall outside the model’s knowledge scope. To address this limitation, we propose RefTool, a reference-guided framework for automatic tool creation that leverages external materials, such as textbooks and knowledge snippets. RefTool consists of two modules: (1) tool creation, where LLMs generate executable tools from reference content, validate them using illustrative examples, and organize them hierarchically into a toolbox; and (2) tool utilization, where LLMs navigate the toolbox structure to select and apply the appropriate tools to solve problems. Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 12.3% on average accuracy, while being cost-efficient and broadly generalizable to non-scientific tasks, e.g., extremely low-resource language translation. Analyses reveal that grounding tool creation in references produces accurate and faithful tools, and that the hierarchical structure facilitates effective tool selection. RefTool enables LLMs to overcome internal knowledge limitations, advancing generalizable reasoning in knowledge-intensive domains.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Token-level Policy Gradients Reshape;Tool-use Large Language Model; Entropy-aware; Reinforcement Learning; Reasoning Model
🎯 研究动机大型语言模型已从被动生成转变为主动调用外部工具的目标驱动型代理。然而,现有强化学习方法仅依赖稀疏奖励,未考虑工具使用任务的特殊性,导致效率低下。
❓ 解决问题通过理论建立策略熵与工具使用任务训练稳定性之间的联系,揭示奖励的主要决定因素为结构化、低熵的标记,并针对性地优化训练过程。
🔍 现象分析发现工具使用任务中的低熵标记对奖励贡献显著,同时现有方法因忽略语义与结构的平衡而导致梯度方差增加和训练收敛性较差。
🛠️ 主要方法提出ResT方法,通过熵感知的标记重权机制重新塑造策略梯度,逐步加权推理标记,实现在训练中从结构正确转向语义推理的平滑过渡。
📊 数据与实验在BFCL和API-Bank数据集上进行评估,ResT在单回合和多回合任务上分别相比基线提升最多8.76%,并超越GPT-4o 1.50%-4.11%。
⭐ 主要贡献提出熵驱动的标记重权策略,显著提高工具使用任务中策略训练的稳定性和有效性,为复杂语言模型的强化学习提供新方法。
查看完整摘要 (Abstract)
Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training.
To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose Reshaped Token-level policy gradients (ResT) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT outperforms other strong baselines, outperforming prior methods by up to 8.76%. When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by 4.11% on single-turn tasks and 1.50% on multi-turn base tasks. Code is available at https://github.com/1229095296/ResT_Tool_use_LLM.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Semi-Structured Data #Reinforcement Learning #Information Extraction
🎯 研究动机半结构化数据广泛存在于HTML表格、列表和信息框中,但其格式复杂性限制了可用性,提取结构化信息仍是一个难题。
❓ 解决问题现有方法缺乏泛化能力或计算成本高,SCRIBES旨在实现高效、可复用的脚本生成以进行网页规模的信息提取。
🔍 现象分析不同网页之间的布局具有相似性,可以作为信号对提取脚本进行优化和泛化。
🛠️ 主要方法使用强化学习框架,以布局相似性作为奖励信号;通过CommonCrawl数据生成合成注释进行迭代训练,提高泛化能力。
📊 数据与实验利用大型CommonCrawl数据对方法进行实验,结果表明在脚本质量提升13%以上、GPT-4o问答准确率提高4%以上。
⭐ 主要贡献提出适用于半结构化数据提取的可扩展强化学习框架,大幅提升脚本生成质量和问答任务的准确率,推动网页信息提取领域的发展。
查看完整摘要 (Abstract)
Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (**SCRI**pt-**B**ased Semi-Structured Content **E**xtraction at Web-**S**cale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13\% in script quality and boosts downstream question answering accuracy by more than 4\% for GPT-4o, enabling scalable and resource-efficient web information extraction.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Coding Agent #Reward Model #Test-time Scaling #Reinforcement Learning
🎯 研究动机当前编码代理的开发广泛依赖于基于执行的反馈。然而,这种方法对单元测试用例的收集要求较高,反馈稀疏且难以区分部分成功的轨迹,限制了其效能。
❓ 解决问题探索无执行反馈的奖励模型在软件工程代理中的应用,以提供更精细化的反馈信号,同时适用于测试时缩放和强化学习场景。
🔍 现象分析发现两种测试时表现相近的验证器在强化学习场景中表现显著不同,指出测试时能力与强化学习能力间的脱钩现象,并提出分类准确性和校准性是关键因素。
🛠️ 主要方法设计了一种基于专家混合的奖励模型(SWE-RM),总参数量为30B,推理时激活3B参数,通过训练数据规模、策略混合等因素的实验优化模型鲁棒性。
📊 数据与实验在SWE-Bench数据集上进行实验,显著提升了多种开源模型的测试时缩放和强化学习性能,如Qwen3-Coder-Flash的准确率从51.6%提高至62.0%。
⭐ 主要贡献提出并验证了一种执行无关的奖励模型SWE-RM,实现开源软件工程代理在多任务基准上的最新性能,同时揭示了奖励模型在强化学习中的关键设计指标。
查看完整摘要 (Abstract)
Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model’s ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition.
Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models. On RL training, SWE-RM lifts the resolve rate of execution-based counterparts by 3 absolute points on SWE-Bench Verified.
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLM Agent #Reinforcement Learning #Data Synthesis
TL;DR:We introduce DreamGym, a unified framework that synthesizes diverse agent experiences to enable scalable RL for LLM agents, outperforming baselines in synthetic and sim-to-real settings with minimal real interactions.
🎯 研究动机当前强化学习的实际应用由于高昂的环境交互成本、有限的任务多样性和复杂的基础设施限制,导致经验数据的可扩展性受阻。
❓ 解决问题提出统一框架 DreamGym,通过合成多样化的代理经验,解决强化学习中经验采集成本高和扩展性不足的问题。
🔍 现象分析在强化学习任务中,传统方法需依赖昂贵的真实环境交互,而缺乏多样化和可扩展的经验会限制代理学习的效果和效率。
🛠️ 主要方法DreamGym框架通过推理基础的经验模型模拟环境动态,结合离线真实数据初始化的经验回放缓冲区和在线经验增强机制,并通过自适应生成新任务促进在线课程学习。
📊 数据与实验实验覆盖多种环境与代理结构,包括非强化学习任务和模拟到真实迁移场景,证明DreamGym在完全合成环境中优于基线超过30%,并以较少真实交互达到现有方法性能。
⭐ 主要贡献提出了第一个支持大规模强化学习的多样化经验合成框架DreamGym,大幅减少真实交互需求,同时提升代理的训练效率与迁移表现。
查看完整摘要 (Abstract)
While reinforcement learning (RL) can empower autonomous agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Continual Pre-training #Deep Research Agent #Agentic Training #Data Synthesis
TL;DR:The first work to introduce agentic continual pre-training to agent training pipeline, providing strong agentic foundamental models.
🎯 研究动机现有的大型语言模型在代理任务中的表现不足,特别是开源实现中无法有效执行复杂的多步推理与工具使用。研究者旨在解决这一问题,构建更强大的代理基础模型。
❓ 解决问题传统的训练流程中,模型需同时学习代理行为和对齐专家演示,导致优化冲突,影响性能表现。核心问题是缺乏鲁棒的代理任务基础模型。
🔍 现象分析后训练过程中模型在代理任务中表现不佳因需处理多重任务目标,无法有效发挥性能。开源实现的进一步观察表明,该问题尤为突出。
🛠️ 主要方法提出代理持续预训练(Agentic CPT),将其集成至深度研究代理的训练流水线中,专注生成强大的代理基础模型。基于此方法开发了深度研究代理模型 AgentFounder。
📊 数据与实验使用10个基准测试评估AgentFounder-30B,结果显示其在多任务代理和工具使用方面均达到了最新的领域最佳,如BrowseComp-en 39.9%、BrowseComp-zh 43.3%、HLE的Pass@1 31.5%。
⭐ 主要贡献首次将代理持续预训练引入代理任务训练流程,构建强健的代理基础模型,提升了多基准上的性能并增强了工具使用能力,为代理研究提供了新的方向。
查看完整摘要 (Abstract)
Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Data Analysis #LLM Agents #Agent Training
TL;DR:This paper introduces DataMind, a scalable data synthesis and agent training pipeline designed to build generalist data-analytic agents.
🎯 研究动机当前数据分析代理依赖于专有模型的提示工程,开源模型难以处理多格式数据文件和长跨度推理任务,自动化科学发现对通用数据分析代理的需求迫切。
❓ 解决问题为构建通用数据分析代理,解决数据资源不足、训练策略不当以及基于代码的多轮推理不稳定等挑战。
🔍 现象分析现有开源方案在多样化的数据形式、复杂任务推理及稳定性上均存在不足,限制了其实际应用价值。
🛠️ 主要方法提出 DataMind 训练方案,包括细粒度任务分类、递归任务合成、多层过滤的知识增强轨迹采样、动态可调损失目标以及低内存稳定代码推理框架。
📊 数据与实验构建了 DataMind-12K 数据集,涵盖多领域和任务类别;模型 DataMind-14B 在多项基准测试上平均得分 71.16%,超越专有模型 DeepSeek-V3.1 和 GPT-5;开源版 DataMind-7B 取得 68.10% 的最佳开源模型成绩。
⭐ 主要贡献提出了适配复杂数据分析任务的通用代理训练框架 DataMind;发布高质量数据集 DataMind-12K,以及领先性能的开源模型 DataMind-7B 和 DataMind-14B;总结训练经验,为社区提供参考。
查看完整摘要 (Abstract)
Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.
基础/前沿模型 (含LLM)
Agent 与工具使用
#data science #embodied ai #agents #computer use agents
TL;DR:data generation pipeline for digitial ai agents
🎯 研究动机多模态大语言模型(MLLMs)在计算机操作、网页导航和机器人等领域训练交互智能体潜力巨大。但目前缺乏高质量、多样化、可执行且可验证的下游智能体任务数据集,制约了模型的后训练扩展。
❓ 解决问题现有任务生成方法主要依赖人工标注或提示缺乏环境信息的 MLLMs,导致成本高昂或可扩展性差。本文旨在提出一种可扩展的任务生成管道,通过探索环境自动合成多样且可行的任务。
🔍 现象分析当前依赖人工标注的方法在覆盖范围上受限,而直接提示 MLLMs 的方法因缺乏下游环境的具体状态信息,难以生成高质量、可验证的任务。这阻碍了能够理解数字界面(UI)的智能体的大规模训练。
🛠️ 主要方法提出了 AutoPlay 框架,包含探索和任务生成两个阶段。探索阶段由 MLLM 智能体系统性地探索交互环境以发现新状态和功能;生成阶段则基于探索轨迹和任务指导提示,合成多样化、可执行、可验证的任务。
📊 数据与实验在 20 个 Android 应用和 13 个 Ubuntu 应用上自动生成了数万条任务数据,用于训练移动端和计算机使用智能体。实验表明,使用该数据训练使 MLLM 智能体在相应场景上的成功率最高提升了 20.0%。
⭐ 主要贡献提出了一种可扩展的、无需人工标注的任务自动生成管道,显著减少了智能体后训练对人工数据的依赖。生成的任务数据支持大规模演示合成和基于 MLLM 验证器的强化学习训练,进一步提升了智能体性能。
查看完整摘要 (Abstract)
Post-Training Multimodal Large Language Models (MLLMs) to build interactive agents holds promise across domains such as computer-use, web navigation, and robotics. A key challenge in scaling such post-training is lack of high-quality downstream agentic task datasets with tasks that are diverse, feasible, and verifiable. Existing approaches for task generation rely heavily on human annotation or prompting MLLM with limited downstream environment information, which is either costly or poorly scalable as it yield tasks with limited coverage. To remedy this, we present AutoPlay, a scalable pipeline for task generation that explicitly explores interactive environments to discover possible interactions and current state information to synthesize environment-grounded tasks. AutoPlay operates in two stages: (i) an exploration phase, where an MLLM explorer agent systematically uncovers novel environment states and functionalities, and (ii) a task generation phase, where a task generator leverages exploration trajectories and a set of task guideline prompts as context to synthesize diverse, executable, and verifiable tasks. We show AutoPlay generates $20$k tasks across $20$ Android applications and $10$k tasks across 13 applications Ubuntu applications to train mobile-use and computer-use agents. AutoPlay generated tasks enable large-scale task demonstration synthesis without human annotation by employing an MLLM task executor and verifier. This data enables training MLLM-based UI agents that improve success rates up to $20.0\%$ on mobile-use and $10.9\%$ on computer-use scenarios. In addition, AutoPlay generated tasks combined with MLLM verifier-based rewards enable scaling reinforcement learning training of UI agents, leading to an additional $5.7\%$ gain. coverage. These results establish AutoPlay as a scalable approach for post-training capable MLLM agents reducing reliance on human annotation.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Evolution #LLMs #Scientific discovery
TL;DR:We introduce a new framework using LLMs to advance automated scientific discovery with SotA efficiency and significant results across engineering and scientific fields.
🎯 研究动机当前基于大型语言模型(LLMs)的科学发现存在样本效率低下的问题,需要大量样本才能找到有效解决方案。
❓ 解决问题提出一种新的框架ShinkaEvolve,通过提升样本效率推进开放式自动化科学发现,在工程与科学领域取得重大成果。
🔍 现象分析传统LLM驱动的框架面临探索空间效率不足和解决方案生成成本高的问题,限制了其广泛应用。
🛠️ 主要方法引入三项创新技术:平衡探索与利用的父级采样、基于代码新颖性的拒绝采样策略和基于赌博理论的LLM集成选择方法。
📊 数据与实验在经典圆形占位优化任务上仅用150样本发现最佳解决方案,同时在工程任务如数学推理、编程优化和LLM训练等广泛领域验证其有效性。
⭐ 主要贡献首次实现大规模科学发现的样本高效性,提出可扩展的创新框架并开源代码,推动多领域自动化探索与创新。
查看完整摘要 (Abstract)
We introduce ShinkaEvolve: a new framework leveraging large language models (LLMs) to advance scientific discovery with state-of-the-art performance and efficiency. The field of LLM-driven scientific discovery has seen significant progress, but has yet to overcome a critical limitation: sample inefficiency, requiring thousands of samples to identify effective solutions. ShinkaEvolve takes a concrete step towards addressing this critical limitation by introducing three key innovations: a parent sampling technique balancing exploration and exploitation, code novelty rejection-sampling for efficient search space exploration, and a bandit-based LLM ensemble selection strategy. When applied to the canonical circle-packing optimization task, ShinkaEvolve discovers a new state-of-the-art circle packing solution using only 150 samples, orders of magnitude fewer than prior frameworks. Furthermore, applied to a broader set of engineering problems, ShinkaEvolve designs robust agentic harnesses for AIME mathematical reasoning tasks, identifies improvements to ALE-Bench competitive programming solutions, and discovers novel mixture-of-expert load balancing loss functions to stabilize LLM training itself. We provide ShinkaEvolve's full code together with this submission, which will be open-sourced to accelerate open advancements to open-ended automated discovery across diverse computational problems.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Vision Language Models #Planning #PDDL #LLM Tool Use
TL;DR:We propose VLMFP, a Dual-VLM-guided framework that can autonomously generate both PDDL problem and domain files for formal visual planning.
🎯 研究动机现有视觉语言模型(VLMs)在视觉规划中具有潜力,但难以进行精确的空间和长程推理;而规划领域定义语言(PDDL)规划器虽擅长形式化长程规划,却无法直接处理视觉输入。尽管近期研究尝试结合两者优势,但自动生成编码规划规则的PDDL领域文件仍面临挑战,通常依赖人工经验或环境交互。
❓ 解决问题本文提出VLMFP,一个双VLM引导的框架,旨在自主生成用于形式化视觉规划的PDDL问题文件和领域文件,以解决现有方法在规则编码自动化方面的不足。
🔍 现象分析VLMs可较好生成PDDL问题文件,但准确生成包含规划规则的领域文件仍非常困难,这限制了视觉到形式化规划的实际应用泛化能力。
🛠️ 主要方法VLMFP框架整合了SimVLM和GenVLM:SimVLM模拟动作结果,GenVLM生成并迭代优化PDDL文件,通过符号执行与模拟结果的对齐,实现了跨未见实例、视觉外观和游戏规则的多层次泛化。
📊 数据与实验在6个网格世界领域评估VLMFP,SimVLM在可见和未见外观下的场景理解与动作模拟平均分别达87.3%和86.0%;VLMFP在可见和未见外观下的未见实例规划成功率分别达70.0%和54.1%。此外,框架可扩展至部分可观测、视觉多样的复杂3D长程规划任务。
⭐ 主要贡献提出了首个双VLM框架,能自主生成完整的PDDL问题与领域文件,实现了视觉规划到形式化规则的自动转换,并在跨实例、外观和规则层面展示了显著的泛化能力。
查看完整摘要 (Abstract)
Vision Language Models (VLMs) show strong potential for visual planning but struggle with precise spatial and long-horizon reasoning, while Planning Domain Definition Language (PDDL) planners excel at formal long-horizon planning but cannot interpret visual inputs. Recent works combine these complementary advantages by translating visual problems into PDDL. However, while VLMs can generate PDDL problem files satisfactorily, accurately generating PDDL domain files, which encode planning rules, remains challenging and typically requires human expertise or environment interaction.
We propose VLMFP, a Dual-VLM-guided framework that autonomously generates both PDDL problem and domain files for formal visual planning. VLMFP combines a SimVLM that simulates action consequences with a GenVLM that generates and iteratively refines PDDL files by aligning symbolic execution with simulated outcomes, enabling multiple levels of generalization across unseen instances, visual appearances, and game rules.
We evaluate VLMFP on 6 grid-world domains and demonstrate its generalization capability. On average, SimVLM achieves 87.3\% and 86.0\% scenario understanding and action simulation for seen and unseen appearances, respectively. With the guidance of SimVLM, VLMFP attains 70.0\%, 54.1\% planning success on unseen instances in seen and unseen appearances, respectively. We further demonstrate that VLMFP scales to complex long-horizon 3D planning tasks, including multi-robot collaboration and assembly scenarios with partial observability and diverse visual variations. Project page: https://sites.google.com/view/vlmfp.
基础/前沿模型 (含LLM)
Agent 与工具使用
#wisdom of crowds #LLM #multiagent systems
🎯 研究动机人类社会中的群体决策通常能超越个体判断,而大语言模型(LLMs)缺乏多样性,仅给出单一答案。受众多意见的智慧启发,研究探讨是否通过模拟多样化答案提升预测性能。
❓ 解决问题大语言模型对复杂多样的人类行为预测能力有限,尤其在涉及广告效果、视频回忆度等需多维理解的场景中。
🔍 现象分析群体智慧产生更准确预测的原因在于群体汇集了多样视角、独立判断及分布式知识,能够抵消个体偏差;而单一答案不足以反映多样化人群的实际偏好。
🛠️ 主要方法提出 Social Agents 框架,模拟带有人类化特征的多代理系统,包括不同人口学与心理属性的虚拟个体,对输入刺激独立评分并给出定量和定性评估,最终汇总群体意见生成预测分布。
📊 数据与实验在十一项行为预测任务中,与单一LLM基线相比,Social Agents对简单任务提升达67.45%,复杂任务提升达9.88%,并与人类判断表现出最高达0.71的相关性。
⭐ 主要贡献提出具有可解释性且可扩展的群体模拟工具,有效提升行为预测和社会决策支持,验证了将群体智慧应用于LLMs的潜能。
查看完整摘要 (Abstract)
In human society, collective decision making has often outperformed the judgment of individuals. Classic examples range from estimating livestock weights to predicting elections and financial markets, where averaging many independent guesses often yields results more accurate than experts. These successes arise because groups bring together diverse perspectives, independent voices, and distributed knowledge, combining them in ways that cancel individual biases. This principle, known as the Wisdom of Crowds, underpins practices in forecasting, marketing, and preference modeling. Large Language Models (LLMs), however, typically produce a single definitive answer. While effective in many settings, this uniformity overlooks the diversity of human judgments shaping responses to ads, videos, and webpages. Inspired by how societies benefit from diverse opinions, we ask whether LLM predictions can be improved by simulating not one answer but many. We introduce Social Agents, a multi-agent framework that instantiates a synthetic society of human-like personas with diverse demographic (e.g., age, gender) and psychographic (e.g., values, interests) attributes. Each persona independently appraises a stimulus such as an advertisement, video, or webpage, offering both a quantitative score (e.g., click-through likelihood, recall score, likability) and a qualitative rationale. Aggregating these opinions produces a distribution of preferences that more closely mirrors real human crowds. Across eleven behavioral prediction tasks, Social Agents outperforms single-LLM baselines by up to 67.45% on simple judgments (e.g. webpage likability) and 9.88% on complex interpretive reasoning (e.g. video memorability). Social Agents’ individual persona predictions also align with human judgments, reaching Pearson correlations up to 0.71. These results position computational crowd simulation as a scalable, interpretable tool for improving behavioral prediction and supporting societal decision making.
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLM-based Agents #Process Supervision #Curriculum Learning
TL;DR:We introduce HPL, a hierarchical framework that resolves the granularity mismatch in agent alignment by optimizing preferences over semantically coherent "action groups" (sub-tasks), guided by a dual-layer curriculum.
🎯 研究动机随着大型语言模型(LLMs)被用于处理复杂的长时序问题,对其进行偏好对齐成为关键。然而现有方法在长轨迹与单步监督之间存在粒度不匹配问题,影响效率与行为优化。
❓ 解决问题通过引入多层次的偏好学习框架,解决奖励信号在轨迹级和单步级间的稳定性与细粒度冲突,从而提升模型对复杂任务的解答能力。
🔍 现象分析轨迹级偏好优化纵览任务全局但难以细分行为贡献;单步级优化提供更多细节但受限于数据效率和统计噪声,特别是对多步结构行为的奖励无法充分捕捉。
🛠️ 主要方法提出层次优先学习(HPL)框架,通过分组行动与双层课程学习,实现轨迹级、单步级与分组级偏好信号的协同优化。分组基于语义一致性分解专家轨迹并生成对比组,随后依赖任务长度与奖励差距设计课程规划。
📊 数据与实验在三个复杂代理基准数据集上测试,实验表明 HPL 相对于现有方法在多种任务表现上均有显著提升,并通过分析证明其分层优化与课程结构设计的有效性。
⭐ 主要贡献提出一种全新的分层偏好优化框架,引入课程式偏好训练策略,解决现存偏好对齐方法中的粒度矛盾,显著提升 LLM 基代理的复杂任务解决能力。
查看完整摘要 (Abstract)
Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long-horizon problems.
Aligning these agents via preference-based methods like Direct Preference Optimization (DPO) is a promising direction, yet it faces a critical granularity mismatch.
Trajectory-level DPO provides stable signals but blur where credit should be assigned within long trajectories, whereas step-level DPO offers fine-grained supervision but can be statistically noisy and data-inefficient when Monte Carlo rollouts are limited, and can be hard to fully exploit multi-step structured behaviors that only reveal their effect over several actions.
To balance this trade-off, we introduce **H**ierarchical **P**reference **L**earning (HPL), a hierarchical framework that optimizes LLM agents by leveraging preference signals at multiple, synergistic granularities.
While HPL incorporates trajectory- and step-level DPO for global and local policy stability, its core innovation lies in group-level preference optimization guided by a dual-layer curriculum.
Our approach first decomposes expert trajectories into semantically coherent action groups and then generates contrasting suboptimal groups to enable preference learning at a fine-grained, sub-task level.
Then, instead of treating all preference pairs equally, HPL introduces a curriculum scheduler that organizes the learning process from simple to complex.
This curriculum is structured along two axes: the group length, representing sub-task complexity, and the sample difficulty, defined by the reward gap between preferred and dispreferred action groups.
Experiments on three challenging agent benchmarks show that HPL outperforms existing state-of-the-art methods.
Our analyses demonstrate that the hierarchical DPO loss effectively integrates preference signals across multiple granularities, while the dual-layer curriculum is crucial for enabling the agent to solve a wide range of tasks, from simple behaviors to complex multi-step sequences.
基础/前沿模型 (含LLM)
Agent 与工具使用
#LLM #multi agent system #reinforcement learning
🎯 研究动机多智能体系统(MAS)和强化学习(RL)常用于提升大语言模型(LLM)的协作表现,但结合使用尚存挑战需解决。
❓ 解决问题传统GRPO算法在MAS中无法处理不同角色与回合的变量,同时缺乏支持多策略和单策略训练的系统。
🔍 现象分析MAS中的提示输入因角色及回合变化导致GRPO分组假设失效,现有训练系统难以符合MAS的工作流及升级需求。
🛠️ 主要方法提出AT-GRPO算法,包括基于智能体角色与回合分组的优化策略,以及支持多策略和单策略训练流程的系统设计。
📊 数据与实验针对游戏、规划、编码及数学任务进行实验,验证算法在长周期规划任务中将准确率提升至96.0–99.5%,并显著提高逻辑推理能力。
⭐ 主要贡献开发新型MAS适配RL算法并设计支持先进训练模式的系统,显著提升跨领域协作性能与推理能力,同时公开代码供研究者使用。
查看完整摘要 (Abstract)
Multi-Agent System (MAS) and Reinforcement Learning (RL) are both widely adopted to improve large language model (LLM) agentic performance. MAS strengthens task-specialized performance via role-based orchestration; RL leverages environment rewards to train stronger policies, such as Group Relative Policy Optimization (GRPO)-style optimization. Yet applying on-policy RL training to MAS is underexplored. While promising, it poses several challenges. On the algorithm side, Standard GRPO grouping assumptions fail in MAS because prompts differ by role and turn. On the system side, the training system needs to support MAS-workflow-based rollouts and on-policy updates for both single and multiple policy models. To address these issues, we introduce AT-GRPO, consisting of (i) an Agent- and Turn-wise grouped RL algorithm tailored for MAS and (ii) a system to support both single-policy and multi-policy training. Across game, plan, coding, and math tasks, AT-GRPO demonstrates substantial performance gains across diverse domains. Especially on long-horizon planning tasks, AT-GRPO boosts accuracy from a 14.0–47.0% single-agent RL baseline to 96.0–99.5%. Furthermore, it improves reasoning performance, with an average gain of 3.87–7.62% on coding and 9.0-17.93% on math. The code are available at https://github.com/pettingllms-ai/PettingLLMs.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large language model #math reasoning #tool use #test-time scaling #small language model
🎯 研究动机测试时计算扩展已被证明能提升小型语言模型性能,但验证过程主要依赖较大的模型,研究验证由小型语言模型完成的可能性显得重要。
❓ 解决问题探索小型语言模型能否在测试时扩展中可靠验证输出,同时解决其在需要记忆性任务上的表现弱点。
🔍 现象分析小型语言模型即便通过大规模验证器的知识蒸馏,仍难以胜任记忆性验证任务,如数值计算与事实核查。
🛠️ 主要方法提出一种工具整合验证框架(T1),通过外部工具进行候选过滤,再由小型语言模型完成最终验证,减轻记忆任务负担。
📊 数据与实验利用MATH基准数据集进行实验,证明T1框架在测试时扩展中优于更大的模型,并提升过程奖励模型和评论模型的验证精度。
⭐ 主要贡献展示工具整合显著增强小型语言模型的验证能力,为小型语言模型的计算扩展提供新的解决方案。
查看完整摘要 (Abstract)
Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs).
However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving verification by sLMs underexplored.
In this work, we investigate whether sLMs can reliably verify the output candidates under test-time scaling.
We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking.
To address this limitation, we propose Tool-integrated verification (T1), a two-stage framework that first filters candidates with external tools and then uses an sLM for final verification, offloading memorization-heavy steps to tools such as a code interpreter.
Within T1 we prove that offloading to external tools reduces the memorization burden on sLMs and improves test-time scaling performance.
Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model.
Moreover, T1 improves the verification accuracy of both process reward models (PRMs) and critic models.
Our findings highlight the potential of tool integration to substantially improve the verification abilities of sLMs.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Models #Tool-Augmented Reasoning #Test-Time Scaling #Multi-Agent Systems #Code Interpreter #Search
TL;DR:TUMIX combines diverse tool-use agents (text, code, search) into a dynamic mixture, outperforming state-of-the-art test-time scaling with higher accuracy at half the cost.
🎯 研究动机现有的大型语言模型在结合工具如代码解释器和搜索时,缺乏高效、实用的指导来优化工具的使用,特别是在应对多样性问题场景下的推理任务。
❓ 解决问题如何高效结合文本推理、代码生成和搜索策略,动态分配工具使用,以提升推理准确性并降低推理成本。
🔍 现象分析通过对现有方法的对比,发现工具使用策略的多样性与质量是影响推理性能的关键因素,但现有解决方案无法同时兼顾高效性与准确性。
🛠️ 主要方法提出了 TUMIX 框架,基于多代理并行运行,不同代理采用独立的工具使用策略,在迭代中通过答案共享与自适应优化逐步提升整体推理效果。
📊 数据与实验实验在 Gemini-2.5-Pro 与 Gemini-2.5-Flash 推理基准上进行,结果显示 TUMIX 相较当前最优基线平均提升准确率 3.55%,推理成本仅为 49%。
⭐ 主要贡献开发了一种动态、多代理工具混合框架 TUMIX,在不显著增加推理成本的情况下实现了更高的准确性,并提出利用 LLM 自动优化代理设计的新方法。
查看完整摘要 (Abstract)
While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55\% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49\% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Compound AI System
🎯 研究动机随着大语言模型在复杂AI系统中的应用,如何优化多模块协作的长流程任务成为关键挑战,现有全局文本反馈传播方法存在性能瓶颈。
❓ 解决问题解决长流程任务中的两类深度扩展失败模式:文本梯度爆炸(反馈长度爆炸增长)和文本梯度消失(模型过度依赖局部信息且反馈逐渐失去细节)。
🔍 现象分析本文发现长流程任务中的全局反馈传播易导致消息冗长化和评价偏差放大,同时多跳传播会逐步模糊信息,影响后续决策的精准性。
🛠️ 主要方法提出文本平衡传播(TEP),通过灵感源自能量模型平衡传播的局部学习策略分为自由阶段和扰动阶段,实现局部优化与受限全局调整相结合,避免全局反馈的计算与信号降解问题。
📊 数据与实验在长流程问答基准和多智能体工具使用数据集上,TEP在深度增加时表现出更高的准确性和效率,相较于全局传播方法如TextGrad有显著改进,且保留了黑箱大语言模型的实用性。
⭐ 主要贡献提出TEP方法,有效解决文本梯度爆炸与消失问题,实现复杂AI系统中模块间长流程优化的新机制,展示出改进模型性能和实现高效协作的潜力。
查看完整摘要 (Abstract)
Large language models (LLMs) are increasingly deployed as part of compound AI systems which coordinate multiple modules (e.g., retrievers, tools, verifiers) over long-horizon workflows. Although recent frameworks that propagate textual feedback globally (e.g., TextGrad make it feasible to optimize such pipelines, we identify two depth-scaling failure modes in long-horizon agentic workflows: 1) exploding textual gradient, where textual feedback grows exponentially with depth, leading to prohibitively long message and amplifies evaluation biases; and 2) vanishing textual gradient, where limited long-context ability causes models overemphasize recent or early feedback, while compression of lengthy feedback causes downstream messages to lose specificity gradually as they propagate many hops upstream. To mitigate these issues, we introduce Textual Equilibrium Propagation (TEP), a local learning principle inspired by Equilibrium Propagation in energy-based models. TEP includes two phases: 1) a free phase where a local LLM critics iteratively refine prompts until reaching equilibrium (no further improvements are suggested); and 2) a nudged phase which applies proximal prompt edits with bounded modification intensity, using task-level objectives that propagate via forward signaling rather than backward feedback chains. This design supports local prompt optimization followed by controlled adaptation toward global goals without the computational burden and signal degradation of global textual backpropagation. Across long-horizon QA benchmarks and multi-agent tool-use dataset, TEP consistently improves accuracy and efficiency over global propagation methods such as TextGrad, with gains that increase at greater depths, while preserving the practicality of black-box LLM components in deep compound AI system.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Models #Long Horizon #Agents
TL;DR:We measure long-horizon execution capability of LLMs, identify a failure mode where models self-condition on their own errors, and show benefits of increasing model size and sequential test-time compute
🎯 研究动机短任务基准可能给人以规模增大效益递减的错觉,但单步准确率的细微提升可实现任务长度的指数级提高。
❓ 解决问题分析长任务失败的原因,认为问题出在执行错误而非推理能力不足,并探讨如何提升执行力来解决这一问题。
🔍 现象分析发现大模型在单回合准确率接近完美情况下能正确执行更多回合,但随着步骤增加,精度下降且易因自身错误积累造成的‘自条件效应’影响结果。
🛠️ 主要方法通过显性提供知识与计划隔离执行能力,并通过扩展计算时间与增加模型规模来缓解自条件效应,提高长任务执行能力。
📊 数据与实验对前沿推理模型进行基准测试,分析其单回合执行长任务的能力,以及模型规模和分步计算在长任务上的表现。
⭐ 主要贡献揭示模型执行力与推理问题的关系,强调扩大模型规模和增加测试时序计算对提升长任务能力的巨大潜力。
查看完整摘要 (Abstract)
Does continued scaling of large language models (LLMs) yield diminishing returns? In this work, we show that short-task benchmarks may give an illusion of slowing progress, as even marginal gains in single-step accuracy can compound into exponential improvements in the length of tasks a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. So, we propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. First, we find that larger models can correctly execute significantly more turns even when small models have near-perfect single-turn accuracy. We then observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations---curiously, we observe a self-conditioning effect---models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. But, we find that thinking mitigates self-conditioning, and also enables execution of much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of tasks they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.
基础/前沿模型 (含LLM)
Agent 与工具使用
#State Space Models #Mamba #Length Generalization #LLM #Transformers
🎯 研究动机状态空间模型(SSMs)以其固定内存和线性计算复杂度,在长序列建模中展现效率优势,然而其无法有效处理真正的长形式生成问题,限制了应用范围。
❓ 解决问题通过引入工具使用交互机制和特定任务训练数据,解决SSMs在长形式生成中的理论缺陷,实现任意问题长度和复杂度的泛化能力。
🔍 现象分析研究表明SSMs在缺少外部工具时无法解决复杂长序列任务,而引入工具使用后可以显著提升在算术、推理与编程任务中的泛化表现。
🛠️ 主要方法提出工具增强的SSMs架构,通过允许模型交互使用外部工具并结合任务相关的训练数据进行优化,提升解决复杂问题和任意长度泛化能力。
📊 数据与实验设计多种任务实验(如算术、逻辑推理、代码生成),验证工具增强SSMs在不同问题中均能实现显著的泛化性能和效率提升。
⭐ 主要贡献从理论和实验层面提出工具使用与外部交互的解决方案,证明了SSMs在复杂长序列任务中的潜力,为替代Transformer架构提供了新方向。
查看完整摘要 (Abstract)
State Space Models (SSMs) have become the leading alternative to Transformers for sequence modeling tasks. Their primary advantage is efficiency in long-context and long-form generation, enabled by fixed-size memory and linear scaling of computational complexity. We begin this work by showing a simple theoretical result stating that SSMs cannot accurately solve any "truly long-form" generation problem (in a sense we formally define), undermining their main competitive advantage. However, we show that this limitation can be mitigated by allowing SSMs interactive access to external tools. In fact, we show that given the right choice of tool access and problem-dependent training data, SSMs can learn to solve any tractable problem and generalize to arbitrary problem length/complexity (i.e., achieve length generalization). Following our theoretical finding, we demonstrate that tool-augmented SSMs achieve remarkable length generalization on a variety of arithmetic, reasoning, and coding tasks. These findings highlight SSMs as a potential efficient alternative to Transformers in interactive tool-based and agentic settings.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Large Language Models #Tool-Augmented LLMs #Scalable Tool Use #Tool Learning #Collaborative Semantics
TL;DR:We enable LLMs to use massive toolsets by learning compact, compositional tokens whose structure is guided by collaborative semantics, boosting scalability and multi-tool reasoning.
🎯 研究动机现有基于检索的工具使用方法在语义理解上存在双重挑战,既无法有效捕捉复杂语义,又因语言模型缺乏工具知识导致推理局限。
❓ 解决问题针对工具唯一标识符带来的扩展性和语义瓶颈问题,提出一种能够利用协同语义学习工具关系的框架,提高规模化和泛化能力。
🔍 现象分析工具标识符语义孤立性阻碍了模型识别工具之间的协作关系,同时标识符数量剧增导致词汇扩展不具备可伸缩性。
🛠️ 主要方法提出 ToolWeaver 框架,通过层次结构编码工具语义与协同关系,并通过生成式对齐方法将结构化编码嵌入大语言模型中。
📊 数据与实验使用约 47,000 个工具进行评估,验证了框架在规模化工具使用、泛化能力及语义理解方面的优势。
⭐ 主要贡献显著提升工具扩展效率和大语言模型的多工具推理能力,为工具增强型智能体的开发奠定了可扩展和语义敏感的基础。
查看完整摘要 (Abstract)
Prevalent retrieval-based tool-use pipelines struggle with a dual semantic challenge: their retrievers often employ encoders that fail to capture complex semantics, while the Large Language Model (LLM) itself lacks intrinsic tool knowledge from its natural language pretraining. Generative methods offer a powerful alternative by unifying selection and execution, tasking the LLM to directly learn and generate tool identifiers. However, the common practice of mapping each tool to a unique new token introduces substantial limitations: it creates a scalability and generalization crisis, as the vocabulary size explodes and each tool is assigned a semantically isolated token. This approach also creates a semantic bottleneck that hinders the learning of collaborative tool relationships, as the model must infer them from sparse co-occurrences of monolithic tool IDs within a vast library. To address these limitations, we propose ToolWeaver, a novel generative tool learning framework that encodes tools into hierarchical sequences. This approach makes vocabulary expansion logarithmic to the number of tools. Crucially, it enables the model to learn collaborative patterns from the dense co-occurrence of shared codes, rather than the sparse co-occurrence of monolithic tool IDs. We generate these structured codes through a novel tokenization process designed to weave together a tool's intrinsic semantics with its extrinsic co-usage patterns. These structured codes are then integrated into the LLM through a generative alignment stage, where the model is fine-tuned to produce the hierarchical code sequences. Evaluation results with nearly 47,000 tools show that ToolWeaver significantly outperforms state-of-the-art methods, establishing a more scalable, generalizable, and semantically-aware foundation for advanced tool-augmented agents.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Tree Search #LLM #Agent #Reinforcement Learning
🎯 研究动机强化学习近年来大幅提升大型语言模型的智能代理能力,但在长期和多轮任务场景中,单靠结果奖励驱动的方法存在监督稀疏的问题。
❓ 解决问题为应对监督稀疏性,该研究提出一种基于树搜索的分组强化学习方法,实现更高效的策略优化与信号提取。
🔍 现象分析树结构的轨迹生成相较链式方法更自然地从结果奖励中构建逐步监督信号,解决了传统方法在稀疏奖励场景下的适应性不足问题。
🛠️ 主要方法提出了Tree-GRPO,通过共享前缀提升采样效率,结合树级与组级优势估计进行优化,理论上等效于逐步偏好学习。
📊 数据与实验在包含11个数据集和3种问答任务的实验中,Tree-GRPO展现出对比链式RL的显著优越性。
⭐ 主要贡献引入一种基于树搜索的强化学习框架,提升长期任务中监督信号的利用效率,并通过理论和实验验证了其优越性。
查看完整摘要 (Abstract)
Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs).
In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision.
To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step.
By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls.
Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward.
Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels.
Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning.
Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.
基础/前沿模型 (含LLM)
Agent 与工具使用
#evolutionary strategies #multi-agent LLM systems #role-based delegation #logits-to-agent mapping
TL;DR:A 0.6B coordinator whose ES-evolved 10k-param head reads the penultimate token to pick agents and assign worker/thinker/verifier roles—beating single models/routers without SFT or RL.
🎯 研究动机结合多个基础模型具有潜力,但现有权重合并方法受架构不匹配及封闭API限制,需要一种有效的协调机制来优化模型协作。
❓ 解决问题开发一个轻量级协调器,解决模型协作中角色分配和任务委派的挑战,同时避免单一模型或路由器系统的性能瓶颈。
🔍 现象分析结果表明,协调器通过输入的隐状态表征带来丰富的上下文信息,在高维条件和资源限制下具备较强的适应能力。
🛠️ 主要方法使用进化策略优化拥有约6亿参数的协调模型以及一个10K参数的轻量级头部,以实现以角色为基础的动态委派和模型选择。
📊 数据与实验在编码、数学、推理和领域知识等任务上进行广泛实验,并成功实现了出色的泛化能力,同时在LiveCodeBench上创造了86.2%的新纪录。
⭐ 主要贡献提出并验证了一种创新性的基于角色分配的多轮对话协调框架,与现有方法相比实现了更优性能,并在进化策略和高维优化领域提供理论与实证支持。
查看完整摘要 (Abstract)
Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs. **Trinity** addresses this with a lightweight coordinator that orchestrates collaboration among large language models (LLMs). The coordinator, comprising a compact language model ($\approx 0.6$B parameters) and a lightweight head ($\approx 10$K parameters), is optimized with an evolutionary strategy for efficient and adaptive delegation. **Trinity** processes queries over multiple turns, where at each turn the coordinator assigns one of three roles (*Thinker*, *Worker*, or *Verifier*) to a selected LLM, effectively offloading complex skill acquisition from the coordinator itself. Extensive experiments demonstrate that **Trinity** consistently outperforms individual models and existing methods in various tasks, including coding, math, reasoning, and domain knowledge, while robustly generalizing to out-of-distribution tasks. On established benchmarks, **Trinity** achieves state-of-the-art performance, including a new record of $86.2\%$ on LiveCodeBench. Theoretical and empirical analyses highlight two key factors driving this success: (1) the coordinator’s hidden-state representations provide rich contextualization of inputs, and (2) under high dimensionality and strict budget constraints, the separable Covariance Matrix Adaptation Evolution Strategy algorithm provides substantial advantages over RL, imitation learning, and random search, leveraging potential block-$\varepsilon$-separability.
基础/前沿模型 (含LLM)
Agent 与工具使用
#GUI Grounding #GUI Agents #Multimodal Large Language Model
🎯 研究动机现有GUI理解技术将用户指令视为静态代理,忽略了指令多样性与质量对性能的影响。研究发现现有数据集的指令存在高达23.3%的缺陷,而利用指令多样性可使性能相对提升76%,这揭示了动态解析指令的必要性。
❓ 解决问题提出指令即推理(Instruction-as-Reasoning)新范式,将指令视为动态分析路径,使模型能在推理时选择最优路径。为此设计了两阶段训练框架,先通过多视角指令监督微调,再用强化学习优化路径选择与组合。
🔍 现象分析当前GUI理解数据集指令缺陷率高,导致模型性能受限。实验证明,指令多样性对性能有重大影响,而传统静态指令处理方法无法充分利用这一特性。
🛠️ 主要方法采用合成多样化指令进行监督微调,注入多视角推理能力;第二阶段通过强化学习优化推理路径的选择与组合策略,得到UI-Ins-7B和UI-Ins-32B模型。
📊 数据与实验在UI-I2E-Bench等五个基准测试中获得SOTA,UI-Ins-32B在UI-I2E-Bench达87.3%准确率;AndroidWorld测试中UI-Ins-7B执行成功率74.1%。深入分析揭示了推理优化机制如何提升性能,并解决了SFT+RL框架的策略崩溃问题。
⭐ 主要贡献建立了指令即推理新范式及两阶段训练框架,推出了高性能UI-Ins系列模型;首次系统揭示指令多样性对GUI理解的增强作用,并开源了全部代码与模型。
查看完整摘要 (Abstract)
GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement.
In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning to optimize pathway selection and composition.
Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and models are released.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Agentic RL; Asynchronous RL; Search Agent
TL;DR:This paper introduces training expert-level search agents with large-scale RL. By using 128 turn limit and high-quality synthetic data, the agent learns complex long-horizon search strategies, reaching competitive results on frontier benchmarks.
🎯 研究动机现有开源搜索代理依赖于商用大型语言模型(LLM),但需通过强化学习独立开发高性能搜索代理以减少对外部资源的依赖。
❓ 解决问题如何通过大规模端到端强化学习在无商用API支持的条件下训练能够执行复杂推理的高性能长程搜索代理。
🔍 现象分析实验表明,通过强化学习训练的单模型代理可接近甚至超越依赖商用LLM的代理,并展现出较强的零样本迁移能力。
🛠️ 主要方法提出两阶段生成高质量问答数据过程,并训练支持最高128次行动的大规模长程强化学习代理。
📊 数据与实验在GAIA、xBench和Frames基准测试中达到接近或超越商用模型的表现,同时验证了模型通过额外工具和并行推理后具有更高性能。
⭐ 主要贡献开发首个独立于商用API的强化学习搜索代理,展示RL在长程智能任务中的扩展潜力,并开源所有代码和数据以推动领域发展。
查看完整摘要 (Abstract)
Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling knowledge-intensive tasks using external tools. One representative example is search agent. Existing open-source search agents heavily rely on advanced commercial LLMs: they either collect trajectories from the larger, stronger models for supervised fine-tuning or directly use them as specialized tools. In this work, we develop ASearcher, a single-model search agent purely trained by reinforcement learning (RL) without using any commercial APIs for data or tools. Based on an RL-trained QwQ-32B model, ASearcher is capable of conducting complex reasoning, such as uncertainty analysis and conflict verification, and achieve comparable performances to commercial search agents. There are two key techniques to unlock such long-horizon information-seeking abilities: first, we design a two-staged agentic process to synthesize high-quality QA pairs as the training data for RL; second, we conduct large-scale long-horizon RL, allowing the agent to take up to 128 actions per rollout for sufficient exploration. In particular, after RL training, ASearcher achieved scores of GAIA 58.1, xBench 51.1, and Frames 74.5 using only basic search tools. Furthermore, ASearcher also demonstrates strong zero-shot transferability: ASearcher can be further augmented with an additional summary tool, which is supported by DeepSeek-V3, and test-time scaling, which aggregates the answer from 16 parallel rollouts. With both zero-shot enhancements, the performances of ASearcher further rise to 71.8, 75.0, and 83.4, respectively, outperforming OpenAI DeepResearch and Kimi-Researcher, suggesting the great potential of RL scaling for agentic tasks. We release all the code and data at an anonymous link. The model will be released after the review process.
基础/前沿模型 (含LLM)
Agent 与工具使用
#Reinforcement Learning #Vison Lanaguage Model #Reasoning
TL;DR:Reinforcement learning finetuning can enable vision language models to think with intermediate image reasoning steps.
🎯 研究动机强化学习微调(RFT)已显著提升大语言模型(LLMs)的推理能力,但现有研究在视觉语言模型(VLMs)中多局限于基于原始图像的文本推理,缺乏在响应中整合视觉推理步骤。测试时方法虽引入视觉步骤但缺乏训练机制。
❓ 解决问题针对VLMs在生成多模态思维链时无法有效融合中间视觉推理步骤的问题,提出了首个RFT框架VTool-R1,训练VLMs交替生成文本与视觉推理步骤,实现“用图像思考”。
🔍 现象分析现有VLM的RFT方法通常仅基于原始图像进行文本条件推理,忽略了视觉推理步骤的生成;而测试时视觉推理方法缺乏系统训练,导致模型无法学习何时及如何生成有效的视觉中间步骤。
🛠️ 主要方法VTool-R1将基于Python的视觉编辑工具集成到RFT过程中,使用基于结果的奖励进行训练,使VLMs能够学习策略性地生成视觉推理步骤,而无需依赖过程监督。
📊 数据与实验在图表和表格的结构化视觉推理任务上进行了广泛实验,验证了VTool-R1通过教授VLMs“用图像思考”并生成带工具的多模态思维链,显著提升了推理性能。
⭐ 主要贡献提出了首个训练VLMs生成多模态思维链的RFT框架,实现了文本与视觉推理步骤的交错生成;通过开源代码促进了多轮多模态推理的未来研究。
查看完整摘要 (Abstract)
Reinforcement learning finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, multi-turn self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely focus on text-only reasoning conditioned on original image inputs, and do not incorporate visual reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms.
We introduce VTool-R1, the first RFT framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that enhance the final output quality. Trained with outcome-based rewards, our approach elicits strategic visual tool use for multi-modal reasoning without relying on process-based supervision. Extensive experiments on structured visual reasoning over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools. To support future research in multi-turn multi-modal reasoning, we open-source our code at https://github.com/VTOOL-R1/vtool-r1.
基础/前沿模型 (含LLM)
Agent 与工具使用
#multi-agent system #visual hallucination snowballing
TL;DR:Using visual flow to relay information, which mitigates visual hallucination snowballing in multi-agent system.
🎯 研究动机视觉语言模型驱动的多智能体系统在执行复杂任务时,易出现一种新型失效模式——多智能体视觉幻觉滚雪球效应。该效应导致幻觉从单个智能体产生,并因后续智能体过度依赖文本流传递视觉信息而被放大。
❓ 解决问题本文旨在缓解多智能体系统中的视觉幻觉滚雪球问题。核心在于通过引入视觉流来传递关键视觉证据,减少对文本流的过度依赖,从而抑制幻觉在多轮交互中的累积与放大。
🔍 现象分析通过逐轮、逐层和逐令牌的注意力分析发现,幻觉滚雪球与视觉注意力分配的减少直接相关。部分视觉令牌在中层呈现单峰注意力峰值,能有效保留视觉证据,但在更深层的智能体轮次中逐渐减弱,最终导致幻觉扩散。
🛠️ 主要方法提出了名为ViF的轻量级、模型无关的缓解范式。该方法通过选定的视觉中继令牌构建视觉流来传递跨智能体信息,并应用注意力重分配机制来强化这一模式,从而增强视觉证据的保留。
📊 数据与实验实验基于四种常见多智能体结构和十种基础模型,在八个基准测试上进行验证。结果表明,所提方法显著减少了幻觉滚雪球效应,并一致提升了性能表现。源代码已公开。
⭐ 主要贡献首次系统分析了多智能体视觉幻觉滚雪球效应的内在机制,并提出了一种通用的视觉流缓解范式ViF。该方法在多种基准和模型上均有效提升了系统鲁棒性和性能,为多模态多智能体系统的可靠性提供了新思路。
查看完整摘要 (Abstract)
Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, model-agnostic mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code is publicly available at: https://github.com/YU-deep/ViF.git.
基础/前沿模型 (含LLM)
Agent 与工具使用
#web agents #tool use #LLMs #agentic reasoning
TL;DR:Web Agents that Learn Tools autonomously that leads to improved success rate and efficiency
🎯 研究动机当前Web代理在自动化复杂浏览器任务时依赖逐步UI交互和繁重的LLM推理,难以应对动态布局和长任务链,而人类能够利用网站提供的高层功能(如搜索、筛选、排序)高效操作。
❓ 解决问题提出一种能够自主发掘网站潜在功能并作为工具调用的框架,以解决现有方法在动态场景中易碎和低效的问题。
🔍 现象分析现有Web代理方法在复杂动态任务中表现有限,主要因过于依赖逐步交互的低效推理,而利用网站内置功能可显著提高稳健性与效率。
🛠️ 主要方法引入WALT框架,通过逆向分析网站隐含功能,将其转化为可调用工具,包括搜索、筛选、排序,内容通信和管理等,减少低层交互并优化任务执行效率。
📊 数据与实验在VisualWebArena和WebArena数据集上取得最高成功率(分别为52.9%和50.1%),并在包含139个真实网站的Online-Mind2Web基准中自主发现252个工具,使成功率提升20.5%。
⭐ 主要贡献提出了一种稳健且可拓展的浏览器自动化范式,通过工具化操作简化推理难度,显著提升了Web代理的成功率和效率,并公开相关代码资源。
查看完整摘要 (Abstract)
Web agents promise to automate complex browser tasks, but current methods remain brittle -- relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into deterministic, callable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites, spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves state-of-the-art success rates (52.9% on VisualWebArena, 50.1% on WebArena) with fewer steps and less LLM-dependent reasoning. On Online-Mind2Web, a benchmark of 139 real-world websites, WALT autonomously discovers 252 tools and improves success rate by 20.5% over a tool-free baseline, establishing a robust and generalizable paradigm for browser automation. Code: https://github.com/SalesforceAIResearch/WALT
基础/前沿模型 (含LLM)
Agent 与工具使用
#code agent #website generation #large language model
TL;DR:We propose WebGen-Agent, a novel website-generation agent that leverages comprehensive and multi-level visual feedback to iteratively generate and refine the website codebase.
🎯 研究动机现有基于大语言模型的代码生成智能体在网站代码库生成任务上表现不足,因其仅依赖简单代码执行反馈,难以评估生成代码的实际视觉与交互质量。
❓ 解决问题针对网站生成任务中视觉与交互反馈的重要性,提出一种能利用多层级视觉反馈进行迭代式代码生成与优化的智能体系统。
🔍 现象分析当前代码智能体在网站生成任务中,仅通过代码执行结果进行验证,忽略了视觉效果与用户交互体验,导致生成质量难以保证。
🛠️ 主要方法提出WebGen-Agent,通过视觉语言模型分析网站截图与GUI测试,生成多级质量分数与文本反馈,并引入基于步骤奖励的Step-GRPO强化学习算法进行过程监督。
📊 数据与实验在WebGen-Bench数据集上验证,显著提升Claude 3.5 Sonnet等模型的生成准确率与外观评分,优于现有最优方法。
⭐ 主要贡献设计首个整合视觉反馈与步骤级强化学习的网站生成智能体框架,并通过多级评分与回溯选择机制,实现高效且高质量的网站代码库迭代生成。
查看完整摘要 (Abstract)
Agent systems powered by large language models (LLMs) have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and user-interaction feedback, current code agents rely only on simple code execution for feedback and verification. This approach fails to capture the actual quality of the generated code. In this paper, we propose WebGen-Agent, a novel website-generation agent that leverages comprehensive and multi-level visual feedback to iteratively generate and refine the website codebase. Detailed and expressive text descriptions and suggestions regarding the screenshots and GUI-agent testing of the websites are generated by a visual language model (VLM), together with scores that quantify their quality. The screenshot and GUI-agent scores are further integrated with a backtracking and select-best mechanism, enhancing the performance of the agent. Utilizing the accurate visual scores inherent in the WebGen-Agent workflow, we further introduce Step-GRPO with Screenshot and GUI-Agent Feedback to improve the ability of LLMs to act as the agent-engine model. By using the screenshot and GUI-agent scores at each step as the reward in Step-GRPO, we provide a dense and reliable process supervision signal, which effectively improves the model's website-generation ability. On the WebGen-Bench dataset, WebGen-Agent increases the accuracy of Claude 3.5 Sonnet from 26.4\% to 51.9\% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system. Additionally, our training approach increases the accuracy of Qwen2.5-Coder-7B-Instruct from 38.9\% to 45.4\% and raises the appearance score from 3.4 to 3.7.
基础/前沿模型 (含LLM)
Agent 与工具使用
#lm agent #retrieval-augmented generation #Reinforcement Learning
🎯 研究动机现有搜索智能体在交互式信息检索中表现有限,难以实现深层次工具使用且易累积交互错误。
❓ 解决问题提出一种新方法以提高搜索智能体的工具链长度与回答准确性,同时增强其在实际应用环境中的表现。
🔍 现象分析通过反思机制的引入,可有效延长工具使用链条,同时减少因多轮交互而导致的误差积累。
🛠️ 主要方法设计了一个两阶段训练框架,结合冷启动与增强学习,将大量带有反思模式的标注数据用于模型训练,从而强化智能体的深度交互能力。
📊 数据与实验在HotpotQA和SimpleQA测试中取得72.3%和90.0%的准确率,同时在分布外数据上展现良好的泛化能力。
⭐ 主要贡献提出了WebSeer搜索智能体,显著提高了复杂检索任务的深度交互能力与回答准确性,实现了当前领域的技术突破。
查看完整摘要 (Abstract)
Search agents have achieved significant advancements in enabling intelligent information retrieval and decision-making within interactive environments.
Although reinforcement learning has been employed to train agentic models capable of more dynamic interactive retrieval, existing methods are limited by shallow tool-use depth and the accumulation of errors over multiple iterative interactions.
In this paper, we present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism. Specifically, we construct a large dataset annotated with reflection patterns and design a two-stage training framework that unifies cold start and reinforcement learning within the self-reflection paradigm for real-world web-based environments, which enables the model to generate longer and more reflective tool-use trajectories.
Our approach substantially extends tool-use chains and improves answer accuracy. Using a single 14B model, we achieve state-of-the-art results on HotpotQA and SimpleQA, with accuracies of 72.3\% and 90.0\%, respectively, and demonstrate strong generalization to out-of-distribution datasets.
基础/前沿模型 (含LLM)
Agent 与工具使用
#generative engine optimization #generative engines #preference rule discovery #reinforcement learning
🎯 研究动机生成型搜索引擎通过大型语言模型检索文档并生成自然语言响应,显著提升用户体验并成为搜索的一种新形式。这促使内容提供者寻求优化方法以吸引更多用户。
❓ 解决问题探讨如何自动学习生成型搜索引擎对内容的偏好规则,并据此优化网页内容以提升吸引力,同时保持搜索实用性。
🔍 现象分析生成型搜索引擎具有独特的内容偏好,影响它们使用检索到的内容进行响应生成的方式。这些偏好可以被提取和分析以指导内容优化。
🛠️ 主要方法提出AutoGEO框架,通过大型语言模型解释引擎偏好,生成偏好规则并应用于内容重写。同时使用规则作为API上下文和训练奖励以开发成本优化模型。
📊 数据与实验在标准GEO-Bench和两个基于真实用户查询的新构造基准上验证,实验结果显示优化内容吸引力和保留搜索功能。分析确认规则的稳健性及其适配领域独特偏好的能力。
⭐ 主要贡献推出了AutoGEO系统及其成本优化版本,并公开偏好规则与代码,为生成型搜索引擎优化提供了新方向。
查看完整摘要 (Abstract)
By employing large language models (LLMs) to retrieve documents and generate natural language responses, Generative Engines, such as Google AI overview and ChatGPT, provide significantly enhanced user experiences and have rapidly become the new form of search. Their rapid adoption also drives the needs of Generative Engine Optimization (GEO), as content providers are eager to gain more traction from them. In this paper, we introduce AutoGEO, a framework to automatically learn generative engine preferences when using retrieved contents for response generation, and rewrite web contents for more such traction. AutoGEO first prompts frontier LLMs to explain generative engine preferences and extract meaningful preference rules from these explanations. Then it uses preference rules as context engineering for AutoGEO$\_\text{API}$, a prompt-based GEO system, and as rule-based rewards to train AutoGEO$\_\text{Mini}$, a cost-effective GEO model. Experiments on the standard GEO-Bench and two newly constructed benchmarks using real user queries demonstrate the effectiveness of AutoGEO in enhancing content traction while preserving search utility. Analyses confirmed the learned rules' robustness and abilities to capture unique preferences in variant domains, and AutoGEO systems' ability to embed them in content optimization. The learned preference rules, our models, and the code is released at https://github.com/cxcscmu/AutoGEO
基础/前沿模型 (含LLM)
Agent 与工具使用
#agent; Vision Language Model; Uncernity
TL;DR:We presents a framework that reframes coordination as a decentralized market for uncertainty.
🎯 研究动机视觉语言模型(VLM)在构建强大多智能体系统方面具有潜力,但在信息不对称下协调异构智能体常导致成本失控。现有方法基于启发式代理,忽略成本并破坏不确定性结构,导致可证明的次优协调,经济上难以持续扩展。
❓ 解决问题本文提出了Agora框架,将多智能体协调问题重构为一个去中心化的不确定性市场。旨在通过市场机制,将认知不确定性转化为可交易资产,并基于理性经济规则驱动智能体间的利益交换,从而实现成本高效的协调。
🔍 现象分析现有协调范式(如智能体混合或基于知识的路由器)依赖忽略成本的启发式代理,导致不确定性结构坍塌。这造成了协调成本飙升和性能次优的问题,阻碍了多智能体视觉系统的经济可行性和可扩展性。
🛠️ 主要方法Agora将认知不确定性(感知、语义、推理)形式化为结构化、可交易的资产。它引入一个基于扩展汤普森采样的市场感知代理(broker),以启动协作并引导系统走向成本高效的均衡。
📊 数据与实验在五个多模态基准(MMMU, MMBench, MathVision, InfoVQA, CC-OCR)上进行了实验。结果表明,Agora在性能(如在MMMU上准确率相对最佳基线提升8.5%)和成本(降低超过3倍)上均优于现有VLM和启发式多智能体策略。
⭐ 主要贡献提出了首个将多智能体协调形式化为不确定性市场的框架Agora。确立了基于市场的协调作为一种原则性且可扩展的范式,用于构建经济可行的多智能体视觉智能系统,并通过实验验证了其有效性和高效性。
查看完整摘要 (Abstract)
Vision-Language Models (VLMs) enable powerful multi-agent systems, but scaling them is economically unsustainable: coordinating heterogeneous agents under information asymmetry often spirals costs. Existing paradigms, such as Mixture-of-Agents and knowledge-based routers, rely on heuristic proxies that ignore costs and collapse uncertainty structure, leading to provably suboptimal coordination.
We introduce Agora, a framework that reframes coordination as a decentralized market for uncertainty. Agora formalizes epistemic uncertainty into a structured, tradable asset (perceptual, semantic, inferential), and enforces profitability-driven trading among agents based on rational economic rules. A market-aware broker, extending Thompson Sampling, initiates collaboration and guides the system toward cost-efficient equilibria. Experiments on five multimodal benchmarks (MMMU, MMBench, MathVision, InfoVQA, CC-OCR) show that Agora outperforms strong VLMs and heuristic multi-agent strategies, e.g., achieving +8.5% accuracy over the best baseline on MMMU while reducing cost by over 3×. These results establish market-based coordination as a principled and scalable paradigm for building economically viable multi-agent visual intelligence systems.
多模态基础模型74 篇
基础/前沿模型 (含LLM)
多模态基础模型
#Vision Language Models #Spatial Reasoning
🎯 研究动机旨在弥合单视图2D图像与多视图3D数据之间的鸿沟,以实现更准确的跨视角空间推理。当前方法往往依赖多帧标注或缺乏统一的表示空间,限制了其在开放场景中的应用。
❓ 解决问题解决传统方法在灵活区域提示和跨帧空间推理上的不足,无需详尽的多帧标注。使模型能够处理任意帧的2D区域标注或直接3D标注,并提升对未见视图物体的空间关系推断能力。
🔍 现象分析现有视觉语言模型在3D空间推理上受限,尤其是当目标物体不在同一视图中出现时。缺乏一个能统一2D与3D表示的空间,导致对场景理解的深度和准确性不足。
🛠️ 主要方法提出SR-3D模型,通过共享视觉词元空间连接2D与3D数据。核心是使用3D位置嵌入增强2D视觉特征,利用强大的2D先验提升跨帧空间推理。支持边界框、分割掩码或直接3D区域提示。
📊 数据与实验在通用2D视觉语言和专用3D空间基准上进行了广泛实验,证明其达到最先进性能。还展示了在无传感器3D输入或真实3D标注的野外视频中的适用性,能推断空间关系和度量测量。
⭐ 主要贡献首次实现了通过共享表示空间统一2D与3D的视觉语言模型,支持灵活区域提示。显著提升了跨视图空间推理的准确性,尤其在物体不共现的情况下。为野外视频的3D感知应用提供了新途径。
查看完整摘要 (Abstract)
We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements. We show more qualitative results at https://www.anjiecheng.me/sr3d.
基础/前沿模型 (含LLM)
多模态基础模型
#Large Vision-Language Model #Hallucination Mitigation #Activation Editing #inferece-time
🎯 研究动机大型视觉语言模型在跨模态任务上取得显著进展,但其存在的语言偏见导致物体幻觉问题,阻碍可信AI应用。现有方法在缓解幻觉时未充分利用事实文本语义的指导,难以显式减轻语言偏见。
❓ 解决问题针对物体幻觉问题,提出自适应事实引导激活编辑方法AFTER,旨在通过事实语义指导将模型内部激活从偏见调整到事实方向。该方法可适配处理类别、属性和关系三种幻觉类型,并以最小成本在推理时实施干预。
🔍 现象分析物体幻觉主要分为类别、属性和关系幻觉,源于模型语言偏见导致生成与视觉输入不符的内容。现有编辑方法缺乏对事实文本语义的有效利用,限制了缓解偏见的能力。
🛠️ 主要方法AFTER包含两个核心模块:事实增强激活引导(FAS)通过建模视觉-文本关联提供通用事实指导;查询自适应偏移优化(QAO)根据具体查询生成特定编辑偏移,增强编辑的多样性和细粒度。
📊 数据与实验在三个主流LVLM和标准幻觉基准上开展实验,在AMBER基准上比基线降低幻觉达16.3%。实验证明了方法在多种模型和任务上的有效性。
⭐ 主要贡献提出首个自适应事实引导的视觉-文本编辑框架AFTER,通过双阶段机制实现细粒度幻觉缓解。方法在推理时以低成本显著降低幻觉率,为可信跨模态建模提供新思路。
查看完整摘要 (Abstract)
Large Vision-Language Models (LVLMs) have achieved substantial progress in cross-modal tasks. However, due to language bias, LVLMs are susceptible to object hallucination, which can be primarily divided into category, attribute, and relation hallucination, significantly impeding the trustworthy AI applications. Editing the internal activations of LVLMs has shown promising effectiveness in mitigating hallucinations with minimal cost. However, previous editing approaches neglect the effective guidance offered by factual textual semantics, thereby struggling to explicitly mitigate language bias. To address these issues, we propose Adaptive Factual-guided Visual-Textual Editing for hallucination mitigation (AFTER), which comprises Factual-Augmented Activation Steering (FAS) and Query-Adaptive Offset Optimization (QAO), to adaptively guides the original biased activations towards factual semantics. Specifically, FAS is proposed to provide factual and general guidance for activation editing, thereby explicitly modeling the precise visual-textual associations. Subsequently, QAO introduces a query-aware offset estimator to establish query-specific editing from the general steering vector, enhancing the diversity and granularity of editing. Extensive experiments on standard hallucination benchmarks across three widely adopted LVLMs validate the efficacy of the proposed AFTER, notably achieving up to a 16.3% reduction of hallucination over baseline on the AMBER benchmark. Our code and data will be released for reproducibility.
基础/前沿模型 (含LLM)
多模态基础模型
#SAM2 #LVLM
🎯 研究动机现有视频对象分割(VOS)方法多依赖特征匹配,缺乏对高层、以对象为中心的概念化表征的建模能力,难以处理外观剧烈变化和复杂动态场景。
❓ 解决问题旨在通过构建和利用高层概念化先验,推动VOS从传统特征匹配向概念驱动推理转变,提升模型在语义复杂场景下的分割鲁棒性。
🔍 现象分析常规VOS方法在动态多场景视频中性能受限,因特征匹配易受外观变化和场景转换干扰,需融入更强大的语义理解能力。
🛠️ 主要方法提出Segment Concept(SeC)框架,利用大视觉语言模型(LVLM)跨帧整合视觉线索,渐进构建对象级概念表征,并在新场景出现时注入概念级特征以平衡语义推理与计算开销。
📊 数据与实验为系统评估模型的高层概念推理能力,构建了包含160个多场景视频的SeCVOS基准,在SeCVOS和标准基准上的实验表明,SeC显著优于SAM 2等现有方法,在SeCVOS上相对SAM 2.1提升11.8个百分点。
⭐ 主要贡献提出了首个概念驱动的VOS框架SeC,引入了首个关注语义复杂场景的VOS基准SeCVOS,并通过概念级表征建模在VOS任务上实现了新的最先进性能。
查看完整摘要 (Abstract)
We propose Segment Concept (SeC), a concept-driven video object segmentation (VOS) framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. To balance semantic reasoning with computational overhead, SeC forwards the LVLMs only when a new scene appears, injecting concept-level features at those points.
To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. Empirical evaluations demonstrate that SeC substantially outperforms state-of-the-art approaches, including SAM 2 and its advanced variants, on both SeCVOS and standard VOS benchmarks. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware VOS.
基础/前沿模型 (含LLM)
多模态基础模型
#ECG #foundation model #benchmark #representation learning
TL;DR:We provide a comprehensive benchmark for ECG foundation models
🎯 研究动机12导联心电图作为经典诊断工具,但现有的机器学习研究局限于特定任务或数据集,缺乏统一的评估框架。基础模型(FM)被认为具备广泛适应性,但其性能的关键因素尚未明确。
❓ 解决问题系统性地评估不同架构的心电图基础模型,分析模型在有限标注条件下的伸缩性及性能差异,并探索有效的心电图表征路径。
🔍 现象分析研究发现不同的模型架构在多任务中表现差异显著,规模较小但结构化的模型ECG-CPC在多个任务中表现优异,挑战了规模至上的假设。
🛠️ 主要方法基于26个临床相关任务和12个公共数据集,对8种心电图基础模型在微调与冻结设定下进行基准评估,并分析其随数据集规模变化的性能表现。
📊 数据与实验实验使用1,650个回归和分类目标,覆盖成人心电图分析等领域,对标注效率和内部表征进行深入比较评估。
⭐ 主要贡献建立全面的心电图基础模型基准测试框架,揭示架构的归纳偏差对模型性能的影响,强调规模并非决定性因素,扩展了心电图分析的研究视角。
查看完整摘要 (Abstract)
The 12-lead electrocardiogram (ECG) is a long-standing diagnostic tool. Yet machine learning for ECG interpretation remains fragmented, often limited to narrow tasks or datasets. FMs promise broader adaptability, but fundamental questions remain: Which architectures generalize best? How do models scale with limited labels? What explains performance differences across model families? We benchmarked eight ECG FMs on 26 clinically relevant tasks using 12 public datasets comprising 1,650 regression and classification targets. Models were evaluated under fine-tuning and frozen settings, with scaling analyses across dataset sizes. Results show heterogeneous performance across domains: in adult ECG interpretation, three FMs consistently outperformed strong supervised baselines. In contrast, ECG-CPC, a compact structured state-space model, dominated 5 of 7 task categories, demonstrating that architecture matters more than scale. FMs improved label efficiency 3.3-9× over supervised baselines, though scaling behaviors varied across architectures. Representation analysis reveals that models with similar performance learn markedly different internal structures, suggesting multiple viable paths to effective ECG representation. Overall, while FMs show promise for adult ECG analysis, substantial gaps remain in cardiac structure, outcome prediction, and patient characterization. ECG-CPC's strong performance despite being orders of magnitude smaller challenges the assumption that FM quality requires massive scale, highlighting architectural inductive biases as an untapped opportunity.
基础/前沿模型 (含LLM)
多模态基础模型
#Empirical Study #Large Vision-Language Model #Benchmark #Evaluation
🎯 研究动机现有研究对大规模视觉语言模型(LVLMs)的整体评估较多,但在计算机视觉基础性的细粒度图像任务上仍缺乏系统性的评测基准,制约了对其感知能力的深入理解。
❓ 解决问题构建了首个全面评测LVLMs细粒度图像任务能力的基准FG-BMK,填补了这一空白,旨在推动模型在精细化视觉理解方面的研究与改进。
🔍 现象分析通过评测发现,当前LVLMs在细粒度图像任务上的性能受训练范式、模态对齐、抗干扰性以及细粒度类别推理等多种因素显著影响,揭示了现有模型在该领域的局限性。
🛠️ 主要方法设计了同时面向人类感知与机器性能的评估框架,从语义识别和细粒度特征表示两个维度系统性考察模型的细粒度视觉理解能力。
📊 数据与实验FG-BMK包含101万问题和28万图像,并在12个代表性LVLMs/VLMs上进行了广泛实验,验证了基准的全面性与有效性。
⭐ 主要贡献提出了首个大规模细粒度图像评测基准FG-BMK,并开源了代码;通过实验揭示了当前LVLMs的核心弱点,为未来数据构建与模型设计提供了重要指导。
查看完整摘要 (Abstract)
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks—fundamental to computer vision—remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.28 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.
基础/前沿模型 (含LLM)
多模态基础模型
#Multi-Granular Language Learning #Medical Image Analysis #Multimodal Learning
🎯 研究动机CLIP 等图像-文本预训练模型在通用领域取得成功,但其单标签、单粒度的对齐方式难以适应医学影像中多标签、多粒度描述并存的复杂情况。因此,需要开发一个专门针对医学影像理解、能处理多粒度和多标签信息的预训练框架。
❓ 解决问题本文旨在解决现有视觉-语言预训练模型在复杂医学图像理解上的局限性,特别是它们无法有效处理图像与多粒度、多标签文本描述之间对齐的问题。MGLL 框架被提出,以同时提升多标签对齐和跨粒度对齐的能力。
🔍 现象分析医学图像通常关联着多个不同层次的诊断信息,例如从整体病变到局部细节。当前基于对比学习的模型主要进行图像与单个文本描述的全局匹配,忽视了这种丰富的层级和多标签关联,导致在细粒度医学任务上性能受限。
🛠️ 主要方法提出的 MGLL 是一种对比学习框架。它利用结构化的多标签监督,整合来自不同粒度的文本描述,并通过引入带逐点约束的软标签监督来增强对齐。同时,它使用平滑的 KL 散度来保证跨粒度的一致性,并能作为一个即插即用的模块高效地融入现有视觉-语言模型。
📊 数据与实验研究团队构建了大规模多粒度数据集进行预训练,并在多个下游任务数据集上评估 MGLL。实验结果表明,MGLL 在各类下游任务中优于其他最先进方法,证明了其有效性。
⭐ 主要贡献提出了 Multi-Granular Language Learning 框架,专为提升医学图像的多标签与跨粒度理解而设计。该框架是一个计算高效的即插即用模块,能有效整合多粒度文本信息,并通过所构建的数据集和实验结果验证了其在多项任务上的卓越性能。
查看完整摘要 (Abstract)
Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple labels across different levels of granularity. To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback–Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at https://github.com/HUANGLIZI/MGLL.
基础/前沿模型 (含LLM)
多模态基础模型
#Prompt Weighting #Prompt Ensembling #Pre-trained Models #Vision-Language Models
🎯 研究动机预训练视觉-语言模型(VLM)的零样本分类效果对提示词(prompt)选择敏感,现有方法通过为所有类别共享的加权向量集成多个提示,忽略了提示与类别之间的条件依赖性。
❓ 解决问题针对现有提示加权方法中“提示权重与类别无关”的假设不合理问题,提出类感知的零样本提示重加权方法(CARPRT),以无训练方式建模提示与类别间的依赖关系。
🔍 现象分析实践中,同一提示对不同类别的适用性差异显著(如“航拍视图”对“机场”合适,对“苹果”却不适用),而现有方法假设提示权重跨类别共享,导致次优预测性能。
🛠️ 主要方法CARPRT 为每个类别计算特定于该类的提示权重:通过平均在给定提示下被预测为该类的图像-文本相关性得分,量化每个提示与类别的相关性,并归一化得到类感知权重。
📊 数据与实验在标准图像分类基准上评估,CARPRT 优于现有的类别无关重加权方法,证明了建模提示-类别依赖关系对零样本预测及依赖提示集成的 VLM 应用至关重要。
⭐ 主要贡献提出了首个类感知的零样本提示重加权框架,以无训练方式捕获提示-类别相关性;在多个基准上验证了方法的有效性,为基于提示集成的 VLM 应用提供了改进思路。
查看完整摘要 (Abstract)
Pre-trained vision-language models (VLMs) enable zero-shot image classification by computing the similarity score between an image and textual descriptions, typically formed by inserting a class label (e.g., "cat") into a prompt (e.g., "a photo of a").
Since the score for a given image-class pair is sensitive to the choice of prompt, existing studies ensemble multiple prompts using a weighting vector to aggregate scores across different prompts.
Yet, in current strategies, the weighting vector assigned to each prompt is shared across all classes, implicitly assuming that prompts are conditionally independent of classes, which often does not hold in practice, as a prompt like "an aerial view of" might be apt for "airport" but ill-suited for "apple".
To address this, we propose class-aware zero-shot prompt reweighting (CARPRT).
This scoring scheme adjusts the weighting vector for each class label by capturing the class-specific relevance of different prompts in a training-free manner.
For each class label and every available prompt, we quantify their class-specific relevance by averaging image–text relevance scores over images predicted to that class under the given prompt. These estimates are then normalized to derive class-specific weights.
Evaluations on standard image classification benchmarks show that CARPRT outperforms existing class-independent reweighting methods, confirming that modeling prompt-class dependencies is crucial for effective zero-shot prediction and even broader VLM-based application settings that rely on prompt ensembling. Our code is available at https://github.com/tmlr-group/CARPRT.
基础/前沿模型 (含LLM)
多模态基础模型
#MLLM #Self-Distillation #Fine-Grained Perception
TL;DR:We propose an efficient method to improve MLLM's fine-grained perception by training a module to predict Regions-of-Interest using clean pseudo-labels distilled from the model's own noisy attention maps.
🎯 研究动机多模态大语言模型(MLLM)需高分辨率视觉信息进行细粒度感知,但处理全图计算成本过高。现有基于区域的注意力机制面临两难:训练方法依赖大规模标注数据,而无训练方法计算效率低、准确度不足。
❓ 解决问题提出自蒸馏区域提案网络(SD-RPN),无需人工标注,解决计算效率与准确性的权衡。该方法能从MLLM的噪声注意力图中生成高质量伪标签,训练轻量级区域检测网络。
🔍 现象分析现有无训练方法依赖模型内部注意力,需多轮前向传播或依赖自回归解码,导致计算低效且准确度有限。传统训练方法则受限于大规模标注数据获取成本。
🛠️ 主要方法通过去噪和消歧处理将MLLM中间层噪声注意力图转化为高质量伪区域标签,训练高效的单次前传区域提案网络,将区域识别与生成过程解耦。
📊 数据与实验基于LLaVA-1.5架构,仅使用少量(约1万)问答对训练,在TextVQA、DocVQA和V-Star等未见基准上取得超过10%的绝对准确率提升。
⭐ 主要贡献提出高效无标注的自蒸馏区域提案框架,显著提升MLLM细粒度感知的数据效率和泛化能力,为MLLM的实用化部署提供可扩展解决方案。
查看完整摘要 (Abstract)
Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive.
While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process.
In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM's middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM's middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations.
To validate our approach, we integrate the framework into the LLaVA-1.5 architecture. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10\% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning. Code is available at https://github.com/YuHengsss/SD-RPN .
基础/前沿模型 (含LLM)
多模态基础模型
#VLM #Vision Language Learning #Continual Learning
🎯 研究动机在动态环境中进行持续视觉语言学习对多模态任务至关重要,但现有系统需要在严格的隐私和内存限制下从非平稳数据流中学习。简单微调会导致灾难性遗忘并损害知识迁移,因此需要一种能在不存储原始数据的前提下维持稳定且可塑能力的方法。
❓ 解决问题CoMem致力于解决持续学习中稳定性和可塑性之间的权衡问题,避免依赖原始数据进行回放,从而在隐私受限的场景下支持跨领域和跨任务的知识重用与重组。
🔍 现象分析传统持续学习方法在内存受限时容易发生遗忘,且多依赖原始数据回放,这不符合隐私保护需求。知识缺乏组合性结构也会限制跨任务迁移效果。
🛠️ 主要方法CoMem以组合性结构作为记忆单元,将知识增量组织为紧凑的概念关系图。它通过在特征空间中对采样子图进行条件化回放来直接演练,并结合轻量级组合一致性目标保持部分整体预测一致性,利用教师引导的不确定性感知过滤限制特征漂移。
📊 数据与实验实验涵盖跨领域检索、结构化概念学习和持续多模态VQA任务,在SVLC、VQACL和CLOVE等基准上使用匹配的内存和参数预算进行评估。CoMem在保持率和迁移性方面达到最先进水平,并展现出稳定收益。
⭐ 主要贡献提出了将结构作为记忆并在特征空间进行回放的新范式,实现了隐私友好的持续适应。通过组合性概念图记忆和特征空间演练,为多模态持续学习提供了可测试且高效的解决方案。
查看完整摘要 (Abstract)
Continual vision–language learning is crucial for multimodal tasks such as image–text retrieval, visual question answering, and grounded reasoning in dynamic environments, yet deployed systems must learn from non-stationary streams under strict privacy and memory budgets, where naïve finetuning forgets and harms transfer. We aim to sustain stable yet plastic capability in this setting without storing raw data, enabling reuse and recombination across domains and tasks. We present CoMem, a framework that treats compositional structure as the unit of memory and rehearsal: it incrementally organizes knowledge into a compact graph of concepts and relations and rehearses directly in feature space by conditioning practice signals on sampled subgraphs. A lightweight compositional consistency objective keeps part–whole predictions coherent, while teacher-informed, uncertainty-aware filtering limits off-manifold drift. Across cross-domain retrieval, structured concept learning, and continual multimodal VQA, CoMem achieves state-of-the-art retention and transfer alongside consistent gains on SVLC and VQACL/CLOVE under matched memory and parameter budgets. By casting structure as memory and rehearsing where learning happens (feature space), CoMem provides a privacy-friendly and testable paradigm for reliable continual adaptation without raw exemplars.
基础/前沿模型 (含LLM)
多模态基础模型
#Vision Language Models #Data Contamination
TL;DR:We devise a novel contamination detection method for vision language models.
🎯 研究动机视觉语言模型(VLM)在多项基准任务上取得最优性能,但其依赖互联网规模且常为专有的预训练数据引发了关键担忧:测试集泄露导致性能虚高。现有工作多针对大语言模型提出数据去污染或基准重设计,而针对污染VLM的检测方法研究不足。
❓ 解决问题本文旨在填补污染VLM检测方法的空白,特别针对现有检测方法在面对VLM污染时失效或行为不一致的问题。通过主动污染开源VLM,验证现有方法不足,并设计一种新颖、简单且有效的检测方案。
🔍 现象分析实验显示,当VLM在训练数据中混入测试集信息(即污染)时,现有检测方法要么完全失败,要么表现出不一致的行为。这凸显了开发专门针对多模态污染检测方法的紧迫性。
🛠️ 主要方法提出基于多模态语义扰动的新检测方法:通过施加受控的语义扰动,观察模型响应。核心思想是,被污染的VLM在扰动下无法泛化,而干净模型则表现稳定,从而有效区分两者。
📊 数据与实验在多个流行基准上故意污染开源VLM以构建测试环境。方法在多种现实污染策略下进行验证,证实了其鲁棒性和有效性。代码及扰动数据集已开源。
⭐ 主要贡献首次系统探索了VLM污染检测问题,提出了基于多模态语义扰动的简单有效检测方法。通过大量实验验证了该方法对多种污染策略的鲁棒性,并为社区提供了开源工具和扰动数据集。
查看完整摘要 (Abstract)
Recent advances in Vision–Language Models (VLMs) have achieved state-of-the-art performance on numerous benchmark tasks. However, the use of internet-scale, often proprietary, pretraining corpora raises a critical concern for both practitioners and users: inflated performance due to \emph{test-set leakage}. While prior works have proposed mitigation strategies such as decontamination of pretraining data and benchmark redesign for LLMs, the complementary direction of developing detection methods for \emph{contaminated VLMs} remains underexplored. To address this gap, we deliberately contaminate open-source VLMs on popular benchmarks and show that existing detection approaches either fail outright or exhibit inconsistent behavior. We then propose a novel simple yet effective detection method based on \textit{multi-modal semantic perturbation}, demonstrating that contaminated models fail to generalize under controlled perturbations. Finally, we validate our approach across multiple realistic contamination strategies, confirming its robustness and effectiveness. The code and perturbed dataset are released here: \href{https://github.com/jadenpark0/mm-perturb}{https://github.com/jadenpark0/mm-perturb}.
基础/前沿模型 (含LLM)
多模态基础模型
#Diffusion Multimodal Large Language Models; Information flow
TL;DR:Understanding and Mitigating the “Repeat Curse” from the Perspective of Information Flow.
🎯 研究动机当前基于扩散的多模态大语言模型(dMLLMs)面临高推理延迟问题,为加速解码常采用缓存机制,但该机制容易引发文本重复生成的副作用,作者称之为"重复诅咒"。
❓ 解决问题本研究旨在从信息流视角深入探究重复诅咒的内在机制,并提出一种即插即用的方法来缓解重复生成,从而提升模型的生成质量。
🔍 现象分析研究发现语境词元作为语义锚点引导最终预测,其信息熵在深层网络收敛;重复生成与语境词元信息流中断及其熵在深层不收敛密切相关。
🛠️ 主要方法提出的CoTA方法通过增强语境词元的注意力以保持信息流模式,并在解码时对置信度分数引入惩罚项,避免不确定语境词元驱动输出。
📊 数据与实验通过大量实验验证了CoTA在缓解重复生成方面的显著有效性,并在通用任务上取得了持续的性能提升。
⭐ 主要贡献首次从信息流角度揭示了dMLLMs中重复诅咒的成因,并提出了一种有效缓解重复的即插即用方案,为优化大模型解码机制提供了新思路。
查看完整摘要 (Abstract)
Recent diffusion-based Multimodal Large Language Models (dMLLMs) suffer from high inference latency and therefore rely on caching techniques to accelerate decoding. However, the application of cache mechanisms often introduces undesirable repetitive text generation, a phenomenon we term the Repeat Curse. To better investigate underlying mechanism behind this issue, we analyze repetition generation through the lens of information flow. Our work reveals three key findings: (1) context tokens aggregate semantic information as anchors and guide the final predictions; (2) as information propagates across layers, the entropy of context tokens converges in deeper layers, reflecting the model’s growing prediction certainty; (3) Repetition is typically linked to disruptions in the information flow of context tokens and to the inability of their entropy to converge in deeper layers. Based on these insights, we present CoTA, a plug-and-play method for mitigating repetition. CoTA enhances the attention of context tokens to preserve intrinsic information flow patterns, while introducing a penalty term to the confidence score during decoding to avoid outputs driven by uncertain context tokens. With extensive experiments, CoTA demonstrates significant effectiveness in alleviating repetition and achieves consistent performance improvements on general tasks. Code is available at https://github.com/ErikZ719/CoTA
基础/前沿模型 (含LLM)
多模态基础模型
#Vision-language Pretraining #Cortical surface modeling #Lifespan
🎯 研究动机大脑皮层蕴含理解发育、衰老及疾病的丰富神经信息,但现有皮层表征学习方法局限于特定年龄段,缺乏跨生命周期的泛化能力。视觉-语言模型虽具潜力,但构建统一框架面临三大挑战。
❓ 解决问题本文提出 CortiLife,首个面向全生命周期的统一视觉-语言框架,旨在克服皮层曲面非欧结构、配准导致的个体折叠模式同质化,以及皮层特征随年龄的分布偏移问题。
🔍 现象分析现有方法难以统一建模跨年龄段的皮层特征变化,且标准配准过程会消除个体解剖差异;视觉-语言模型的引入为融合多模态生物标记与结构化元数据提供了新途径。
🛠️ 主要方法CortiLife 设计表面分词器,基于二十面体划分皮层曲面区块并进行多级编码,融合局部拓扑、全局交互和区块分布模式;结合掩码自蒸馏与元数据语言提示,将年龄、性别等属性嵌入文本编码器。
📊 数据与实验在下游任务中验证,包括年龄预测和皮层分区两类编码器冻结任务,以及脑疾病诊断的四类微调任务,结果显示 CortiLife 在不同年龄段与模态上均超越现有基准。
⭐ 主要贡献提出了首个生命周期感知的皮层统一表征学习框架;通过多级表面编码与元数据融合,有效缓解了配准同质化与特征分布偏移;实验证明了其在跨年龄与跨模态任务中的优越泛化能力。
查看完整摘要 (Abstract)
The human cerebral cortex encodes rich neurobiological information that is essential for understanding brain development, aging, and disease. Although various cortical representation learning methods have been proposed, existing models are typically restricted to stage-specific cohorts and lack generalization across the lifespan. While recent vision-language models offer a promising direction, building a unified framework for cortical representation faces three key challenges: (1) the non-Euclidean manifold structure of cortical surfaces, (2) homogenization of individual folding patterns induced by registration, and (3) distribution shifts of cortical features across the lifespan. To address these issues, we present CortiLife, the first unified vision-language framework for lifespan-aware cortical representation learning. Specifically, CortiLife introduces a surface tokenizer that integrates icosahedron-based surface patchification with multi-level patch encoding to transform complex cortical manifolds into compact token representations. The multi-level encoding incorporates three complementary streams that capture local topology, global interactions, and patch-wise distributional patterns, effectively mitigating the challenges of homogenization and distribution shifts. Furthermore, CortiLife integrates masked self-distillation with metadata language prompting, embedding information such as age, sex, health status, and attribution type into the text encoder to better capture individual-specific cortical representations while enabling both age-aware and modality-aware modeling. Extensive experiments on downstream tasks, including two encoder-frozen tasks (age prediction and cortical parcellation) and four encoder fine-tuning tasks (brain disorder diagnosis), demonstrate that CortiLife consistently outperforms state-of-the-art baselines across different age stages and modality types, underscoring its effectiveness and generalization ability.
基础/前沿模型 (含LLM)
多模态基础模型
#embodied ai #vision-language-action models #inverse dynamics models
TL;DR:Desktop gaming data effectively pretrains embodied AI: 152× compression via OWA Toolkit, YouTube pseudo-labeling with Generalist-IDM, achieving 96.6% on LIBERO manipulation and 83.3% on CANVAS navigation with 1.3K hours of data.
🎯 研究动机大语言模型利用互联网规模的文本数据取得了成功,但具身AI的发展因物理轨迹数据采集成本高昂而受到限制。桌面环境(尤其是游戏)提供了丰富的感知运动交互数据,且保持了具身学习所需的结构化观测-动作耦合关系,为解决数据瓶颈提供了新思路。
❓ 解决问题本文旨在证明桌面交互数据可以作为机器人具身AI任务的有效预训练基底。不同于先前工作局限于特定领域(如Minecraft的VPT)或数据私有(如SIMA),D2E构建了一个从可扩展的桌面数据收集到具身领域验证迁移的完整、公开的管道。
🔍 现象分析大规模物理轨迹数据获取的高昂成本是制约具身AI发展的关键瓶颈。而桌面游戏等数字环境能以低成本提供海量、多样化的传感器-动作交互序列,其中蕴含的可迁移的感知运动基元为解决该问题提供了可能。
🛠️ 主要方法D2E框架包含三个核心组件:1) OWA Toolkit将多样桌面交互统一为标准格式并实现152倍压缩;2) Generalist-IDM通过基于时间戳的事件预测实现对新游戏的零样本泛化,支持互联网规模的伪标注;3) VAPT负责将桌面预训练表示迁移到物理操控与导航任务。
📊 数据与实验模型利用总计超1300小时的数据(含259小时人类演示和超1000小时伪标注游戏数据)进行预训练。实验表明,其10亿参数模型在LIBERO操控任务上达到96.6%成功率,在CANVAS导航任务上达到83.3%,性能可匹敌或超越模型大小达其7倍的基线(如π_0和OpenVLA)。
⭐ 主要贡献提出了首个从桌面到具身AI的完整公开框架D2E,证明了桌面数据预训练范式的有效性与高效性。方法通过压缩、零样本泛化伪标注和迁移学习实现了性能突破,所有资源均已开源,为推动具身AI发展提供了新路径。
查看完整摘要 (Abstract)
Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection.
Desktop environments---particularly gaming---offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning.
We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks.
Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains.
Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152× compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation.
Using 1.3K+ hours of data (259 hours of human demonstrations and 1K+ hours of pseudo-labeled gameplay), our 1B-parameter model achieves 96.6\% success on LIBERO manipulation and 83.3\% on CANVAS navigation, matching or surpassing models up to 7$\times$ larger, such as $\pi_0$ (3.3B) and OpenVLA (7B).
These results demonstrate that sensorimotor primitives learned from digital interactions transfer effectively to real-world physical tasks, establishing desktop pretraining as a practical paradigm for embodied AI.
All resources are publicly available at https://worv-ai.github.io/d2e.
基础/前沿模型 (含LLM)
多模态基础模型
#Multimodal Large Language Models #Multimodal Reasoning #Reinforcement Learning
🎯 研究动机现有大型视觉语言模型在多模态理解方面表现出色,但其推理过程仍以文本为主,难以深度整合视觉信息,难以模拟人类依赖图像的深度认知过程。本研究旨在激励模型‘用图像思考’,以缩小与人类认知方式的差距。
❓ 解决问题针对模型难以将视觉信息深度融入推理过程的问题,提出了DeepEyes。该模型通过强化学习端到端训练,无需依赖预先收集的推理数据进行冷启动监督微调,即可让‘用图像思考’的能力自主涌现。
🔍 现象分析研究发现,当前模型主要依赖文本推理,未能充分利用视觉信息。通过引入主动感知,模型学会策略性地将视觉信息作为推理依据,从而提升多项任务性能,其感知过程展现出从探索到高效利用的演化规律。
🛠️ 主要方法核心方法是通过强化学习进行端到端训练,引导模型进行主动感知。该方法采用定制化的数据选择与奖励策略,促使模型自主学会策略性地基于视觉信息进行推理。
📊 数据与实验实验在通用感知与推理基准上进行,并评估了定位能力、幻觉问题及数学推理任务。结果显示了显著的性能提升,且模型展现出了与人类视觉推理过程相似的多样化思维模式。
⭐ 主要贡献提出了DeepEyes模型,其创新点在于通过强化学习引导的主动感知,让‘用图像思考’的能力自主涌现,不依赖外部模型或API。该模型在多项任务上取得显著提升,并揭示了其感知过程的演化规律,为推进多模态推理研究提供了新思路。
查看完整摘要 (Abstract)
Large Vision-Language Models excel at multimodal understanding but struggle to deeply integrate visual information into their predominantly text-based reasoning processes, a key challenge in mirroring human cognition. To address this, we introduce \nameshort{}, a model that learns to ``think with images'', trained end-to-end with reinforcement learning without requiring pre-collected reasoning data for cold-start supervised fine-tuning (SFT). Notably, this ability emerges natively, leveraging the model's own grounding capability as an intrinsic function rather than relying on external specialized models or APIs. We enable this capability through active perception, where the model learns to strategically ground its reasoning in visual information, guided by a tailored data selection and reward strategy. \nameshort{} achieves significant performance gains on general perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of active perception from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at \url{https://github.com/Visual-Agent/DeepEyes}.
基础/前沿模型 (含LLM)
多模态基础模型
#Metric depth #Vision language model #Spatial reasoning
TL;DR:The first proof that VLMs can have expert model level depth estimation accuracy without architecture or loss change
🎯 研究动机尽管纯视觉专家模型在度量深度估计等3D理解任务上达到了超人的准确率,但它们需要任务特定的架构和损失函数。而视觉语言模型(VLMs)虽然语义理解能力强,但在从2D输入理解3D方面仍有困难。本研究旨在探索VLM是否能在不改变架构或损失函数的情况下达到专家级别的度量深度估计精度。
❓ 解决问题本研究致力于解决视觉语言模型(VLMs)在从单张2D图像进行精确度量深度估计这一关键3D理解任务上的能力瓶颈,使其在不依赖特定架构或损失设计的情况下,达到与专家纯视觉模型相媲美的性能。
🔍 现象分析分析发现,现有的顶尖VLMs在3D空间推理上存在不足,主要瓶颈在于像素级参照和跨数据集相机参数的不确定性。然而,通过基于文本的监督微调和稀疏标签,无需复杂的密集预测头或回归损失,就足以解锁VLM的强3D理解潜力。
🛠️ 主要方法提出了DepthLM方法,其核心是通过视觉提示(Visual Prompting)来解决像素参照问题,并采用以相机内参为条件的增强策略来克服跨数据集相机模糊性。该方法仅通过稀疏标签的文本监督微调,无需修改VLM的骨干架构或引入复杂的任务特定损失。
📊 数据与实验实验评估表明,在更小的模型规模下,DepthLM在度量深度估计上的准确率超越了最先进的VLMs超过两倍,首次使VLM在该任务上可与纯视觉专家模型相提并论。代码和模型已开源。
⭐ 主要贡献首次证明了视觉语言模型在不改变架构或损失函数的情况下,可以达到专家模型级别的度量深度估计精度。所提出的DepthLM方法简单有效,不仅在该核心任务上表现出色,还为实现单个VLM覆盖多种3D任务提供了可能。
查看完整摘要 (Abstract)
Vision language models (VLMs) can flexibly address various vision tasks through text interactions. Although successful in semantic understanding, state-of-the-art VLMs including GPT-5 still struggle in understanding 3D from 2D inputs. On the other hand, expert pure vision models achieve super-human accuracy in metric depth estimation, a key 3D understanding task. However, they require task-specific architectures and losses. Such difference motivates us to ask: Can VLMs reach expert-level accuracy without architecture or loss change? We take per-pixel metric depth estimation as the representative task and show that the answer is yes! Surprisingly, comprehensive analysis shows that text-based supervised-finetuning with sparse labels is sufficient for VLMs to unlock strong 3D understanding, no dense prediction head or complex regression/regularization loss is needed. The bottleneck lies in pixel reference and cross-dataset camera ambiguity, which we address through visual prompting and intrinsic-conditioned augmentation. With much smaller models, our method DepthLM surpasses the accuracy of most advanced VLMs by over 2x, making VLMs for the first time comparable with pure vision models. The simplicity of DepthLM also enables a single VLM to cover various 3D tasks beyond metric depth. Code and model are available at https://github.com/facebookresearch/DepthLM_Official.
基础/前沿模型 (含LLM)
多模态基础模型
#voice conversation model #parallel speech-text #end-to-end #dual-resolution
🎯 研究动机近年来,端到端语音生成与大语言模型结合受到广泛关注,但现有方法在结合语音和文本生成时存在互相感知不足的问题。
❓ 解决问题现有方法存在独立生成语音离散表示或联合生成时频率差异过大的不足,限制了多模态生成性能。
🔍 现象分析传统方法多使用12.5Hz语音输入表示,导致计算成本高且语音与文本频率差异显著,不利于充分利用模型能力。
🛠️ 主要方法提出DrVoice模型,基于联合自回归建模,采用双分辨率语音表示机制,将输入频率降低至5Hz,提升多模态感知与生成效率。
📊 数据与实验通过在OpenAudioBench、VoiceBench、UltraEval-Audio和Big Bench Audio等基准数据集上的实验,DrVoice-7B在多个指标上取得了最新SOTA性能。
⭐ 主要贡献设计了双分辨率语音表示机制,优化了语音与文本频率匹配,提出了开源的7B参数语音基础模型,为语音-文本多模态生成领域树立了新标杆。
查看完整摘要 (Abstract)
Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM’s autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs’ capabilities. Experimental results demonstrate that DrVoice-7B establishes new state-of-the-art (SOTA) on prominent speech benchmarks including OpenAudioBench, VoiceBench, UltraEval-Audio and Big Bench Audio, making it a leading open-source speech foundation model in ∼7B models.
基础/前沿模型 (含LLM)
多模态基础模型
#Multi-modal Large Language Model #3D Understanding #Large Language Model
🎯 研究动机现有编码器架构的3D LMM在处理动态点云分辨率和语义对齐方面存在局限。本文旨在探索去编码器架构是否能在3D多模态场景中替代编码器,并提升模型通用性与语义理解能力。
❓ 解决问题解决了传统编码器架构因固定分辨率适应性差及点云特征与LLM语义需求不匹配的两大瓶颈。通过消除预训练编码器,使LLM直接承担3D编码功能,提升跨任务适应性和语义一致性。
🔍 现象分析在2D LMM中,去编码器架构已初步验证潜力,但3D场景因其点云数据稀疏性和几何复杂性,仍缺乏系统探索。传统编码器产生的特征难以满足LLM的高层语义需求,限制了3D理解任务的性能上限。
🛠️ 主要方法提出预训练阶段的LLM嵌入语义编码策略,结合混合语义损失提取高层次语义;在指令微调阶段采用分层几何聚合策略,向LLM层引入几何归纳偏置以增强局部细节感知。最终构建首个去编码器3D LMM模型ENEL。
📊 数据与实验在分类、描述生成和视觉问答任务上评估,ENEL的7B参数量模型媲美13B量级的SOTA模型,分别达到57.91%、61.0%和55.20%的性能。实验表明去编码器架构在3D理解领域具有高度替代潜力。
⭐ 主要贡献首次系统论证去编码器架构在3D LMM中的可行性,提出语义编码与几何聚合的双阶段策略。开源ENEL模型及代码,为3D多模态研究提供新范式,推动高效轻量化架构发展。
查看完整摘要 (Abstract)
Encoder-free architectures have been preliminarily explored in the 2D Large Multimodal Models (LMMs), yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to alleviate the challenges of encoder-based 3D LMMs. These long-standing challenges include the failure to adapt to varying point cloud resolutions during inference and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the pre-trained encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, **ENEL**. Our 7B model rivals the state-of-the-art model, PointLLM-PiSA-13B, achieving 57.91%, 61.0%, and 55.20% on the classification, captioning, and VQA tasks, respectively. Our results show that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL.
基础/前沿模型 (含LLM)
多模态基础模型
#Multimodal reasoning #Multimodal RL #Multimodal Large Language Model #Attention Analysis
🎯 研究动机多模态大模型冷启动阶段的机制尚未被充分理解,其初始化对后续训练至关重要。现有方法缺乏有效指标来量化模型对视觉信息的关注度。
❓ 解决问题提出视觉注意力分数(VAS)作为量化模型关注视觉token程度的指标。设计了无需训练即可调整注意力分配的推理时干预方法。开发了AVAR冷启动框架以系统提升多模态推理能力。
🔍 现象分析发现推理性能与VAS高度相关(r=0.9616),但传统多模态冷启动未能显著提升VAS。仅文本冷启动反而能提高注意力得分,这一反常现象被命名为惰性注意力定位。
🛠️ 主要方法AVAR框架整合视觉锚定数据合成、注意力引导目标和视觉锚定奖励塑造。通过训练免费干预直接操控推理时的注意力分配。在Qwen2.5-VL-7B模型上实现端到端优化。
📊 数据与实验在7个多模态推理基准测试上评估,平均提升7.0%。消融研究证实AVAR各组件对性能增益具有阶梯式贡献。开源代码、数据和模型促进可复现性。
⭐ 主要贡献首次揭示冷启动阶段的注意力分配与最终性能的直接关联。提出VAS度量标准及有效的冷启动优化框架。为多模态模型训练提供了新的理论视角和实践工具。
查看完整摘要 (Abstract)
The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold-start fails to raise VAS, leaving distributions close to the base model, whereas text-only cold-start induces a clear increase. We term this counter-intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training-free interventions that directly manipulate attention allocation at inference time, yielding consistent 1--2% gains without retraining. Building on these insights, we propose Attention-Guided Visual Anchoring and Reflection (AVAR), a comprehensive cold-start framework that integrates visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping. Applied to Qwen2.5-VL-7B, AVAR delivers an average gain of 7.0% across 7 multimodal reasoning benchmarks. Ablation studies further confirm that each component of AVAR contributes step-wise to the overall gains. The code, data, and models are available at https://github.com/lrlbbzl/Qwen-AVAR.
基础/前沿模型 (含LLM)
多模态基础模型
#Native Vision-Language Models #Vision-Language Primitive #Holistic Vision-Language Buffer
TL;DR:A novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios.
🎯 研究动机原生视觉-语言模型逐渐兴起,但其基础约束和开发壁垒尚不明确,阻碍了进一步探索。本文旨在厘清这些问题,推动该领域的民主化发展。
❓ 解决问题定义了原生视觉-语言模型的基本构成原则,并提出新型模型NEO,大幅缩小了与主流模块化模型在实际场景中的性能差距。
🔍 现象分析原生模型与模块化模型之间的本质差异及可突破性仍是核心挑战,同时当前领域研究缺乏开放性和易用性,制约了进展速度。
🛠️ 主要方法提出原生视觉-语言模型应具备的三条基本原则:像素与词语表征在共享语义空间对齐、视觉与语言能力深度融合、以及内嵌跨模态统一编码与推理特性。
📊 数据与实验基于3.9亿图文样本,构建了从零开始训练的原生模型NEO,并在密集单体结构中缓解了视觉与语言模块的冲突,验证了其高效视觉感知能力。
⭐ 主要贡献提出了NEO模型系列作为可扩展原生视觉-语言模型的基石,并配套了丰富的可复用组件,促进了低成本、可扩展的生态系统建设。
查看完整摘要 (Abstract)
The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome?
(-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should:
(i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, greatly narrowing the gap with top-tier modular counterparts across diverse real-world scenarios. With 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLM development, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem.
基础/前沿模型 (含LLM)
多模态基础模型
#Large Multimodal Models #Multi-token Prediction #Non-Autoregressive Learning
🎯 研究动机当前大型语言模型(LLMs)向多模态扩展时,语音到语音(S2S)系统主要依赖自回归(AR)方法,但忽视了文本依赖目标-目标关系而音频依赖源-目标关系的本质差异,导致模型处理音频-文本混合模态时存在局限性。
❓ 解决问题本文提出Text-to-Talk(TtT)框架,旨在统一处理音频和文本生成,通过结合自回归文本生成与非自回归音频扩散,以解决现有方法在模态特性不匹配和生成效率低下的问题。
🔍 现象分析现有多模态模型在处理交错音频和文本时,常采用统一的AR方法,但音频的连续性和强上下文关联性与文本的离散生成逻辑不同,直接应用AR会导致训练目标不一致和推理速度受限。
🛠️ 主要方法TtT使用单一Transformer架构,整合AR文本生成与基于吸收离散扩散的非自回归音频生成,并引入模态感知注意力机制,确保文本因果解码的同时支持音频跨度的双向建模;配合三种训练策略减少训练-测试差异,推理时采用块状扩散并行合成音频。
📊 数据与实验在Audio-QA、ASR、AAC和S2S基准测试上进行综合实验,TtT均超越强AR和NAR基线,并通过消融和训练策略分析验证了各组件贡献。
⭐ 主要贡献提出首个统一音频-文本生成框架,通过非自回归联合训练实现模态协同;设计模态感知注意力机制和训练策略,提升生成质量与效率;在多项任务中验证了方法的有效性和泛化能力。
查看完整摘要 (Abstract)
Recent advances in large language models (LLMs) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech (S2S) conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive (AR) methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates AR text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order AR property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Comprehensive experiments on audio question answering (Audio-QA), automatic speech recognition (ASR), automated audio caption (AAC) and S2S benchmarks show that TtT consistently surpasses strong AR and NAR baselines, with additional ablation and training-strategy analyses confirming the contribution of each component.
基础/前沿模型 (含LLM)
多模态基础模型
#Multimodal Large Language Models
🎯 研究动机当前多模态大模型在视觉推理和生成过程中,缺乏对视觉结果进行可靠反思和细化的基础能力,这限制了下一代多模态系统的可信度和可控性。
❓ 解决问题本文提出生成式通用验证器(Generative Universal Verifier)作为多模态元推理插件,旨在为视觉语言模型提供视觉结果的反思与精炼能力,从而提升生成可靠性和推理准确性。
🔍 现象分析通过构建ViVerBench基准测试发现,现有视觉语言模型在16类关键任务中普遍表现不佳,与人类水平的视觉验证能力存在显著差距。
🛠️ 主要方法设计自动化管道构建大规模视觉验证数据,训练出首个全能生成式验证器OmniVerifier-7B;并提出序列测试时扩展范式OmniVerifier-TTS,通过迭代细粒度优化提升生成能力上限。
📊 数据与实验构建了ViVerBench综合基准,并在T2I-ReasonBench和GenEval++上验证了方法有效性,OmniVerifier-TTS分别取得+3.7和+4.3的性能提升,优于Best-of-N等并行测试时扩展方法。
⭐ 主要贡献提出生成式通用验证器新概念,建立了视觉验证基准并揭示能力差距;首次训练出全能视觉验证器,发现三种原子能力及其协同作用;扩展了验证器在生成编辑和世界模型推理场景中的应用潜力。
查看完整摘要 (Abstract)
We introduce *Generative Universal Verifier*, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build **ViVerBench**, a comprehensive benchmark spanning $16$ categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train **OmniVerifier-7B**, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+$8.3$). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose **OmniVerifier-TTS**, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+$3.7$), and GenEval++(+$4.3$), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.
基础/前沿模型 (含LLM)
多模态基础模型
#Vision Encoder #Multimodal Large Language Model #Fine-Grain Perception
🎯 研究动机当前MLLM视觉编码器主要关注全局图像表征,缺乏细粒度区域感知能力。这一局限性源于缺乏大规模细粒度标注数据和相应的预训练范式。
❓ 解决问题提出GranViT,一种集成细粒度特征提取与区域级自回归训练的视觉Transformer。它通过区域级标注和双向回归机制增强局部视觉表征与语义对齐能力。
🔍 现象分析现有方法因细粒度标注数据稀缺和预训练框架不足,导致模型在细节感知和定位推理方面存在显著缺陷。这限制了MLLM在精细视觉任务上的表现。
🛠️ 主要方法构建Gran-29M数据集(2900万图像含1.8亿区域标注),设计预训练-适配框架。采用边界框-描述双向回归训练,并引入自蒸馏机制强化区域定位约束。
📊 数据与实验使用Gran-29M进行大规模预训练,在细粒度识别、多模态VQA和OCR理解任务上实现SOTA。实验证明模型具备优异的跨LLM迁移能力。
⭐ 主要贡献提出首个集成细粒度感知与区域自回归训练的视觉编码器;构建大规模区域标注数据集;创新性双向回归框架与自蒸馏机制显著提升局部表征能力。
查看完整摘要 (Abstract)
Vision encoders are indispensable for allowing impressive performance of Multimodal Large Language Models (MLLMs) in vision–language tasks such as visual question answering and reasoning. However, existing vision encoders focus on global image representations but overlook fine-grained regional analysis. They are limited in fine-grained perception due to the scarcity of fine-grained annotated data and the lack of a fine-grained pre-training paradigm. In this paper, we propose GranViT, a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to Large Language Models (LLMs) via region-level autoregressive training. We first construct Gran-29M, a dataset comprising 29 million natural and OCR images paired with over 180 million high-quality region-level annotations, to enable large-scale fine-grained pretraining. Consequently, we develop a pretraining-adaptation framework along with a self-distillation mechanism to train fine-grained GranViT on Gran-29M. We sufficiently exploit the fine-grained annotations from Gran-29M to resort to bounding-box-to-caption regression to enhance localized visual representation of the vision encoder in the pretraining and caption-to-bounding-box regression to improve vision feature utilization and localization for LLM in the adaptation. We further incorporate a self-distillation mechanism that imposes explicit localization constraints on the vision encoder to strengthen its regional reasoning capability. Extensive experiments show that GranViT surpasses existing vision encoders and attains strong transferability to varying LLMs. Remarkably, it achieves state-of-the-art results on fine-grained recognition, multimodal VQA, and OCR understanding.
基础/前沿模型 (含LLM)
多模态基础模型
#image caption #benchmark #region understanding
🎯 研究动机多模态大语言模型在全景理解方面表现出色,但在密集场景的细粒度分析与复杂关系理解上存在局限。现有区域级模型常孤立处理给定区域,忽视了全局上下文的关键影响。
❓ 解决问题针对区域理解中全局上下文缺失和跨区域交互建模不足的问题,提出GAR模型以实现综合性区域级视觉理解。方法通过新设计的特征回放技术和多提示建模,增强精确感知与复杂推理能力。
🔍 现象分析当前区域级MLLMs难以有效利用全局线索,限制了对复杂场景的深度解析。现有评测基准也未能充分评估多区域交互和组合推理等高级能力。
🛠️ 主要方法采用RoI对齐的特征回放技术整合必要全局上下文,支持多提示间交互建模。该方法实现了从被动描述到主动对话的范式转变,能够回答针对任意区域的开放式问题。
📊 数据与实验构建GARBench评估框架,包含单区域理解和多区域复杂推理任务。GAR-1B在DLC-Bench上超越DAM-3B 4.5个点,GAR-8B零样本性能甚至优于领域专用模型VideoRefer-7B。
⭐ 主要贡献提出首个整合全局上下文的区域级视觉理解模型GAR,构建多维评估基准GARBench。该模型在图像和视频领域均展现出卓越的迁移能力和高级推理性能,相关代码数据将开源。
查看完整摘要 (Abstract)
While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle with the dense world, i.e., complex scenes requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehensive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GARBench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Empirically, GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g.,
outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GARBench-VQA. More importantly, our zero-shot
GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong comprehension capabilities can be easily transferred to videos. Code and data will be released to the community.
基础/前沿模型 (含LLM)
多模态基础模型
#navigation foundation models #Vision-and-Language Navigation
🎯 研究动机现有基于大视觉语言模型的导航方法多为端到端映射,存在动作碎片化、延迟高且难以应对动态障碍等问题,需设计一种能兼顾高级推理与低级执行的导航基础模型。
❓ 解决问题提出首个双系统视觉语言导航基础模型DualVLN,通过高低层级协同解决现有方法在实时控制、动态环境适应性和轨迹平滑性方面的不足。
🔍 现象分析当前端到端VLN方法直接输出离散短视距动作,导致运动不连贯、计算延迟高,且无法有效处理动态环境中的实时避障等挑战。
🛠️ 主要方法采用双系统架构:System 2为基于VLM的全局规划器,通过图像推理预测中程路径点;System 1为轻量级多模态扩散Transformer策略,结合显式像素目标与隐式特征生成平滑轨迹。
📊 数据与实验模型在标准VLN基准测试中全面超越先前方法,并通过真实世界实验验证其在动态环境中长视距规划与实时适应的鲁棒性。
⭐ 主要贡献首创双系统VLN基础模型,通过解耦训练实现泛化性保持与可解释局部导航;提出扩散策略生成连续轨迹,显著提升动态环境下的实时控制性能与运动连贯性。
查看完整摘要 (Abstract)
While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance.
We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories.
The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.
基础/前沿模型 (含LLM)
多模态基础模型
#LVLMs-Saliency; Saliency-Guided Rejection Sampling; Local Coherence Reinforcement; Hallucination
🎯 研究动机现有方法仅依赖前向注意力,无法可靠区分LVLM输出中的幻觉与正确内容。这忽略了梯度信号所揭示的令牌影响力传播信息。
❓ 解决问题本文旨在通过融合注意力与梯度信息,量化输出令牌的接地强度,从而诊断和减少LVLM中的幻觉生成。
🔍 现象分析分析发现一个关键模式:当先前的输出令牌对下一令牌预测的显著性较低时,幻觉就会发生,这表明上下文记忆出现了故障。
🛠️ 主要方法提出了一个双机制推理时框架:1) 显著性引导拒绝采样,动态过滤解码过程中显著性低于上下文自适应阈值的候选令牌;2) 局部一致性增强模块,强化当前令牌对近期输出的注意力,主动抵消已识别的“遗忘”行为。
📊 数据与实验实验结果表明,该方法在多个LVLM上显著减少了幻觉。
⭐ 主要贡献提出了LVLMs-Saliency,一个梯度感知的诊断工具;揭示了幻觉发生的显著性模式;并设计了包含SGRS和LocoRE的推理时框架,为提升模型可靠性提供了一个鲁棒且可解释的解决方案。
查看完整摘要 (Abstract)
Recent studies have investigated attention dynamics in large vision language models (LVLMs), yet existing methods remain limited in reliably distinguishing hallucinated from correct outputs — primarily because they rely solely on forward-pass attention, ignoring gradient-based signals that reveal how token influence propagates through the model. To bridge this gap, we introduce \textbf{LVLMs-Saliency}, an \textit{gradient-aware diagnostic tool} that quantifies the grounding strength of each output token by fusing attention weights with their gradients. Through analysis, we identify a decisive pattern: \textit{Hallucinations occur when prior output tokens shows low saliency to the next token prediction}, indicating a failure of contextual memory. Building on this insight, we propose a dual-mechanism inference-time framework: (1) Saliency-Guided Rejection Sampling (SGRS), which dynamically filters candidate tokens during decoding by rejecting those with saliency below a context-adaptive threshold, thereby preventing coherence-breaking tokens from entering the sequence; and (2) Local Coherence Reinforcement (LocoRE), a lightweight plug-and-play module that strengthens attention from the current token to its most recent outputs, actively counteracting the “forgetting” behavior identified by LVLMs-Saliency. Experimental results demonstrate that our method significantly reduces hallucinations across multiple LVLMs, offering a robust and interpretable solution to improve model reliability.
基础/前沿模型 (含LLM)
多模态基础模型
#mitigating hallucination #feature editing #LVLMs
🎯 研究动机大型视觉语言模型在多模态推理和复杂场景理解上表现出色,但仍面临显著的幻觉问题,即输出与视觉事实相矛盾。现有基于重训练或对比解码的方法虽有效,但分别存在计算资源消耗大或推理开销高的问题,限制了其实际适用性。
❓ 解决问题本文提出了一种幻觉感知的中间表示编辑框架,旨在以最小额外计算成本动态检测并消除幻觉表示,从而实现高效且可控的幻觉缓解。
🔍 现象分析幻觉问题源于模型输出与视觉输入的不一致;现有缓解方法在性能与效率间难以平衡,重训练方法资源需求高,而对比解码方法引入双重推理负担。
🛠️ 主要方法通过动态检测模型中间表示中的幻觉特征,并针对性地进行编辑以消除幻觉,在保持高效推理的同时提升输出准确性。
📊 数据与实验在现有基准测试上进行了广泛实验,证明了该方法达到了最先进的性能,并有效验证了其高效、鲁棒的幻觉消除能力和可控性。
⭐ 主要贡献提出了一种轻量级幻觉感知编辑框架,以低计算成本实现了先进的幻觉缓解性能;强调了方法在效率、鲁棒性和可控性方面的优势,并开源了代码以促进后续研究。
查看完整摘要 (Abstract)
Large Vision-Language Models have demonstrated exceptional performance in multimodal reasoning and complex scene understanding. However, these models still face significant hallucination issues, where outputs contradict visual facts. Recent research on hallucination mitigation has focused on retraining methods and Contrastive Decoding (CD) methods. While both methods perform well, retraining methods require substantial training resources, and CD methods introduce dual inference overhead. These factors hinder their practical applicability. To address the above issue, we propose a framework for dynamically detecting hallucination representations and performing hallucination-eliminating edits on these representations. With minimal additional computational cost, we achieve state-of-the-art performance on existing benchmarks. Extensive experiments demonstrate the effectiveness of our approach, highlighting its efficient and robust hallucination elimination capability and its powerful controllability over hallucinations. Code is available at https://github.com/ASGO-MM/HIRE.
基础/前沿模型 (含LLM)
多模态基础模型
#Hallucination Mitigation #Large Vision-Language Models #Attention Intervention
TL;DR:We propose AGE, a training-free framework that mitigates hallucinations in LVLMs by imitating truth-grounded attention patterns of real tokens, yielding more accurate and trustworthy multimodal reasoning.
🎯 研究动机大型视觉语言模型在多模态推理中表现出色,但仍易出现与视觉证据不符的幻觉问题。现有缓解方法常依赖外部模块或粗粒度解码调整,忽略真实与幻觉令牌间的细粒度注意力动态差异。
❓ 解决问题本研究提出无需训练的AGE框架,通过模仿真实令牌的注意力模式,对模型进行细粒度层级干预,以减少幻觉生成,同时保持语言流畅性。
🔍 现象分析分析发现真实令牌与幻觉令牌在注意力行为上存在阶段特异性差异,幻觉的产生源于模型未能复现真实令牌的注意力模式。
🛠️ 主要方法AGE引入两种轻量干预:模仿图像注意力(基于真实与幻觉令牌的差异),以及在需要语义关联时模仿文本注意力,实现层级的注意力引导增强。
📊 数据与实验在COCO图像描述、POPE和MME等基准上广泛测试,涵盖LLaVA、MiniGPT-4和mPLUG-Owl2等模型,实验表明AGE能一致降低幻觉且无需额外训练。
⭐ 主要贡献提出以真实注意力模式为指导的干预原理,设计无需训练的通用框架AGE,并在多个基准和模型上验证其提升LVLM可靠性的有效性。
查看完整摘要 (Abstract)
Large Vision-Language Models (LVLMs) achieve impressive multimodal reasoning but remain prone to hallucinations, generating content inconsistent with visual evidence.
Existing mitigation methods often rely on auxiliary modules or coarse decoding-time adjustments, overlooking the fine-grained dynamics that distinguish truthful (real) tokens from hallucinatory ones.
In this paper, we introduce \textbf{AGE (Attention-aware Truth-Guided Enhancement)}, a training-free framework that performs fine-grained, layer-wise interventions guided by attention patterns of real tokens.
Our analysis reveals that real and hallucinated tokens follow distinct stage-specific attention behaviors, and hallucinations emerge when models fail to reproduce these behaviors.
AGE addresses this by introducing two lightweight interventions: (i) Imitating the image attention, derived from discrepancies between real and hallucinated tokens, and (ii) Imitating the text attention when semantic grounding is required.
Extensive experiments on widely used benchmarks, including COCO Image Captioning, POPE, and MME, demonstrate that AGE consistently mitigates hallucinations across diverse LVLMs such as LLaVA, MiniGPT-4, and mPLUG-Owl2, without additional training or loss of fluency.
Our results highlight that imitating truth-grounded attention dynamics is a simple yet powerful principle to improve the reliability of LVLMs.
基础/前沿模型 (含LLM)
多模态基础模型
#Interleaving Reasoning #Text-to-Image Generation
🎯 研究动机现有统一多模态理解与生成模型在图像生成方面已取得显著进展,但在遵循指令和细节保持方面与GPT-4o等理解-生成紧密耦合的系统仍有较大差距。受交织推理近期进展的启发,我们探索是否可借助该范式进一步提升文本到图像生成性能。
❓ 解决问题旨在通过引入思维与图像生成交替进行的推理框架,增强模型对复杂文本指令的细节遵循能力、视觉质量和审美表现,同时保持语义一致性,以缩小与顶尖系统的差距。
🔍 现象分析当前统一多模态模型在图像生成中常出现细节丢失或指令遵循偏差,关键在于缺乏迭代式的精细推理过程,导致单次生成难以兼顾语义准确性与细粒度质量。
🛠️ 主要方法提出交织推理生成框架,分阶段交替进行文本思考与图像合成:首先生成文本思考以引导初始图像,再通过反思优化细节、画质与美学。配套设计交织推理生成学习方法,专注强化初始生成与高质量反思执行。
📊 数据与实验构建IRGL-300K数据集,涵盖六种分解学习模式,用于训练文本思考及完整思维-图像轨迹。实验在多个基准测试上取得5-10分的绝对提升,证实了方法在视觉质量与细粒度保真度的显著改进。
⭐ 主要贡献首次将交织推理范式系统引入文本到图像生成,提出可迭代优化的生成框架与学习方法,通过公开代码、模型及数据集推动了该方向探索,为提升生成质量提供了新途径。
查看完整摘要 (Abstract)
Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o.
Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve text-to-image (T2I) generation.
We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics.
To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image.
We curate IRGL-300K, a 300K-scale dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking–image trajectories.
Starting from a unified foundation model that natively emits interleaved text–image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking–image trajectory data.
Extensive experiments show SoTA performance, yielding absolute gains of 5–10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity.
As an early exploration, our results demonstrate that interleaving reasoning is a powerful paradigm for advancing T2I.
The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation.
基础/前沿模型 (含LLM)
多模态基础模型
#Large language Model #MLLMs #Vision Encoder #Machine Learning
TL;DR:Adding more vision encoders to MLLMs often yields redundancy or even harms performance.
🎯 研究动机针对当前多模态大语言模型(MLLMs)普遍采用多个视觉编码器的趋势,研究团队质疑其实际必要性。现有假设认为多样化的预训练目标能带来互补的视觉信号,但作者通过实证检验这一假设是否成立。
❓ 解决问题论文旨在揭示多编码器MLLMs中视觉编码器的冗余现象。通过系统性的实验分析,探究增加编码器数量是否真能提升性能,并量化这种冗余效应。
🔍 现象分析研究发现多编码器常存在显著冗余甚至有害:在OCR与图表任务中呈现强专业化,而通用VQA任务中编码器高度可互换。掩蔽特定编码器有时甚至能提升16%的特定任务准确率。
🛠️ 主要方法提出两种量化指标:条件利用率(CUR)衡量编码器在共存时的边际贡献,信息差距(IG)捕捉编码器效用异质性。通过系统性编码器掩蔽实验进行验证。
📊 数据与实验在包含OCR、图表、VQA及知识型任务的多模态基准上进行实验。对比了全编码器模型与掩蔽变体,单/双编码器变体在多数非OCR任务上能达到基线90%以上性能。
⭐ 主要贡献挑战了“编码器越多越好”的经验法则,提出可量化的冗余诊断工具。为开发更高效的多模态架构提供了实证依据,实现了最高3.6%的整体性能提升。
查看完整摘要 (Abstract)
Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals.
However, we show this assumption often fails in practice. Through systematic encoder masking across representative multi-encoder MLLMs, we find that performance typically degrades gracefully—and sometimes even improves—when selected encoders are masked, revealing pervasive encoder redundancy.
To quantify this effect, we introduce two principled metrics: the Conditional Utilization Rate (CUR), which measures an encoder’s marginal contribution in the presence of others, and the Information Gap (IG), which captures heterogeneity in encoder utility within a model.
Using these tools, we observe: (i) strong specialization on tasks like OCR \& Chart, where a single encoder can dominate with a CUR >90%, (ii) high redundancy on general VQA and knowledge-based tasks, where encoders are largely interchangeable, (iii) instances of detrimental encoders with negative CUR.
Notably, masking specific encoders can yield up to 16% higher accuracy on a specific task category and 3.6% overall performance boost compared to the full model.
Furthermore, single- and dual- encoder variants recover over 90% of baseline on most non-OCR tasks. Our analysis challenges the “more encoders are better” heuristic in MLLMs and provides actionable diagnostics for developing more efficient and effective multimodal architectures.
基础/前沿模型 (含LLM)
多模态基础模型
#Vision Language Model
TL;DR:We revert the paradigm of feeding image into large language models, and show language can guide image encoder to learn more sophisticated features and combat hallucinations.
🎯 研究动机现有的视觉基础模型通常作为静态特征提取器进行训练,将任务适应的负担转移给大型下游模型。我们提出了一种新范式,即利用语言本身动态指导视觉编码器,而不是仅仅将视觉特征输入语言模型。这旨在使视觉表示更可控和泛化,减少模型对下游任务的依赖。
❓ 解决问题针对视觉基础模型在任务适应中依赖大型下游模型、容易产生幻觉、泛化能力不足的问题,本文提出用语言指令动态引导视觉编码器的方案。这种方法旨在增强模型对上下文相关特征的注意力,提高可控性和泛化性。
🔍 现象分析传统方法将图像作为静态特征输入大型语言模型,但可能导致特征缺乏任务针对性,引发视觉幻觉。语言引导视觉编码器可动态提取任务核心特征,减少无关信息的干扰,提升模型感知的准确性。
🛠️ 主要方法提出了语言引导视觉嵌入(LIVE)方法,利用语言作为高层次指导,在推理时动态生成任务中心化的嵌入,无需任务特定的重新训练。这使编码器能关注输入中与上下文相关的方面,增强表示的可控性。
📊 数据与实验在MMVP基准上减少视觉幻觉(提升34个百分点),在视觉问答任务上超越参数量大数个数量级的视觉语言模型,并在未见过的指令和任务上展示了良好的泛化能力。
⭐ 主要贡献逆转了将图像输入大型语言模型的范式,首次用语言动态指导视觉编码器学习更精细的特征。LIVE方法实现了自适应、指令驱动的视觉智能,显著提升了模型的可控性和泛化性。
查看完整摘要 (Abstract)
Vision foundation models are typically trained as static feature extractors, forcing the burden of task adaptation onto large downstream models. We propose a different paradigm: instead of solely feeding visual features into language, we use language itself to dynamically guide the vision encoder. Our method, Language-Instructed Vision Embeddings (LIVE), leverages language as high-level guidance to produce task-centric embeddings at inference time—without requiring task-specific retraining. This enables the encoder to focus attention on contextually relevant aspects of the input, yielding more controllable and generalizable representations. Empirically, LIVE reduces visual hallucinations (+34 points on MMVP), outperforms vision–language models with orders of magnitude more parameters on visual question answering, and generalizes to unseen instructions and tasks---offering a direct path toward adaptive, instruction-driven visual intelligence.
基础/前沿模型 (含LLM)
多模态基础模型
#Speech–Text Models #Latent Patching #Multimodal Alignment #Large Language Models
TL;DR:We introduce Latent Speech-Text Transformer, which patches long speech token sequences into latent units, improving text–speech transfer while cutting pre-training and inference compute, and significantly outperforming existing speech-text LLMs.
🎯 研究动机现有的自回归语音文本模型存在模态失衡问题,语音标记序列远长于文本,导致计算效率低下,跨模态对齐不足。
❓ 解决问题提出隐式语音文本变换器,将长语音标记序列聚合成高级隐式语音片段,以平衡语音与文本的序列建模粒度,提升计算效率。
🔍 现象分析语音标记序列过长导致预训练和推理计算资源过度向语音倾斜,阻碍跨模态知识迁移,并显著拖慢性能扩展速度。
🛠️ 主要方法通过隐式语音补丁技术,将语音标记聚合为高层自回归单元,使其与文本单元对齐,同时捕获重复声学模式如静音。
📊 数据与实验在计算控制和数据控制的故事补全基准测试中,模型参数从420M扩展至7B,在语音HellaSwag上最高提升6.5%。
⭐ 主要贡献提出LST模型,显著提升语音文本互转效率和性能,稳定ASR适应,并降低ASR与TTS推理的自回归序列长度和计算成本。
查看完整摘要 (Abstract)
Auto-regressive speech–text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less compute-efficient than text LLMs, partly due to the much longer sequences of speech tokens relative to text. This modality imbalance disproportionately allocates pre-training and inference compute to speech, potentially hindering effective cross-modal alignment and slowing performance scaling by orders of magnitude. We introduce the Latent Speech-Text Transformer (LST), which aggregates speech tokens into latent speech patches that serve as higher-level autoregressive units. This design aligns the sequence-modeling granularity between speech and text while improving computational efficiency. The resulting patches can align with textual units to facilitate cross-modal knowledge transfer and compactly capture recurring acoustic patterns such as silence. Across story-completion benchmarks under both compute-controlled and data-controlled settings, LST consistently improves speech accuracy while also improving text performance, achieving up to +6.5% absolute gain on speech HellaSwag in compute-controlled training (+5.3% in data-controlled training). Under compute-controlled scaling from 420M to 1.8B parameters in a near compute-optimal regime, gains grow with scale, and improvements persist up to 7B parameters under fixed-token budgets. These benefits extend to downstream tasks: LST stabilizes ASR adaptation and reduces the effective autoregressive sequence length during ASR and TTS inference, lowering computational cost without degrading reconstruction quality. The Code is available at https://github.com/facebookresearch/lst.
基础/前沿模型 (含LLM)
多模态基础模型
#Masked Diffusion Model #Unified Multi-modal model
TL;DR:We built a state-of-the-art unified masked diffusion model for image understanding ,object grounding, image generation and editing tasks.
🎯 研究动机现有统一掩码扩散模型(MDM)在图像理解和生成任务上存在局限性,如仅支持简单图像级理解和低分辨率生成。因此,本研究旨在开发一个统一的多模态模型,能够同时处理高级图像理解和高分辨率生成任务。
❓ 解决问题通过提出Lavida-O模型,解决现有MDMs在多模态统一任务上的不足,特别是图像理解、目标定位、编辑和高质量生成。这克服了传统模型在分辨率、任务泛化性和效率方面的限制。
🔍 现象分析当前统一MDMs(如MMaDa和Muddit)在复杂理解任务和高分辨率生成上表现不足,而自回归模型和连续扩散模型(如Qwen2.5-VL和FluxKontext-dev)在速度和效果上仍有改进空间。这揭示了统一框架在模态对齐和质量提升上的潜力。
🛠️ 主要方法引入弹性混合变换器(Elastic-MoT)架构,通过耦合轻量生成分支和大型理解分支,结合令牌压缩、通用文本调节和分层采样,实现高效且高质量的生成。此外,整合规划和迭代自反思机制来提升生成与编辑质量。
📊 数据与实验在多个基准测试上验证性能,包括RefCOCO目标定位、GenEval文本到图像生成和ImgEdit图像编辑。实验结果表明,Lavida-O在效果和推理速度上超越现有自回归模型和连续扩散模型,显示出优越的泛化能力。
⭐ 主要贡献Lavida-O建立了一个新的可扩展多模态推理与生成范式,通过单一框架支持多任务处理,并实现了高质量的生成结果。其方法和模型在多个SOTA基准测试中表现突出,为多模态AI领域提供了创新性进展。
查看完整摘要 (Abstract)
We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text-to-image synthesis. Lavida-O incorporates a novel Elastic Mixture-of-Transformers (Elastic-MoT) architecture that couples a lightweight generation branch with a larger understanding branch, supported by token compression, universal text conditioning and stratified sampling for efficient and high-quality generation. Lavida-O further incorporates planning and iterative self-reflection in image generation and editing tasks, seamlessly boosting generation quality with its understanding capabilities. Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive models and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference. These advances establish Lavida-O as a new paradigm for scalable multimodal reasoning and generation.
基础/前沿模型 (含LLM)
多模态基础模型
#LLM pre-training #MLLMs #multi-modality
TL;DR:Explore and understand the visual priors within LLMs and thus build better MLLMs.
🎯 研究动机尽管大型语言模型仅通过文本训练,却意外地形成了丰富的视觉先验知识。本文旨在系统性地揭示并理解这些视觉先验的本质与构成,以期为构建更优的多模态大模型奠定理论基础。
❓ 解决问题本研究核心在于解析LLM视觉先验的结构、来源及扩展规律,并探索如何利用这些先验知识更高效地构建多模态模型。
🔍 现象分析研究发现,视觉先验可分为可分离的感知先验和推理先验,两者具有不同的扩展趋势与数据起源。推理先验主要源于代码、数学等推理密集型数据的预训练,并具备可迁移性;感知先验则更广泛地来自通用语料,并对视觉编码器及视觉指令微调数据更为敏感。
🛠️ 主要方法通过大量对照实验对全MLLM构建流程(从LLM预训练到视觉对齐及监督微调)进行了系统性分析,并基于发现提出了一个以数据为中心的、用于预训练视觉感知LLM的方案。
📊 数据与实验研究基于超过10万个对照实验,消耗50万GPU小时,覆盖五种模型规模、广泛的数据类别与混合方式以及多种适应设置。同时,引入了多层级存在基准(MLE-Bench)以促进未来研究。
⭐ 主要贡献揭示了LLM视觉先验的二元结构与形成机制,并提出了针对性强的数据预训练方案。这项工作为从语言预训练中有意识地培育视觉先验提供了新思路,推动了下一代多模态LLM的发展。
查看完整摘要 (Abstract)
Large Language Models (LLMs), despite being trained on text alone, surprisingly develop rich visual priors. These priors allow latent visual capabilities to be unlocked for vision tasks with a relatively small amount of multimodal data, and to perform symbolic visual generation tasks without ever having seen an image. Through systematic analysis, we reveal that visual priors—the implicit, emergent knowledge about the visual world acquired during language pre-training—are composed of separable perception and reasoning priors with unique scaling trends and origins. We show that an LLM's latent visual reasoning ability is predominantly developed by pre-training on reasoning-centric data (\eg, code, math, academia) and scales progressively. This reasoning prior acquired from language pre-training is transferable and universally applicable to visual reasoning. In contrast, the perception prior emerges more diffusely from broad corpora, and perception ability is more sensitive to the vision encoder and visual instruction tuning data. In parallel, text describing the visual world proves crucial, though its performance impact saturates rapidly. Leveraging these insights, we propose a data-centric recipe for pre-training vision-aware LLMs and verify it in 1T token scale pre-training. Our findings are grounded in over 100 controlled experiments consuming 500,000 GPU-hours, spanning the full MLLM construction pipeline—from LLM pre-training to visual alignment and supervised multimodal fine-tuning—across five model scales, a wide range of data categories and mixtures, and multiple adaptation setups. Along with our main findings, we also propose and investigate several hypotheses, and introduce a Multi-Level Existence Bench (MLE-Bench) to facilitate future research. Together, this work provides a new way of deliberately cultivating visual priors from language pre-training, paving the way for the next generation of multimodal LLMs.
We recommend a visit to our project page (https://junlinhan.github.io/projects/lsbs/) for an interactive reading.
基础/前沿模型 (含LLM)
多模态基础模型
#chemical foundation models #protein foundation models #low rank adaptation #olfaction #multi-modal model #computational neuroscience
TL;DR:We benchmark chemical foundation models for odorant-receptor binding prediction and introduce LORAX, a LoRA-based cross-attention model that outperforms existing approaches and yields more informative odorant representations.
🎯 研究动机由于气味分子在嗅觉系统中引发的复杂激活模式,其特征化以预测其性质具有挑战性。结构相似的气味分子可能引发不同受体和感知水平的激活,现有特征设计方法缺乏公认的通用方案。
❓ 解决问题针对气味分子-受体结合预测任务,研究旨在评估化学基础模型特征的有效性,并开发一种能够生成更具信息量的嗅觉特异性表征的新方法。
🔍 现象分析研究发现,依赖预训练化学基础模型的基于特征的方法,在气味分子-受体结合预测任务上并未显著优于经典的手工设计物理化学描述符。两类表征之间存在大量信息重叠,这表明需要微调才能产生新颖且更优的气味分子表征。
🛠️ 主要方法提出了LORAX模型,这是一种基于低秩自适应(LoRA)和交叉注意力的气味分子-受体亲和力预测模型。该模型通过微调生成嗅觉特异性表征,其产生的特征空间更接近嗅觉神经表征。
📊 数据与实验研究对现有的化学基础模型表征和手工设计的物理化学描述符进行了基准测试和比较,应用于气味分子-受体结合预测任务。实验表明LORAX在预测任务上优于现有模型。
⭐ 主要贡献系统评估了化学基础模型在嗅觉任务上的表现,揭示了预训练特征在此问题上的局限性。提出了LORAX模型,其通过微调生成的表征更具信息量,并且在预测性能上超越了现有方法。
查看完整摘要 (Abstract)
Featurizing odorants to enable robust prediction of their properties is difficult due to the complex activation patterns that odorants evoke in the olfactory system. Structurally similar odorants can elicit distinct activation patterns in both the sensory periphery (i.e., at the receptor level) and downstream brain circuits (i.e., at a perceptual level). Despite efforts to design odorant features to better predict how they interact with the olfactory system, there is still no universally accepted approach to this problem. We demonstrate that feature-based approaches that rely on pre-trained foundation models to generate odorant representations $\textit{do not}$ significantly outperform classical hand-designed features on odorant-receptor binding tasks. Instead, we show that it is necessary to fine-tune these features to increase predictive performance. To show this, we introduce a new model that creates olfaction-specific representations: $\textbf{L}$oRA-based $\textbf{O}$dorant-$\textbf{R}$eceptor $\textbf{A}$ffinity prediction with $\textbf{CROSS}$-attention ($\textbf{LORAX}$). We compare existing chemical foundation model representations to hand-designed physicochemical descriptors using feature-based methods and identify large information overlap between these representations, highlighting the necessity of fine-tuning to generate novel and superior odorant representations. We show that LORAX produces a feature space more closely aligned with olfactory neural representation, enabling it to outperform existing models on predictive tasks.
基础/前沿模型 (含LLM)
多模态基础模型
#Unified #Multimodal Large Language Models #understanding #generation #hybrid tokenizer
TL;DR:We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe
🎯 研究动机多模态大语言模型在理解与生成视觉内容方面潜力巨大,但现有开源模型常面临性能权衡。
❓ 解决问题设计统一多模态框架,通过混合视觉分词器与训练方案缓解理解与生成之间的张力。
🔍 现象分析现有统一模型中视觉理解与生成任务常出现性能折衷,阻碍两种能力在单一模型中共存。
🛠️ 主要方法采用共享视觉编码器配合轻量适配器,为理解和生成分别提供连续嵌入与离散token;利用统一自回归模型预测语义,辅助扩散解码器生成像素。
📊 数据与实验结合理解与生成数据进行统一训练;实验表明模型达到SOTA性能,特别是在文本丰富任务上可匹敌专业模型。
⭐ 主要贡献提出简单可扩展的统一多模态框架;混合分词器实现任务冲突最小化与规模增益;验证了理解与生成能力协同训练的有效性。
查看完整摘要 (Abstract)
Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.
基础/前沿模型 (含LLM)
多模态基础模型
#generation #multimodal diffusion language model
🎯 研究动机现有用于提升复杂任务性能的思维感知生成方法,其顺序自回归方式存在错误传播的失效模式,可能导致性能下降。本研究旨在通过跨模态对齐分析,系统地诊断该问题。
❓ 解决问题提出了一个并行多模态扩散框架 MMaDA-Parallel,以解决因推理文本与生成图像之间未对齐导致的性能下降问题。通过持续双向跨模态交互,确保生成过程中的语义一致性。
🔍 现象分析基于新基准 ParaBench 的分析揭示,性能下降与生成的推理内容和最终图像之间的对齐程度差呈强相关性。这凸显了现有自回归方法在跨模态一致性上的不足。
🛠️ 主要方法MMaDA-Parallel 框架在去噪轨迹全程支持文本与图像的连续双向交互。训练包含监督微调和并行强化学习 (ParaRL),后者沿轨迹施加语义奖励以强化跨模态一致性。
📊 数据与实验引入 ParaBench 基准来评估文本和图像输出模态。实验表明,模型在 ParaBench 上相比当前最优模型 Bagel 在输出对齐指标上提升了 6.9%,显著改善了跨模态对齐和语义一致性。
⭐ 主要贡献提出并实证了并行多模态扩散框架 MMaDA-Parallel,为思维感知图像合成建立了更鲁棒的范式。同时发布了 ParaBench 基准和 ParaRL 优化策略,用于系统评估和提升跨模态一致性。
查看完整摘要 (Abstract)
While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation.
To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image.
To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9\% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis.
基础/前沿模型 (含LLM)
多模态基础模型
#Video Large Language Models #Information Flow Analysis #Video Question Answering
TL;DR:This paper presents a systematic analysis of where and how information flows in VideoLLMs for temporal reasoning in VideoQA, revealing key patterns and effective pathways.
🎯 研究动机尽管VideoLLMs在视频问答任务上取得进展,但其内部提取和传播视频与文本信息的机制尚不明确。本文旨在系统地揭示VideoLLMs在时序推理中的内部信息流动模式。
❓ 解决问题探究VideoLLMs在时序推理任务中,信息在模型内部的何处及如何流动,特别是跨帧交互与跨模态融合的关键过程。
🔍 现象分析分析揭示了跨层信息流动的一致模式:早期到中间层进行活跃的跨帧交互,随后在中间层实现渐进式的视频-语言表征对齐与融合,融合完成后模型在后续层生成答案。
🛠️ 主要方法运用机制可解释性技术对VideoLLMs的内部信息流进行系统性分析,并基于分析识别出高效信息通路,例如通过抑制大量注意力边来保留模型性能。
📊 数据与实验研究在多样的视频问答任务上展开分析。实验表明,LLaVA-NeXT-7B-Video-FT模型在剪掉约58%注意力边的情况下仍能保持性能,验证了核心信息通路的存在。
⭐ 主要贡献为VideoLLMs如何进行时序推理提供了机制性的解释蓝图。研究结果为提升模型可解释性和下游泛化能力提供了实用的分析见解与改进方向。
查看完整摘要 (Abstract)
Video Large Language Models (VideoLLMs) extend the capabilities of vision-language models to spatiotemporal inputs, enabling tasks such as video question answering (VideoQA). Despite recent advances in VideoLLMs, their internal mechanisms on where and how they extract and propagate video and textual information remain less explored. In this study, we investigate the internal information flow of VideoLLMs using mechanistic interpretability techniques. Our analysis reveals consistent patterns across diverse VideoQA tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame interactions in early-to-middle layers, (2) followed by progressive video-language integration in middle layers. This is facilitated by alignment between video representations and linguistic embeddings containing temporal concepts. (3) Upon completion of this integration, the model is ready to generate correct answers in middle-to-late layers. (4) Based on our analysis, we show that VideoLLMs retain their VideoQA performance by selecting these effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a blueprint for how VideoLLMs perform temporal reasoning and offer practical insights for improving model interpretability and downstream generalization.
基础/前沿模型 (含LLM)
多模态基础模型
#Omnimodal #Multimodal Learning #Discrete Flow Matching
🎯 研究动机实现任意模态间相互生成与多轮交互的下一代多模态基础模型,是构建通用人工智能系统的核心,对推动人机交互发展至关重要。
❓ 解决问题现有自回归架构模型难以平衡理解与生成能力,而混合或解耦设计又冗余且适用范围有限(如跨模态检索),NExT-OMNI旨在通过统一建模克服这些局限。
🔍 现象分析当前多模态模型多受限于自回归结构,其理解与生成能力不平衡;虽然已有策略尝试在统一框架内分别处理任务,但设计冗余且未能整合,限制了更广泛的应用场景。
🛠️ 主要方法采用离散流范式,利用度量诱导的概率路径和动力学最优速度进行统一建模,支持任意模态间的理解与生成,并通过简洁的统一表示而非任务解耦设计扩展应用范围。
📊 数据与实验模型在大规模交织的文本、图像、视频和音频数据上训练,在多模态理解与生成基准上表现竞争性,并在多轮多模态交互和跨模态检索任务上优于先前统一模型。
⭐ 主要贡献提出了开源的NExT-OMNI全模态基础模型,通过离散流匹配实现统一建模,增强了响应效率和应用广度,为下一代多模态系统提供了架构优势。
查看完整摘要 (Abstract)
Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval. In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal understanding and generation benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. The code is available at https://github.com/ritzz-ai/Next-OMNI.
基础/前沿模型 (含LLM)
多模态基础模型
#unified generation and understanding
🎯 研究动机统一的多模态大语言模型具有无缝集成理解和生成能力的潜力,但现有的自回归架构面临语义与结构的内在冲突。
❓ 解决问题提出ORION框架,解决自回归模型中生成任务的结构重构优化导致语义理解灾难性遗忘的根本问题。
🔍 现象分析单一自回归架构在优化生成的低层重构性时,会损害高层语义理解能力,形成语义-结构冲突。
🛠️ 主要方法采用非线性视觉头解耦结构压力,结合表征一致性损失对齐生成过程的语义,并设计渐进式训练策略。
📊 数据与实验使用高质量多模态数据进行训练,在单一自回归骨干网上实现了与复杂设计模型相当或更优的性能。
⭐ 主要贡献验证了单一自回归架构作为实现真正统一多模态智能的简单有效路径,无需任务特定参数即可平衡理解和生成。
查看完整摘要 (Abstract)
Unified multimodal Large Language Models (MLLMs) hold great promise for seamlessly integrating understanding and generation. However, monolithic autoregressive architectures, despite their elegance and conversational fluency, suffer from a fundamental semantic–structural conflict: optimizing for low-level reconstructability in generation leads to catastrophic forgetting of high-level semantic understanding. We present ORION, a unified framework that resolves this conflict through Decoupling and Alignment. A non-linear vision head decouples structural pressures from shared representations, while a novel Representation Consistency Loss explicitly aligns semantics during generation. Together with a curated progressive training recipe and high-quality multimodal data, our method enables balanced optimization of both capabilities. Built purely on a monolithic autoregressive backbone without task-specific separate parameters, ORION achieves performance on par with or exceeding recent state-of-the-art unified models that rely on more complex designs. These results validate monolithic autoregression as a simple, effective, and competitive path toward truly integrated multimodal intelligence.
基础/前沿模型 (含LLM)
多模态基础模型
#document analysis #tampered text detection #vision foundation model
TL;DR:The first generalist Image Manipulation Localization (IML) model that unifies interpretable IML on four major IML domains, along with a high-quality dataset constructed by a novel method.
🎯 研究动机现有的图像篡改定位方法依赖任务特定设计,无法有效支持多任务联合训练,限制了在实际应用中的性能表现。
❓ 解决问题提出首个统一图像篡改定位模型 Omni-IML,旨在解决多任务联合训练性能降低的问题,同时提升模型可解释性与广泛适用性。
🔍 现象分析联合训练多个图像篡改定位任务会导致现有方法性能显著下降;缺少高质量数据集和解释性模块进一步限制了技术发展。
🛠️ 主要方法引入三项核心组件:模态门编码器、自适应权重解码器以及异常增强模块,同时设计链式思维自动注释技术生成高质量数据。
📊 数据与实验构建Omni-273k数据集,包含基于自然语言的篡改描述,并通过实验证明该模型在四个主要任务上均实现当前最优性能。
⭐ 主要贡献提出通用性图像篡改定位模型,设计高质量数据集与解释模块,为图像取证领域提供重要解决方案与研究方向。
查看完整摘要 (Abstract)
Existing Image Manipulation Localization (IML) methods rely heavily on task-specific designs, making them perform well only on the target IML task, while joint training on multiple IML tasks causes significant performance degradation, hindering real applications. To this end, we propose Omni-IML, the first generalist model designed to unify IML across diverse tasks. Specifically, Omni-IML achieves generalization through three key components: (1) a Modal Gate Encoder, which adaptively selects the optimal encoding modality per sample, (2) a Dynamic Weight Decoder, which dynamically adjusts decoder filters to the task at hand, and (3) an Anomaly Enhancement module that leverages box supervision to highlight the tampered regions and facilitate the learning of task-agnostic features. Beyond localization, to support interpretation of the tampered images, we construct Omni-273k, a large high-quality dataset that includes natural language descriptions of tampered artifacts. It is annotated through our automatic, chain-of-thoughts annotation technique. We also design a simple-yet-effective interpretation module to better utilize these descriptive annotations. Our extensive experiments show that our single Omni-IML model achieves state-of-the-art performance across all four major IML tasks, providing a valuable solution for practical deployment and a promising direction of generalist models in image forensics. Our code and dataset are available at https://github.com/qcf-568/OmniIML.
基础/前沿模型 (含LLM)
多模态基础模型
#unified model; generation helps understanding; 3d scene understanding; novel view synthesis
TL;DR:Generation helps Understanding in 3d scene.
🎯 研究动机探究生成任务如何促进三维场景理解这一原则,将统一的多模态理解与生成框架扩展到基于多视图图像的三维场景中。
❓ 解决问题解决现有三维理解模型与生成任务割裂的问题,实现三维场景理解、新视角合成和几何估计的统一协同建模。
🔍 现象分析通过显式几何约束和时空建模能力的协同,生成任务能为理解提供补充信息,从而增强模型对三维场景的整体认知。
🛠️ 主要方法提出Omni-View模型,由理解模块、纹理模块和几何模块组成,采用两阶段训练策略,联合优化理解与生成任务。
📊 数据与实验在VSI-Bench基准测试上取得55.4分的SOTA性能,超越现有专用三维理解模型,同时在新视角合成和场景生成任务上表现优异。
⭐ 主要贡献首次在三维场景中验证“生成促进理解”原则,开源代码与模型为统一三维理解与生成研究提供了重要基准。
查看完整摘要 (Abstract)
This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that ``generation facilitates understanding". Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the model's holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D understanding models, while simultaneously delivering strong performance in both novel view synthesis and 3D scene generation. The code and pretrained models are open-sourced at https://github.com/AIDC-AI/Omni-View .
基础/前沿模型 (含LLM)
多模态基础模型
#Robustness #Vision-Language-Action Models
TL;DR:We evaluate and enhance the robustness of VLAs under 17 uncertainties in 4 modalities.
🎯 研究动机视觉-语言-动作模型在真实世界部署时面临多模态扰动的威胁,现有方法通常仅关注视觉扰动,忽略了动作、指令等广泛存在的不确定性。本文旨在全面评估和增强 VLA 模型对跨模态扰动的鲁棒性。
❓ 解决问题本文提出 RobustVLA 方法,以应对 VLA 模型输入和输出的多模态扰动,弥补现有工作对动作、环境等多模态扰动鲁棒性考虑不足的缺陷。
🔍 现象分析通过对四模态共 17 种扰动的评估,发现:动作模态最脆弱;现有视觉鲁棒性增强方法无法泛化到其他模态;π₀ 模型展现出相对优越的鲁棒性。
🛠️ 主要方法方法包含两部分:针对输出鲁棒性,采用离线鲁棒优化以对抗最大化流匹配目标错配的最坏情况动作噪声;针对输入鲁棒性,强制模型在保持任务语义的输入变化中产生一致动作。采用多臂老虎机框架和置信上界算法自动识别最有害噪声。
📊 数据与实验在 LIBERO 基准上进行实验,RobustVLA 在 π₀ 和 OpenVLA 骨干上平均绝对提升 12.6% 和 10.4%。推理速度相比需外部大模型的视觉鲁棒方法快 50.6 倍。在 FR5 真实机器人上,仅需 25 条演示即能取得 65.6% 的成功率优势。
⭐ 主要贡献首次系统性评估了 VLA 模型在四模态扰动下的鲁棒性,提出了统一的对抗性训练与一致性约束的鲁棒性增强框架,并在仿真和真实机器人实验中验证了方法的显著性能提升与效率优势。
查看完整摘要 (Abstract)
In Vision–Language–Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) $\pi_0$ demonstrates superior robustness. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6\% on the $\pi_0$ backbone and 10.4\% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust BYOVLA that requires external LLMs, and a 10.4\% gain under mixed perturbations. On the real-world FR5 robot, under four types of multimodal perturbations, RobustVLA shows strong low-data performance, outperforming $\pi_0$ by $65.6\%$ success rate with 25 demonstrations. Even with abundant demos, our method still outperform $\pi_0$ by 30\% success rate. Code and demo videos available at https://github.com/gakakulicc/RobustVLA.
基础/前沿模型 (含LLM)
多模态基础模型
#3D Computer Vision #Multimodal Large Language Model #Spatial Intelligence #Embodied AI
TL;DR:We show that RGB-only MLLMs are fundamentally flawed for spatial reasoning due to an inherent geometric ambiguity, and propose a camera-aware MLLM framework that incorporates camera intrinsics for robust, generalizable spatial intelligence.
🎯 研究动机现有RGB-only多模态大语言模型在3D定位与导航等空间推理任务中表现出潜力,但研究者认为它们因忽视相机参数而存在根本性缺陷。这些模型将物体物理属性与相机视角纠缠,无法解决几何模糊性,导致泛化能力受限。
❓ 解决问题提出相机感知多模态大语言模型框架,旨在克服RGB-only模型在跨相机泛化上的不足。该框架通过注入相机内参、数据增强和几何先验蒸馏,使模型学习可泛化的空间推理能力。
🔍 现象分析RGB-only模型因忽略相机参数,将物体属性与相机视角混淆,产生无法解析的歧义。这导致模型过度拟合训练数据中的相机分布,而非学习真实、可泛化的3D几何原理。
🛠️ 主要方法框架通过密集嵌入为每个视觉标记注入相机内参,引入相机感知数据增强策略以合成变化相机参数,迫使模型解耦相机属性与场景内容。此外,从3D视觉基础模型中蒸馏几何先验。
📊 数据与实验在空间基础任务上进行广泛实验,包括跨相机泛化测试。结果表明相机感知模型显著优于朴素RGB-only模型,特别是在不同相机分布下的泛化性能。
⭐ 主要贡献揭示了RGB-only多模态大语言模型在空间推理中的根本缺陷,并提出相机感知框架作为解决方案。证明相机感知是实现鲁棒、可泛化空间智能的必要前提,推动了多模态模型在具身AI领域的发展。
查看完整摘要 (Abstract)
Multimodal Large Language Models (MLLMs) that directly process RGB inputs for tasks like 3D localization and navigation have shown remarkable potential. However, we argue that these ``RGB-only'' approaches are fundamentally flawed in their ability to generalize across cameras. By ignoring camera parameters, they entangle an object's physical properties with the camera's perspective, creating an irresolvable ambiguity. We show this leads MLLMs to overfit to the training camera distribution, rather than learning true and generalizable 3D geometric principles. To address this, we propose Camera-Aware MLLM framework for spatial MLLMs. It learns generalizable spatial reasoning by: (i) injecting camera intrinsics via a dense embedding that conditions each visual token; (ii) introducing a camera-aware data augmentation strategy that synthetically varies camera parameters, forcing the model to disentangle camera properties from scene content; and (iii) distilling geometric priors from a 3D vision foundation model. Extensive experiments demonstrate that camera-aware MLLMs substantially outperform their naive counterparts, particularly in cross-camera generalization tests on spatially-grounded tasks, indicating that camera-awareness is not only beneficial but also a prerequisite for robust and generalizable spatial intelligence in MLLMs.
基础/前沿模型 (含LLM)
多模态基础模型
#3D Computer Vision #3D Vision-language Modeling #Part-aware 3D understanding #Multimodal Large Language Model
🎯 研究动机现有的3D多模态大语言模型缺乏对物体部件的结构化理解能力,导致在部件级的生成与编辑任务中存在局限性。研究旨在构建一个能够统一多种3D任务的、具有部件感知能力的原生3D多模态大模型。
❓ 解决问题该模型解决了传统方法在3D任务中缺乏结构化输出和统一接口的问题。通过将多样化的3D任务(如部件级检测、描述和编辑)编码为单一、连贯的结构化程序,实现了统一的任务接口。
🔍 现象分析现有3D多模态模型通常将符号规划与几何合成耦合,限制了其灵活性和可扩展性。部件级的理解需要同时处理语义描述、空间边界框和编辑指令,这要求模型具备结构化的输出能力。
🛠️ 主要方法采用双编码器架构,解耦结构信息与语义信息,进行预训练。通过指令微调,使模型能够根据RGB点云和自然语言提示自回归生成编码部件边界框、语义描述和编辑命令的结构化令牌序列。
📊 数据与实验使用大规模以部件为中心的数据集进行指令微调。实验表明,模型在接地问答、组合生成和局部编辑等任务中通过统一接口实现了最先进的性能,验证了结构化输出的有效性。
⭐ 主要贡献提出了首个部件感知的原生3D多模态大语言模型Part-X-MLLM,将多样3D任务统一为结构化程序。通过解耦符号规划与几何合成,为兼容的几何引擎提供了单一的语言原生前端接口。
查看完整摘要 (Abstract)
We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/
基础/前沿模型 (含LLM)
多模态基础模型
#vision–language learning #VQA #replay-free
🎯 研究动机基础视觉-语言模型部署在非平稳数据流上时,需要在不访问历史数据的条件下进行持续更新。然而,简单的微调会损害其零样本识别能力和提示词的鲁棒性。
❓ 解决问题本文旨在寻找一种免回放的持续学习原则,以在领域或提示词发生漂移时,保持预训练模型的跨模态泛化能力。
🔍 现象分析传统的微调方法会破坏预训练模型学到的跨模态对齐结构,导致在持续学习新数据时丢失对旧任务的零样本泛化性以及对提示词变化的鲁棒性。
🛠️ 主要方法提出了提示不变CCA证书(Pi-CCA),这是一种几何优先的方法,它将图像-文本对齐关系总结为一个紧凑的证书,捕捉了其顶层的典型谱和子空间。在适配时,仅使用小批量统计量匹配此摘要,并通过在扰动上求平均来增强对提示词的鲁棒性。
📊 数据与实验在MTIL、X-TAIL、VLCL和ConStruct-VL四个基准数据集上进行了评估。实验表明,Pi-CCA在免回放的持续多模态学习方法中达到了最先进的性能。
⭐ 主要贡献通过优化对齐不变性而非代理信号,Pi-CCA为持续适配提供了一条简单、无需生成器、内存恒定的路径,同时保持了强大的零样本能力以及对提示词/风格漂移的韧性。
查看完整摘要 (Abstract)
When deployed on non-stationary data streams, foundation vision-language models require continual updates without access to past data. However, naive fine-tuning undermines their zero-shot recognition capabilities and prompt robustness. We seek a replay-free principle that preserves pre-trained cross-modal generalization under domain/prompt shifts. We introduce Prompt-Invariant CCA Certificates (Pi-CCA), a geometry-first approach that summarizes image--text alignment with a compact certificate capturing the top-k canonical spectrum and subspace. During adaptation, we match this summary using only mini-batch statistics and induce prompt robustness via averaging over perturbations. Across MTIL, X-TAIL, VLCL, and ConStruct-VL, Pi-CCA achieves state-of-the-art performance among replay-free methods.
By optimizing alignment invariants rather than proxy signals, Pi-CCA provides a simple, generator-free, constant-memory path to continual adaptation with strong zero-shot retention and resilience to prompt/style shifts.
基础/前沿模型 (含LLM)
多模态基础模型
#multimodal grounding #MLLM hallucination #alignment
🎯 研究动机多模态大语言模型(MLLMs)在视觉语言任务中表现出色,但仍面临细粒度视觉理解能力不足和容易产生幻觉的问题。这些问题源于模型过度依赖语言先验而忽视实际视觉信息,导致输出内容与视觉内容脱节。
❓ 解决问题本文旨在通过一种后多模态对齐框架来增强MLLMs的视觉理解能力并抑制幻觉。重点是确保输出结果能够牢固地基于视觉和文本证据,以减少与视觉内容无关的错误。
🔍 现象分析幻觉产生的主要原因是模型在推理时过多依赖语言先验,分散了对真实视觉信息的利用。这导致输出内容常常无法准确反映图像中的实际对象和关系,影响了任务的可靠性。
🛠️ 主要方法提出MMGrounded-PostAlign框架,包含视觉接地模块和文本接地模块。视觉模块通过负拒绝机制识别图像中的对象并排除不存在的实体;文本模块则采用选择性推理机制,根据查询复杂度调整推理策略。
📊 数据与实验在POPE、HaloQuest、ReasonSeg、MME和MMBench等基准上进行广泛评估。实验结果显示该方法在细粒度视觉理解和幻觉抑制方面取得显著改进,验证了其在真实多模态任务中的有效性。
⭐ 主要贡献引入了后多模态对齐框架,通过双接地方法确保输出基于视觉和文本证据。创新性地提出了负拒绝机制和选择性推理机制,有效缓解了幻觉问题并增强了多模态对齐的鲁棒性。
查看完整摘要 (Abstract)
Multimodal Large Language Models (MLLMs) have shown remarkable performance in vision-language tasks, such as image captioning and visual question answering. However, these models often struggle with fine-grained visual understanding and are prone to hallucinations, primarily due to over-reliance on linguistic priors that distract them from leveraging actual visual information. This results in outputs that are often unanchored in the visual content, leading to errors. To address these challenges, we introduce MMGrounded-PostAlign, a post-multimodal alignment framework designed to enhance the visual understanding capabilities of MLLMs and mitigate hallucinations. In the framework, the visual grounding module identifies the referred objects in the image, while the textual grounding module generates the rationale for the final answer. This dual grounding approach ensures that outputs are firmly anchored in both visual and textual evidence. In particular, we incorporate a negative rejection mechanism within the visual grounding module to distinguish between grounded entities and non-existent objects influenced by linguistic biases. Moreover, we propose a selective reasoning mechanism within the textual grounding module to adjust the model’s reasoning strategy based on the complexity of the query. These innovations together work to resolve the issues associated with hallucinations and enhance the overall alignment between visual and textual modalities. Extensive evaluations on benchmarks such as POPE, HaloQuest, ReasonSeg, MME, and MMBench demonstrate significant improvements in fine-grained visual understanding and hallucination suppression, showcasing the effectiveness of our approach in real-world multimodal tasks.
基础/前沿模型 (含LLM)
多模态基础模型
#MLLMs #visual attention sink
🎯 研究动机多模态大语言模型(MLLMs)在视觉语言任务中表现出色,但输出层时常未达到最优性能,表现为中间解码器层优于最终层,表明模型潜力未被充分利用。
❓ 解决问题本文旨在解决MLLMs中因视觉注意力再沉没现象导致的模态融合失效和视觉信息利用不足问题,从而提升模型性能。
🔍 现象分析视觉注意力再沉没由文本监督主导导致的注意力梯度稀疏引起,使注意力头演变为优先关注低语义背景的沉没头,从而偏置输出于文本先验。
🛠️ 主要方法提出了一种无参数的沉没注意力动态稀疏化框架,动态识别并保留所有视觉头以聚焦语义相关区域,同时稀疏化沉没头并通过共享头保留关键全局上下文。
📊 数据与实验方法在涵盖视觉定位、通用VQA、OCR相关VQA、视觉中心任务和视觉幻觉缓解的五类任务共20个基准上集成到多种MLLMs中,性能显著提升并加速推理10.3%。
⭐ 主要贡献通过揭示并逆转视觉注意力再沉没现象,提供了一种最大化MLLMs潜力的新途径,其框架以参数无关方式超越监督微调效果。
查看完整摘要 (Abstract)
Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet they frequently exhibit suboptimal output layers, where intermediate decoder layers outperform the final ones, signaling underutilized model capacity. In this work, we delve into the root causes and attribute this issue to the Visual Attention Re-sinking phenomenon, precipitated by attention gradient sparsity driven by textual supervision dominance. This degradation causes attention heads to evolve into sink heads that prioritize low-semantic backgrounds, thereby disrupting modality fusion, neglecting visual information, and biasing outputs toward textual priors, ultimately impairing model performance. To mitigate this, we introduce a parameter-free Sink Attention Dynamic Sparsification (SADS) framework that dynamically identifies and retains all vision heads(concentrating visual attention on semantically relevant regions) while sparsifying sink heads, preserving essential global context through a shared head. Integrated into diverse MLLMs, our framework yields substantial performance gains across 20 benchmarks spanning five task categories (visual grounding, general VQA, OCR-related VQA, vision-centric tasks, and visual hallucination mitigation) surpassing supervised fine-tuning while boosting inference speed by 10.3\%. This approach offers a novel avenue for maximizing MLLMs capabilities.
基础/前沿模型 (含LLM)
多模态基础模型
#Multimodal LLM #Reinforcement Learning #Vision Model #Visual representation
TL;DR:We study how SFT and RL affect not only MLLMs but also their vision encoders, and formulate a simple recipe, PIVOT, for evolving vision models for use in MLLM.
🎯 研究动机现有MLLM研究普遍低估了视觉编码器的作用,并忽视了强化学习对视觉表征的影响。本文旨在揭示不同训练策略如何重塑视觉编码器及MLLM的整体表现。
❓ 解决问题通过实验分析SFT与RL对视觉编码器表征能力的影响,并开发高效方法优化MLLM的视觉骨干网络。
🔍 现象分析强化学习相比监督微调在视觉相关VQA任务中优势明显,且能生成更强、更聚焦的视觉表征,提升视觉编码器性能。
🛠️ 主要方法提出偏好引导视觉优化方法PIVOT,将其整合至MLLM中以高效训练视觉编码器,成本仅为标准预训练的1%。
📊 数据与实验在ImageNet分类与分割等多任务及梯度可视化实验上评估视觉表征能力,验证PIVOT方法的优越性。
⭐ 主要贡献揭示了训练策略对视觉表征的关键影响,并提供了一种高效构建强视觉骨干的解决方案,显著降低计算成本并提升性能。
查看完整摘要 (Abstract)
A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities.
This has created a void in the understanding of the vision encoder, which determines 'how MLLMs perceive images'.
The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight—namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM.
To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage in strongly vision-related VQA benchmarks than SFT.
Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization.
Our results demonstrate that MLLM's post-training strategy 'i.e, SFT or RL' not only leads to disctinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations.
Specifically, our main finding is that
RL produces stronger and more localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM.
We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT).
When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1\% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs.
基础/前沿模型 (含LLM)
多模态基础模型
#3d spatial reasoning #3d visual grounding
TL;DR:3D visual grounding is one of the cornerstones of spatial reasoning
🎯 研究动机该研究旨在探索有效的空间表征,以弥合3D视觉定位与空间推理之间的鸿沟,并构建统一且自包含的3D空间推理框架。现有3D大语言模型在表征上存在缺陷,制约了两者的深度融合。
❓ 解决问题通过提出GS-Reasoner模型和双路径池化机制,构建统一的3D表征,实现了无需外部模块的自回归视觉定位,并提升了空间推理性能。同时创建了GCoT数据集以促进两者的协同研究。
🔍 现象分析当前3D大语言模型缺乏能同时捕捉语义与几何信息的统一表征,导致其在视觉定位上表现不佳或过度依赖外部模块,阻碍了定位与空间推理的无缝集成。
🛠️ 主要方法提出双路径池化机制,紧密对齐几何特征与语义及位置线索,构建基于图像块的统一3D表征。利用该表征,GS-Reasoner实现了首个完全无需外部模块的自回归3D视觉定位。
📊 数据与实验引入了包含3D边界框标注和逐步推理路径的GCoT数据集。大量实验表明,GS-Reasoner在3D视觉定位上取得优异结果,并显著提升了空间推理能力,达到先进水平。
⭐ 主要贡献首次提出无需外部模块即可实现自回归3D视觉定位的GS-Reasoner模型,构建了统一且自包含的空间推理框架。同时发布了GCoT数据集,为连接视觉定位与空间推理提供了关键资源。
查看完整摘要 (Abstract)
In this paper, we claim that 3D visual grounding is one of the cornerstones of spatial reasoning and introduce the $\textit{Grounded-Spatial Reasoner (GS-Reasoner)}$ to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective \emph{dual-path pooling} mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without extra tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLMs that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the $\textit{Grounded Chain-of-Thought (GCoT)}$ dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.
基础/前沿模型 (含LLM)
多模态基础模型
#MLLM #multi-modal #reasoning
🎯 研究动机当前多模态大语言模型(MLLM)中的推理组件常依赖落后的大型语言模型,直接升级需昂贵重训练。
❓ 解决问题提出感知与推理解耦框架,将MLLM的感知功能与外部纯文本LLM推理器分离,以低成本利用前沿文本推理模型。
🔍 现象分析现有MLLM因内部LLM过时而性能受限,升级需重复进行视觉-语言对齐训练,成本过高且效率低下。
🛠️ 主要方法通过感知-推理解耦模块化设计,使MLLM专注生成详细文本描述;提出视觉感知优化算法,强化学习优化感知输出以对齐下游推理任务。
📊 数据与实验在多项多模态推理基准测试中验证RAPID框架,表明该方法能显著提升性能,且无需重训练即可适配不同先进LLM推理器。
⭐ 主要贡献建立可扩展的解耦式多模态推理范式;提出VPO算法实现感知与推理任务对齐;实现推理时灵活扩展,显著降低模型更新成本。
查看完整摘要 (Abstract)
Recent breakthroughs in reasoning language models have significantly advanced text-based reasoning. On the other hand, Multi-modal Large Language Models (MLLMs) still lag behind, hindered by their outdated internal LLMs. Upgrading these LLMs is often prohibitively expensive, as it requires costly vision-language alignment retraining. To address this issue, we introduce Perception-Reasoning Decoupling, which modularizes the MLLM’s reasoning component and makes it easily replaceable. This approach redefines the MLLM's role to convert multi-modal inputs into detailed textual outputs that can be processed by any powerful, external, text-only LLM reasoners. To align the MLLM's perceptual output with the final reasoning task, we propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO). VPO rewards the MLLM based on the correctness of answers generated by the external reasoner to produce faithful and query-relevant captions. Together, this decoupling pipeline and VPO form our Reasoning-Aligned PerceptIon Decoupling (RAPID) approach. Empirical results show that RAPID achieves significant performance gains on multi-modal reasoning benchmarks. Crucially, RAPID enables a novel inference-time scaling paradigm: Once trained with VPO, the MLLM can be paired with any state-of-the-art LLM reasoner for consistent performance improvement without retraining.
基础/前沿模型 (含LLM)
多模态基础模型
#large vision-language models multi-round visual reasoning
TL;DR:We propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes.
🎯 研究动机当前大视觉语言模型在视觉推理方面依赖单步或纯文本推理,限制了跨多轮视觉上下文迭代细化的能力,需要建立新的评估基准和方法。
❓ 解决问题针对多轮视觉推理中推理轨迹缺乏空间基础的问题,提出强化学习框架要求推理步骤显式引用边界框,确保基础推理和语义连贯性。
🔍 现象分析现有系统在迭代推理场景下无法有效结合全局场景与局部区域信息,导致空间基础精度和一致性不足,影响多轮推理效果。
🛠️ 主要方法RegionReasoner通过强化学习强制推理轨迹引用参考边界框,采用全局-局部一致性奖励对齐推理步骤与场景描述,优化基础忠实度和语义对齐的结构化奖励。
📊 数据与实验构建RegionDial-Bench多轮视觉推理基准,覆盖检测与分割任务,实验显示RegionReasoner-7B在精度、空间基础一致性和多轮推理准确性上显著提升。
⭐ 主要贡献提出首个多轮视觉推理基准RegionDial-Bench,设计RegionReasoner强化学习框架提升基础推理性能,为迭代视觉推理建立强基线方法。
查看完整摘要 (Abstract)
Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts.
To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios.
We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global–local consistency reward.
This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps.
RegionReasoner is optimized with structured rewards combining grounding fidelity and global–local semantic alignment.
Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global–local consistency, establishing a strong baseline for this emerging research direction.
基础/前沿模型 (含LLM)
多模态基础模型
#foundation models #open vocabulary segmentation #semantic instance segmentation #object tracking
TL;DR:We present a strong model and challenging benchmark to advance open-vocabulary concept segmentation in images and videos.
🎯 研究动机开放词汇概念分割对图像和视频中的目标检测和跟踪提出了挑战,亟需统一的模型和高质量数据进行推动。
❓ 解决问题提出一种能够处理基于概念提示的统一分割和跟踪任务的模型,解决图像和视频中的概念驱动实例分割与标识问题。
🔍 现象分析当前方法在开放词汇分割精度和概念识别能力方面存在不足,难以有效处理语义复杂性和数据多样性。
🛠️ 主要方法构建一个集成图像检测与基于内存的视频跟踪的统一模型,引入概念提示,将识别与定位分离,提升检测准确性。
📊 数据与实验开发包含4百万独特概念标签的高质量图像和视频数据集,用于训练与评估,并通过实验验证模型在图像和视频分割任务上的性能翻倍提升。
⭐ 主要贡献提出SAM 3模型及新的SA-Co基准,公开源码和数据,为基于概念提示的分割任务设立高效的工具和挑战性标准。
查看完整摘要 (Abstract)
We present Segment Anything Model (SAM) 3, a unified model that detects,
segments, and tracks objects in images and videos based on concept prompts,
which we define as either short noun phrases (e.g., “yellow school bus”), image
exemplars, or a combination of both. Promptable Concept Segmentation (PCS)
takes such prompts and returns segmentation masks and unique identities for all
matching object instances. To advance PCS, we build a scalable data engine that
produces a high-quality dataset with 4M unique concept labels, including hard
negatives, across images and videos. Our model consists of an image-level detector
and a memory-based video tracker that share a single backbone. Recognition and
localization are decoupled with a presence head, which boosts detection accuracy.
SAM 3 doubles the accuracy of existing systems in both image and video PCS,
and improves previous SAM capabilities on visual segmentation tasks. We open
source SAM 3 along with our new Segment Anything with Concepts (SA-Co)
benchmark for promptable concept segmentation.
基础/前沿模型 (含LLM)
多模态基础模型
#computer vision #human pose #segmentation #transformers #foundation models
TL;DR:High resolution transformers for human-centric images.
🎯 研究动机针对以人为中心的视觉任务需求,开发一种能够同时处理高分辨率图像和多样化任务的高效模型,以提升泛化能力与结果准确性。
❓ 解决问题解决前代模型在处理复杂人类视觉任务时的不足,包括特征提取能力、分辨率支持以及多任务扩展性。
🔍 现象分析提出的统一预训练目标能够同时学习低层次细节和高层语义特征,并对多任务适应能力表现出显著提升。
🛠️ 主要方法采用结合掩码图像复原与对比自蒸馏的预训练策略,基于1亿高质量人像图像进行预训练,同时引入窗式注意力机制支持4K图像处理。
📊 数据与实验利用包含10亿人像图像的高质量数据集,并改进任务标注质量,以一系列定量指标证明新模型在多个任务中的性能提升。
⭐ 主要贡献提出Sapiens2模型家族,显著改进姿态估计(+4 mAP)、部位分割(+24.3 mIoU)、法线估计(角误差降低45.6%)等多任务表现,并首次扩展新任务如点图与反照率估计。
查看完整摘要 (Abstract)
We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation.
基础/前沿模型 (含LLM)
多模态基础模型
#Large Vision Language Model #Image Caption
TL;DR:We present ScaleCap, a scalable image captioning framework.
🎯 研究动机本文旨在解决大规模视觉语言模型在图像描述任务中存在多模态偏差和语言偏差的问题,这些偏差导致描述粒度不均衡及幻觉描述,从而限制了生成描述的全面性和准确性。
❓ 解决问题提出了ScaleCap框架,通过可扩展的去偏策略,利用增加的推理预算来逐步丰富和校准描述,以提高其准确性、平衡性和信息量。
🔍 现象分析LVLMs存在多模态偏差,导致对图像元素的描述粒度不平衡,即某些细节被详细描述而其他部分被忽视;语言偏差则引发对不存在物体的幻觉描述。
🛠️ 主要方法核心方法包括启发式问答和对比语句评级:前者通过图像生成特定问题并回答来逐步注入相关信息;后者使用离线对比解码以识别并消除语言偏差导致的幻觉。
📊 数据与实验在大规模多模态对齐实验中验证了ScaleCap的有效性,使用其标注的45万图像进行LVLM预训练,在11个常用基准上取得稳定性能提升,并通过VQA任务和图像重建任务展示了描述生成的优势。
⭐ 主要贡献首次提出可扩展的图像描述框架ScaleCap,通过双模态去偏策略解决LVLM的固有偏差;实验证明其能生成更准确、平衡和丰富的描述,并为下游任务提供了高质量的标注数据。
查看完整摘要 (Abstract)
This paper presents ScaleCap, a scalable image captioning strategy that generates
comprehensive and detailed image captions. The key challenges of high-quality
image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in
imbalanced descriptive granularity, offering detailed accounts of some elements
while merely skimming over others; linguistic bias leading to hallucinated de-
scriptions of non-existent objects. To address these issues, we propose a scalable
debiased captioning strategy, which continuously enriches and calibrates the caption
with increased inference budget. Specifically, we propose two novel components:
heuristic question answering and contrastive sentence rating. The former generates
content-specific questions based on the image and answers them to progressively
inject relevant information into the caption. The latter employs sentence-level
offline contrastive decoding to effectively identify and eliminate hallucinations
caused by linguistic biases. With increased inference cost, more heuristic questions
are raised by ScaleCap to progressively capture additional visual details, generating
captions that are more accurate, balanced, and informative. Extensive modality
alignment experiments demonstrate the effectiveness of ScaleCap. Annotating
450K images with ScaleCap and using them for LVLM pretraining leads to consis-
tent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap
showcases superb richness and fidelity of generated captions with two additional
tasks: replacing images with captions in VQA task, and reconstructing images
from captions to assess semantic coverage.
基础/前沿模型 (含LLM)
多模态基础模型
#Vision Language Model #Multi-modal QA #Mechanistic Interpretability
TL;DR:We show that VLMs often perceive the right evidence but fail to use it, and propose a simple inference-time method that highlights evidence to improve accuracy.
🎯 研究动机尽管视觉语言模型在多模态问答任务中表现良好,但经常在正确答案的视觉证据存在的情况下出错。本研究旨在系统性地探究这种失败是由于未能感知视觉证据,还是未能有效利用证据。
❓ 解决问题针对VLM在视觉证据存在的情况下仍会答错的问题,即“看到却无法相信”的普遍现象,提出了简单推理时干预方法,以提升答案准确性。
🔍 现象分析通过分析层间注意力动态,发现浅层注意力集中于文本,深层稀疏但可靠地定位视觉证据区域。在输出错误答案时,VLM仍能感知证据,揭示了感知与推理间的不匹配。
🛠️ 主要方法提出一种无需训练的推理时干预方法,通过基于注意力掩码选择性地高亮深层证据区域,引导模型更有效地利用编码的证据。
📊 数据与实验实验涵盖LLaVA、Qwen、Gemma和InternVL等多个主流VLM家族,验证该方法能够持续提高跨模型任务的准确率。
⭐ 主要贡献揭示了VLM内部编码了可靠视觉证据但未被充分利用的现象,提出高效推理时干预以连接感知与推理,增强了VLM的诊断理解和可靠性。
查看完整摘要 (Abstract)
Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term "seeing but not believing" that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, and that making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.
基础/前沿模型 (含LLM)
多模态基础模型
#large vision language models #large language models #contrastive decoding #multimodal learning
🎯 研究动机大型视觉语言模型存在幻觉倾向,现有视觉对比解码方法使用通用视觉增强而忽略文本查询的特定上下文,效果受限。
❓ 解决问题提出无需训练的查询自适应解码策略,以解决通用视觉增强的局限性,提高事实一致性。
🔍 现象分析现有方法未充分利用查询与视觉增强的语义对齐,且候选词选择固定,未考虑对数分布的全部信息。
🛠️ 主要方法开发自增强提示策略,动态对齐查询与视觉增强的语义;设计自适应阈值算法,根据输出稀疏性调整候选词规模。
📊 数据与实验在四个大型视觉语言模型和七个基准上广泛实验,相比最先进解码方法显著提升了事实一致性。
⭐ 主要贡献强调了查询相关增强与熵感知解码对改善模型生成效果的重要性,代码将在接受后开源。
查看完整摘要 (Abstract)
Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal capabilities, but they inherit the tendency to hallucinate from their underlying language models. While visual contrastive decoding has been proposed to mitigate this issue, existing methods often apply generic visual augmentations that disregard the specific context provided by the text query, limiting their effectiveness. This study introduces a novel training-free decoding strategy that addresses these limitations, featuring two key contributions. First, a self-augmentation prompting strategy that leverages the intrinsic knowledge of the model to dynamically align semantics between the query and the visual augmentation. Second, an adaptive thresholding algorithm that adaptively adjusts next token candidate size based on the output sparsity, utilizing full information from the logit distribution. Extensive experiments across four LVLMs and seven benchmarks demonstrate that the proposed decoding significantly enhances factual consistency compared to state-of-the-art decoding methods. This work highlights the importance of integrating query-dependent augmentation and entropy-aware decoding for improving effective generation of LVLMs. The source code will be released upon acceptance.
基础/前沿模型 (含LLM)
多模态基础模型
#Vision Language Models #VLMs #Multi Modal Language Models #Spatial Intelligence #Spatial Reasoning
TL;DR:We propose MindCube and find existing VLMs perform poorly on it. Supervising models to first generate cognitive maps and then reason upon them proves to be a quite effective approximation for spatial mental modeling from limited views.
🎯 研究动机论文旨在探究视觉语言模型(VLMs)能否像人类一样,仅通过少量视角形成对完整场景的空间心理建模能力。现有VLMs在此方面的能力缺失是研究的主要动机。
❓ 解决问题研究提出了MindCube基准,以量化评估VLMs在受限视角下构建空间心理模型的能力。针对现有模型性能接近随机水平的问题,探索了多种提升方法。
🔍 现象分析分析表明,现有VLMs在从有限视角进行空间推理时表现不佳。这暴露了它们在认知地图构建、视角采纳和心理模拟等关键空间能力上的严重不足。
🛠️ 主要方法核心方法是“先绘制地图再推理”的协同策略,即联合训练模型先生成认知地图,再基于该地图进行推理。进一步结合强化学习以进一步提升性能。
📊 数据与实验构建了包含21,154个问题和3,268张图像的MindCube基准数据集进行系统性评估。实验表明,所提方法将准确率从37.8%显著提升至70.7%。
⭐ 主要贡献贡献在于揭示了VLMs在空间心理建模上的关键不足,并提出了有效的“地图-推理”协同框架及MindCube基准,显著提升了模型对不可观测空间的理解能力。
查看完整摘要 (Abstract)
Can Vision Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.
基础/前沿模型 (含LLM)
多模态基础模型
#Vision-Language Models #Spatial Reasoning
🎯 研究动机当前视觉语言模型在空间推理方面存在根本性挑战,性能提升有限。研究旨在通过构建渐进式训练方法来弥补这一能力缺陷。
❓ 解决问题提出SpatialLadder方法,构建涵盖多模态空间推理任务的SpatialLadder-26k数据集,并设计三阶段渐进训练框架。
🔍 现象分析现有方法直接学习空间推理而缺乏感知与理解的层次化基础,这导致了模型鲁棒性不足的根本问题。
🛠️ 主要方法采用三阶段渐进训练:通过物体定位建立空间感知,多维度空间任务发展理解能力,强化学习验证奖励机制提升复杂推理。
📊 数据与实验构建26,610样本的多模态数据集,覆盖物体定位、单图/多视角/视频任务。3B参数模型在基准测试中平均提升23.4%,超越GPT-4o 20.8%且泛化能力提升7.2%。
⭐ 主要贡献提出系统性空间智能构建方法论,创建标准化多模态数据集,验证感知到推理的渐进训练对空间智能鲁棒性的关键作用。
查看完整摘要 (Abstract)
Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single-image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.
基础/前沿模型 (含LLM)
多模态基础模型
#Spatial Priors #Robot Manipulation #Instruction Following
🎯 研究动机现有大视觉-语言模型在多模态理解任务上表现出色,但在具身任务上存在不足,无法将高层指令转化为低层电机动作。
❓ 解决问题旨在构建一个视觉-语言-动作模型,通过引入空间先验,弥合语言指令与机器人特定控制之间的鸿沟,实现更精确的动作生成。
🔍 现象分析传统视觉-语言-动作模型在动作学习过程中可能丢失空间基础信息,导致指令执行效果不佳,尤其在复杂、长期任务中表现不稳定。
🛠️ 主要方法提出SP-VLA双阶段框架:首先通过空间基础预训练,利用可扩展的点、框和轨迹预测模型从大规模网络数据和机器人数据中学习可迁移的空间先验;然后通过空间引导后训练,鼓励模型生成丰富的空间先验指导动作生成,保持空间基础与策略学习的一致性。
📊 数据与实验在Google Robot和WidowX等机器人平台上验证,SP-VLA将性能分别从66.1提升至84.6和从54.7提升至73.2,在SimplerEnv上达到新SOTA,并展现了对未见物体、改写指令和长时扰动更强的泛化能力和鲁棒性。
⭐ 主要贡献首次将空间先验系统性地整合到视觉-语言-动作模型中,提供了一种可扩展的空间引导训练框架,显著提升了机器人操作任务的性能、泛化性和鲁棒性,并为未来研究公开了代码、数据和模型权重。
查看完整摘要 (Abstract)
Large vision–language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce SP-VLA, a dual-system **V**ision–**L**anguage–**A**ction framework that leverages **S**patial **P**riors as a bridge between linguistic instructions and embodiment-specific control.
introduce SP-VLA aligns action learning with spatial priors through two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting.
This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, introduce SP-VLA achieves substantial improvements over vanilla VLA, with performance increasing from $66.1{\rightarrow}84.6$ on Google Robot and from $54.7{\rightarrow}73.2$ on WidowX, establishing new state-of-the-art results on SimplerEnv. It also demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings. These results highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning. We will release code, data, and model checkpoints to support future research.
See more visualization results at the anonymous page: https://sp-vla-anonymous.vercel.app
基础/前沿模型 (含LLM)
多模态基础模型
#Multimodal Reasoning #LVLM #Reinforcement Learning
TL;DR:This paper introduces VPPO, which integrates token-level visual perception into multimodal RLVR to enhance the reasoning capabilities of Large Vision-Language Models.
🎯 研究动机现有多模态推理方法在RLVR优化过程中大多忽视了视觉感知的关键作用。本研究旨在从token感知这一新颖视角,对多模态RLVR进行开创性探索。
❓ 解决问题针对多模态强化学习中视觉依赖未被显式建模的问题,本文通过量化每个生成token的视觉依赖性,以提升LVLMs的推理能力。
🔍 现象分析通过对思维链过程进行细粒度分析,发现两个关键现象:一是rollout轨迹中的token感知分布稀疏,仅有少数token具有高视觉依赖性;二是不同轨迹的整体视觉依赖性存在显著差异。
🛠️ 主要方法提出视觉感知策略优化(VPPO)算法,该算法通过双重机制改进学习信号:根据轨迹整体视觉依赖性重新加权优势估计,并仅对感知关键token进行策略更新。
📊 数据与实验在八个综合感知与推理基准上评估VPPO,结果显示其在7B和32B模型规模上均显著优于当前领先的开源RL微调模型。
⭐ 主要贡献为分析多模态RLVR建立了新的token级感知视角,并提出了一种有效优化策略以显著增强LVLMs的多模态推理能力。
查看完整摘要 (Abstract)
While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose **V**isually-**P**erceptive **P**olicy **O**ptimization (**VPPO**), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.
基础/前沿模型 (含LLM)
多模态基础模型
#Remote Sensing #Foundation Model #Geospatial
🎯 研究动机遥感领域深度学习日益依赖多样化的卫星影像,但现有基础模型的训练数据在规模、地理覆盖和光谱多样性方面存在局限性,影响全局通用表征的学习效果。
❓ 解决问题提高基础模型对全球遥感数据的适应性,通过融合多种传感器数据和优化采样策略,解决地物类别分布不均对任务表现的影响。
🔍 现象分析现有模型在分类和分割任务上的泛化能力受限,难以充分利用多传感器影像和地物类别分布丰富的特性。
🛠️ 主要方法构建自监督学习框架TerraFM,结合雷达和光学传感器数据作为自然增强方式,通过多模态嵌入和自适应跨模态注意力融合机制实现统一表征;引入局部-全局对比学习和类别频率约束的双中心化机制优化训练。
📊 数据与实验使用全球分布的Sentinel-1和Sentinel-2影像,结合GEO-Bench和Copernicus-Bench评估,实验结果显示在分类与分割任务上优于现有模型。
⭐ 主要贡献提出通过跨模态自监督学习构建统一遥感表征的模型,改进长尾类别处理方式并提升泛化性能;模型代码及预训练权重将公开发布。
查看完整摘要 (Abstract)
Modern Earth observation (EO) increasingly leverages deep learning to harness the scale and diversity of satellite imagery across sensors and regions. While recent foundation models have demonstrated promising generalization across EO tasks, many remain limited by the scale, geographical coverage, and spectral diversity of their training data, factors critical for learning globally transferable representations. In this work, we introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery, combined with large spatial tiles and land-cover aware sampling to enrich spatial and semantic coverage. By treating sensing modalities as natural augmentations in our self-supervised approach, we unify radar and optical inputs via modality-specific patch embeddings and adaptive cross-attention fusion. Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism that incorporates class-frequency-aware regularization to address long-tailed distributions in land cover. TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench. Our code and pretrained models will be publicly released.
基础/前沿模型 (含LLM)
多模态基础模型
#Compositional reasoning #multimodal learning #test-time adaptation #evaluation metrics #vision-language models
🎯 研究动机前沿多模态模型在组合推理方面存在明显不足,现有评估指标可能系统性低估了模型的真实能力。
❓ 解决问题论文通过引入新的评估指标和测试时匹配算法,旨在更准确地评估并提升多模态模型的组合推理能力。
🔍 现象分析研究发现,现有评估方法存在偏差,导致模型性能被低估;校正后模型表现显著提升,甚至可超越GPT-4.1。
🛠️ 主要方法提出了群组匹配分数以更忠实评估模型能力,并设计了无监督的测试时匹配算法,通过迭代自改进提升模型性能。
📊 数据与实验在包括Winoground、MMVP-VLM和WhatsUp在内的16个数据集变体上验证了方法的有效性,取得了最高85.7%的相对性能提升。
⭐ 主要贡献揭示了评估偏差问题并提出了校正方案;设计了通用的测试时匹配算法,在多个基准上实现了新的最先进性能。
查看完整摘要 (Abstract)
Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To correct this artifact, we introduce a group matching score that more faithfully evaluates model capability. Moreover, correctness under the new metric can be translated into correctness under existing metrics via a simple overfitting step. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.
基础/前沿模型 (含LLM)
多模态基础模型
#Vision Transformer #Visual Attention Sink #Attention Sink #Multimodal LLM #Large Vision Langauge Model
🎯 研究动机现有大型视觉语言模型(LVLMs)通常由视觉Transformer(ViT)和大型语言模型(LLM)组成,但视觉编码器向LLM有效传递信息的方式和关键视觉token的作用机制尚不明确。我们旨在探索ViT中哪些视觉token对理解和推理最为关键,以及这些信号如何有效传播。
❓ 解决问题我们重点关注并识别出ViT中一类高范数的视觉token,即ViT注意力下沉(token),它们尽管在现有LVLM架构中常被忽视,却对高层语义理解和推理至关重要。旨在解决视觉信息在ViT与LLM之间有效传播的问题。
🔍 现象分析与现有研究主要关注LLM内部的注意力下沉不同,我们发现ViT中也存在注意力下沉现象,这些token往往携带图像的高级语义概念。分析表明,它们能够帮助LLM进行更有效的理解和推理。
🛠️ 主要方法我们提出了定性和定量分析来探究ViT注意力下沉token中所嵌入的信息。同时,提出了无需训练和基于训练的两种方法,以更好地利用这些token在LLM中的信息处理方式。
📊 数据与实验通过在一系列LVLMs和视觉推理任务上进行实验验证,包括数学问题求解、逻辑推理和几何理解等任务。实验表明,显式利用这些token能够带来显著的性能提升。
⭐ 主要贡献首次系统地识别并研究了ViT中的注意力下沉现象,揭示了其高级语义的重要作用。提出多种方法来有效利用这些token,在多个视觉推理任务上显著提升了模型性能,凸显了其增强视觉推理的未开发潜力。
查看完整摘要 (Abstract)
Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, including but not limited to mathematical problem solving, logical inference, and geometric understanding, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.
基础/前沿模型 (含LLM)
多模态基础模型
#multimodal large language model #large language model #speech language model
TL;DR:We present a true speech-to-speech LLM that understands and generates speech directly, without text intermediates, achieving state-of-the-art spoken QA.
🎯 研究动机现有语音对话系统多采用级联流水线,依赖文本作为中间媒介,这不仅丢弃了副语言线索,也限制了表达的丰富性。为了克服这一瓶颈,本文旨在探索无需文本引导的真实语音到语音对话模型。
❓ 解决问题本文提出了一个无需文本中间件、直接理解与生成语音的真实语音到语音大语言模型。它解决了现有端到端方法仍依赖文本中间表示的根本限制,缩小了文本引导与直接语音生成之间的差距。
🔍 现象分析现有系统虽有效但舍弃了副语言线索,且文本中间表示成为性能瓶颈。最近的端到端方法减少了延迟,但在理解和生成中仍无法摆脱对文本的依赖,从而限制了模型的表达潜力。
🛠️ 主要方法模型采用基于模态的层分割架构,并结合冻结预训练策略。该设计在保持预训练文本大语言模型推理和知识能力的同时,赋予了模型原生的语音理解和生成能力。
📊 数据与实验实验在语音问答任务上验证了模型性能。模型在该任务上取得了最先进的结果,并在语音到语音性能上与现有文本引导系统相当,同时保持了有竞争力的文本性能。
⭐ 主要贡献本文建立了无需文本引导的端到端语音交互新范式。通过发布代码和模型,为真实语音到语音基础模型的进一步研究提供了支持。
查看完整摘要 (Abstract)
Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction. We will release our code and models to support further research in true speech-to-speech foundation models.
基础/前沿模型 (含LLM)
多模态基础模型
#MLLM #Self-improvement #Unification
TL;DR:We systematically explore the internal gap in unified MLLMs, which typically manifests as understanding being stronger than generation, covering empirical validation, mitigation methods, mechanistic analysis, and the design of improved approaches.
🎯 研究动机当前统一多模态大语言模型(MLLMs)虽然旨在整合生成与理解能力,但普遍存在理解优于生成的内部鸿沟,限制了模型性能的均衡发展。本文旨在系统性地分析这一鸿沟,并探索如何将其转化为自我优化的动力。
❓ 解决问题研究目标在于缓解MLLMs中生成能力弱于理解能力的非统一问题,进而促进生成与理解的协同提升。通过设计内部鸿沟驱动的自我改进框架,无需依赖外部信号即可实现模型性能的优化。
🔍 现象分析经大规模实证验证,多个MLLMs在多种任务中均表现出生成弱于理解的显著内部鸿沟,且这种非统一性主要源于生成能力不足而非理解偏差。进一步分析揭示了生成与理解在学习动态上的共享性,为协同改进提供了理论基础。
🛠️ 主要方法提出基于内部鸿沟的自我改进框架:利用强理解能力对生成内容进行评分,构建图像数据进行后训练(如SFT和DPO),从而直接提升生成质量并促进统一。此外,设计课程学习方法动态扩展后训练数据,进一步优化性能。
📊 数据与实验通过跨模型与多任务的综合实验验证方法的有效性,实验表明后训练能显著改善生成能力并推动统一。研究发现自我改进中生成与理解存在协同提升效应,且该效应可通过神经正切核理论解释其学习动态对齐机制。
⭐ 主要贡献首次系统性地证实MLLMs中生成弱于理解的内部鸿沟,并提出无需外部信号的自我改进框架;揭示了后训练中生成与理解的协同改进现象,并从理论角度阐释其动态对齐机制;设计课程学习策略动态优化数据利用,进一步提升模型性能与统一性。
查看完整摘要 (Abstract)
Although unified MLLMs aim to unify generation and understanding, they are considered to exhibit an internal gap, with understanding outperforming generation. Through large‑scale evaluation across multiple MLLMs and tasks, we confirm the widespread non‑unification of MLLMs, and demonstrate that it indeed stems from weak generation rather than misunderstanding. This finding motivates us to propose a simple yet effective internal gap-based self-improvement framework, which mitigates internal gaps by leveraging stronger understanding to guide weaker generation without relying on any external signals. We validate this strategy through comprehensive experiments: scoring generations with understanding to construct image data for post-training (e.g., SFT and DPO) significantly improves generation while promoting unification. Furthermore, we empirically discover a co-improvement effect of such self-improvement, a phenomenon well known in pre-training but underexplored in post-training. Specifically, as generation improves, understanding becomes more effective at detecting false positives that were previously misclassified as prompt‑aligned. To explain this effect, we extend learning dynamic theory to the MLLM setting, showing that the shared empirical neural tangent kernel between generation and understanding encourages aligned learning dynamics, thereby driving co-improvement. This interplay between generation and understanding further motivates a curriculum learning approach for stronger self‑improvement: progressively enhanced understanding and generation revisit samples underutilized by pre‑trained MLLMs, dynamically expanding post‑training data and leading to improved performance and unification.
基础/前沿模型 (含LLM)
多模态基础模型
#large vision-language models #multimodality #language prior
TL;DR:A formal framework for understanding and quantifying the language prior in LVLMs by contrasting the chain-of-embedding between visual and blind contexts.
🎯 研究动机大型视觉-语言模型(LVLM)在多模态任务中表现出色,但其在处理中常过度依赖预训练时记忆的语言先验,而未能充分利用视觉信息。现有研究方法多基于输入-输出探查,难以揭示模型内部视觉信息影响决策的机制。
❓ 解决问题本研究旨在系统性地理解和量化LVLM中的语言先验,通过对比视觉与纯文本上下文下的嵌入链,探究模型层间表征的动态变化。
🔍 现象分析分析发现一个普遍现象:每个LVLM都存在一个“视觉整合点”(VIP),即视觉信息开始显著重塑隐层表征并影响多模态推理解码的关键网络层。
🛠️ 主要方法提出基于嵌入链的分析框架,通过对比不同上下文下的表征变化识别VIP。进一步引入“总体视觉整合”(TVI)估计量,聚合VIP后的表征差异以量化视觉查询对生成过程的影响强度。
📊 数据与实验实验涵盖10个主流LVLM和6个基准数据集,共60个模型-数据集组合。结果验证了VIP的普遍存在性,并表明TVI能可靠预测语言先验的强度。
⭐ 主要贡献首次通过嵌入链视角系统分析了LVLM的语言先验机制,提出了可量化视觉整合程度的VIP和TVI指标,为诊断和理解模型行为提供了原则性工具包。
查看完整摘要 (Abstract)
Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP)---memorized textual patterns from pre-training while under-utilizing visual evidence.
Prior analyses of LP mostly rely on input–output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs.
Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding for multimodal reasoning.
Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representational discrepancy beyond the VIP to quantify how strongly visual query influences response generation. Across 60 model–dataset combinations spanning 10 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.
基础/前沿模型 (含LLM)
多模态基础模型
#Unified Multimodal Language Models
TL;DR:We found that during training, severe conflicts arise between the text and visual modalities in both the shallow and deep layers of UMMs. Our proposed Uni-X mitigates this issue and achieves strong performance.
🎯 研究动机基于共享自回归 Transformer 的统一多模态模型(UMMs)架构简单,但研究发现,在多模态联合训练时,视觉和文本模态之间在模型的浅层和深层存在严重的梯度冲突。
❓ 解决问题旨在解决 UMMs 训练中由视觉与文本数据底层统计特性差异引发的梯度冲突问题,以提升训练效率和模型性能。
🔍 现象分析冲突根源于图像与文本在浅层与深层表征的根本性差异;而在中间层,表征趋于抽象与语义对齐,冲突显著减弱。
🛠️ 主要方法提出名为 Uni-X 的双端分离、中间共享架构,其浅层和深层参数为模态特定,中间层为共享参数,以实现高效的高层语义融合并减轻梯度冲突。
📊 数据与实验在相同训练条件下,Uni-X 展现出更高的训练效率。将模型扩展至 3B 参数并使用更大数据训练后,其在图像生成(GenEval 得分 82)、文本理解及视觉理解任务上表现优异,可比肩或超越 7B 参数的基线 UMMs。
⭐ 主要贡献提出 Uni-X 架构,有效缓解多模态梯度冲突,为实现参数高效且可扩展的统一多模态建模提供了强有力的基础。
查看完整摘要 (Abstract)
Unified Multimodal Models (UMMs) built on shared autoregressive (AR) transformers are attractive for their architectural simplicity. However, we identify a critical limitation: when trained on multimodal inputs, modality-shared transformers suffer from severe gradient conflicts between vision and text, particularly in shallow and deep layers. We trace this issue to the fundamentally different low-level statistical properties of images and text, while noting that conflicts diminish in middle layers where representations become more abstract and semantically aligned.
To overcome this challenge, we propose Uni-X, a two-end-separated, middle-shared architecture. Uni-X dedicates its initial and final layers to modality-specific processing, while maintaining shared parameters in the middle layers for high-level semantic fusion. This X-shaped design not only eliminates gradient conflicts at both ends but also further alleviates residual conflicts in the shared layers.
Extensive experiments validate the effectiveness of Uni-X. Under identical training conditions, Uni-X achieves superior training efficiency compared to strong baselines. When scaled to 3B parameters with larger training data, Uni-X matches or surpasses 7B AR-based UMMs, achieving a GenEval score of 82 for image generation alongside strong performance in text and vision understanding tasks.
These results establish Uni-X as a parameter-efficient and scalable foundation for future unified multimodal modeling.
Our code is available at https://github.com/CURRENTF/Uni-X.
基础/前沿模型 (含LLM)
多模态基础模型
#Unified understanding and generation; Large language models; 3D generation; 3D vision; Spatial understanding
TL;DR:Unified 3D Understanding and Generation via Geometric-Semantic Encoding
🎯 研究动机尽管当前统一架构在图像理解与生成方面取得显著进展,但3D任务的整合仍具挑战且研究不足。本文旨在填补这一空白,探索统一的三维理解与生成框架。
❓ 解决问题针对现有方法难以将3D理解与生成任务有效集成的问题,提出了首个统一处理3D模态的理解与生成框架。
🔍 现象分析当前统一模型多集中于2D图像任务,而3D任务因其复杂的空间关系和几何结构,在统一架构中尚未得到充分探索。
🛠️ 主要方法提出UniUGG框架,以LLM为核心理解并解码文本与3D表示;核心包括基于潜扩散模型的空间解码器,用于生成高质量3D表示;并提出几何语义学习策略预训练视觉编码器,联合捕获输入语义与几何线索。
📊 数据与实验通过大量实验验证方法在视觉表示、空间理解和3D生成方面的优越性,具体数据集未在摘要中详述。
⭐ 主要贡献首次提出统一3D理解与生成框架UniUGG;设计空间解码器与几何语义学习策略,提升空间理解与生成能力;实验证明方法在多项任务上具有优势。
查看完整摘要 (Abstract)
Despite the impressive progress on understanding and generating images shown by the recent unified architectures, the integration of 3D tasks remains challenging and largely unexplored. In this paper, we introduce UniUGG, the first unified understanding and generation framework for 3D modalities. Our unified framework employs an LLM to comprehend and decode sentences and 3D representations. At its core, we propose a spatial decoder leveraging a latent diffusion model to generate high-quality 3D representations. This allows for the generation and imagination of 3D scenes based on a reference image and an arbitrary view transformation, while remaining supports for spatial visual question answering (VQA) tasks. Additionally, we propose a geometric-semantic learning strategy to pretrain the vision encoder. This design jointly captures the input's semantic and geometric cues, enhancing both spatial understanding and generation. Extensive experimental results demonstrate the superiority of our method in visual representation, spatial understanding, and 3D generation.
基础/前沿模型 (含LLM)
多模态基础模型
#JEPA #VLM #video-language #efficiency
TL;DR:We introduce a vision-language model based on JEPA, that achieves competitive socres while being more efficient during training and inference.
🎯 研究动机现有视觉语言模型在训练和推理时计算效率较低,通常采用自回归生成文本,这限制了模型的效率。通过JEPA架构探索一种更高效的视觉语言建模范式,以解决这一问题。
❓ 解决问题提出一种基于JEPA的视觉语言模型,将目标文本预测为连续嵌入而非离散标记,以减少训练参数和推理开销。模型在保持性能的同时,显著提升了训练和推理效率。
🔍 现象分析传统视觉语言模型通过自回归生成文本,需处理大量离散标记,导致计算复杂度过高。JEPA架构通过学习抽象表示空间,能捕捉任务相关语义,同时忽略表面语言变化,从而提升效率。
🛠️ 主要方法采用JEPA架构预测目标文本的连续嵌入,而非生成离散标记。训练时使用抽象表示空间,减少训练参数;推理时仅在需要时调用轻量级文本解码器生成文本。
📊 数据与实验在八个视频分类和八个视频检索数据集上评估模型,性能优于CLIP、SigLIP2和Perception Encoder。在四个VQA数据集上与经典VLM模型性能相当,但参数仅1.6B,效率显著提升。
⭐ 主要贡献提出高效视觉语言模型VL-JEPA,减少50%训练参数,推理时选择性解码减少解码操作2.85倍。模型支持多任务(分类、检索、VQA)且无需架构修改,为视觉语言任务提供了高效解决方案。
查看完整摘要 (Abstract)
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by ~2.85× while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance of VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance to classical VLMs (InstructBLIP, QwenVL) on four VQA datasets—GQA, TallyQA, POPE, and POPEv2—despite having only 1.6B parameters.
基础/前沿模型 (含LLM)
多模态基础模型
#video-based 3D MLLM #geometric priors #Cross-Task Adapter #Metric Depth calibration
🎯 研究动机现有3D多模态大模型依赖3D数据输入,限制了其可扩展性和泛化能力。因此,研究旨在通过视频输入实现3D场景理解,以提升实际部署的实用性。
❓ 解决问题针对3D-MLLMs对3D数据输入的依赖问题,提出Vid-LLM模型,直接处理视频输入无需外部3D数据,解决扩展性和泛化性挑战。
🔍 现象分析当前2D视觉语言推理已有显著进展,但3D场景理解仍面临数据依赖瓶颈,导致模型难以在现实场景中广泛应用。
🛠️ 主要方法设计Cross-Task Adapter模块对齐3D几何先验与视觉语言表示;引入Metric Depth Model确保几何一致性;采用两阶段蒸馏优化策略实现稳定训练。
📊 数据与实验在多个基准测试上进行广泛实验,涵盖3D问答、3D密集描述和3D视觉定位任务,验证了模型的有效性和多任务能力。
⭐ 主要贡献提出首个基于视频的3D-MLLM Vid-LLM,无需外部3D数据;创新性地整合几何先验与多模态表示;通过实验证明了其在3D场景理解任务上的优越性能。
查看完整摘要 (Abstract)
Recent developments in Multimodal Large Language Models (MLLMs) have significantly improved Vision–Language (VL) reasoning in 2D domains. However, extending these capabilities to 3D scene understanding remains a major challenge. Existing 3D Multimodal Large Language Models (3D-MLLMs) often depend on 3D data inputs, which limits scalability and generalization. To address this limitation, we propose Vid-LLM, a video-based 3D-MLLM that directly processes video inputs without requiring external 3D data, making it practical for real-world deployment. In our method, the geometric prior are directly used to improve the performance of the sceen perception. To integrate the geometric cues into the MLLM compactly, we design a Cross-Task Adapter (CTA) module to align the 3D geometric priors with the vision-language representations. To ensure geometric consistency and integrity, we introduce a Metric Depth Model that recovers real-scale geometry from the reconstruction outputs. Finally, the model is fine-tuned with a two-stage distillation optimization strategy, realizing fast convergence and stabilizes training. Extensive experiments across diverse benchmarks verified the effectiveness of our method on 3D Question Answering, 3D Dense Captioning and 3D Visual Grounding tasks, demonstrating the superior multi-task capabilities.
基础/前沿模型 (含LLM)
多模态基础模型
#Video; Diffusion; LLM
TL;DR:Video-GPT treats video as new language for visual world modeling.
🎯 研究动机受GPT在NLP领域成功的启发,认为语言序列难以充分描述视觉世界的时空细节。视频序列能更好地捕捉这些细节,因此将视频视为建模视觉世界的新语言。
❓ 解决问题旨在解决视频生成中时空细节建模的挑战,以及现有方法在长短期视频任务上的局限。
🔍 现象分析视频作为时空序列的天然载体,可类比语言序列进行建模。但传统方法难以统一处理生成与预测任务。
🛠️ 主要方法提出Video-GPT,引入“下一片段扩散”预训练范式。通过自回归地根据历史干净片段去噪含噪片段,统一处理短期生成与长期预测。
📊 数据与实验在Physics-IQ等基准测试中取得SOTA(34.97分)。在6类主流视频任务上验证了其生成与理解的泛化能力。
⭐ 主要贡献开创性地将视频视为视觉建模的“新语言”。提出的下一片段扩散范式统一了视频生成与预测,并在多项任务中展示了强大泛化性。
查看完整摘要 (Abstract)
GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream.
基础/前沿模型 (含LLM)
多模态基础模型
#Multimodal Large Language Models #Self-supervised Learning #Post-training #Reinforcement Learning #Visual Jigsaw
🎯 研究动机当前基于强化学习的多模态大语言模型后训练范式以文本为中心,视觉信号仅被用于提取稀疏线索,缺乏对其内在理解的根本性提升。
❓ 解决问题提出一种自监督的视觉为中心后训练框架 Visual Jigsaw,旨在增强 MLLMs 对视觉信号的细粒度、时空及三维理解能力,无需额外视觉生成组件或人工标注。
🔍 现象分析现有视觉相关后训练方法仍依赖文本作为中介或引入额外视觉生成设计,未能充分利用视觉信号本身进行自监督学习,限制了模型视觉理解能力的提升。
🛠️ 主要方法将视觉输入分割、打乱后,要求模型通过自然语言输出正确排序,将此排序任务与可验证奖励的强化学习相结合,形成通用的自监督后训练范式。
📊 数据与实验在图像、视频和 3D 数据三种视觉模态上实例化该框架,实验证明其在细粒度感知、时序推理和三维空间理解方面均有显著提升。
⭐ 主要贡献提出首个通用的自监督视觉后训练框架,利用排序任务与 RLVR 结合增强视觉理解;展示了其在多视觉模态上的有效性,启发了未来视觉为中心预训练任务的设计研究。
查看完整摘要 (Abstract)
Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While *vision-centric* post-training is crucial for enhancing MLLMs’ intrinsic understanding of visual signals, current post-training paradigms are predominantly *text-centric*, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce **Visual Jigsaw**, a generic *self-supervised* post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs.
基础/前沿模型 (含LLM)
多模态基础模型
#Image Reconstruction #Image Generation
TL;DR:A family of powerful discrete visual tokenizers designed to resolve the long-standing conflict between compression efficiency and reconstruction fidelity.
🎯 研究动机现有视觉分词器在压缩效率与重建保真度之间存在长久性冲突,亟需优化设计以改善视觉生成表现。
❓ 解决问题设计一种新的离散视觉分词器,解决压缩比与重建质量难以兼顾的问题,同时提升视觉生成任务的效果。
🔍 现象分析传统分词器受限于内存和计算效率,导致在高压缩比下视觉重建质量下降;生成解码器缺乏对噪声变量建模的能力,无法精细捕捉视觉细节。
🛠️ 主要方法提出WeTok分词器,通过组内无查找量化(GQ)和生成解码(GD)技术优化分词器的性能,分别提升压缩效率及数据分布建模能力。
📊 数据与实验基于ImageNet 50k验证集,在高保真设置下实现零样本rFID低至0.12,远优于FLUX-VAE和SD-VAE等方案;在768×高压缩比下显著超越Cosmos分词器,展现优异性能。
⭐ 主要贡献通过创新设计组内量化和生成解码方法,解决视觉分词器长期存在的压缩与重建矛盾,同时在多个基准测试中取得领先性能指标。
查看完整摘要 (Abstract)
Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. On the ImageNet 50k validation set, at a high-fidelity setting, WeTok achieves a record-low zero-shot rFID of 0.12, outperforming leading continuous tokenizers like FLUX-VAE (0.18) and SD-VAE 3.5 (0.19) with 400% compression ratio. Furthermore, in a high-compression regime, WeTok achieves a zero-shot rFID of 3.49 at a 768× compression ratio, substantially surpassing Cosmos, which scores 4.57 at only 50% our compression ratio.
基础/前沿模型 (含LLM)
多模态基础模型
#Multi-Modal LLMs #Spatial Reasoning #3D Vision
TL;DR:We introduce pySpatial, a visual programming framework that flexibly composes spatial tools (e.g., 3D reconstruction, camera movements, and novel view synthesis) to enable MLLMs to explicitly reason in 3D space for diverse spatial reasoning tasks.
🎯 研究动机多模态大语言模型 (MLLMs) 虽在通用感知和推理上表现强大,但在需要真实3D空间理解的任务上仍存在局限,难以处理空间关系。现有方法缺乏对结构化三维空间的显式建模能力。
❓ 解决问题提出pySpatial,一种视觉编程框架,通过生成Python代码调用空间工具,将原始2D图像序列转化为可探索的3D场景,使MLLMs能基于结构化空间表示进行显式推理。该方法无需梯度微调,实现零样本操作。
🔍 现象分析MLLMs在涉及深度、遮挡和视角变化的3D空间推理任务中表现不佳,根本原因在于其缺乏对场景的几何和拓扑结构进行显式操作和推理的机制,仅依赖隐式的二维视觉特征。
🛠️ 主要方法框架基于视觉编程,输入图像序列和自然语言查询,模型组合调用3D重建、相机位姿恢复、新视角合成等空间工具的函数,构建可交互的3D环境,支撑后续推理。
📊 数据与实验在MindCube和Omni3D-Bench等基准测试上评估,pySpatial显著超越GPT-4.1-mini等强基线模型。同时,在真实室内导航任务中验证了其生成路径规划的实际有效性。
⭐ 主要贡献设计了一个零样本、无需训练的空间推理框架,通过工具组合将2D视觉提升至3D显式推理;在多个基准上取得显著性能提升,并展示了在机器人导航等现实任务中的实用潜力。
查看完整摘要 (Abstract)
Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach. Our project website will be available at https://pySpatial.github.io.
模型架构74 篇
基础/前沿模型 (含LLM)
模型架构
#attention sinks #compression valleys #deep trasformer-based LLMs
🎯 研究动机注意力陷阱和压缩谷是大型语言模型中的两个重要现象,但之前研究仅限于孤立分析。作者试图探索这两种现象之间的联系,并揭示其根源。
❓ 解决问题理论上证明注意力陷阱和压缩谷均源于残差流中的大规模激活,同时量化了压缩现象导致的熵减程度。
🔍 现象分析通过实验证明序列起始符在中间层产生的极端激活会同时引发注意力陷阱和压缩谷现象,这一过程与模型的计算深度密切相关。
🛠️ 主要方法提出了信息流的“混合-压缩-精炼”理论,将Transformer中的计算分为三个阶段:早期的广泛混合、中期的压缩计算、后期的选择性精炼。
📊 数据与实验在参数规模从410M到120B的多个模型上进行实验,验证理论预测,并通过定向消融实验进一步支持提出的理论框架。
⭐ 主要贡献首次统一注意力陷阱与压缩谷现象,提出“混合-压缩-精炼”理论解释大型语言模型的信息流组织规律,并深刻解释了中间层适合嵌入任务、深层适合生成任务的原因。
查看完整摘要 (Abstract)
Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M--120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation validates our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.
基础/前沿模型 (含LLM)
模型架构
#Large language model #permutation language model
🎯 研究动机扩散语言模型具有任何顺序生成和双向条件的灵活性,但单步依赖限制了模型深度与稳定性,需要提升生成质量和实用性。
❓ 解决问题改革扩散式训练,构建一种具有更强灵活性和结构化多层依赖的语言生成模型,同时保留生成质量与解码效率。
🔍 现象分析传统扩散模型在样本质量和稳定性上逊于自回归模型,尤其在复杂语言任务中表现出局限性。
🛠️ 主要方法提出A3框架,将自回归因子分解扩展至任意令牌组与生成顺序,并通过双流注意力架构和渐进式适应策略优化性能。
📊 数据与实验在问答、常识推理和故事填充任务上对A3框架进行测试,结果显示其生成质量与灵活性优于扩散模型。
⭐ 主要贡献为语言模型构建了统一框架,融合自回归模型的严密概率性与扩散模型的双向灵活性,实现效率与创新的平衡。
查看完整摘要 (Abstract)
Diffusion language models enable any-order generation and bidirectional conditioning, offering appealing flexibility for tasks such as infilling, rewriting, and self-correction. However, their formulation—predicting one part of a sequence from another within a single-step dependency—limits modeling depth and often yields lower sample quality and stability than autoregressive (AR) models. To address this, we revisit autoregressive modeling as a foundation and reformulate diffusion-style training into a structured multi-group prediction process. We propose Any-order Any-subset Autoregressive modeling (A3), a generalized framework that extends the standard AR factorization to arbitrary token groups and generation orders. A3 preserves the probabilistic rigor and multi-layer dependency modeling of AR while inheriting diffusion models' flexibility for parallel and bidirectional generation. We implement A3 through a two-stream attention architecture and a progressive adaptation strategy that transitions pretrained AR models toward any-order prediction. Experiments on question answering, commonsense reasoning, and story infilling demonstrate that A3 outperforms diffusion-based models while maintaining flexible decoding. This work offers a unified approach for a flexible, efficient, and novel language modeling paradigm. Code is at https://github.com/PKU-ML/Any-order-Any-subset-AR.
基础/前沿模型 (含LLM)
模型架构
#Bidirectional models #Transformer-based encoders
TL;DR:We propose a new bidirectional attention-free encoder
🎯 研究动机紧凑的双向编码器在计算和内存受限的工业 NLP 场景中仍是重要技术,其优势源于自注意力机制能够以序列级并行实现高质量的上下文建模。探索不依赖注意力的双向编码器具有潜力。
❓ 解决问题现有基于 Transformer 的双向编码器在长序列处理效率和计算资源利用方面仍存在优化空间。需要设计一种无需注意力机制,能更高效处理长上下文的编码器架构。
🔍 现象分析通过对现有编码器架构的性能对比,发现注意力机制虽然灵活,但在计算效率和资源占用方面存在瓶颈。Avey 的自回归特性提供了一种简化且高效的处理链路。
🛠️ 主要方法对 Avey 进行重新架构设计,采用静态与动态参数解耦,引入稳定性导向的归一化机制以及神经网络压缩技术,以改进其适配双向编码器的表现。
📊 数据与实验在标准的标记分类和信息检索基准测试中,经过实验对比,改进后的架构不仅优于四种主流 Transformer 编码器,还展示了较长序列处理时的显著效率提升。
⭐ 主要贡献提出了一种高效的双向注意力无关编码器,将 Avey 重构为编码器范式并引入多项架构创新,同时在关键 NLP 任务中实现性能与扩展性的双赢。
查看完整摘要 (Abstract)
Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention’s ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, **Avey** was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.
基础/前沿模型 (含LLM)
模型架构
#Large Language Models #Diffusion Language Model #Training-Free
🎯 研究动机扩散大语言模型(DLLMs)具备并行生成和全局上下文建模的优势,但固定生成长度的架构限制导致任务表现和计算效率之间存在矛盾。
❓ 解决问题通过引入一种无需额外训练的动态自适应长度扩展策略(DAEDAL),解决 DLLMs 固定生成长度限制的问题。
🔍 现象分析虽然模型推理框架是固定的,但内部信号能够揭示针对具体任务的最佳生成长度,表明存在优化动态长度生成的可能性。
🛠️ 主要方法提出 DAEDAL 策略,分两个阶段操作:首先根据序列完成度指标从初始短长度扩展至粗略适配长度;其次在去噪过程中通过掩码令牌插入动态扩展不足区域。
📊 数据与实验实验结果表明,DAEDAL 在多个 DLLM 任务中实现了与精心调整的固定长度基线相当或更优的表现,同时显著提升了计算效率。
⭐ 主要贡献解决了 DLLMs 固定长度限制问题,显著提升模型性能和生成效率,为扩散语言模型进一步发展提供了新方向。
查看完整摘要 (Abstract)
Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.
基础/前沿模型 (含LLM)
模型架构
#spiking neuron models #spiking neural networks #dendritic integration #brain-inspired computing
🎯 研究动机现有的LIF神经元模型仅支持输入电流的线性累加,而生物神经元能够非线性整合输入并执行复杂计算,如XOR操作,因此亟需更贴近生物特性的模型以提升计算能力。
❓ 解决问题设计一种新型神经元模型,能够实现非线性输入整合以支持更复杂的分类任务,同时保持与现有计算资源的兼容性。
🔍 现象分析基于理论证明,DLIF神经元在单神经元层面可捕捉输入之间的相关性,支持非线性分类任务;在网络层面能够保留并传播输入层的相关性结构至输出层。
🛠️ 主要方法提出DLIF模型,通过引入源于神经生理学实验的双线性树突整合规则,弥补传统LIF模型的功能局限,实现更具生物合理性的计算。
📊 数据与实验在ResNet、VGG、Transformer等架构上,通过静态数据集(CIFAR-10/100,ImageNet)与神经形态数据集(DVS-Gesture,DVS-CIFAR10)进行评估,实验结果显示DLIF在性能上超越LIF及其他高级模型,同时保持接近的计算成本。
⭐ 主要贡献开发了一种兼具生物学合理性与计算效率的DLIF神经元模型,为下一代脑启发计算模型的研究提供新思路和技术支持。
查看完整摘要 (Abstract)
As widely used neuron model in Spiking Neural Networks (SNNs), the Leaky Integrate-and-Fire (LIF) model assumes the linear summation of injected currents. However, recent studies have revealed that a biological neuron can integrate inputs nonlinearly and perform computations such as XOR while an LIF neuron cannot. To bridge this gap, we propose the Dendritic LIF (DLIF) model, which incorporates a bilinear dendritic integration rule derived from neurophysiological experiments. At the single-neuron level, we theoretically demonstrate that a DLIF neuron can capture input correlations, enabling it to perform nonlinear classification tasks. At the network level, we prove that DLIF neurons can preserve and propagate correlation structures from the input layer to the readout layer. These theoretical findings are further confirmed by our numerical experiments. Extensive experiments across diverse architectures—including ResNet, VGG, and Transformer—demonstrate that DLIF achieves state-of-the-art performance on static (CIFAR-10/100, ImageNet) and neuromorphic (DVS-Gesture, DVS-CIFAR10) benchmarks, surpassing LIF and other advanced alternatives while maintaining comparable computational cost. This work provides a biologically plausible and computationally powerful spiking neuron model, paving the way for next-generation brain-inspired computing.
基础/前沿模型 (含LLM)
模型架构
#byte-level language modeling #tokenization
TL;DR:A Tokenizer-free Language Model based on Information-Theoretical Chunker
🎯 研究动机当前语言模型依赖固定的子词分词方法,导致在推理过程中表现出脆弱且不直观的行为。需要一种能够动态适应输入、摆脱预定义分词器的新方案。
❓ 解决问题提出一种无需分词器的语言模型架构,旨在让模型能够在处理字节流时自动学习语义单元的分割方式。
🔍 现象分析现有方法过于依赖人工设计的启发式规则且缺乏适应性,而固定分词方法限制了语言模型的通用性和鲁棒性。
🛠️ 主要方法基于信息理论的压缩驱动片段化策略,利用潜在表示的编码率评估信息成本,动态决定字节分组边界,实现自适应分割。
📊 数据与实验实验表明,提出的 ByteFlow Net 架构性能优于基于 BPE 的 Transformer 模型及以往的字节级方法。
⭐ 主要贡献首次证明无需分词器的端到端语言模型在适配性和有效性上的可行性,并提出了一种以信息理论为核心的鲁棒语言建模方法。
查看完整摘要 (Abstract)
Modern language models (LMs) still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce \textbf{ByteFlow Net}, a new architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. Our approach is grounded in information theory: ByteFlow Net performs compression-driven segmentation based on coding rate of latent representation, allowing the model to dynamically evaluate the information cost of grouping bytes and decide chunk boundaries during processing. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive, robust, and information-grounded language models.
基础/前沿模型 (含LLM)
模型架构
#model routing #continual routing #constrained routing #unconstrained routing
🎯 研究动机AI任务的复杂性不同,需要用不同的计算策略处理,因此有效的任务到策略的路由系统至关重要;以往方法单一且重训练成本高,难以实现持续路由。
❓ 解决问题解决当前路由系统在面对新策略时需完全重训练、通用性差,以及输入表示单一导致的决策性能不足问题。
🔍 现象分析以往模型通常采用单一模型处理所有策略,单一输入表示限制了其捕捉问题复杂性的能力,从而无法实现最优路由。
🛠️ 主要方法提出CONCUR框架,采用模块化设计,为每种策略单独训练预测器,支持约束和非约束路由,同时利用多重任务和策略表示以提升问题建模能力。
📊 数据与实验实验覆盖分布内与分布外、知识密集和推理任务,结果表明,在持续和非持续设置下,CONCUR在准确性和推理成本上都优于现有方法,且持续设置下进一步降低了训练成本。
⭐ 主要贡献构建了一种模块化的持续路由框架CONCUR,支持灵活扩展策略,显著提升持续和非持续路由的综合性能,并有效解决了传统方法的高成本和迁移适应性差问题。
查看完整摘要 (Abstract)
AI tasks differ in complexity and are best addressed with different computation strategies (e.g., combinations of models and decoding methods). Hence, an effective routing system that maps tasks to the appropriate strategies is crucial.
Most prior methods build the routing framework by training a *single* model across *all* strategies, which demands full retraining whenever new strategies appear and leads to high overhead. Attempts at such continual routing, however, often face difficulties with generalization.
Prior models also typically use a *single* input representation, limiting their ability to capture the full complexity of the routing problem and leading to sub-optimal routing decisions.
To address these gaps, we propose CONCUR, a **con**tinual routing framework that supports both **c**onstrained and **u**nconstrained **r**outing (i.e., routing with or without a budget).
Our *modular* design trains a separate predictor model for each strategy, enabling seamless incorporation of new strategies with low additional training cost.
Our predictors also leverage *multiple* representations of both tasks and computation strategies to better capture overall problem complexity.
Experiments on both in-distribution and out-of-distribution, knowledge- and reasoning-intensive tasks show that our method outperforms the best single strategy and strong existing routing techniques with higher end-to-end accuracy and lower inference cost in both continual and non-continual settings, while also reducing training cost in the continual setting.
基础/前沿模型 (含LLM)
模型架构
#Neural architecture search #hybrid models #efficient ML
TL;DR:An efficient search framework for hybrid neural architecture design
🎯 研究动机结合多种计算结构的混合模型架构近年来表现优异,但现有方法依赖手动探索设计空间,效率较低且成本较高。
❓ 解决问题设计一个高效框架,用于探索和扩展混合神经网络架构,以解决大规模设计空间和训练成本问题。
🔍 现象分析混合模型中不同计算单元的排列方式对模型性能有显著影响,但缺乏系统化的探索方法。
🛠️ 主要方法提出模块化混合架构搜索框架Composer,以小规模探索高效架构并通过缩放策略将设计扩展至大规模模型。
📊 数据与实验实验涵盖350M至8B参数规模的模型,与现有最优基线相比,验证损失显著降低,下游任务准确率平均提高2%-2.1%。
⭐ 主要贡献开发了Composer框架并发现新型混合LLM架构,在性能、训练及推理效率上明显优于Llama 3.2及其他现有方法。
查看完整摘要 (Abstract)
Hybrid model architectures that combine computational primitives (e.g., Attention, MLP) in different ratios have shown promising performance beyond Transformers. Some studies have shown that different interleavings of primitives can affect model quality as well. However, prior works explore the hybrid model architecture design space manually. Due to the large design space and training costs, discovering hybrid models that combine key computational primitives for pre-training is challenging. In this work, we take a principled approach in designing a modular hybrid model architecture search framework — Composer. Composer explores model architectures at a small scale and extrapolates the top-performing model architectures to a larger scale using our proposed scaling strategies. Using Composer, we discover new hybrid LLM architectures that outperform Llama 3.2. Compared to Llama 3.2 and previous state-of-the-art baselines, the new model architectures consistently reduce validation loss at parameter scales of 350M-8B and improve evaluation accuracy on the downstream tasks by up to 2-2.1% on average while improving both training and inference efficiency.
基础/前沿模型 (含LLM)
模型架构
#Language Models #Autoregressive Language Models #Autoregressive Image Generation
🎯 研究动机近年来自回归模型在图像生成中表现出色,但结合扩散模型优化生成过程仍存在条件误差问题。研究旨在提升模型生成的稳定性与一致性。
❓ 解决问题自回归模型在条件生成中易引发条件误差,导致条件分布不稳定。本文提出一种基于扩散损失的优化方法以解决“条件不一致”问题。
🔍 现象分析理论分析表明,自回归模型的扩散损失可有效缓解条件误差,其条件误差影响呈指数衰减。条件生成过程中的局部去噪优化有助于形成稳定的条件分布。
🛠️ 主要方法提出基于最优运输理论的条件细化方法,将条件细化公式化为Wasserstein梯度流,通过扩散损失确保条件分布收敛至理想分布。
📊 数据与实验使用多个图像生成数据集进行实验,结果验证了本文方法在条件生成稳定性和一致性上的显著优势。
⭐ 主要贡献理论分析扩散与自回归模型的性能差异,提出基于最优运输的条件细化方法,实验验证了该方法在图像生成中的优越性。
查看完整摘要 (Abstract)
Recent studies have explored autoregressive models for image generation, with promising results, and have combined diffusion models with autoregressive frameworks to optimize image generation via diffusion losses. In this study, we present a theoretical analysis of diffusion and autoregressive models with diffusion loss, highlighting the latter's advantages. We present a theoretical comparison of conditional diffusion and autoregressive diffusion with diffusion loss, demonstrating that patch denoising optimization in autoregressive models effectively mitigates condition errors and leads to a stable condition distribution. Our analysis also reveals that autoregressive condition generation refines the condition, causing the condition error influence to decay exponentially. In addition, we introduce a novel condition refinement approach based on Optimal Transport (OT) theory to address ``condition inconsistency''. We theoretically demonstrate that formulating condition refinement as a Wasserstein Gradient Flow ensures convergence toward the ideal condition distribution, effectively mitigating condition inconsistency. Experiments demonstrate the superiority of our method over diffusion and autoregressive models with diffusion loss methods.
基础/前沿模型 (含LLM)
模型架构
#Mixture-of-Experts #Large language models #Auxiliary loss #Expert-router coupling #Expert specialization
🎯 研究动机Mixture-of-Experts 模型缺乏明确约束,导致路由器决策无法充分匹配专家能力,限制了性能发挥。
❓ 解决问题提出 ERC 辅助损失,通过耦合路由器决策与专家能力,改善专家特化程度及模型表现。
🔍 现象分析ERC 损失定义了两个约束条件,确保每位专家对其对应路由嵌入的激活最高,同时专家嵌入的激活与分配行为精确匹配。
🛠️ 主要方法利用扰动后的路由器嵌入生成中间激活,以轻量化的 ERC 损失约束专家专注于分配的任务;计算复杂度为 $n^2$ 激活,独立于批量大小。
📊 数据与实验在 3B 到 15B 参数规模的 MoE-LLM 预训练和数万亿标记的实验中验证方法有效性,并提供专家特化水平的定量控制与追踪。
⭐ 主要贡献提出了一种低成本、高效的专家-路由器耦合方法,显著提升了 Mixture-of-Experts 模型性能,并提供了理论与应用洞见。
查看完整摘要 (Abstract)
Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain intermediate activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on $n^2$ activations, where $n$ is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.
基础/前沿模型 (含LLM)
模型架构
#Transformers architecture #positional encodings #Transformers theory #large language models
🎯 研究动机语言理解和生成需要独立编码句子中词语的位置信息和符号信息,当前Transformers主要通过位置编码实现这种能力。
❓ 解决问题研究现有位置编码(如RoPE)在独立编码位置和符号信息上的机制和有效性,并分析注意力头的行为如何影响模型的性能。
🔍 现象分析RoPE的成功部分来源于其对大频率和小频率分别编码位置和语义信息的能力,研究发现注意力头的行为和频率使用存在强相关性。
🛠️ 主要方法提出判定注意力头行为的通用定义与量化指标,从理论和实证角度证明其位置性和符号性行为是互斥的,并设计任务验证注意力头的频率控制能力。
📊 数据与实验基于RoPE分析Transformer模型表现,通过构建纯位置性和符号性的任务实验模型性能是否与注意力头频率控制能力一致。
⭐ 主要贡献揭示RoPE编码与模型表现的关联;提出量化注意力头行为的理论框架;通过控制注意力头的频率使用显著改善Transformer性能。
查看完整摘要 (Abstract)
An important aspect subtending language understanding and production is the ability to independently encode positional and symbolic information of the words within a sentence. In Transformers, positional information is typically encoded using Positional Encodings (PEs). One such popular PE, namely Rotary PE (RoPE), has been widely used due to its empirical success.
Recently, it has been argued that part of RoPE's success emerges from its ability to encode robust positional and semantic information using large and small frequencies, respectively. In this work, we perform a deeper dive into the positional versus symbolic dichotomy of attention heads behavior, both at the theoretical and empirical level. We provide general definitions of what it means for a head to behave positionally or symbolically, prove that these are two mutually exclusive behaviors and develop a metric to quantify them.
We apply our framework to analyze Transformer-based LLMs using RoPE and find that all heads exhibit a strong correspondence between behavior and frequency use.
Finally, we introduce canonical tasks designed to be either purely positional or symbolic, and demonstrate that the Transformer performance causally relates to the ability of attention heads to leverage the appropriate frequencies. In particular, we show that we can control the Transformer performance by controlling which frequencies the attention heads can access. Altogether, our work provides a detailed understanding of RoPE, and how its properties relate to model behavior.
基础/前沿模型 (含LLM)
模型架构
#MoE #Mixture of experts #sparsity #intrepretability
TL;DR:An intretable probablisitc sparse mixture of expert based Dirichlet distribution.
🎯 研究动机Mixture-of-Experts (MoE) 模型在大规模语言模型中表现卓越,但现有路由器的非可微分机制限制了性能与可扩展性。需要一种可解释且具有稀疏性的路由方案以提升模型效率与专家贡献的细化程度。
❓ 解决问题现有的 Top-k+Softmax 路由方法无法分离专家选择与专家贡献分配的决策过程,影响了模型性能及可解释性。
🔍 现象分析标准路由机制将专家激活与贡献分配过于耦合,模型优化容易受限,导致专家分工不明确与学习效率降低。
🛠️ 主要方法提出 Dirichlet-Routed MoE(DirMoE),基于 Dirichlet 变分自编码器框架设计一种端到端可微分的路由机制,通过 Gumbel-Sigmoid 和隐式重参数化技术分别实现专家选择与贡献分配的解耦。
📊 数据与实验通过多种基准实验验证,DirMoE 在性能上与其他方法持平或优于它们,同时促进专家角色的专门化与模型稀疏化。
⭐ 主要贡献设计了全新的可解释性路由机制、整合变分ELBO目标函数以实现专家稀疏控制,并导入从探索到收敛的超参数优化策略,引领路由状态的渐进式过渡。
查看完整摘要 (Abstract)
Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-$k$+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-$k$+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization.
基础/前沿模型 (含LLM)
模型架构
#Large Language Model #LLMs #Minimal Criterion Coevolution #Evolutionary Model Merging #Synthetic Data #Quality-Diversity #Open-endedness
TL;DR:Open-ended coevolution of LLMs and synthetic data (without explicit optimization) leads to the discovery of a superior population of LLMs than baselines.
🎯 研究动机前沿模型开发者希望通过持续训练开发具备多样性和新颖能力的模型,以突破当前以静态数据集或奖励函数为基础的训练局限。
❓ 解决问题现有的大模型训练范式需要每次手动设定固定的训练配置,限制了模型能力的自发扩展与无止境的发展可能性。
🔍 现象分析研究发现,通过模型与任务的协同进化,无需显式优化即可逐步发现具备新颖技能的模型群体,并在能力覆盖度上超越传统模型基线。
🛠️ 主要方法提出AC/DC框架,通过模型合并和合成数据生成实现模型与任务的开放式协同进化,从而动态扩展模型能力归档并提高性能覆盖范围。
📊 数据与实验实验表明AC/DC方法生成的LLM群体在多任务基准测试上表现优异,能力覆盖广泛,且无需显式任务优化,逐步提升多代理选择性能。
⭐ 主要贡献引入一种新范式,通过协同进化实现LLM能力多样性与连续创新,为LLM开发的开放式发展提供了新方向。
查看完整摘要 (Abstract)
Frontier model developers aim to train models continually to possess emergent, diverse capabilities.
To extend capabilities, the current pre-training and post-training paradigm requires manually starting training runs with static datasets or reward functions every time.
Addressing this limitation, our work pursues the insight that open-endedness (via the coevolution of models and tasks) can discover models with increasingly novel skills in a single run.
We introduce a new model development framework that extends coevolution to large language model (LLM) discovery, open-ended \textit{Assessment Coevolving with Diverse Capabilities} (AC/DC).
AC/DC evolves both LLMs via model merging and natural language tasks via synthetic data generation.
AC/DC discovers growing archives of LLMs that surpass the capabilities of larger LLMs while taking up less GPU memory.
In particular, our LLM populations achieve a broader Coverage of expertise than other curated models or baselines on downstream benchmarks, without \textit{any} explicit benchmark optimization.
Furthermore, AC/DC improves Coverage over time, continually innovates on tasks and models, and improves performance in multi-agent best-of-N selection.
Our findings highlight the potential of coevolution as a means of discovering broader sets of capabilities from base LLMs.
Overall, AC/DC brings us one step closer to a profoundly new paradigm of LLM development, where continual improvements to the diversity of model capabilities can be accelerated by leveraging existing models as stepping stones to increasingly powerful models.
基础/前沿模型 (含LLM)
模型架构
#diffusion language model #discrete diffusion #masked diffusion model #language model
TL;DR:We propose RemeDi, a new diffusion-based text generation model that introduces remasking allowing model to detect and resample low-confidence tokens during generation.
🎯 研究动机现有的掩码扩散语言模型在生成过程中难以纠正已生成的错误 token,因为缺乏识别错误的机制。
❓ 解决问题提出一种新的 remasking 机制,允许模型在生成过程中检测并重新采样低置信度的 token,从而提升文本生成的灵活性和质量。
🔍 现象分析掩码扩散模型生成的 token 一旦被确定,通常无法调整,而这会导致错误逐步累积,对生成质量产生负面影响。
🛠️ 主要方法通过联合预测 token 分布和逐个 token 的置信度分数,并设计一种基于重新掩码的训练管线,包括有监督微调和基于奖励优化的强化学习。
📊 数据与实验在多个数据集上进行实验,结果表明 RemeDi 在开源扩散语言模型中取得了最新的最优性能。
⭐ 主要贡献提出了 remasking 机制的概念,设计了 remask-aware 的训练管线,验证了该方法在文本生成质量上的显著改进。
查看完整摘要 (Abstract)
Mask-based Diffusion Language Models (DLMs) struggle to revise incorrect tokens: once a token is generated, it typically remains fixed. The key challenge is to identify potential errors in the inputs. In this paper, we propose Remasking-enabled Diffusion Language Model (RemeDi), a mask-based DLM that introduces remasking as another fundamental mechanism, enabling more flexible text refinement in diffusion-based text generation. To achieve this, RemeDi jointly predicts token distributions and per-token confidence scores at each step. The confidence scores determine which tokens to be unmasked after the current step, allowing the model to identify tokens with low quality and remask them. These remasked tokens can be resampled with richer context in subsequent steps. We design a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens in addition to predict mask tokens, and reinforcement learning which optimizes full generation trajectories toward higher rewards. Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.
基础/前沿模型 (含LLM)
模型架构
#Diffusion Language Models
TL;DR:We introduce a simple yet effective method for diffusion language models to perform variable-length generation.
🎯 研究动机扩散语言模型提供灵活且任意顺序的补全能力,但因固定长度遮罩限制,阻碍其代码补全性能与实际应用。
❓ 解决问题提出DreamOn框架,解决扩散语言模型无法进行动态可变长度生成的核心难题。
🔍 现象分析固定长度生成在代码补全任务中表现受限,尤其在补全长度与遮罩大小不匹配时。
🛠️ 主要方法通过设计扩散过程中的长度控制状态,实现模型自主预测输出扩展或收缩,简化现有模型训练目标,无需修改架构。
📊 数据与实验在HumanEval-Infilling和SantaCoder-FIM数据集上,与最先进的自回归模型表现相当,同时匹配理想长度条件下的oracle性能。
⭐ 主要贡献消除扩散语言模型在可变长度生成中的关键障碍,提高其灵活性与实用性,推动实际部署。
查看完整摘要 (Abstract)
Diffusion Language Models (DLMs) present a compelling alternative to autoregressive models, offering flexible, any-order infilling without specialized prompting design. However, their practical utility is blocked by a critical limitation: the requirement of a fixed-length masked sequence for generation. This constraint severely degrades code infilling performance when the predefined mask size mismatches the ideal completion length.
To address this, we propose DreamOn, a novel diffusion framework that enables dynamic, variable-length generation. DreamOn augments the diffusion process with two length control states, allowing the model to autonomously expand or contract the output length based solely on its own predictions. We integrate this mechanism into existing DLMs with minimal modifications to the training objective and no architectural changes.
Built upon Dream-Coder-7B and DiffuCoder-7B, DreamOn achieves infilling performance on par with state-of-the-art autoregressive models on HumanEval-Infilling and SantaCoder-FIM and matches oracle performance achieved with ground-truth length.
Our work removes a fundamental barrier to the practical deployment of DLMs, significantly advancing their flexibility and applicability for variable-length generation.
Our code is available at https://github.com/DreamLM/DreamOn.
基础/前沿模型 (含LLM)
模型架构
#deep learning #architecture #tokenization
TL;DR:We introduce H-Net: an end-to-end hierarchical network that compresses raw data through a recursive, data-dependent dynamic chunking process
🎯 研究动机近年来语言模型的进展侧重于从原始数据中学习,但预处理步骤如分词限制了模型的完全端到端能力。
❓ 解决问题提出一种动态分块机制,实现内容和上下文相关的分割策略,取消传统分词等过程,实现端到端的层次化序列建模。
🔍 现象分析基于字节级操作的单阶段层次网络在计算和数据匹配条件下优于传统基于分词的Transformer;多阶段层次进一步增强了模型的抽象表达能力。
🛠️ 主要方法引入动态分块技术,与显式层次网络(H-Net)联合训练,通过递归分块策略显著提高模型表现和扩展能力。
📊 数据与实验在英语预训练中,展示字符级的鲁棒性及数据依赖分块策略,并在中文、代码及DNA序列等弱分词领域实现最高近4倍的数据效率提升。
⭐ 主要贡献提出一种端到端的层次化网络架构,取消传统的分词流程,显著提升模型在多个语言和模态上的性能和扩展潜力。
查看完整摘要 (Abstract)
Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data. Despite this trend, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content- and context- dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization--LM--detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a strong Transformer language model operating over BPE tokens. Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction, demonstrating significantly better scaling with data and matching the token-based Transformer of twice its size. H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without any heuristics or explicit supervision. Finally, the H-Net's improvement over tokenized pipelines is further increased in languages and modalities with weaker tokenization heuristics, such as Chinese and code, or DNA sequences (nearly 4x improvement in data efficiency over baselines), showing the potential of true end-to-end models that learn and scale better from unprocessed data.
基础/前沿模型 (含LLM)
模型架构
#LVLM #Multi-Image Understanding #Training-free
🎯 研究动机大视觉语言模型在单图像任务上表现优异,但多图像输入时性能下降。核心问题在于跨图像信息泄露,模型难以区分不同图像的信息。现有分隔符标记未能有效解决此问题,需增强其有效性。
❓ 解决问题提出一种无训练的方法,通过分隔符标记的隐藏状态缩放来增强多图像理解能力。该方法强化图像内交互,限制跨图像交互,从而提升模型区分和推理图像的准确性。同时,本方法在纯文本任务上也展现出优势。
🔍 现象分析现有大视觉语言模型使用分隔符标记标识图像边界,但分析表明这些标记未能有效阻断跨图像信息泄露。导致模型在混合输入时信息混淆,性能受损,尤其是在需要清晰区分多源信息的场景下。
🛠️ 主要方法通过缩放分隔符标记的隐藏状态来增强其效果。这一操作强化了图像内的信息交互,同时抑制了不希望的跨图像交互。方法无需额外训练或推理成本,易于部署到现有模型中。
📊 数据与实验实验在Mantis、MuirBench、MIRB和QBench2等多图像基准上验证了性能提升。此外,在TQABench、MultiNews和WCEP-10等多文档和多表格理解任务上也取得了改进,证明了方法的通用性。
⭐ 主要贡献首次系统分析分隔符标记在跨图像信息泄露中的失效问题,并提出一种无训练的分隔符标记缩放方法。该方法显著提升多图像和多文本理解性能,且无需额外成本,具有广泛适用性。
查看完整摘要 (Abstract)
Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input.
One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images.
Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage.
To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens.
This enhances the model’s ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions.
Consequently, the model is better able to distinguish between images and reason over them more accurately.
Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB and QBench2.
We further evaluate our method on text-only tasks that require clear distinction.
The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews and WCEP-10.
Notably, our method requires no additional training or inference cost.
基础/前沿模型 (含LLM)
模型架构
#Mixture of Experts #Large Language Model #Pretraining
🎯 研究动机Mixture-of-Experts (MoE) 架构虽然能够扩展语言模型规模,但易出现专家同质化问题,限制其潜力。
❓ 解决问题提出一种称为 Expert Divergence Learning 的新预训练策略,通过鼓励专家功能分化来缓解同质化问题。
🔍 现象分析专家同质化导致模型功能冗余,而功能分化能够通过对不同数据域分布进行优化,从而实现更高效的专家配置。
🛠️ 主要方法引入基于域标签的辅助损失函数,利用 Jensen-Shannon Divergence 优化专家路由分布,为不同数据域实现分化路由策略并提升同域一致性。
📊 数据与实验在包括多达 150 亿参数的 MoE 模型上进行从零训练并验证,在多个下游基准测试中实现显著性能提升,且几乎不增加训练计算负担。
⭐ 主要贡献提出分化专家学习策略,有效缓解 MoE 模型的同质化,提升功能专业化,增强模型跨域适应能力和下游性能。
查看完整摘要 (Abstract)
The Mixture-of-Experts (MoE) architecture is a powerful technique for scaling language models, yet it often suffers from expert homogenization, where experts learn redundant functionalities, thereby limiting MoE's full potential. To address this, we introduce Expert Divergence Learning, a novel pre-training strategy that explicitly encourages functional specialization among experts. Our method incorporates a label-driven auxiliary loss that leverages domain labels inherent in pre-training corpora to maximize the Jensen-Shannon Divergence between the expert routing distributions of different data domains. This optimization objective guides the model to develop diverged routing policies for varied domains and closer routing policies for the same domain, which leads to emergent and organized expert specialization. We validate our approach by pre-training MoE models of up to 15 billion parameters from scratch. Experimental results demonstrate that models trained with Expert Divergence Learning not only achieve a lower language modeling loss but also exhibit significant performance improvements across a diverse range of downstream benchmarks. Further analysis confirms that our method effectively mitigates expert homogenization and brings greater functional specialization, all with negligible computational overhead during training.
基础/前沿模型 (含LLM)
模型架构
#Mixture of Experts #Game Theory
TL;DR:NAMEx uses Nash bargaining and complex momentum to merge experts more fairly and efficiently, outperforming prior methods across tasks.
🎯 研究动机现有稀疏专家混合框架的专家融合通常缺乏基于权重的合理机制,导致效率与公平性的问题。
❓ 解决问题提出一种基于博弈论视角的专家融合方法,解决专家间的合作与竞争动态,同时提升融合的效率和公平性。
🔍 现象分析通过重新定义专家融合问题,揭示专家融合过程中的协作与竞争机制,以更合理的理论框架解释其行为。
🛠️ 主要方法提出 NAMEx 框架,通过引入纳什均衡与复杂动量机制,使专家融合具有理论收敛性并提高系统效率。
📊 数据与实验在语言建模、文本分类、图像分类以及数据扰动下的零样本鲁棒性测试中验证 NAMEx 的优势,同时在 Qwen1.5-MoE 和 DeepSeek-MoE 等大规模模型中证明其可扩展性。
⭐ 主要贡献提出 NAMEx 框架,融合博弈论与复杂动量理论应用于稀疏专家混合,全面提升融合效率、公平性与可扩展性。
查看完整摘要 (Abstract)
Existing expert merging strategies for Sparse Mixture of Experts (SMoE) typically rely on input-dependent or input-independent averaging of expert parameters, but often lack a principled weighting mechanism. In this work, we reinterpret expert merging through the lens of game theory, revealing cooperative and competitive dynamics among experts. Based on this perspective, we introduce Nash Merging of Experts (NAMEx), a novel framework that incorporates Nash Bargaining into the merging process, enabling more balanced and efficient collaboration among experts. Additionally, we incorporate complex momentum into NAMEx to accelerate expert propagation with theoretical guarantees for convergence. Extensive experiments across language modeling, text classification, image classification, and zero-shot robustness under data corruption show that NAMEx consistently outperforms competing methods while integrating seamlessly with popular MoE architectures. Finally, we demonstrate NAMEx’s scalability by applying it to large-scale systems, including Qwen1.5-MoE (14B) and DeepSeek-MoE (16B), where it proves effective in both zero-shot and fine-tuning settings.
基础/前沿模型 (含LLM)
模型架构
#generative flow networks #language models
🎯 研究动机传统自回归语言模型以固定词汇表逐字生成文本,受限于树状状态空间,缺乏生成灵活性和表达能力。引入动态词汇表的方法忽视了句子可由不同长度的片段组成的有向无环图(DAG)结构。
❓ 解决问题现有基于生成流网络(GFlowNets)的语言模型停留在树状空间,难以充分探索和泛化复杂的状态空间。需要一种方法显式建模DAG结构,提高片段生成的多样性和质量。
🔍 现象分析动态词汇表的片段生成路径受到偏向性限制,探索不足,局限于预设路径,难以覆盖更广泛的组合可能。
🛠️ 主要方法FoSS框架通过灵活分割检索到的文本构建动态片段词汇表,明确DAG状态空间结构;结合特定奖励模型,利用GFlowNets高效探索多样的组合路径以生成高质量文本。
📊 数据与实验实验证明FoSS在文本生成中提升了MAUVE分数最多12.5%,在知识密集型任务中提升了3.5%,并在模型规模、数据量和语料丰富度扩展时继续优于强基线。
⭐ 主要贡献提出了FoSS框架,突破传统树状生成限制,引入DAG状态空间;在动态词汇表上实现多样化高质量文本生成,显著提升任务效果和模型扩展性。
查看完整摘要 (Abstract)
Standard autoregressive language models generate text token-by-token from a fixed vocabulary, inducing a *tree-structured state space* when viewing token sampling as an action, which limits flexibility and expressiveness. Recent work introduces dynamic vocabulary by sampling retrieved text spans but overlooks that the same sentence can be composed of spans of varying lengths, lacking explicit modeling of the *directed acyclic graph (DAG) state space*. This leads to restricted exploration of compositional paths and is biased toward the chosen path. Generative Flow Networks (GFlowNets) are powerful for efficient exploring and generalizing over state spaces, particularly those with a DAG structure. However, prior GFlowNets-based language models operate at the token level and remain confined to tree-structured spaces, limiting their potential. In this work, we propose **F**low **o**f **S**pan**S** (**FOSS**), a principled GFlowNets framework for span generation. FoSS constructs a dynamic span vocabulary by segmenting the retrieved text flexibly, ensuring a DAG-structured state space, which allows GFlowNets to explore diverse compositional paths and improve generalization. With specialized reward models, FoSS generates diverse, high-quality text. Empirically, FoSS improves MAUVE scores by up to 12.5\% over Transformer on text generation and achieves 3.5\% gains on knowledge-intensive tasks, consistently outperforming state-of-the-art methods. Scaling experiments further demonstrate FoSS benefits from larger models, more data, and richer retrieval corpora, retaining its advantage over strong baselines.
基础/前沿模型 (含LLM)
模型架构
#Bidirectional Language Models #Information Bottleneck #Mutual Information #FlowNIB #Layer-wise Analysis #Context Understanding #Natural Language Understanding
TL;DR:Bidirectional LMs retain more mutual information per layer than unidirectional ones, and our FlowNIB method measures this to explain—and predict—their superior downstream performance.
🎯 研究动机双向语言模型在上下文理解上明显优于单向模型,但其理论优势未被清晰解释。本研究从信息瓶颈的角度分析其优势来源。
❓ 解决问题探讨双向语言模型比单向模型在保留输入与目标的互信息层面上的性能差异,并如何通过这一差异解释其下游任务表现的优越性。
🔍 现象分析理论上,双向模型保留更多的输入与目标互信息,形成更丰富的特征表征。在实验中,双向模型的每层互信息值均高于同等甚至更大规模的单向模型。
🛠️ 主要方法提出轻量级后验框架FlowNIB,通过输入数据、标签及层激活值同时估算输入-层和层-标签的互信息,用以量化模型不同层次的信息保留能力。
📊 数据与实验在多个自然语言理解基准(如GLUE)、常识推理和回归任务上验证,双向模型具有广泛的高性能表现,明显优于单向模型。
⭐ 主要贡献通过信息瓶颈视角揭示双向语言模型的理论优势,提出FlowNIB方法精确量化模型层级互信息,并全面验证其对上下文理解与下游任务性能的影响。
查看完整摘要 (Abstract)
Bidirectional language models (LMs) consistently show stronger context understanding than unidirectional models, yet the theoretical reason remains unclear. We present a simple information bottleneck (IB) perspective: bidirectional representations preserve more mutual information (MI) about both the input and the target, yielding richer features for downstream tasks. We adopt a layer–wise view and hypothesize that, at comparable capacity, bidirectional layers retain more useful signal than unidirectional ones. To test this claim empirically, we present Flow Neural Information Bottleneck (FlowNIB), a lightweight, post-hoc framework capable of estimating comparable mutual information values for individual layers in LMs, quantifying how much mutual information each layer carries about the input and target. FlowNIB takes three inputs—(i) the original LM’s inputs/dataset, (ii) ground–truth labels, and (iii) layer activations—simultaneously estimates the mutual information for both the input–layer and layer–label pairs. Empirically, bidirectional LM layers exhibit higher mutual information than similar—and even larger—unidirectional LMs. As a result, bidirectional LMs outperform unidirectional LMs across extensive experiments on NLU benchmarks (e.g., GLUE), commonsense reasoning, and regression tasks, demonstrating superior context understanding.
基础/前沿模型 (含LLM)
模型架构
#Sequence Modeling #Attention #Transformer
🎯 研究动机现有的注意力机制在键值存储上无损,但通过每个头的凸平均读取时无法实现通道级选择,这种限制阻碍了更灵活的信息交互。
❓ 解决问题提出一种新的读取机制,解决标准注意力中存在的通道选择受限问题,同时保持运行复杂度不变。
🔍 现象分析标准注意力机制将 $(q,k)$ 评分分布视为固定的读取方式,无法灵活调整读取策略以根据值的信息进行后验优化。
🛠️ 主要方法提出了 Free Energy Mixer (FEM),采用基于值的逐通道对数线性倾斜机制,通过从 $(q,k)$ 提供的先验生成后验读取,支持更灵活的通道选择,同时保证并行性和计算复杂度。
📊 数据与实验在 NLP、视觉和时间序列任务中进行了广泛实验证明,与标准注意力和线性 RNNs/SSMs对比,FEM在相同参数规模下的一致表现优越。
⭐ 主要贡献提出了一种基于自由能的新型注意力读取机制FEM,解决了通道选择受限问题,提升了多种序列建模任务的性能,同时保持原有计算复杂度。
查看完整摘要 (Abstract)
Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.
基础/前沿模型 (含LLM)
模型架构
#Rotary Position Embedding #Position Interpolation #Extrapolation #Large Language Model
TL;DR:We show that the location of the dominant frequency band is governed jointly by the base and the training sequence length.
🎯 研究动机Rotary Position Embedding (RoPE) 在大语言模型中广泛使用,但其基频参数与长上下文处理性能的关系尚不清晰,需要深入研究如何优化 RoPE 的基频以权衡插值与外推的性能。
❓ 解决问题揭示 RoPE 的频带位置如何受到基频参数和训练序列长度的共同影响,并分析插值与外推性能之间的内在权衡关系。
🔍 现象分析发现 RoPE 中的高范数“频带”维度在不同模型中一致出现,且这些高频维度主导了模型性能;频带的位置由基频和训练长度决定,并且在模型预训练初期即已形成。
🛠️ 主要方法通过引入 NoPE 替代低频维度和变更基频参数 $ heta$ 的实验,分析频带的形成机制及其对插值与外推性能的影响。
📊 数据与实验在多个语言模型上进行了频带定位、基频调整与上下文扩展的实验,评估不同条件下插值与外推性能的变化。
⭐ 主要贡献揭示了 RoPE 中频带的关键作用及其与基频和训练长度的关系;明确了基频参数对插值和外推性能的权衡影响;为长上下文扩展提供了针对性的参数选择和指导。
查看完整摘要 (Abstract)
Rotary Position Embeddings (RoPE) are widely adopted in LLMs, and it is commonly believed that larger base frequencies $\theta$ yield better long-context performance. In this paper, we show that a high-norm RoPE dimension, referred to as the “frequency band,” consistently emerges across multiple models, and we focus on this band to reveal the trade-offs inherent in RoPE. We find that replacing the RoPE dimensions below the frequency band with NoPE during inference has little effect on performance, indicating that these lower-frequency dimensions are only weakly utilized. We further find that the location of the frequency band depends on the RoPE base $\theta$ and the training sequence length. Moreover, the band forms early during pre-training and persists even after context extension via position interpolation.
Notably, we show that setting $\theta$ to the training length shifts the band toward lower frequencies and improves extrapolation, whereas increasing $\theta$ enhances interpolation but reduces extrapolation, revealing a clear trade-off between interpolation and extrapolation.
We believe this work is a step toward a sharper understanding of positional embeddings in LLMs, with falsifiable diagnostics and practical guidance for choosing $\theta$ that support scaling to longer contexts.
基础/前沿模型 (含LLM)
模型架构
#State Space Model #Mamba #Graph Signal Processing #Adaptive Filter Bank
TL;DR:HADES reinterprets Mamba2 as a graph-based adaptive filter bank, achieving efficient and interpretable sequence modeling with fewer parameters.
🎯 研究动机状态空间模型(SSM)可提供线性时间复杂度的序列建模,但现有方法如 Mamba2 缺乏多头递归的结构化分析和利用。提出一种改进框架以优化其效率和可解释性。
❓ 解决问题重新设计 Mamba2 的多头递归结构,将其解读为图信号处理领域的自适应滤波器,解决参数占用较多和缺乏层次化设计的问题。
🔍 现象分析Mamba2 在基准任务中表现强劲,但其独立的多头递归限制了结构化频率过滤器的潜力。提出一种分层架构,通过共享低通滤波器和专业高通滤波器,实现频率适应性。
🛠️ 主要方法基于图信号处理理论设计 HADES 框架,利用线性图上的自适应滤波器银行,引入结构化偏置来优化参数 Δ,以实现分层的高效状态空间建模。
📊 数据与实验通过语言建模、常识推理和长上下文检索任务验证模型性能,实验表明 HADES 在减少参数至 58.9% 的情况下,性能与 Mamba2 持平。
⭐ 主要贡献将图信号处理引入神经序列建模,提出分层自适应滤波器框架,减少参数占用量,并增强模型的效率和可解释性。
查看完整摘要 (Abstract)
State-space models (SSMs) offer efficient alternatives to attention with linear-time recurrence. Mamba2, a recent SSM-based language model, uses selective input gating and a multi-head structure, enabling parallel computation and strong benchmark performance. However, its multi-head recurrence operates independently without structured utilization or analysis. In this work, we propose a novel method called **H**ierarchical **AD**aptive filter bank for **E**fficient **S**SMs (*HADES*), a Graph Signal Processing (GSP)-inspired framework that reinterprets Mamba2 as an adaptive filter bank on a line graph. Our hierarchical architecture introduces two filter types: shared filters for global low-pass behavior and expert filters for local high-pass behavior, achieved through structured bias on the parameter $\Delta$. *HADES* achieves comparable performance to baseline models including Mamba2 across various benchmarks in language modeling, commonsense reasoning, and long-context retrieval, while using only **58.9%** of the original parameters. In this regard, *HADES* bridges GSP and neural sequence modeling, enabling efficient, hierarchical, and interpretable filtering within state-space models.
基础/前沿模型 (含LLM)
模型架构
#Linear Attention #Language Model #Foundation Model
🎯 研究动机线性注意力机制作为软注意力的高效替代方案,正通过更复杂的衰减矩阵设计不断提升语言建模能力,但现有结构复杂性多停留在对角加秩1级别。进一步研究更复杂的衰减结构将有助于推动线性注意力的发展。
❓ 解决问题现有线性注意力机制的衰减矩阵结构表达能力有限,无法充分捕捉复杂的序列数据关系;同时算法设计需要提升通用性以支持更高效的并行处理。
🔍 现象分析实验表明,当前基准模型在大规模语言建模和检索任务中表现存在显著差距;线性注意力机制亟需基于更强表达力的结构进行改进以达到状态新高。
🛠️ 主要方法提出HDLA机制,通过高效的矩阵分解实现对角加秩2结构;同时设计通用块状并行算法支持秩增强的衰减结构和键值外积,以提升灵活性和性能。
📊 数据与实验利用语言建模和检索任务,以及合成评测基准测试MAD进行验证,在2.8B参数规模上,HDLA模型实现明显性能提升和状态新高,检索任务提升最多达80%及58.2%,平均分数提升4.39-7.66。
⭐ 主要贡献提出了具有更高表达力的HDLA线性注意力机制和通用块状并行算法,为秩增强结构设计提供了坚实的算法基础和未来应用前景。
查看完整摘要 (Abstract)
Linear attention mechanisms have emerged as efficient alternatives to Softmax attention, exhibiting steady improvements in language modeling capabilities driven by increasingly sophisticated designs for decay matrices—though their structural complexity has typically been limited to the Diagonal-Plus-Rank-1 level. To further advance the understanding and capabilities of linear attention via more complex decay structures, this work makes two primary contributions: (1) We propose the HDLA linear attention mechanism, which utilizes efficient matrix decomposition to achieve a Diagonal-Plus-Rank-2 structure, thereby extending the decay matrix to a broader, more expressive, rank-enhanced and structured class. (2) We propose a more general chunk-wise parallel algorithm that accommodates both diagonal-plus-rank-$r_{ab}$ decay structure and key-value outer products of rank $r_{kv}$, thus providing a versatile foundation for future research. Comprehensive experiments demonstrate that, compared to linear attention baselines, HDLA sets new SOTA results on language modeling and retrieval tasks at 2.8B parameter scale, delivers at most 80\% and 58.2\% performance gains over baselines on retrieval-based MQAR and RULER tasks, and achieves an average score improvement of 4.39–7.66 on the synthetic MAD benchmark, respectively. Our proposed HDLA model, as well as the rank-generalized chunk-wise parallel algorithm, together provide a versatile algorithmic foundation and promising research prospects for the design of rank-enhanced, structured linear attention mechanisms.
基础/前沿模型 (含LLM)
模型架构
#Test Time Memorization #Online Optimization #Recurrent Neural Networks
🎯 研究动机受人类认知现象中注意偏向的启发,该研究旨在重新定义深度学习架构,优化其记忆和学习能力,以提升基础模型的性能。
❓ 解决问题探索如何通过注意偏向和记忆目标的设计改进现代深度学习架构,同时提出一种通用设计框架以增强模型的适用性和表达能力。
🔍 现象分析分析了深度学习架构中的注意偏向现象与记忆目标的关联,并重新解释遗忘机制为一种保留正则化方式。
🛠️ 主要方法提出通用框架 Miras,包括注意偏向目标、保留门设计、关联记忆架构和记忆学习算法,并构建不同配置以探究模型的差异化表现。
📊 数据与实验在语言建模、常识推理、高召回任务和时间序列任务中进行实验,展示所提出框架的多样化设计能超越现有 Transformer 和现代线性循环模型。
⭐ 主要贡献定义并扩展注意偏向概念;提出 Miras 框架以设计高效架构;实验证明其在多个任务中的卓越性能。
查看完整摘要 (Abstract)
Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias—the natural tendency to prioritize certain events or stimuli—we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules with attentional bias. We define and formalize the concept of attentional bias as the internal memory objective deep learning architectures. We show that existing deep learning architectures leverage the same attentional bias based on $L_2$ loss function. Going beyond $L_2$ loss function, we present a set of alternative attentional bias configurations along with their effective approximations. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on the choice of attentional bias objective, retention gate, associative memory architecture, and memory learning algorithm. Our experiments show different designs yield models with varying strengths. Furthermore, our special instances of Miras achieve exceptional performance in language modeling, commonsense reasoning, recall intensive, and time series tasks, outperforming Transformers and other modern linear recurrent models.
基础/前沿模型 (含LLM)
模型架构
#Mixture of Experts #Mixture of LoRA Experts #Dynamic routing #Fully differentiable #LoRA #MoE
🎯 研究动机将参数高效微调与专家混合方法相结合,可以更有效地适配大型语言模型以处理下游任务,但现有方法对固定专家分配机制有所局限,需要动态优化方案。
❓ 解决问题提出一种可学习的动态路由机制,以实现根据每个token和层级的需求灵活分配专家,同时解决传统TopK路由不可微分及超参数依赖问题。
🔍 现象分析实验表明固定的TopK方式无法适应任务复杂性,专家激活数量的灵活控制有助于提高模型的适应性和性能。
🛠️ 主要方法设计了可微分的动态路由函数,使用闭式解替代TopK选择,并引入稀疏性控制目标约束激活专家的数量,提高可控性和效率。
📊 数据与实验基于Qwen3-1.7B和Llama-3.2-3B模型,在多种基准数据集上验证方法的有效性,实验显示其性能超越现有最先进基线。
⭐ 主要贡献提出了可学习动态专家分配机制,显著提升了任务表现,展示了基于token和层次的灵活专家分配能力,并提供了稀疏性控制目标以优化资源使用。
查看完整摘要 (Abstract)
Recent studies have shown that combining parameter-efficient fine-tuning (PEFT) with mixture-of-experts (MoE) is an effective strategy for adapting large language models (LLMs) to the downstream tasks. However, most existing approaches rely on conventional TopK routing, which requires careful hyperparameter tuning and assigns a fixed number of experts to each token. In this work, we propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts that enables adaptive, token-dependent, and layer-wise expert allocation. Our method replaces the non-differentiable TopK selection with a differentiable routing function and a closed-form solution. Moreover, our design allows the model to adaptively determine the number of experts to activate for each token at different layers. In addition, we introduce an analytical sparsity control objective to regularize the number of activated experts. Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks. Our method not only achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.
基础/前沿模型 (含LLM)
模型架构
#transformers #language models #invertibility #injectivity #inversion #privacy
TL;DR:We prove that transformers are (a.s.) injective and propose an algorithm that provably inverts their hidden representations back to the original input prompt.
🎯 研究动机现有研究认为变换器组件因其非线性激活和归一化特性并非单射,难以从隐表示完全恢复输入。在语言模型透明性和安全性方面,该问题亟需解决。
❓ 解决问题通过数学证明语言模型的连续隐表示实际上具备单射性,并提出一种从隐表示反向恢复原始输入的高效算法。
🔍 现象分析基于理论推导和通过六个最先进语言模型的数十亿次碰撞测试,确认模型隐表示没有发生模糊碰撞,验证了理论结果。
🛠️ 主要方法提出SipIt算法,利用单射性,实现隐表示到输入的线性时间精确反转操作,为语言模型单射特性提供证明性实现。
📊 数据与实验选用六个主流语言模型进行碰撞测试,通过大量实验验证理想单射性并实现高效的实际反演。
⭐ 主要贡献首次证明语言模型单射性是一种天然且可利用的属性,为语言模型的透明性、可解释性及隐私保护奠定理论和实践基础。
查看完整摘要 (Abstract)
Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model’s representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.
基础/前沿模型 (含LLM)
模型架构
#Linear Attention #Transformer #Kernel learning
🎯 研究动机Transformer 的软最大化注意力机制复杂度为二次方级,限制了其在高分辨率视觉任务中的扩展性。现有的线性注意力变种通常使用高斯核替代软最大化,但缺乏理论支持并容易抑制中程标记的交互。
❓ 解决问题提出了一种基于拉普拉斯核的替代方案,旨在保留标记间的细粒度信息,同时解决低秩近似下的表达能力退化问题。
🔍 现象分析提出的拉普拉斯核变种基于理论分析和实证观察,能够在复杂性和表达性之间取得平衡,避免现有方法中对中程交互的过度抑制。
🛠️ 主要方法利用拉普拉斯核代替软最大化引入新的注意力机制,并通过可证明的注入性的特征映射克服低秩近似问题。此外,引入 Nyström 近似及 Newton--Schulz 迭代避免繁重的矩阵反演和 SVD 操作,并开发适用于 CUDA 的高效实现。
📊 数据与实验在 ImageNet 上进行实验,结果表明 LaplacianFormer 在性能-效率权衡上表现优异,并提升了注意力机制的表达能力。
⭐ 主要贡献提出了新的拉普拉斯核注意力机制,并构建了高效可扩展的实现工具链;在理论和实验证明中实现了更优的性能和复杂度平衡;适用于边缘设备部署的 Transformer 变体。
查看完整摘要 (Abstract)
The quadratic complexity of softmax attention presents a major obstacle for scaling Transformers to high-resolution vision tasks. Existing linear attention variants often replace the softmax with Gaussian kernels to reduce complexity, but such approximations lack theoretical grounding and tend to oversuppress mid-range token interactions. We propose LaplacianFormer, a Transformer variant that employs a Laplacian kernel as a principled alternative to softmax, motivated by empirical observations and theoretical analysis. To address expressiveness degradation under low-rank approximations, we introduce a provably injective feature map that retains fine-grained token information. For efficient computation, we adopt a Nyström approximation of the kernel matrix and solve the resulting system using Newton--Schulz iteration, avoiding costly matrix inversion and SVD. We further develop custom CUDA implementations for both the kernel and solver, enabling high-throughput forward and backward passes suitable for edge deployment. Experiments on ImageNet show that LaplacianFormer achieves strong performance-efficiency trade-offs while improving attention expressiveness. Code is available at the following site: \href{https://mike7472727.github.io/laplacianformer.github.io/}{\textcolor{black}{LaplacianFormer }}.
基础/前沿模型 (含LLM)
模型架构
#Attention Mechanism; Sequence Modeling; Test-Time Training; Local Linear Regression; Associative Memory; Hardware-Efficient Attention
🎯 研究动机Transformer 架构在多个领域表现出色,但现有研究多集中于高效替代 Softmax Attention,较少关注理论支持下的更具表现力的机制。本文试图填补这一研究空白。
❓ 解决问题设计一种同时兼具理论优势和实际可行性的注意力机制,解决现有方法在关联记忆和计算效率上的权衡问题。
🔍 现象分析通过偏差-方差权衡分析表明,与 Linear 和 Softmax Attention 相比,提出的 LLA 在关联记忆任务中具备理论优势,同时揭示其计算复杂性挑战。
🛠️ 主要方法提出 Local Linear Attention (LLA),结合非参数统计的回归视角,并设计两个内存高效的模块以及硬件高效的 FlashLLA 算法以解决计算和内存瓶颈。
📊 数据与实验在测试时回归、关联回忆和状态跟踪等任务中验证了 LLA 的有效性,结果表明其在适应非平稳性和测试时训练方面优于强基线方法,并具备良好的扩展性。
⭐ 主要贡献提出具有理论依据的 LLA 注意力机制,优化其硬件实现并显著降低内存开销;实验展示了在多任务中的优越表现,拓展了注意力机制的应用场景。
查看完整摘要 (Abstract)
Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight—even at greater computational cost—has been relatively underexplored. In this work, we bridge this gap by proposing Local Linear Attention (LLA), a novel attention mechanism derived from nonparametric statistics through the lens of test-time regression. First, we show that LLA offers theoretical advantages over Linear and Softmax Attention for associative memory via a bias-variance trade-off analysis. Next, we address its computational challenges and propose two memory-efficient primitives to tackle the $\Theta(n^2d)$ and $\Theta(nd^2)$ complexity. We then introduce {FlashLLA}, a hardware-efficient, blockwise algorithm that enables scalable and parallel computation on modern accelerators. In addition, we implement and profile a customized inference kernel that significantly reduces memory overheads. Finally, we empirically validate the advantages and limitations of LLA on test-time regression, in-context regression, associative recall and state tracking tasks. Experiment results demonstrate that LLA effectively adapts to non-stationarity, outperforming strong baselines in test-time training and in-context learning, and exhibiting promising evidence for its scalability and applicability in large-scale models.
基础/前沿模型 (含LLM)
模型架构
#subquadratic architecture #triton kernel #structured matrices
TL;DR:We introduce a tensor attention framework and propose log-linear attention, which expands beyond fixed-size hidden states to achieve log-linear complexity.
🎯 研究动机注意力机制是Transformer的核心,但其二次计算和线性内存复杂度成为序列建模的瓶颈。现有线性注意力和状态空间模型尽管提高了效率,仍受限于固定大小的隐藏状态设计。
❓ 解决问题引入一种新的注意力机制以克服固定大小隐藏状态的限制,同时在计算成本和表达能力之间取得平衡,实现更高效的序列建模。
🔍 现象分析现有线性注意力模型通过矩阵乘法的并行化实现高效训练,但由于本质上仍为RNN架构,无法充分建模更复杂的上下文信息。
🛠️ 主要方法提出了对数线性注意力机制,将固定大小的隐藏状态替换为对数增长的隐藏状态集,并设计了支持矩阵乘法并行操作的计算结构,实现对数线性计算复杂度。
📊 数据与实验通过在Mamba-2和Gated DeltaNet等架构上实例化对数线性注意力机制,并与线性时间模型进行对比实验,验证其性能优越性。
⭐ 主要贡献提出了通用的对数线性注意力框架,兼具线性注意力的高效性和软最大注意力的表达能力,并扩展了序列建模的计算能力。
查看完整摘要 (Abstract)
The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length. Log-linear attention is a general framework and can be applied on top of existing linear attention variants. As case studies, we instantiate log-linear variants of two recent architectures---Mamba-2 and Gated DeltaNet---and find they perform well compared to their linear-time variants.
基础/前沿模型 (含LLM)
模型架构
#large language models #LLMs #reasoning #looped transformers #efficient inference #parameter sharing
🎯 研究动机环状Transformer在语言领域的推理任务中表现出色,但现有方法固定循环次数,缺乏灵活适应计算深度的能力。
❓ 解决问题设计一种可根据计算预算动态调整循环深度的Transformer模型,提升在不同预算条件下的推理效率和表现力。
🔍 现象分析环状架构具备潜在推理的归纳偏置,短环路可生成有效表示,而长环路能进一步优化,但未充分研究其灵活性。
🛠️ 主要方法提出LoopFormer,通过变长轨迹训练和捷径一致性训练方案,使不同长度的循环保持表示一致性,避免漂移或停滞。
📊 数据与实验在语言建模和推理基准测试中进行实验,该模型在严苛预算下持续表现出稳定性能,并可随资源增加扩展。
⭐ 主要贡献展示环状Transformer的自适应潜力,提出可预算调控的大型语言模型方向,并提升其在多任务条件下的实用性。
查看完整摘要 (Abstract)
Looped Transformers have emerged as an efficient and powerful class of models for reasoning in the language domain. Recent studies show that these models achieve strong performance on algorithmic and reasoning tasks, suggesting that looped architectures possess an inductive bias toward latent reasoning. However, prior approaches fix the number of loop iterations during training and inference, leaving open the question of whether these models can flexibly adapt their computational depth under variable compute budgets. We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning. Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths, ensuring that shorter loops yield informative representations while longer loops continue to refine them. LoopFormer conditions each loop on the current time and step size, enabling representations to evolve consistently across trajectories of varying length rather than drifting or stagnating. Empirically, LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints, while scaling gracefully with additional budget. These results show that looped Transformers are inherently suited for adaptive language modeling, opening a path toward controllable and budget-aware large language models.
基础/前沿模型 (含LLM)
模型架构
#State Space Models #Mamba #LLMs #Subquadratic Models
TL;DR:Mamba-3, an inference-first SSM that pushes on core SSM principles: improved discretization for better quality, complex dynamics for new capabilities, and MIMO updates for efficient inference.
🎯 研究动机大规模语言模型(LLMs)的推断效率已成为性能优化的核心驱动因素,现有模型如Transformer因推断计算量呈二次增长,推断成本高亟需改善。
❓ 解决问题解决现有线性模型在推断效率提升时牺牲模型质量与能力的问题,同时克服理论上线性推断在实际硬件上的效率不足。
🔍 现象分析传统线性模型在部分任务(如状态追踪)中表现失败,当前架构需要突破计算效率与模型质量的权衡点。
🛠️ 主要方法引入三项基于状态空间模型(SSM)的改进:更优离散化方法以增强表达能力,利用复数动态更新规则丰富状态追踪能力,以及采用多输入多输出(MIMO)架构提升性能与解码效率。
📊 数据与实验在规模为1.5B的实验中,Mamba-3在下游任务的平均准确率较下一最佳模型提升0.6个百分点,MIMO版本增加额外1.2个百分点;同时,使用一半的状态规模实现与Mamba-2相当的困惑度表现。
⭐ 主要贡献提出了推断优先的改进型SSM模型,通过先进离散化、复动力学和MIMO设计,显著推进性能与计算效率的边界,并在多项核心任务中实现了性能指标的全面领先。
查看完整摘要 (Abstract)
Scaling inference-time compute has emerged as an important driver of LLM performance, making inference efficiency a central focus of model design alongside model quality. While current Transformer models deliver strong quality, their quadratic compute and linear memory requirements make inference expensive. This has spurred the development of sub-quadratic models with reduced compute and constant memory requirements.
However, many recent linear models trade off model quality and capability for algorithmic efficiency, failing on tasks such as state tracking. Moreover, their theoretically linear inference remains hardware-inefficient in practice.
Guided by an inference-first perspective, we introduce three core methodological improvements inspired by the state space model (SSM) viewpoint of linear models. We combine: (1) a more expressive recurrence derived from SSM discretization, (2) a complex-valued state update rule enabling richer state tracking, and (3) a multi-input, multi-output (MIMO) formulation that improves model performance without increasing decode latency.
Together with architectural refinements, Mamba-3 achieves significant gains across retrieval, state-tracking, and downstream language modeling tasks. At the 1.5B scale, Mamba-3 improves average downstream accuracy by 0.6 percentage points compared to the next best model (Gated DeltaNet), with the MIMO variant further improving accuracy by an additional 1.2 points, for a total gain of 1.8 points.
Across state-size experiments, Mamba-3 achieves comparable perplexity to Mamba-2 despite using half the state size. These results demonstrate that Mamba-3 advances the performance–efficiency frontier.
基础/前沿模型 (含LLM)
模型架构
#Recursive Transformer #Language Model #Parameter Sharing #Parameter Efficiency
TL;DR:We diagnose why recursive transformers underperform and propose a targeted solution for building stronger recursive backbones.
🎯 研究动机递归Transformer模型能有效减少参数规模,但在相同计算力下性能不如非递归模型,亟需改进其架构以提升性能表现。
❓ 解决问题解决递归Transformer中的计算模式单一化和信息过载问题,从而提升模型的计算效率与功能多样性。
🔍 现象分析通过对隐藏状态的探测,发现性能瓶颈主要源于固定模式的重复计算和长期信息与短期信息的混杂存储。
🛠️ 主要方法提出Memory-as-State-Highways (MeSH)框架,将状态管理外化为显式的内存缓冲,利用轻量化路由器动态调整计算模式,实现功能专门化。
📊 数据与实验在Pythia套件(160M–6.9B参数规模)上进行实验,MeSH增强型模型在1.4B参数规模下超越非递归模型,并以更少参数提升平均下游任务准确率1.06%。
⭐ 主要贡献构建了具备可扩展性和理论支持的递归模型架构MeSH,为递归Transformer的性能优化提供了新思路。
查看完整摘要 (Abstract)
Recursive transformers reuse parameters and iterate over hidden states multiple times, decoupling compute depth from parameter depth.
However, under matched compute, recursive models with fewer parameters often lag behind non-recursive counterparts.
By probing hidden states, we trace this performance gap to two primary bottlenecks: __undifferentiated computation__, where the core is forced to adopt a similar computational pattern at every iteration, and __information overload__, where long-lived and transient information must coexist in a single hidden state.
To address the issues, we introduce a **Me**mory-as-**S**tate-**H**ighways **(MeSH)** scheme, which externalizes state management into an explicit memory buffer and employs lightweight routers to dynamically diversify computation across iterations.
Probing visualizations confirm that MeSH successfully resolves the pathologies by inducing functional specialization across iterations. On the Pythia suite (160M–6.9B), MeSH-enhanced recursive transformers consistently improve over recursive baselines and outperforms its larger non-recursive counterpart at the 1.4B scale, improving average downstream accuracy by +1.06\% with 33\% fewer non-embedding parameters. Our analysis establishes MeSH as a scalable and principled architecture for building stronger recursive models.
基础/前沿模型 (含LLM)
模型架构
#Sequence modeling #test-time training #RNN transformer alternatives
🎯 研究动机序列建模领域广泛采用因果Transformer架构,但其推理阶段的计算与内存需求线性增长,亟需更高效的替代方法。近年来对softmax线性化的研究推动了性能强大的RNN架构的发展,需进一步优化其稳定性与可扩展性。
❓ 解决问题现有的RNN改进模型如DeltaNet、Mamba或xLSTM,虽然具有恒定的内存与计算成本,但并未充分解决长序列任务中的上下文回归动态优化问题。提出一种稳定且并行化的Mesa层,优化其可扩展性以应对复杂序列建模需求。
🔍 现象分析以往RNN改进模型性能受限于非优化的在线学习规则,只能近似解决上下文回归目标。针对长序列任务,准确优化上下文损失可显著提升语言建模精准度和下游任务表现。
🛠️ 主要方法提出一种数值稳定、块状并行的Mesa层,在每个时间点通过快速共轭梯度求解器最优优化上下文损失函数,并有效解决序列化时间依赖问题。该方法支持大规模参数模型并实现推理阶段的动态优化。
📊 数据与实验实验覆盖从中小规模到十亿参数级别的广泛模型规模,涉及多语言建模和长序列任务。在语言模型困惑度及下游基准任务性能表现上均优于现有RNN架构。
⭐ 主要贡献提出一种新型Mesa层,以优化推理阶段计算问题为切入点,实现序列建模性能提升,为高效长序列处理提供革新性解决方案,同时开拓增加推理计算带来的性能增益方向。
查看完整摘要 (Abstract)
Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), which could only run sequentially in time and was therefore not scalable. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments study up to the billion-parameter scale, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.
基础/前沿模型 (含LLM)
模型架构
#Large language models (LLM) #Pre-training #Mixture-of-Experts (MoE)
🎯 研究动机探讨在严格相同资源约束下(参数量、训练计算和数据预算相同),稀疏化的专家混合(MoE)语言模型是否能超越密集架构模型,这一问题具有重要的实践价值却未被充分研究。
❓ 解决问题提出一种新视角和方法框架,系统研究专家混合架构的优势及其资源效率问题。
🔍 现象分析发现具有最优激活率的MoE模型在相同资源条件下能超越密集模型表现,且这一激活率最优区域在不同模型规模间保持一致。
🛠️ 主要方法优化MoE架构设计,通过实验验证最优激活率区域,并在数据重用策略下平衡数据量与性能间的权衡。
📊 数据与实验进行了大规模实验,训练了近200个2B参数模型和50多个7B参数模型,总计处理超过50万亿个标记,验证了提出的框架有效性。
⭐ 主要贡献首次证明MoE在严格相同资源限制下的表现可超越密集模型,提出了通用的最优激活率区域概念,并公开所有代码与模型以供社区使用。
查看完整摘要 (Abstract)
Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints — that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens. All code and models will be released publicly.
基础/前沿模型 (含LLM)
模型架构
#large language models #mixture-of-depth-recurrent transformer #latent space #test-time reasoning
TL;DR:Mixture-of-Depth-Recurrent Transformers for Test-Time Reasoning
🎯 研究动机大型语言模型以逐步推理生成最终答案表现出色,但现有方法如Depth-Recurrent Transformer在适应探索性任务时存在局限性。
❓ 解决问题现有模型采用单一链式传播机制,难以有效应对需要复杂探索和多样性的推理任务。
🔍 现象分析增加递归深度虽提升性能,但单链式模型无法充分开发解决空间的潜力,限制了推理能力的全面提升。
🛠️ 主要方法提出基于动态多分支路由的Mixture-of-Depth-Recurrent (MoDr) Transformer,通过LoRA多分支动态中继和可学习硬门路由机制实现潜在空间的高效探索,并设计无辅助损失的负载均衡策略以防路由崩溃。
📊 数据与实验在数学推理基准上MoDr模型较原始Huginn模型及其微调版本分别提升+7.2%和+2.48%,在常识推理基准上分别提升+21.21%和+1.52%。
⭐ 主要贡献提出了更具适应性和探索能力的MoDr Transformer,显著提升数学和常识推理任务性能,同时优化动态路由机制以实现高效计算。
查看完整摘要 (Abstract)
Large Language Models have demonstrated superior reasoning capabilities by generating step-by-step reasoning in natural language before deriving the final answer. Recently, Geiping et al. introduced 3.5B-Huginn as an alternative to this paradigm, a depth-recurrent Transformer that increases computational depth per token by reusing a recurrent block in latent space. Despite its performance gains with increasing recurrences, this approach is inadequate for tasks demanding exploration and adaptivity, a limitation arising from its single, chain-like propagation mechanism. To address this, we propose a novel dynamic multi-branches routing approach for Huginn, termed as Mixture-of-Depth-Recurrent (MoDr) Transformer, which enables effective exploration of the solution space by shifting linear latent reasoning into a LoRA-based multi-branch dynamic relay mode with a learnable hard-gate routing. Meanwhile, we introduce an auxiliary-loss-free load balancing strategy to mitigate the potential routing collapse. Our empirical results reveal that MoDr achieves average accuracy improvements of +7.2% and +2.48% over the original Huginn model and its fine-tuned variant, respectively, across various mathematical reasoning benchmarks and improvements of +21.21% and +1.52% on commonsense reasoning benchmarks.
基础/前沿模型 (含LLM)
模型架构
#Efficient/Low-Resource Methods for NLP #Linear Sequence Modeling #Machine Learning for NLP
🎯 研究动机线性序列建模方法提供了高效的训练和推理能力,但因压缩整个序列到单一固定大小的记忆状态,导致在需高记忆回溯的任务中表现不佳。
❓ 解决问题现有方法的单一记忆状态限制了任务的记忆容量,提出一种能增加记忆容量且减少干扰的新架构。
🔍 现象分析线性复杂度方法虽然高效,但在回溯要求高的语言任务中表现有限,原因在于记忆状态设计的单一性。
🛠️ 主要方法引入名为‘MoM’的架构,通过多个独立记忆状态和路由网络,将输入令牌定向至特定记忆状态,提升记忆总容量并减少状态干扰,同时保持线性复杂度。
📊 数据与实验实验结果表明,MoM在下游语言任务中特别是回溯任务上优于现有所有线性序列建模方法,甚至达到与Transformer模型相当的性能。
⭐ 主要贡献提出一种混合记忆框架,增强记忆建模能力的同时保留计算效率,为线性序列建模开辟新方向。
查看完整摘要 (Abstract)
Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive tasks. To address this limitation, we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. MoM serves as a general framework that can be seamlessly combined with diverse memory update mechanisms across linear models. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models.
基础/前沿模型 (含LLM)
模型架构
#Generative Models #Autoregressive Models #Diffusion Models #Text-to-image
🎯 研究动机当前自回归模型在文本生成图像任务中存在效率低或量化损失的问题,亟需统一处理离散文本和连续图像令牌以提升性能。
❓ 解决问题提出NextStep-1,通过新颖的模型架构解决现有方法的计算资源消耗与量化损失问题,提升生成图像的质量与编辑能力。
🔍 现象分析实验显示基于连续令牌的预测目标能够显著改善图像生成的高保真度效果,同时支持强大的图像编辑功能。
🛠️ 主要方法NextStep-1结合14B自回归生成模型与157M流匹配头,采用离散文本令牌与连续图像令牌的逐级预测训练目标。
📊 数据与实验模型在广泛的数据集上训练,并通过多个标准化的文本生成图像任务验证其性能,结果表明在同类模型中表现卓越。
⭐ 主要贡献首次实现离散与连续令牌统一处理的自回归模型,达到文本生成图像领域的最新性能,并开放模型代码促进研究发展。
查看完整摘要 (Abstract)
Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, trained on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we have released our code and models to the community at https://github.com/stepfun-ai/NextStep-1.
基础/前沿模型 (含LLM)
模型架构
#transformers #simplicity bias #noise stability #regularization methods #spectral concentration
TL;DR:We introduce noise stability in Transformer models as an alternative proxy for explaining simplicity bias and propose a corresponding regularization method that we observe accelerates grokking.
🎯 研究动机理解深度学习中的简单化偏置,有助于构建可靠的人工智能系统。现有的平均敏感性指标存在无法推广到实值域和解释现代LLM中特定输入依赖现象的局限性。
❓ 解决问题提出噪声稳定性作为一种新的简单化衡量指标,旨在克服现有方法的理论缺陷和实践局限,全面反映模型对输入噪声的鲁棒性。
🔍 现象分析观察到现有Transformer模型在某些输入上呈现类似于'精英依赖'的行为,并推测其与简单化偏置相关。同时发现平均敏感性未能捕捉多输入维度上的协同噪声影响。
🛠️ 主要方法提出理论框架分析单层注意力机制和ReLU MLP对噪声稳定性的贡献。利用协方差区间传播技术解决多层模型中的噪声传播问题,进一步设计了一种基于噪声稳定性的正则化方法。
📊 数据与实验在算法任务和下一标记预测任务中验证,实验结果表明所提正则化方法分别将'grokking'和训练速度提升约35%和75%。
⭐ 主要贡献确立噪声稳定性为理解Transformer模型简单化偏置的有效工具,同时开发出一种可加速训练的新正则化方法,为深度学习理论和实践提供新视角。
查看完整摘要 (Abstract)
Understanding simplicity biases in deep learning offers a promising path toward developing reliable AI. A common metric for this, inspired by Boolean function analysis, is average sensitivity, which captures a model's robustness to single-token perturbations. We argue that average sensitivity has two key limitations: it lacks a natural generalization to real-valued domains and fails to explain the "junta-like" input dependence we empirically observe in modern LLMs. To address these limitations, we propose *noise stability* as a more comprehensive simplicity metric. Noise stability expresses a model's robustness to correlated noise applied to *all* input coordinates simultaneously. We provide a theoretical analysis of noise stability for single-layer attention and ReLU MLP layers and tackle the multi-layer propagation problem with a covariance interval propagation approach. Building on this theory, we develop a practical *noise stability regularization* method. Experiments on algorithmic and next-token-prediction tasks show that our regularizer consistently catalyzes grokking and accelerates training by approximately $35$\% and $75$\% respectively. Our results establish noise stability as a powerful tool for understanding and improving modern Transformers.
基础/前沿模型 (含LLM)
模型架构
#local routing consistency #MoE analysis #expert offloading
TL;DR:We introduce *local routing consistency* as a critical property for efficient expert offloading, conduct empirical analysis across various MoE LLMs, and provide practical insights for MoE architecture and cache system design.
🎯 研究动机在内存受限设备上部署大型MoE模型时,专家缓存成为关键,但现有研究对专家激活的本地一致性探索不足。
❓ 解决问题提出衡量MoE模型本地路由一致性的两种指标,以评估其在专家缓存设计中的有效性并提高存储效率。
🔍 现象分析发现本地路由一致性与本地负载均衡存在权衡,同时领域专家对路由一致性的贡献高于词汇专家。
🛠️ 主要方法设计了SRP和SCH两种指标,从固定专家组覆盖性及缓存命中率角度量化本地路由一致性,并结合玩具模型验证关键因素。
📊 数据与实验基于20个不同规模和架构的MoE模型进行实证分析,比较多个设置对本地路由一致性和缓存有效性的影响。
⭐ 主要贡献揭示了缓存大小与激活专家数量的最佳比例,以及共享专家等结构对一致性的负面影响,为高效的MoE架构和缓存系统设计提供指导。
查看完整摘要 (Abstract)
Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference.
To effectively deploy large MoE models on memory-constrained devices, many systems introduce expert offloading which caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand.
While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied.
In this paper, we propose two metrics to measure local routing consistency of MoE models:
(1) **Segment Routing best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and
(2) **Segment Cache best Hit rate (SCH)**, which measures the hit rate of an expert cache utilizing a length of future information under a cache limit.
We analyze 20 MoE LLMs with diverse sizes and architectures and use toy models to verify key factors related to local routing consistency.
We find a strong trade-off between local routing consistency and *local* load balance, while showing that *global* load balance can coexist with local routing consistency.
Meanwhile, settings like shared experts that decrease expert combination space can lead to low local routing consistency.
We further reveal that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models balance between cache effectiveness and efficiency with cache sizes approximately twice the active experts.
These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed.
基础/前沿模型 (含LLM)
模型架构
#diffusion models #autoregressive models #large language models #expressiveness
🎯 研究动机扩散语言模型因其平行生成和任意顺序生成能力成为自回归模型的有力竞争者,但其计算能力及局限性尚未被系统研究。
❓ 解决问题探讨扩散模型的非自回归生成是否具备超越自回归模型的能力,并提出改进以解决推理问题中的复杂性与灵活性挑战。
🔍 现象分析Masked Diffusion Models 在具有足够上下文的情况下计算上是通用的,但其任意顺序生成能力未能超越自回归模型。
🛠️ 主要方法提出一种新形式的生成方法 'any-process generation',通过重掩码、插入和删除操作扩展模型功能,实现自纠正、可变长度编辑和自适应并行处理。
📊 数据与实验通过理论分析和实验证明新方法在解决现有模型无法解决的复杂推理任务上具有显著优势,尤其是在需要非顺序演化的生成任务中。
⭐ 主要贡献系统比较扩散和自回归模型的计算能力,提出更强大的生成过程,扩展了大语言模型在编码与科学领域的应用潜力。
查看完整摘要 (Abstract)
Diffusion language models have recently emerged as a competitive alternative to autoregressive language models. Beyond next-token generation, they are more efficient and flexible by enabling parallel and any-order token generation. However, despite empirical successes, their computational power and fundamental limitations remain poorly understood. In this paper, we formally study whether non-autoregressive generation in Masked Diffusion Models (MDM) enables solving problems beyond the reach of Auto-Regressive Models (ARM). Our results show that MDM with sufficiently large context length is computationally universal with decoding steps matching the optimal parallel time complexity in PRAM. However, when controlling for other factors, MDM's flexibility to generate in any-order does not expand what ARM can already solve. To address this, we propose a new form of generation called any-process generation, which extends MDM with capabilities to remask, insert and delete tokens, allowing self-correction, length-variable editing, and adaptive parallelism. Theoretically and empirically, we demonstrate these capabilities enable scalability to significantly harder reasoning problems that are otherwise intractable for ARM and vanilla MDM. Additionally, they prove essential for generation tasks where objects naturally evolve through non-sequential processes, crucial for extending current LLMs beyond natural language to domains such as coding and science.
基础/前沿模型 (含LLM)
模型架构
#in-context learning; linear attention; linear dynamical systems; kalman filter; time series
TL;DR:This paper studies how linear attention layers in-context learn linear dynamical systems and shows the optimal weight construction implements one step of Gradient Descent relative to an autoregression objective of window size one.
🎯 研究动机研究线性注意力层在上下文学习线性动态系统中的表达能力,以探索其潜在应用于非独立同分布噪声数据的真实场景建模。
❓ 解决问题如何利用线性注意力层通过最优权重构建实现对线性动态系统的快速收敛,同时解释其与经典递归算法的关系。
🔍 现象分析发现线性注意力层的最优权重构建相当于窗口大小为1的自回归目标上的一次梯度下降,且与更广泛的预处理共轭梯度方法有关。
🛠️ 主要方法分析线性注意力层权重构造与梯度下降之间的联系,并结合数值实验验证其对扩展窗口设定的泛化能力。
📊 数据与实验基于噪声污染的高斯线性动态系统生成序列数据进行训练,通过数值实验验证理论推导的有效性。
⭐ 主要贡献揭示线性注意力层的上下文学习能力,与卡尔曼滤波器性能持平;提出新的假设解释其作为优化方法的有效性并拓展现有理论。
查看完整摘要 (Abstract)
This paper studies the expressive power of linear attention layers for in-context learning (ICL) of linear dynamical systems (LDS). We consider training on sequences of inexact observations produced by noise-corrupted LDSs, with all perturbations being Gaussian. Importantly, this non-i.i.d. data setting is a significant step towards modeling real-world scenarios. We provide the optimal weight construction for a single linear-attention layer and show its equivalence to one step of Gradient Descent relative to an autoregression objective of window size one. Guided by experiments, we uncover a connection to a generalization of the Preconditioned Conjugate Gradient method for larger window sizes. We back our findings with numerical evidence. These results add to the existing understanding of transformers’ expressivity as in-context learners and offer plausible hypotheses for recent observations that place their performance on par with that of the Kalman Filter — the optimal model-dependent learner for this setting.
基础/前沿模型 (含LLM)
模型架构
#world modeling #programmatic RL #probabilistic program #symbolic rule learning #intrinsically motivated and open-ended learning
🎯 研究动机符号化世界建模旨在以可执行程序表示环境动态,但现有研究集中于简单、确定性环境且需大量数据与人工指导。该工作关注复杂且随机环境中的符号化建模问题,并探索无人工奖励与目标的自我探索情境。
❓ 解决问题提出一种框架解决在复杂、随机环境中构建符号化世界模型的挑战,尤其是在交互预算极少,无人指导的场景下准确捕捉动态规律。
🔍 现象分析分析限预算自我探索中,在复杂环境中准确区分和预测未来状态的难点,明确随机动态下建模非相关属性时产生的计算挑战。
🛠️ 主要方法设计了名为 OneLife 的框架,基于条件激活的程序逻辑与概率编程建模环境动态,使用动态计算图优化推理与训练过程,仅处理相关规则,以避免计算扩展性问题。
📊 数据与实验通过 Crafter-OO 数据集验证方法,该环境重构流行的 Crafter 游戏,采用面向对象的符号状态与纯过渡函数。实验评估状态排序及状态真实度,两项指标结果在 23 个场景中超越基线表现。
⭐ 主要贡献建立了自动构建复杂未知环境符号化世界模型的技术框架,为规划任务中的策略优化提供了有效工具,推进了无指导学习与符号规则推理在强化学习领域的应用。
查看完整摘要 (Abstract)
Symbolic world modeling is the task of inferring and representing the transitional dynamics of an environment as an executable program. Previous research
on symbolic world modeling has focused on simple, deterministic environments
with abundant data and human-provided guidance. We address the more realistic and challenging problem of learning a symbolic world model in a complex, stochastic environment with severe constraints: a limited interaction budget
where the agent has only “one life” to explore a hostile environment and no external guidance in the form of human-provided, environment-specific rewards or
goals. We introduce OneLife, a framework that models world dynamics through
conditionally-activated programmatic laws within a probabilistic programming
framework. Each law operates through a precondition-effect structure, allowing
it to remain silent on irrelevant aspects of the world state and predict only the attributes it directly governs. This creates a dynamic computation graph that routes
both inference and optimization only through relevant laws for each transition,
avoiding the scaling challenges that arise when all laws must contribute to predictions about a complex, hierarchical state space, and enabling accurate learning
of stochastic dynamics even when most rules are inactive at any given moment.
To evaluate our approach under these demanding constraints, we introduce a new
evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter-OO, our reimplementation of the popular Crafter environment
that exposes a structured, object-oriented symbolic state and and a pure transition function that operates on that state alone. OneLife can successfully learn
key environment dynamics from minimal, unguided interaction, outperforming a
strong baseline on 16 out of 23 scenarios tested.
We also demonstrate
the world model’s utility for planning, where rollouts simulated within the world
model successfully identify superior strategies in multi-step goal-oriented tasks.
Our work establishes a foundation for autonomously constructing programmatic world models of unknown,
complex environments.
基础/前沿模型 (含LLM)
模型架构
#Mixture of Experts #memorization #reasoning #scaling laws #large language models
TL;DR:Memorization skills consistently benefit from higher sparsity, while reasoning skills require balancing active FLOPs with total tokens per parameter; the optimal point shifts with the compute budget.
🎯 研究动机当前大语言模型的规模扩展主要依赖经验性缩放定律,但其系数在架构或数据管线改变时会发生变化。混合专家模型引入了稀疏性新维度,需进一步研究其对不同能力的影响。
❓ 解决问题探索混合专家模型的稀疏性如何影响记忆能力与推理能力,并找出计算预算下最佳稀疏点以优化模型性能。
🔍 现象分析模型活跃计算量更高时推理准确性提升,而增加参数总量或优化每参数分配的总令牌数可增强记忆能力与推理能力,但两者需求不同。
🛠️ 主要方法通过训练多个混合专家模型系列,系统性调整总参数量、活跃参数量及路由选择,从固定预算下解耦预训练损失与下游准确性。
📊 数据与实验在固定预算条件下,使用多种模型配置进行实验,验证活跃计算与令牌分配的影响;并提供所有代码、数据源及日志以支持复现与未来工作。
⭐ 主要贡献提出需要联合考虑活跃计算量与总令牌分配以确定最佳稀疏性,重新定义计算优化缩放图景,并揭示记忆与推理任务的优化原则差异性。
查看完整摘要 (Abstract)
Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes.
Mixture‑of‑Experts (MoE) models, now standard in state‑of‑the‑art systems, introduce a new sparsity dimension that current dense‑model frontiers overlook.
We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills.
By training MoE families that vary total parameters, active parameters, and top-$k$ routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy.
Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry.
Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends.
We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling.
All code, data sources, and logs are released to facilitate reproducibility and future work.
基础/前沿模型 (含LLM)
模型架构
#RNN #Mamba #SSM #Transformers #Parallelization #Parallel scan #Nonlinear
TL;DR:We break the sequential bottleneck of nonlinear RNNs, enabling training of billion-scale LSTM/GRU models, competitive with modern architectures
🎯 研究动机RNN长期以来由于其序列计算特性限制了并行能力,阻碍了大规模模型的训练和应用,进而导致Transformer和SSM等可并行架构的普及。然而,SSM的线性结构限制了其表达复杂非线性序列依赖的能力。
❓ 解决问题提出ParaRNN框架,突破非线性RNN的序列并行化瓶颈,实现大规模LSTM/GRU模型的高效训练,并在性能上与主流现代架构竞争。
🔍 现象分析通过将非线性递归关系视为单一方程系统并采用Newton迭代与定制并行化技术,可显著减少传统顺序计算的高开销,并大幅提升模型训练效率。
🛠️ 主要方法使用Newton迭代算法结合并行化改进技术,将非线性递归转换为并行计算流程,显著提升训练速度,并扩展适用于LSTM和GRU的架构。
📊 数据与实验成功训练了7B参数规模的非线性RNN模型,实验验证了其困惑度可与相似规模的Transformer和Mamba2模型媲美,同时实现了高达665倍的速度优化。
⭐ 主要贡献提出并实现了ParaRNN框架,突破了非线性RNN的并行计算瓶颈,支持大规模模型训练,并开源代码,推动高效序列建模研究的进一步发展。
查看完整摘要 (Abstract)
Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies.
To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to $665\times$ over na\"ive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures.
To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.
基础/前沿模型 (含LLM)
模型架构
#Large Language Model #Fine-Tuning
🎯 研究动机参数高效微调方法在快速适应大规模语言模型中的重要性日益增加,现有的 Prefix-Tuning 在现代高性能模型上表现有限,需要改进其架构以增强适应能力。
❓ 解决问题Prefix-Tuning 的效果受到输入提示与参数化前缀在注意力层之间的权衡限制,影响了其在现代模型上的表现。
🔍 现象分析作者实验证明,Prefix-Tuning 在现代语言模型上的瓶颈源于前缀模块嵌入到注意力头内部的固有限制。
🛠️ 主要方法提出 PrefixMemory-Tuning,将前缀模块从注意力头中分离出来,同时增强模块的表达性,以解决原架构的局限性。
📊 数据与实验通过多项基准测试,验证 PrefixMemory-Tuning 在各种任务中优于传统 Prefix-Tuning,并在若干通用基准任务上与现代参数高效微调方法表现竞争。
⭐ 主要贡献提出了一种改进的架构使 Prefix-Tuning 性能更为优异,重塑其在大模型微调研究领域的潜力与竞争力。
查看完整摘要 (Abstract)
Parameter-Efficient Fine-Tuning (PEFT) methods have become crucial for rapidly adapting large language models (LLMs) to downstream tasks. Prefix-Tuning, an early and effective PEFT technique, demonstrated the ability to achieve performance comparable to full fine-tuning with significantly reduced computational and memory overhead. However, despite its earlier success, its effectiveness in training modern state-of-the-art LLMs has been very limited. In this work, we demonstrate empirically that Prefix-Tuning underperforms on LLMs because of an inherent tradeoff between the contribution of input prompt and parameterized prefix within the attention head. This motivates us to introduce PrefixMemory-Tuning, an architecture that generalizes the principles of Prefix-Tuning while addressing its shortcomings by shifting the prefix module out of the attention head itself and improving its expressiveness. Our experiments show that, across diverse benchmarks, PrefixMemory-Tuning consistently outperforms existing Prefix-Tuning methods. Notably, it achieves competitive performance with modern PEFTs on several general benchmarks, highlighting a potential extension of Prefix-Tuning approaches to become state-of-the-art. Our findings suggest that by overcoming its inherent limitations, Prefix-Tuning can remain a competitive and relevant research direction in the landscape of parameter-efficient LLM adaptation.
基础/前沿模型 (含LLM)
模型架构
#Rotary Position Embedding #Frequency Entropy #Large Language Model
TL;DR:Frequency Entropy enables analysis of RoPE on a rotational pair basis, allowing measurement of RoPE's periodicity and bands.
🎯 研究动机RoPE 被广泛用于 Transformer 模型中以编码位置信息,但其频率维度的内部结构尚未被系统理解,存在高频与低频维度作用的矛盾结论。
❓ 解决问题提供一个系统性框架,用于统一解读 RoPE 的频率维度特性,并量化其在每个维度上的实际利用率。
🔍 现象分析通过分析 Llama-4 模型,发现 RoPE 层具有周期性特征,而 NoPE 层没有;高熵和低熵维度的能量分布并不局限于特定维度,部分极端熵维度是冗余的。
🛠️ 主要方法提出 Frequency Entropy (FE) 作为量化指标,用于解析 RoPE 的正弦分量对每个频率维度的贡献,并分析维度能量集中情况。
📊 数据与实验基于 Llama-4 模型的实验验证显示,削弱极端熵维度在推理过程中的影响不会显著降低准确性,并可能略微提升其困惑度表现。
⭐ 主要贡献提供了一个新指标 FE,以简单直观的方式诊断与优化 RoPE,实现对其频率结构的深刻理解,并为模型设计提出了指导性建议。
查看完整摘要 (Abstract)
Rotary Position Embeddings (RoPE) are widely used in Transformers to encode positional information in token representations, yet the internal frequency structure of RoPE remains poorly understood. Previous studies have reported conflicting findings on the roles of high- and low-frequency dimensions, offering empirical observations but no unifying explanation. In this paper, we present a systematic framework that bridges these disparate results. We introduce Frequency Entropy (FE), a metric that quantifies the effective utilization of each RoPE frequency dimension, and we provide an analysis of how RoPE’s sinusoidal components contribute to model representations on a per-dimension basis. Based on an analysis of the Llama-4 model, which incorporates both RoPE and NoPE layers, we find that the periodicity captured by FE appears in RoPE layers but not in NoPE layers. Furthermore, FE identifies dimensions in which energy concentrates under RoPE. These characteristics are observed across the spectrum rather than being confined to specific dimensions. Moreover, attenuating extreme-entropy dimensions at inference yields downstream accuracy that is statistically indistinguishable from the baseline, with modest perplexity improvements on average, suggesting that such dimensions are often redundant. Overall, FE provides a simple, general diagnostic for RoPE with implications for analysis and design.
基础/前沿模型 (含LLM)
模型架构
#attention #transformers #model robustness
TL;DR:New attention formulation obtained by normalizing only the keys. Produces stable trainings, improved performance and robustness.
🎯 研究动机Transformer 模型中的注意力机制虽然强大,但由于查询向量和键向量范数的波动,可能导致训练不稳定性。
❓ 解决问题规范化注意力机制中的键向量,减少训练中的范数异常,提高模型稳定性和鲁棒性。
🔍 现象分析在存在易学的伪模式数据中,传统注意力机制可能因查询和键向量的任意增长而引发不稳定性。
🛠️ 主要方法提出一种新的注意力形式——QUEST,通过约束键向量在超球空间中运行,同时允许每个 token 动态控制注意力分布的锐度。
📊 数据与实验以视觉任务为主导的实验,辅以其他域的验证,证明新方法在训练稳定性、性能及对数据破坏和对抗攻击的鲁棒性上的改进。
⭐ 主要贡献提出一种简单可替代标准注意力的新方法 QUEST,提升训练稳定性、模型性能及鲁棒性。
查看完整摘要 (Abstract)
The Transformer model architecture has become one of the most widely used in deep learning and the attention mechanism is at its core. The standard attention formulation uses a softmax operation applied to a scaled dot product between query and key vectors. We explore the role played by norms of the queries and keys, which can cause training instabilities when they arbitrarily increase. We demonstrate how this can happen even in simple Transformer models, in the presence of easy-to-learn spurious patterns in the data. We propose a new attention formulation, QUEry-modulated Spherical aTtention (QUEST), that constrains the keys to a hyperspherical latent space, while still allowing individual tokens to flexibly control the sharpness of the attention distribution. QUEST can be easily used as a drop-in replacement for standard attention. We focus on vision applications while also exploring other domains to highlight the method's generality. We show that (1) QUEST trains without instabilities and (2) produces models with improved performance (3) that are robust to data corruptions and adversarial attacks.
基础/前沿模型 (含LLM)
模型架构
#diffusion #autoregressive #large language model
🎯 研究动机自回归模型在推理中存在速度瓶颈,而掩码扩散模型尽管提供并行化,却因缺少 KV 缓存和组合空间复杂性导致生成质量和效率受限。
❓ 解决问题通过结合序列重组与因果注意力,旨在同时解决掩码扩散模型中的计算开销和生成不连贯问题。
🔍 现象分析掩码扩散模型未能充分利用 KV 缓存,且因高维组合的学习复杂性造成生成任务效率和精度下降。
🛠️ 主要方法提出 ReFusion,将并行解码从单个 token 提升到更高 slot 级别,结合插槽间扩散选择与插槽内自回归填充,并动态调整已生成与未生成内容的顺序。
📊 数据与实验在七个多样性基准数据集上进行实验,显示性能较现有掩码扩散模型提升 34%,速度提升超 18 倍,同时性能接近强自回归模型并获得 2.33 倍平均加速。
⭐ 主要贡献提出了 ReFusion 模型,显著改进了掩码扩散方法,在保持生成质量的同时实现高效并行化推理。
查看完整摘要 (Abstract)
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that integrates sequence reorganization into the causal attention framework. By elevating parallel decoding from the token level to a higher slot level, ReFusion interleaves inter-slot diffusion-based selection with intra-slot autoregressive infilling, while reordering newly generated slots ahead of the remaining masks after each iteration. Consequently, this design simultaneously unlocks full KV cache reuse and reduces learning complexity from an intractable token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with a 34\% performance gain and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.
基础/前沿模型 (含LLM)
模型架构
#foundation models #relational deep learning #relational data #transformer
TL;DR:A novel architecture for relational data that shows strong zero-shot abilities on unseen datasets after pre-training.
🎯 研究动机预训练的Transformer在序列建模任务上具备零样本适应能力,但在关系数据领域缺乏能够跨数据集和任务迁移的架构。
❓ 解决问题关系数据因异构模式、图结构和功能依赖的多样性,难以设计通用模型;研究目标为开发能直接适用于未见数据集和任务的架构。
🔍 现象分析实验表明,Relational Transformer在关系数据上的零样本迁移性能出色,平均达到全监督模型93%的AUROC,仅需单次前向推理。
🛠️ 主要方法提出一种新架构,基于任务表提示进行任务设定,结合表/列元数据进行单元标记,采用掩码标记预测预训练,并设计了关系注意机制处理列、行及关键链接。
📊 数据与实验在多任务数据集RelBench上预训练并评估,涵盖客户流失预测、销售预测等任务;同时对比27B参数的LLM,Fine-tuning提升样本效率并取得SOTA表现。
⭐ 主要贡献开发了Relational Transformer,为关系数据领域零样本基础模型提供了实用路径;模型及代码公开以支持进一步研究。
查看完整摘要 (Abstract)
Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks.
The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures and functional dependencies.
In this paper, we present the Relational Transformer (RT) architecture,
which can be pretrained on diverse relational databases and directly applied to unseen datasets and tasks without task- or dataset-specific fine-tuning, or retrieval of in-context examples. RT (i) incorporates task specification via task table prompting, (ii) tokenizes cells with table/column metadata, (iii) is pretrained via masked token prediction, and (iv) utilizes a novel Relational Attention mechanism over columns, rows, and primary-foreign key links.
Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance,
averaging 93% of fully supervised AUROC
on binary classification tasks
with a single forward pass of a 22M parameter model,
as opposed to 84% for a 27B LLM.
Fine-tuning yields state-of-the-art results with high sample efficiency. Our experimental analyses show that RT's zero-shot transfer leverages task context,
relational attention patterns and schema semantics. Overall, RT provides a practical path toward foundation models for relational data.
Code, models, data: https://github.com/snap-stanford/relational-transformer.
基础/前沿模型 (含LLM)
模型架构
#Vision-Language Models #Multimodal Position Encoding
TL;DR:We analyze multimodal RoPE, distill three guidelines, and introduce MHRoPE and MRoPE‑I—plug‑and‑play variants that consistently outperform prior methods.
🎯 研究动机多模态位置编码对视觉-语言模型至关重要,但目前缺乏系统性研究。本文旨在通过全面分析多模态RoPE,填补该领域的知识空白。
❓ 解决问题现有方法在多模态位置编码设计上缺乏明确指导原则,导致布局模糊、表示能力有限及预训练先验迁移不充分。本研究旨在解决这些核心问题。
🔍 现象分析通过分析RoPE的位置设计和频率分配两个核心组件,揭示了位置一致性、全频率利用和文本先验保持三个关键原则对性能的决定性影响。
🛠️ 主要方法提出了即插即用的Multi-Head RoPE (MHRoPE)和MRoPE-Interleave (MRoPE-I)变体,无需改变模型架构即可实现位置编码优化。这些方法基于上述三原则设计。
📊 数据与实验在多样化基准测试上进行了广泛实验,涵盖通用和细粒度多模态理解任务。所有实验均证明新方法持续优于现有技术。
⭐ 主要贡献首次系统分析多模态RoPE并提炼出三大设计原则;提出了两个高效即插即用变体MHRoPE和MRoPE-I;在多个基准上实现了显著性能提升,推动了领域发展。
查看完整摘要 (Abstract)
Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priors—ensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal understanding. Code is avaliable at https://github.com/JJJYmmm/Multimodal-RoPEs.
基础/前沿模型 (含LLM)
模型架构
#Large Language Models ;Mixture of Experts; Manifold Regularization;
TL;DR:Aligning the routing weight manifold with the manifold of task embedding can significantly improve existing MoE LLMs' downstream task performance by 6-16% in accuracy with lightweight post-training of routers.
🎯 研究动机稀疏专家模型(MoE)在大语言模型中可高效扩展能力,但现有模型路由器表现存在显著性能缺陷,影响泛化性。
❓ 解决问题通过对路由权重的流形与任务嵌入流形进行对齐,改善现有 MoE 模型下游任务表现,提高模型泛化能力。
🔍 现象分析现有 MoE LLM 路由器在广泛任务中表现不佳,与最佳路由策略存在 10-20% 的准确率差距。
🛠️ 主要方法提出轻量的后训练策略“路由流形对齐(RoMA)”,通过添加流形正则项,仅对路由器细调,鼓励权重接近任务嵌入空间中成功邻居的权重。
📊 数据与实验使用两个近期 MoE LLM 并在多个基准数据集上测试,通过与现有对照方法比较验证 RoMA 方法的显著优越性。
⭐ 主要贡献提高 MoE 路由器性能,显著改善模型在下游任务上的准确率(6-16%),同时统一了任务理解与专家选择的流程。
查看完整摘要 (Abstract)
Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding via post-training can effectively reduce the gap and improve MoE LLMs’ generalization performance. Our method, “Routing Manifold Alignment (RoMA)”, introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in two recent MoE LLMs using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.
基础/前沿模型 (含LLM)
模型架构
#Sparse Transformer #Parametric scaling #Embedding Layers #Foundation Models #Pre-training #Model Architecture
TL;DR:STEM replaces each FFN up-projection with a per-layer embedding lookup to scale parametric capacity without increasing per-token compute or cross-device communication, yielding FLOP-efficient performance gains.
🎯 研究动机细粒度稀疏性能够提升参数容量,但伴随训练不稳定、负载平衡困难和通信开销等挑战,亟需简化且高效的解决方案。
❓ 解决问题通过设计一种方法,在提升模型参数容量的同时减少每个 token 的计算与跨设备通信,从而实现高效性能提升。
🔍 现象分析传统稀疏模型易受加载不平衡和运行时路由限制,而替代为静态、token 索引的方式可显著缓解这些问题,并提升嵌入空间的知识存储能力。
🛠️ 主要方法提出 STEM 模型,将 FFN 的上投影替换为按层局部的嵌入查找,同时保持门控与下投影部分的密集架构,支持异步预取与计算卸载。
📊 数据与实验在 350M 和 1B 参数规模的模型上验证,尤其在知识和推理相关任务(如 ARC-Challenge、OpenBookQA、GSM8K、MMLU)上带来 3-4% 的性能提升。
⭐ 主要贡献设计了 STEM,兼顾高参数容量与低计算成本,简化了训练和部署流程,为精细稀疏模型的设计引入了一种高效路径。
查看完整摘要 (Abstract)
Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce \textbf{STEM} (\emph{Scaling Transformers with Embedding Modules}), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances it knowledge storage capacity. In addition, STEM strengthens long-context performance: as sequence length grows, more distinct parameters are activated, yielding practical test-time capacity scaling. Across 350M and 1B model scales, STEM delivers up to $\sim$3--4\% improvements in average downstream performance, with notable gains on knowledge and reasoning-heavy benchmarks (ARC-Challenge, OpenBookQA, GSM8K, MMLU). Overall, STEM is an effective way of scaling parametric memory while remaining simpler to train and deploy than existing fine-grained sparse models.
基础/前沿模型 (含LLM)
模型架构
#Scaling Laws #Model Architecture #Inference-Efficient
🎯 研究动机随着大语言模型规模扩张,推理成本问题日益突出,探索模型架构与推理效率之间的权衡成为研究重点。
❓ 解决问题研究关键架构因素(如隐藏层大小、MLP与注意力参数分配比例、分组查询注意力)对模型推理成本和准确性的影响,寻找推理高效且性能优异的架构。
🔍 现象分析增加架构信息后提出的条件缩放定律能够有效预测最佳架构选择,优化后的模型相比现有开源基线在推理效率和准确性上均有显著提升。
🛠️ 主要方法基于条件缩放定律扩展Chinchilla框架,引入架构搜索机制,优化隐藏层、MLP与注意力比例以及分组查询注意力配置。
📊 数据与实验训练了规模在80M到3B参数及8B到100B训练tokens的200多种模型,验证了条件缩放定律对架构优化的预测能力。
⭐ 主要贡献提出带架构信息的条件缩放定律,优化架构在同等训练预算下取得高达2.1%准确率提升及42%的推理吞吐量提升,优于现有开源模型基线。
查看完整摘要 (Abstract)
Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines. Under the same training budget, optimized architectures achieve up to 2.1\% higher accuracy and 42\% greater inference throughput compared to LLaMA-3.2.
基础/前沿模型 (含LLM)
模型架构
#Linear Attention #Language Model
TL;DR:SSE is a partitioned state expansion method under a row-sparse update framework for linear attention, improving retrieval and reasoning.
🎯 研究动机Transformer在面对长上下文时,因计算和内存复杂度的限制,性能受制,需要更高效的上下文压缩方法。
❓ 解决问题现有线性注意力的上下文压缩方案常导致在检索和推理任务中的性能退化,亟需改进以增强上下文表示能力和推理准确性。
🔍 现象分析通过线性注意力中的稀疏状态更新和扩展,可以扩大接收场并减少信息干扰,实现更具判别力的状态表示。
🛠️ 主要方法提出稀疏行选择的状态更新范式,并结合稀疏状态扩展(SSE)方法,将上下文状态划分为多个分区,以在保持稀疏性的同时扩展容量。
📊 数据与实验在语言建模、上下文检索和数学推理任务基准上验证SSE性能,并展示2B参数SSE-H模型在数学推理AIME数据集上的领先表现。
⭐ 主要贡献提出一种高效建模长上下文的架构,通过稀疏更新和状态扩展提升了检索和数学推理性能,为小规模推理模型建立了新的性能基准。
查看完整摘要 (Abstract)
The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information categorization. This enables sparse state updates via softmax-based top-$k$ row selection, thereby extending receptive fields and reducing information interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse row-selection paradigm. Supported by efficient parallelized implementations, our design achieves highly discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.5 on AIME24 and 50.2 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.
基础/前沿模型 (含LLM)
模型架构
#Normalization Layer
🎯 研究动机在神经网络中特别是Transformer中,常用的RMSNorm丢弃了输入的模信息,且其静态缩放因子无法适应多样化输入数据和分布变化,限制了模型的表现特别是零样本场景下的性能。
❓ 解决问题通过动态调整缩放系数以保留输入模信息,解决了传统RMSNorm无法在输入数据分布广泛变化时优化模型性能的问题。
🔍 现象分析RMSNorm在前向传播中丢弃模信息,且静态缩放因子难以捕捉输入分布变化,导致模型表示能力受限,优化和泛化能力受影响。
🛠️ 主要方法提出SeeDNorm,根据当前输入动态调整缩放系数,实现数据驱动的自适应归一化,保留输入模信息,并设计解决训练不稳定问题的优化方案。
📊 数据与实验在大规模语言模型预训练、监督与非监督视觉任务中验证了SeeDNorm的有效性,测试了不同规模的模型,结果显示其性能优于RMSNorm、LayerNorm及DyT等。
⭐ 主要贡献引入轻量级参数,几乎不影响模型效率,实现了较RMSNorm等显著优异的表现,推动了动态归一化方法的发展。
查看完整摘要 (Abstract)
Normalization layer constitutes an essential component in neural networks. In transformers, the predominantly used RMSNorm constrains vectors to a unit hypersphere, followed by dimension-wise rescaling through a learnable scaling coefficient $\gamma$ to maintain the representational capacity of the model. However, RMSNorm discards the input norm information in forward pass and a static scaling factor $\gamma$ may be insufficient to accommodate the wide variability of input data and distributional shifts, thereby limiting further performance improvements, particularly in zero-shot scenarios that large language models routinely encounter. To address this limitation, we propose SeeDNorm, which enhances the representational capability of the model by dynamically adjusting the scaling coefficient based on the current input, thereby preserving the input norm information and enabling data-dependent, self-rescaled dynamic normalization. During backpropagation, SeeDNorm retains the ability of RMSNorm to dynamically adjust gradient according to the input norm. We provide a detailed analysis of the training optimization for SeedNorm and proposed corresponding solutions to address potential instability issues that may arise when applying SeeDNorm. We validate the effectiveness of SeeDNorm across models of varying sizes in large language model pre-training as well as supervised and unsupervised computer vision tasks. By introducing a minimal number of parameters and with negligible impact on model efficiency, SeeDNorm achieves consistently superior performance compared to previously commonly used normalization layers such as RMSNorm and LayerNorm, as well as element-wise activation alternatives to normalization layers like DyT.
基础/前沿模型 (含LLM)
模型架构
#state space models #linear RNNs #linear transformers #sequence modeling
TL;DR:Propose new architecture Transformer-PSM that generalizes state space models to use softmax attention
🎯 研究动机现有序列模型需兼具可并行训练与快速顺序推断能力,亟待更广泛的方法论架构来定义其性能上限。
❓ 解决问题提出如何构建既支持并行前缀扫描算法又具备高效线性顺序推断能力的新型架构。
🔍 现象分析通过实验验证,Prefix-Scannable Models(PSMs)既具备Transformer的表达能力,又能实现与状态空间模型相当的推断效率,且在序列长度泛化任务中表现更优。
🛠️ 主要方法定义了PSMs,通过放宽状态聚合算子支持非关联函数如softmax注意力,统一线性RNNs与线性Transformers等架构。
📊 数据与实验在语言建模、状态跟踪和关联召回等模拟任务上验证,展示其在推断效率和泛化性能上的优势。
⭐ 主要贡献提出Transformer-PSM架构,扩展状态空间模型至更广泛的应用范围,统一现有架构并推动序列建模的理论发展。
查看完整摘要 (Abstract)
Modern neural sequence models are designed to meet the dual mandate of parallelizable training and fast sequential inference. Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba, that achieve such ``sequential-parallel duality.'' This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference? We begin by describing a broad class of such models -- state space models -- as those whose state updates can be computed using the classic parallel prefix scan algorithm with a custom associative aggregation operator. We then define a more general class, Prefix-Scannable Models (PSMs), by relaxing the state aggregation operator to allow arbitrary (potentially non-associative) functions such as softmax attention. This generalization unifies many existing architectures, including element-wise RNNs (e.g., Mamba) and linear transformers (e.g., GLA, Mamba2, mLSTM), while also introducing new models with softmax-like operators that achieve O(1) amortized compute per token and log(N) memory for sequence length N. We empirically evaluate such models on illustrative small-scale language modeling and canonical synthetic tasks, including state tracking and associative recall. Empirically, we find that PSMs retain the expressivity of transformer-based architectures while matching the inference efficiency of state space models -- in some cases exhibiting better length generalization than either.
基础/前沿模型 (含LLM)
模型架构
#Large Language Models #Multimodal Large Language Models #Hallucination #Context Forgetting
🎯 研究动机大型语言模型(LLMs)和视觉语言模型(VLLMs)普遍存在幻觉和上下文遗忘问题。已有研究表明,注意力漂移(即模型关注点从初始输入转移到新生成token)是主要原因,这损害了生成的忠实度。
❓ 解决问题论文旨在解决LLMs在生成长序列时因注意力漂移而出现的幻觉与上下文遗忘问题。核心思路是利用模型固有的注意力汇聚现象来锚定关键上下文信息,从而提升输出的准确性和一致性。
🔍 现象分析论文指出了一个关键的内在特性:注意力汇聚——模型倾向于持续为序列的第一个token(如⟨BOS⟩)分配高注意力。这个token可以作为一个稳定的信息锚点,但并未被充分利用。
🛠️ 主要方法提出了SINKTRACK方法,这是一种免训练、即插即用的上下文锚定技术。它将关键上下文特征(如图像或指令信息)注入到⟨BOS⟩ token的表征中,使其成为信息锚,从而在整个生成过程中稳定地引导模型注意力。
📊 数据与实验在文本和模态任务上进行了广泛评估,例如在QuAC和M3CoT数据集上分别使用Llama3.1-8B-Instruct和Qwen2.5-VL-7B-Instruct模型,取得了显著的性能提升(最高+23.0%)。实验表明该方法在不同架构和规模的模型中具有鲁棒性和泛化性。
⭐ 主要贡献创新性地利用注意力汇聚特性提出了高效且通用的上下文锚定方法SINKTRACK。该方法无需训练、几乎不增加推理开销,并能显著缓解文本和多模态任务中的幻觉与遗忘问题。
查看完整摘要 (Abstract)
Large language models (LLMs) suffer from hallucination and context forgetting. Prior studies suggest that attention drift is a primary cause of these problems, where LLMs' focus shifts towards newly generated tokens and away from the initial input context. To address this, we make use of a related, intrinsic characteristic of LLMs: attention sink – the tendency to consistently allocate high attention to the very first token (i.e., ⟨BOS⟩) of a sequence. Concretely, we propose an advanced context anchoring method, SINKTRACK, which treats ⟨BOS⟩ as an information anchor and injects key contextual features (such as those derived from the input image or instruction) into its representation. As such, LLM remains anchored to the initial input context throughout the entire generation process. SINKTRACK is training-free, plug-and-play, and introduces negligible inference overhead. Experiments demonstrate that SINKTRACK mitigates hallucination and context forgetting across both textual (e.g., +18.9% on QuAC with Llama3.1-8B-Instruct) and multi-modal (e.g., +23.0% on M3CoT with Qwen2.5-VL-7B-Instruct) tasks. Its consistent gains across different architectures and scales underscore the robustness and generalizability. We also analyze its underlying working mechanism from the perspective of information delivery. Our source code is available at anonymous GitHub.
基础/前沿模型 (含LLM)
模型架构
#Steering #MoE #Mixture-of-Experts #LLM #Safety
TL;DR:A framework for steering MoE LLMs by detecting and controlling behavior-associated experts.
🎯 研究动机随着大规模语言模型(LLMs)的应用扩大,确保模型行为的安全性和可控性变得至关重要。Mixture-of-Experts(MoE)架构提供了灵活性,但也引入了潜在的漏洞和行为不可控风险。
❓ 解决问题现有方法难以在推理阶段高效控制LLMs行为。本研究旨在通过激活或停用与特定行为相关的专家网络,实现动态、安全的模型行为控制,无需重新训练。
🔍 现象分析通过对比不同输入对(如安全与不安全行为),发现某些专家网络对特定行为具有明显的激活偏好。这种行为关联使得专家网络成为控制模型输出的关键点。
🛠️ 主要方法提出SteerMoE框架,基于输入行为激活模式检测关键专家,并在推理阶段选择性地激活或停用这些专家,从而实现对模型行为的动态调整。
📊 数据与实验在11个基准数据集和6种LLMs上进行实验,评估框架对安全性和真实性的影响。结果表明,安全性提升高达20%,真实性提升达27%;同时,结合现有方法可完全绕过模型的安全防护。
⭐ 主要贡献提出一种轻量、有效且适用范围广的推理阶段控制方法SteerMoE,同时揭示了MoE架构在安全性和行为控制方面的独特漏洞。
查看完整摘要 (Abstract)
Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework to steer MoE models by detecting and controlling behavior-associated experts. We detect key experts by comparing how often they activate between paired inputs that demonstrate opposite behaviors (e.g., safe vs. unsafe). By selectively activating or deactivating such experts during inference, we control behaviors like faithfulness and safety without fine-tuning. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. Alternatively, unsafe steering drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails. Overall, SteerMoE offers a lightweight, effective, and widely applicable test-time control, while revealing unique vulnerabilities in MoE LLMs.
基础/前沿模型 (含LLM)
模型架构
#Recurrent Neural Networks #Sequence Modeling
🎯 研究动机递归神经网络(RNN)在推理阶段的深度记忆模块具有潜力,但因训练速度慢和硬件利用率低而未被充分开发,而且现有并行化方法在块大小选择上存在速度与性能的权衡问题。
❓ 解决问题解决RNN训练中因块大小选择而无法兼顾训练效率与推理性能的矛盾,同时提升模型的训练速度与精度。
🔍 现象分析大块大小有助于提高训练速度但会降低模型的性能,而小块大小则优化了结果但显著降低训练效率,呈现不可调和的冲突。
🛠️ 主要方法提出两阶段的TNT训练范式:阶段一通过层级记忆模块以大块处理长程上下文并并行处理细粒度细节,从而提升硬件利用率;阶段二对局部记忆模块进行小块微调,以实现高精度推理。
📊 数据与实验在Titans和TTT模型上评估,TNT训练速度相比最优基线配置提升达17倍,同时精度也得到显著改善。
⭐ 主要贡献TNT打破了阻碍递归神经网络发展的训练效率瓶颈,为研发更具表现力的RNN奠定了基础,为缩小与Transformers的性能差距铺平了道路。
查看完整摘要 (Abstract)
Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers. While these expressive models do not yet match the peak performance of state-of-the-art Transformers, their potential has been largely untapped due to prohibitively slow training and low hardware utilization.
Existing parallelization methods force a fundamental conflict governed by the chunksize hyperparameter: large chunks boost speed but degrade performance, necessitating a fixed, suboptimal compromise. To solve this challenge, we introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process. Stage one is an efficiency-focused pre-training phase utilizing a hierarchical memory. A global module processes large, hardware-friendly chunks for long-range context, while multiple parallel local modules handle fine-grained details. Crucially, by periodically resetting local memory states, we break sequential dependencies to enable massive context parallelization. Stage two is a brief fine-tuning phase where only the local memory modules are adapted to a smaller, high-resolution chunksize, maximizing accuracy with minimal overhead. Evaluated on Titans and TTT models, TNT achieves a substantial acceleration in training speed—up to 17$\times$ faster than the most accurate baseline configuration—while simultaneously improving model accuracy. This improvement removes a critical scalability barrier, establishing a practical foundation for developing expressive RNNs and facilitating future work to close the performance gap with Transformers.
基础/前沿模型 (含LLM)
模型架构
#Curvature #transformers
TL;DR:We propose a method to analyze and enforce the stability of training transformers.
🎯 研究动机大规模 Transformer 的训练易出现损失尖峰和发散问题,浪费计算资源;现有稳定性控制方法复杂,难以适用于大规模训练。
❓ 解决问题提出一种快速在线估算曲率的新方法,并通过网络深度渐进增长控制曲率,解决训练中的不稳定性问题。
🔍 现象分析实验发现训练不稳定性与预处理曲率的突增相关,并且曲率随着网络深度的增加而增长。
🛠️ 主要方法引入基于 Hessian-向量乘积的快速在线曲率估计器,并设计架构热启动机制以逐步增大网络深度,从而稳定训练。
📊 数据与实验在大型 Transformer 模型的实验中验证,新方法相比现有稳定化技术实现了更高的效率和稳定性,且未影响收敛速度。
⭐ 主要贡献提出高效曲率估计方法,发现曲率与深度的关系,提出架构热启动机制,有效减少训练不稳定性,推动大规模 Transformer 模型的稳定训练研究。
查看完整摘要 (Abstract)
Training billion-parameter Transformers is often brittle, with transient loss spikes and divergence that waste compute. Even though the recently developed Edge of Stability (EoS) theory provides a powerful tool to understand and control the stability of optimization methods via the (preconditioned) curvature, these curvature-controlling methods are not popular in large-scale Transformer training due to the complexity of curvature estimation. To this end, we first introduce a fast online estimator of the largest (preconditioned) Hessian eigenvalue (i.e., curvature) based on a warm-started variant for power iteration with Hessian–vector products. We show theoretically, and verify empirically, that the proposed method makes per-iteration curvature tracking feasible at billion-parameter scale while being more accurate. Using this tool, we find that training instabilities coincide with surges in preconditioned curvature and that curvature grows with depth. Motivated by these observations, we propose architecture warm-up: progressively growing network depth to carefully control the preconditioned Hessian and stabilize training. Experiments on large Transformers validate that our approach enables efficient curvature tracking and reduces instabilities compared to existing state-of-the-art stabilization techniques without slowing down convergence.
基础/前沿模型 (含LLM)
模型架构
#Softmax Bottleneck+ #Transformer+ #Output Projection Matrix+ #Large Language Models+
TL;DR:We show that randomly initialized or trained output projection matrices can successfully produce exact probabilities for the top m tokens for rather large values of m.
🎯 研究动机Transformer架构中输出投影矩阵的设计可能因softmax瓶颈导致概率分布表达能力受限,从而影响大语言模型对自然语言统计的准确性预测。
❓ 解决问题评估输出投影矩阵是否能够有效预测top-m令牌的概率,以及softmax瓶颈是否显著限制了模型的能力。
🔍 现象分析理论与实验证明,即使在随机初始化情况下,输出投影矩阵也可以对较大的m值生成准确的top-m令牌概率分布。
🛠️ 主要方法从理论和实验层面推导和验证输出投影矩阵对于top-m概率估测的能力,分析随机初始化和训练矩阵的表现。
📊 数据与实验通过随机与训练后的矩阵进行实证性实验,验证理论推导并分析矩阵条件下的概率分布表现能力。
⭐ 主要贡献提出并证明softmax瓶颈对大语言模型适配自然语言概率并非重大限制,拓宽了对输出投影矩阵特性的认知,并给出训练矩阵所能定义概率范围的理论界限。
查看完整摘要 (Abstract)
In many popular transformer architectures, an output projection matrix linearly maps lower-dimensional embeddings into a higher-dimensional space of logits. It has been shown that this leads to a softmax bottleneck that prevents the production of arbitrary probability distributions. It has been argued that this limits large language models (LLMs) in their ability to express next token probabilities that perfectly align with the statistics of natural language. We focus on the ability of such models to produce accurate probabilities for just the top-$m$ tokens. We provide theoretical bounds that show that even a randomly initialized projection matrix can successfully do this for rather large values of $m$, supported by empirical results on both random and trained matrices. This raises questions about whether the softmax bottleneck significantly limits the capabilities of LLMs. We also derive bounds on the maximum number of probabilities that any trained output projection matrix can specify.
基础/前沿模型 (含LLM)
模型架构
#MultiModal Large Language Model;Pre-Normlization
🎯 研究动机当前多模态大语言模型(MLLMs)广泛采用前归一化(Pre-Norm)架构,但该架构导致视觉令牌和文本令牌之间存在严重的范数差异。这种差异可能破坏跨模态特征融合的有效性,然而其具体动态机制尚未得到充分的理论分析与实证验证。
❓ 解决问题本文通过理论与实证分析揭示了范数差异引发的“非对称更新动态”,并提出了一种简单的归一化对齐方法来解决这一问题。该方法旨在提升MLLMs在多模态和纯文本任务上的整体性能。
🔍 现象分析理论分析表明,高范数的视觉令牌表现出“表示惯性”,其语义更新速度远低于低范数的文本令牌,从而形成了非对称的更新动态。大量主流MLLMs的实证结果证实了这种范数差异持续存在及其导致更新速率不平衡的现象是普遍存在的。
🛠️ 主要方法核心方案是在视觉投影器后插入一个经过精心初始化的LayerNorm层。这一设计旨在强制对齐视觉和文本特征的范数,从而缓解非对称更新问题,促进更有效的跨模态融合。
📊 数据与实验实验基于LLaVA-1.5架构进行,在一系列广泛使用的多模态基准测试中验证了方法的有效性。值得注意的是,该方法在纯文本评估(如MMLU)上也带来了显著的性能提升,表明其改善了模型的整体能力。
⭐ 主要贡献首次从理论上形式化分析了Pre-Norm MLLMs中由范数差异导致的非对称更新动态问题。提出并验证了一种极其简单有效的归一化对齐解决方案,该方案不仅在多模态任务上有效,还提升了模型的通用语言理解能力。
查看完整摘要 (Abstract)
Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an "asymmetric update dynamic", where high-norm visual tokens exhibit a ''representational inertia,'' causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic---the persistence of norm disparity and the resulting asymmetric update rates---is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.
基础/前沿模型 (含LLM)
模型架构
#LLM #Multilinguistic #Interpretability
🎯 研究动机大型语言模型在多语言翻译中的内部机制尚未被完全理解,尤其是其注意力机制在翻译中的具体作用。
❓ 解决问题明确注意力机制与翻译能力的关系,识别特定注意力头(称为 Token Alignment Heads)在源语言到目标语言映射中的作用。
🔍 现象分析发现这些关注头具有普遍性、稀疏性、一致性和因果性,且它们对翻译任务高相关而对其他多语言任务影响不均。
🛠️ 主要方法对不同模型系统性验证这些注意力头的特性,通过消融实验和模型内部机制追踪,揭示其形成与功能演化。
📊 数据与实验基于多语言数据集验证 Token Alignment Heads 的特性,利用其过滤翻译训练数据,并测试其对模型翻译能力的提升效果。
⭐ 主要贡献明确了 Token Alignment Heads 的关键角色,揭示其在翻译中由形成到精炼的内在演化过程,提出可利用其优化多语言训练数据的方法,显著提升模型翻译性能。
查看完整摘要 (Abstract)
Recently, large language models (LLMs) have made remarkable progress, with multilingual capability emerging as a core foundational strengths. However, the internal mechanisms by which these models perform translation remain incompletely understood. In this paper, we elucidate the relationship between the attention mechanism in LLMs and their translation abilities. We find that certain attention heads, which we term token alignment heads, are specifically responsible for mapping tokens from the source language to the target language during inference.
Through a systematic investigation across various models, we confirm that these token alignment heads exhibit several key characteristics: (1) Universality: They are present in all LLMs we studied. (2) Sparsity: They constitute only a small fraction of all attention heads. (3) Consistency: The set of token alignment heads activated by the model shows strong consistency across different language pairs. (4) Causality: Interventionally removing these heads leads to a sharp decline in the model's translation performance, while randomly removing non-token alignment heads has little impact on translation ability. (5) Functional Specificity: Ablating token alignment heads disproportionately harms translation but has a varied impact on other multilingual tasks. We also traced the formation of token alignment heads during pre-training, revealing an evolutionary path of rapid proliferation, stabilization, and eventual pruning. Furthermore we leverage these token alignment heads to filter multilingual training data, and our experiments show that these data could enhance translation capabilities of the models.
基础/前沿模型 (含LLM)
模型架构
#Scaling Laws #Mixture-of-Experts #Large Language Models
TL;DR:We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and efficiency.
🎯 研究动机混合专家(MoE)架构能够高效扩展大型语言模型,但如何预测其在不同配置下的模型能力仍是未解决的问题。
❓ 解决问题提出效率杠杆(Efficiency Leverage,EL)作为衡量MoE相对于稠密模型计算优势的指标,并建立统一扩展规律以预测MoE架构的效率表现。
🔍 现象分析分析表明,EL受到专家激活比例和计算预算的驱动,遵循可预测的幂律关系;专家粒度对EL影响呈现非线性调节,存在最佳范围。
🛠️ 主要方法通过大规模实证研究,训练超过300个模型,探索MoE架构配置参数(激活比例、粒度、计算预算)与EL之间的关系,并推导统一扩展规律。
📊 数据与实验使用1万亿高质量标注数据集进行训练与对比,验证扩展规律,通过MoE-mini以0.85B参数实现与6.1B稠密模型性能匹敌,并减少超过7倍计算资源。
⭐ 主要贡献提出新指标EL,解析影响MoE效率的关键因素,推导统一扩展规律并验证其准确性,为高效混合专家模型的扩展提供理论与实证基础。
查看完整摘要 (Abstract)
Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration.
To validate our derived scaling laws, we designed and trained MoE-mini, a model with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, MoE-mini matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws.
This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.
基础/前沿模型 (含LLM)
模型架构
#in-context learning #markov chain #transformers #mirror descent #mixture models #latent variables
🎯 研究动机序列建模需要识别上下文中哪些过去的标记是因果相关的,其重要性评估机制依赖于transformer的注意力层,但其底层机制仍未充分解明。
❓ 解决问题提出一个基于过渡分布混合模型的框架,通过隐变量建模过去标记对下一步预测的影响,并研究transformer如何在上下文中学习未观察到的混合权重。
🔍 现象分析理论和实验证明,transformers能够通过实现Mirror Descent学习这些混合权重,从而反映其对相关上下文标记的重要性评估过程。
🛠️ 主要方法设计了一个明确的三层transformer结构,该结构完全实现了一步Mirror Descent,并证明其结果是Bayes最优预测的一阶近似。
📊 数据与实验通过从零开始训练transformers的实验,观察到其预测分布、注意模式及学习的转移矩阵与理论构造一致,而更深模型实现了接近于多步Mirror Descent的性能。
⭐ 主要贡献揭示了transformer在上下文中学习混合模型权重的机制,提供了理论支撑,并通过实验验证了transformer通过梯度下降学习该机制的可行性和有效性。
查看完整摘要 (Abstract)
Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly understood. In this work, we formalize the task of estimating token importance as an in-context learning problem by introducing a framework based on Mixture of Transition Distributions, where a latent variable determines the influence of past tokens on the next. The distribution over this latent variable is parameterized by unobserved mixture weights that transformers must learn in-context. We demonstrate that transformers can implement Mirror Descent to learn these weights from the context. Specifically, we give an explicit construction of a three-layer transformer that exactly implements one step of Mirror Descent and prove that the resulting estimator is a first-order approximation of the Bayes-optimal predictor. Corroborating our construction and its learnability via gradient descent, we empirically show that transformers trained from scratch learn solutions consistent with our theory: their predictive distributions, attention patterns, and learned transition matrix closely match the construction, while deeper models achieve performance comparable to multi-step Mirror Descent.
基础/前沿模型 (含LLM)
模型架构
#boolean analysis #simplicity bias #transformer #feature noise
🎯 研究动机研究噪声特征对Transformer模型学习目标函数的影响,探讨其在噪声-鲁棒学习中的表现。
❓ 解决问题分析Transformer在含噪数据训练后,是否能正确预测无噪声数据标签,以及探索其对于布尔函数学习的能力限制。
🔍 现象分析发现Transformer对部分函数(如稀疏奇偶性和多数函数)表现良好,但对随机布尔函数的噪声-鲁棒学习通常失败,尤其是当最优解的布尔敏感性低于目标函数时。
🛠️ 主要方法通过引入额外损失项惩罚高敏感性函数,测试Transformer是否能从其对简单函数的偏好陷阱中跳脱。
📊 数据与实验设计和使用布尔函数数据集对Transformer与LSTM进行对比实验,评估模型在噪声特征环境中的表现。
⭐ 主要贡献揭示Transformer的简单性偏好会导致其在特定噪声特征下学习布尔函数失败,并提出改善其噪声-鲁棒性的有效方法。
查看完整摘要 (Abstract)
Noise is ubiquitous in data used to train large language models, but it is not well understood whether these models are able to correctly generalize to inputs generated without noise. Here, we study noise-robust learning: are transformers trained on data with noisy features able to find a target function that correctly predicts labels for noiseless features? We show that transformers succeed at noise-robust learning for a selection of $k$-sparse parity and majority functions, compared to LSTMs which fail at this task for even modest feature noise. However, we find that transformers typically fail at noise-robust learning of random $k$-juntas, especially when the boolean sensitivity of the optimal solution is smaller than that of the target function. We argue that this failure is due to a combination of two factors: transformers' bias toward simpler functions, combined with an observation that the empirically optimal function for noise-robust learning has lower sensitivity than the target function. We test this hypothesis by exploiting transformers' simplicity bias to trap them in an incorrect solution, but show that transformers can escape this trap by training with an additional loss term penalizing high-sensitivity solutions. Overall, we find that transformers are particularly ineffective for learning boolean functions in the presence of feature noise.
基础/前沿模型 (含LLM)
模型架构
#memory network #moe #pretrain #long context
TL;DR:Previous Memory-layer attempts have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2 to closes this performance gap.
🎯 研究动机现有的 Memory-layer 架构在性能上仅能匹配 2-expert MoE 模型,远低于 8-expert MoE 的表现,且 MoE 模型面临高内存访问成本问题。
❓ 解决问题提出全新的 UltraMemV2 架构,旨在弥补 Memory-layer 在性能上的差距,同时降低内存访问成本,提升长上下文任务的学习能力。
🔍 现象分析前沿 Memory-layer 架构在效率上具备优势,但其计算性能受到设计限制,表现不及先进的 MoE 模型;实验表明激活密度对性能的影响大于稀疏参数总量。
🛠️ 主要方法通过五项改进优化 Memory-layer 架构,包括整合内存层到每个 Transformer 块、精简值扩展、采用基于 FFN 的值处理、参数初始化优化、以及重新平衡内存与 FFN 的计算比重。
📊 数据与实验验证模型扩展能力至具有 120B 总参数且 2.5B 被激活的规模,实验结果在多项测试中均优于现有方法;长上下文任务性能提升显著,如长记忆能力提升 1.6 分,多轮记忆提升 6.2 分,内上下文学习提升 7.9 分。
⭐ 主要贡献首次定位 Memory-layer 架构性能与 8-expert MoE 模型持平,通过低内存访问成本实现稀疏计算新范式,推动 Memory-layer 系统适用于更广泛任务。
查看完整摘要 (Abstract)
While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.
基础/前沿模型 (含LLM)
模型架构
#Probabilistic embeddings #Embedding convolution #Uncertainty-aware similarity
🎯 研究动机现有的嵌入模型在不同任务和领域中的表现不一致,单一模型难以统领全局,而现有的集成方法忽略了嵌入模型的不确定性,影响下游任务的稳健性和可靠性。
❓ 解决问题提出一种能够捕捉并利用嵌入模型不确定性的集成方法,使得模型在性能和稳健性上均得到提升。
🔍 现象分析传统方法仅处理确定性嵌入,未能有效量化并利用模型中的不确定性,导致集成效果受限。
🛠️ 主要方法提出不确定性驱动的嵌入卷积(UEC),它通过后处理方式将确定性嵌入转化为概率嵌入,并利用不确定性计算适应性集成系数,结合不确定性感知相似度函数实现理论上有效的分布距离替代。
📊 数据与实验在多样化的基准数据集上进行实验,验证UEC方法在性能提升及抗干扰能力上的一致性与鲁棒性。
⭐ 主要贡献提出了一种新的基于不确定性建模的嵌入集成方法,为提高嵌入模型的性能和可靠性提供了理论支持与实践价值。
查看完整摘要 (Abstract)
Text embeddings are essential components in modern NLP pipelines. Although numerous embedding models have been proposed, no single model consistently dominates across domains and tasks. This variability motivates the use of ensemble techniques to combine complementary strengths. However, most existing ensemble methods operate on deterministic embeddings and fail to account for model-specific uncertainty, limiting their robustness and reliability in downstream applications. To address these limitations, we propose Uncertainty-driven Embedding Convolution (UEC). UEC first transforms deterministic embeddings into probabilistic ones in a post-hoc manner. It then computes adaptive ensemble coefficients based on embedding uncertainty, derived from a principled surrogate-loss formulation. Additionally, UEC employs an uncertainty-aware similarity function that directly incorporates uncertainty into the similarity scoring, providing a theoretically grounded and efficient surrogate to distributional distances. Extensive experiments on diverse benchmarks demonstrate that UEC consistently improves both performance and robustness by leveraging principled uncertainty modeling.
基础/前沿模型 (含LLM)
模型架构
#Mixture of Experts #Large Language Models #Foundation Model
TL;DR:We rethink the MoE with Nadaraya-Watson Kernel and propose KERN router to replace Softmax router to achieve better performance.
🎯 研究动机现有的专家混合(MoE)模型在路由器评分函数上长期依赖Softmax,这一设计虽已成为标准,但缺乏理论支撑。
❓ 解决问题探索是否存在比Softmax更高效的路由器函数,以提高MoE和LLMs的性能。
🔍 现象分析观察到MoE和Nadaraya-Watson回归有相同的数学基础,FFN和MoE可以被视为其特殊案例,输入层神经元对应内核函数。
🛠️ 主要方法提出Kernel Inspired Router with Normalization (KERN),一种基于FFN的路由函数,结合ReLU激活和l2归一化,无需额外成本。
📊 数据与实验在多个MoE与LLM任务中,使用综合实验验证KERN路由函数的有效性和优越性能。
⭐ 主要贡献基于Nadaraya-Watson内核理论重新设计了MoE路由器,提出无额外成本的KERN方案,实验验证其对标准Softmax的优越性。
查看完整摘要 (Abstract)
Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya–Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya–Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and Mixture-of-Experts (MoE) can be interpreted as a special case of Nadaraya–Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the **zero-additional-cost** Kernel Inspired Router with Normalization ($\mathrm{KERN}$), an FFN-style router function, as an alternative to $\mathrm{Softmax}$. We demonstrate that this router generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers. **Based on empirical observations and established practices in FFN implementation, we recommend the use of $\mathrm{ReLU}$ activation and $\ell_2$-normalization in $\mathrm{KERN}$ router function.** Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function $\mathrm{KERN}$.
基础/前沿模型 (含LLM)
模型架构
#MoE #LLM #compression #attention
TL;DR:In this work, we present the first identification and systematic study of a distinct subset of experts, termed Super Experts. We analyze their characteristics, distributions, and critical functional roles within MoE LLMs.
🎯 研究动机当前研究通过专家级压缩技术提升混合专家大语言模型(MoE LLMs)的效率,但缺乏对专家间异质性重要性及其内部机制的深入理解。
❓ 解决问题首次系统性研究一种在模型前向推理中发挥关键作用的专家子集——超级专家(Super Experts, SEs)。
🔍 现象分析SEs在down_proj层输出中表现为稀有但极端的激活异常,其分布具有模型特异性且与数据和后训练过程无关;剪枝少数SEs即可显著削弱模型性能,尤其是数学推理能力。
🛠️ 主要方法通过对SEs的分布、激活模式及其剪枝对注意力机制的影响进行细致分析,并开发快速准确的自动化SE检测工具。
📊 数据与实验实验基于开源MoE LLMs(如Qwen3-30B-A3B),剪枝SEs后性能显著下降,验证其在多任务表现中的重要性,特别是数学推理效果的崩溃现象。
⭐ 主要贡献提出超级专家概念,揭示其在Transformer模型系统性异常机制中的核心角色;填补了MoE LLMs内部动力学理解的知识空白;提供了SE自动化分析工具及相关代码。
查看完整摘要 (Abstract)
Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to enhance the efficiency of Mixture-of-Experts (MoE) large language models (LLMs).
However, existing approaches often rely on empirical heuristics to identify critical experts, while lacking a deeper understanding into the heterogeneous importance of experts and the inner workings of MoE LLMs.
In this study, we report, for the first time, the discovery and systematic investigation of a distinct subset of experts that play a pivotal role in the model's forward inference.
These experts are prevalent in open-source MoE LLMs, and despite their extremely limited number, pruning them results in a substantial decline in model performance (e.g., prune just three out of 6,144 causes Qwen3-30B-A3B to generate repetitive and uninformative outputs).
We refer to these experts as Super Experts (SEs).
Our comprehensive analysis provides progressively deeper insights into SEs:
(i) SEs are characterized by rare but extreme activation outliers in the output of the down\_proj, which give rise to massive activations in the
hidden states between decoder layers.
Moreover, the distribution of SEs is model-specific, data-agnostic, and remains unaffected by post-training processes.
(ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model's overall performance, particularly in mathematical reasoning.
(iii) We further investigate why compressing SEs exerts such a pronounced impact.
We show that, in MoE LLMs, SEs serve as the primary source of the systematic outlier mechanism in Transformers, and that compressing them profoundly disrupts this process, ultimately causing the collapse of attention sinks.
These findings advance the understanding of the internal dynamics of MoE LLMs, filling an important gap in the current knowledge.
In addition, we developed an automated tool for rapid and accurate SE profiling.
The code is provided in the supplementary materials.
基础/前沿模型 (含LLM)
模型架构
#Large Language Models #Persona Modeling
🎯 研究动机人类会根据社交情境切换不同的角色,大型语言模型(LLMs)也表现出类似的角色适应能力。现有方法通常依赖外部知识或参数调整进行角色建模,却未探讨LLMs内部的角色潜力。该研究旨在探索LLMs是否内含角色化知识而无需外部干预。
❓ 解决问题LLMs是否需要外部上下文或参数来适应不同角色行为,亦或其参数空间已经嵌入了相关角色化信息。
🔍 现象分析通过小规模校准数据集,研究确认LLMs中不同角色存在独特的激活模式,并能够发现二元对立角色如内向与外向的差异性子网络。
🛠️ 主要方法提出基于统计特征的参数掩码策略,识别轻量角色化子网络;使用对比剪枝策略优化二元对立场景下的角色分离,完全基于现有参数空间,无需额外训练。
📊 数据与实验通过多样化评估设置验证方法有效性,实验显示生成的子网络相比传统基线提升角色适配能力,同时保持更高效率。
⭐ 主要贡献揭示LLMs内部携带人类多样化行为的潜在知识,提出一种无需外部知识的新型可控与可解释角色化建模策略,为个性化语言模型研究提供新方向。
查看完整摘要 (Abstract)
Humans shift between different personas depending on social context. Large Language Models (LLMs) demonstrate a similar flexibility in adopting different personas and behaviors. Existing approaches, however, typically adapt such behavior through external knowledge such as prompting, retrieval-augmented generation (RAG), or fine-tuning.
We ask: do LLMs really need external context or parameters to adapt to different behaviors, or do they already have such knowledge embedded in their parameters?
In this work, we show that LLMs already contain persona-specialized subnetworks in their parameter space. Using small calibration datasets, we identify distinct activation signatures associated with different personas. Guided by these statistics, we develop a masking strategy that isolates lightweight persona subnetworks. Building on the findings, we further discuss: how can we discover opposing subnetworks from the model that lead to binary-opposing personas, such as introvert-extrovert?
To further enhance separation in binary opposition scenarios, we introduce a contrastive pruning strategy that identifies parameters responsible for the statistical divergence between opposing personas. Our method is entirely training-free and relies solely on the language model's existing parameter space. Across diverse evaluation settings, the resulting subnetworks exhibit significantly stronger persona alignment than baselines that require external knowledge while being more efficient. Our findings suggest that diverse human-like behaviors are not merely induced in LLMs, but are already embedded in their parameter space—pointing toward a new perspective on controllable and interpretable personalization in large language models. Our code is available at https://github.com/Ruimeng-Ye/Persona.git.
基础/前沿模型 (含LLM)
模型架构
#Large Language Models #Attention Mechanisms #Training-free Methods #Inference-time Optimization #Model Interpretability #Unsupervised Learning #Attention Sink
TL;DR:We introduce ZeroTuning, a training-free method that enhances LLM performance by tuning attention to the initial token, a simple yet powerful and universal control point.
🎯 研究动机现有的训练-free方法依赖辅助启发式策略来识别重要的任务相关token,该过程易引入偏差且在某些场景下适用性受限。
❓ 解决问题提出ZeroTuning,通过调整初始token的注意力权重,以无需训练的方式提升大语言模型性能,简化现有方法的复杂性。
🔍 现象分析理论上,该方法通过为初始token的注意力logits添加轻量偏置,系统性改进下游注意力分布,结合其注意力汇聚特性放大效果;实验证实,早期层次和特定注意力头的调整尤为显著。
🛠️ 主要方法设计两种ZeroTuning变体:监督式通过验证集校准,非监督式通过最小化输出熵;模型无需参数更新,也对使用不同内核(如SDPA或FlashAttention)的注意力机制通用。
📊 数据与实验在15个数据集上验证,分类任务提升19.9%,问答任务提升4.5%,对话任务提升2.1%;在长上下文和量化推理下仍保持性能提升。
⭐ 主要贡献提出ZeroTuning,在不改变推理流程和模型参数的情况下提升性能;降低实现复杂度,仅需修改四行代码,推动推理优化与可解释性研究。
查看完整摘要 (Abstract)
Token-level attention tuning -- a class of training-free methods including Post-hoc Attention Steering (PASTA) and Attention Calibration (ACT) -- has emerged as a promising approach for improving frozen LLMs via interpretable interventions. However, these methods rely on auxiliary heuristics to identify important task-specific tokens, which can introduce bias and limit applicability when token importance is ambiguous or when optimized kernels make attention maps inaccessible. We propose a simpler alternative: intervening only on the initial token (e.g., <BOS> in LLaMA). We theoretically show that adding lightweight biases to this token’s attention logits systematically shifts and reshapes downstream attention patterns -- an effect amplified by its natural role as an attention sink. Empirically, we find that this tuning can improve LLM performance and better elicit pretrained knowledge, with stronger effects in early layers and distinct scaling preferences across attention heads. Building on these findings, we introduce ZeroTuning, a training-free method that improves LLM performance by applying head-specific attention adjustments to the initial token, requiring no parameter updates. We present two variants: a supervised mode that calibrates on validation examples, and an unsupervised mode that directly minimizes output entropy. ZeroTuning requires no KV-cache or decoding changes and is kernel-agnostic (works with SDPA and FlashAttention). It requires only four lines of modification to standard \texttt{LlamaAttention} code, achieves gains across 15 datasets, and outperforms prior, more complex methods. For example, on Llama-3.1-8B, it yields relative improvements of 19.9% on classification, 4.5% on question answering, and 2.1% on dialogue. ZeroTuning also works out of the box with quantized inference and maintains its improvements as context length increases. Our work provides a lightweight tool for inference-time improvement, advancing both optimization and interpretability. Our code and runnable demo are available at https://anonymous.4open.science/r/ZeroTuning.
LLM 预训练69 篇
基础/前沿模型 (含LLM)
LLM 预训练
#Continual Pretrain #Large Language Models #Parameter-Efficient Training
TL;DR:A novel continual pretraining framework that combines adaptive model expansion with dynamic parameter updating to efficiently adapt large language models to new domains while preserving existing knowledge.
🎯 研究动机传统的大语言模型持续预训练中,容易出现灾难性遗忘和领域容量受限的问题,现有方法的均匀层扩展难以有效平衡通用知识与领域知识的学习。
❓ 解决问题通过识别模型在不同层级的功能差异性,设计适应性扩展和动态去耦优化策略,在域适配中同时保留通用能力和高效注入新知识。
🔍 现象分析研究发现大语言模型中层和单元的功能具有差异性,其中某些部分对通用能力至关重要,指示参数扩展和优化应面向功能设计。
🛠️ 主要方法提出 ADEPT 框架,采用两阶段方法:选择性扩展通用性影响较小的层并增加表现能力,再通过单元级去耦优化赋予非对称学习率以平衡知识注入和保留。
📊 数据与实验在数学和医学领域实验表明,ADEPT 在仅调整 15% 参数和减少 50% 训练时间的情况下,相比全参数方法在通用基准和目标领域分别提升 5.76% 和 5.58%。
⭐ 主要贡献提出了一种高效且鲁棒的域适配持续预训练新方法,验证了层扩展和去耦优化的必要性,开辟了大语言模型领域适配的新策略。
查看完整摘要 (Abstract)
Conventional continual pretraining (CPT) for large language model (LLM) domain adaptation often suffers from catastrophic forgetting and limited domain capacity. Existing strategies adopt layer expansion, introducing additional trainable parameters to accommodate new knowledge. However, the uniform expansion and updates still entangle general and domain learning, undermining its effectiveness. Our pilot studies reveal that LLMs exhibit functional specialization, where layers and units differentially encode general-critical capabilities, suggesting that parameter expansion and optimization should be function-aware. We then propose ADEPT, Adaptive Expansion and Dynamic Decoupled Tuning for continual pretraining, a two-stage framework for domain-adaptive CPT. ADEPT first performs General-Competence Guided Selective Layer Expansion, duplicating layers least critical for the general domain to increase representational capacity while minimizing interference with general knowledge. It then applies Adaptive Unit-Wise Decoupled Tuning, disentangling parameter units within expanded layers according to their general-domain importance and assigning asymmetric learning rates to balance knowledge injection and retention. Experiments on mathematical and medical domains show that ADEPT outperforms full-parameter CPT by up to 5.76% on the general benchmarks and 5.58% on the target domain benchmarks with only 15% of parameters tuned and less than 50% training time. Ablation studies, theoretical analysis, and extended investigations further demonstrate the necessity of targeted expansion and decoupled optimization, providing new principles for efficient and robust domain-adaptive CPT. Our code is open-sourced at https://github.com/PuppyKnightUniversity/ADEPT
基础/前沿模型 (含LLM)
LLM 预训练
#Knowledge-based QA #Memory of LLMs
🎯 研究动机LLMs在问答任务中表现有限,常因幻觉和不确定性生成错误答案,但其参数可能隐藏了正确知识未被有效利用。
❓ 解决问题探索如何挖掘LLMs的潜在知识,评估其记忆中隐含的正确信息,同时优化提示与解码设计以提升知识问答性能。
🔍 现象分析研究发现,尽管LLMs会生成错误或“不确定”答案,但在高概率候选项中仍存在正确答案。
🛠️ 主要方法提出Hits@k指标,用于独立评估模型潜在知识的保留情况,并系统分析生成策略对正确答案的抑制效应。
📊 数据与实验通过多组定量实验验证Hits@k有效性,并分析几种常用Few-shot提示策略如何影响LLMs在知识问答任务中的输出质量。
⭐ 主要贡献揭示LLMs记忆中实际存储的知识超出表面问答准确率,提出Hits@k新标准与优化提示的设计建议,推动知识密集型任务的发展。
查看完整摘要 (Abstract)
Large language models (LLMs) have shown promise as parametric knowledge bases, but often underperform on question answering (QA) tasks due to hallucinations and uncertainty. While prior work attributes these failures to knowledge gaps in the model’s parameters, we uncover a complementary phenomenon: LLMs frequently retain correct knowledge even when generating incorrect or \``unsure'' answers.
By analyzing the token-level output distributions, we find that correct answers often appear among high-probability candidates, despite not being selected. Motivated by this, we propose Hits@k, a novel metric to evaluate latent knowledge retention independent of answer surface form. Our experiments reveal that LLMs possess significantly more factual knowledge than is reflected by standard QA accuracy.
Building on these insights, we further examine the prevailing few-shot QA paradigm. We find that prompting strategies which allow ``unsure'' outputs can inadvertently suppress correct answers by discouraging low-confidence generation. We design a set of quantitative experiments to measure this suppression effect, offering practical guidance for future prompt and decoding design in knowledge-intensive tasks.
基础/前沿模型 (含LLM)
LLM 预训练
#Large language model; Knowledge augmentation; Knowledge graph;
TL;DR:This paper proposes AtlasKV, a scalable, effective, and general way to augment LLMs with billion-scale KGs in less than 20GB GPU VRAM, where KG2KV and HiKVP are introduced to integrate KG triples at scale with sub-linear time and memory complexity.
🎯 研究动机现有的检索增强生成方法在大规模知识集成中存在高延迟问题,因依赖外部检索模块和长上下文推理。
❓ 解决问题提出一种参数化知识集成方法,解决大规模知识图谱整合的内存消耗和推理延迟,同时无需重训练或外部检索模块。
🔍 现象分析传统方法面临搜索成本高和上下文延展性差的问题,影响了大模型在知识增强任务上的实际应用性。
🛠️ 主要方法提出AtlasKV系统,通过KG2KV和HiKVP模块,实现大规模知识图谱的高效集成,以次线性时间和内存复杂度达到知识检索和表达。
📊 数据与实验实验在10亿三元组规模的数据上展示了模型以少于20GB显存实现强大的知识泛化能力和推理效率。
⭐ 主要贡献首次在低显存环境中实现十亿规模知识图谱与大语言模型的高效融合,消除了外部检索依赖,提升了知识增强任务的性能和适应性。
查看完整摘要 (Abstract)
Retrieval-augmented generation (RAG) has shown some success in augmenting large language models (LLMs) with external knowledge. However, as a non-parametric knowledge integration paradigm for LLMs, RAG methods heavily rely on external retrieval modules and the retrieved textual context prior. Especially for very large scale knowledge augmentation, they would introduce substantial inference latency due to expensive searches and much longer relevant context. In this paper, we propose a parametric knowledge integration method, called $\textbf{AtlasKV}$, a scalable, effective, and general way to augment LLMs with billion-scale knowledge graphs (KGs) (e.g. 1B triples) using very little GPU memory cost (e.g. less than 20GB VRAM). In AtlasKV, we introduce KG2KV and HiKVP to integrate KG triples into LLMs at scale with sub-linear time and memory complexity. It maintains strong knowledge grounding and generalization performance using the LLMs' inherent attention mechanism, and requires no external retrievers, long context priors, or retraining when adapting to new knowledge.
基础/前沿模型 (含LLM)
LLM 预训练
#Teacher Forcing #Multi-Token Prediction #Pretraining #Large Language Models
TL;DR:We propose future summary pretraining to better capture long range dependencies and enhance reasoning capabilities of LLMs
🎯 研究动机现有的大型语言模型在长程推理、规划和创作方面受到单步预测能力的限制,需要新的预训练方式提升性能。
❓ 解决问题提出一种能够更好捕捉长程依赖的预训练方法,通过预测未来摘要改善现有多步预测方法的不足。
🔍 现象分析单步预测和多步预测模型主要局限于短程上下文依赖,对长期信息的捕捉能力较弱,限制了模型的推理和生成能力。
🛠️ 主要方法引入未来摘要预测(FSP),通过辅助模型预测紧凑的长期信息表示,包括人工摘要和基于逆向语言模型生成的学习摘要。
📊 数据与实验在包含数学、推理和代码的基准数据集上,以 3B 和 8B 参数规模的模型进行大规模预训练实验验证。
⭐ 主要贡献提出未来摘要预测方法并证明其能显著提升模型长程推理与生成能力,超越现有的单步和多步预测方法。
查看完整摘要 (Abstract)
Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.
基础/前沿模型 (含LLM)
LLM 预训练
#LLM pretraining #efficient LLMs #metadata
🎯 研究动机在大型语言模型预训练中,元数据被认为是加速训练的新兴方法,但现有工作仅聚焦于 URL 等单一信号,其作用是否可被其他形式的元数据取代尚不明确。
❓ 解决问题探索更广范围的元数据类型是否能提升预训练效率,并优化元数据整合方式以提高模型性能。
🔍 现象分析通过研究发现,粒度更细的元数据(如文档质量指标)更能加速训练,并且元数据位置和学习方式对模型表现有显著影响。
🛠️ 主要方法提出将元数据作为辅助任务进行追加训练,并通过引入可学习的元数据标记与掩码损失提升预训练质量。
📊 数据与实验利用多种数据集实验验证元数据类型与位置的影响,并通过探测分析模型表示,评估元数据对潜在学习结构的塑造效果。
⭐ 主要贡献扩展元数据种类以改进预训练效率,提出新的元数据整合策略并提出训练指导原则,最终增强大语言模型的性能与效率。
查看完整摘要 (Abstract)
Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal—URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.
基础/前沿模型 (含LLM)
LLM 预训练
#Diffusion LLMs #Masked Diffusion Models #Training Variance #Training Stability #Mask Schedule #Mask Sampling
TL;DR:We stabilize MDM training by deriving a variance decomposition and introducing two core methods: P-POTS, which is Pareto-optimal among all unbiased t-samplers, and MIRROR, which complements it. Experiments yield clear gains on final performance.
🎯 研究动机Masked Diffusion Models(MDMs)作为一种替代自回归模型(ARMs)的方法,拥有潜力但训练过程中存在高方差问题,导致梯度噪声和优化不稳定。迫切需要理论解释及系统解决方案以提升其训练稳定性。
❓ 解决问题提出第一个针对 MDMs 训练方差的分解框架,揭示三大来源:掩码模式噪声、掩码比例噪声和数据噪声,并设计方法减少方差以缩小与 ARMs 的性能差距。
🔍 现象分析MDMs 训练方差显著高于 ARMs,这主要源于独特的掩码机制,导致训练过程中性能从初始化后逐渐下滑。此外,高方差使得任务特定训练的准确性和稳定性下降。
🛠️ 主要方法提出两个核心方法:P-POTS,为 Pareto 最优的 t-sampler,通过更频繁采样难度较高的 t 值并调整更新步长来减小训练方差;MIRROR,通过使用负相关样本减少掩码模式噪声。
📊 数据与实验在复杂推理任务上进行实验,验证方法在准确性与稳定性上的显著提升。相比标准 MDMs,准确性提高 7–8%,运行间可变性降至接近 ARMs 的水平。
⭐ 主要贡献首次从理论上分解与分析 MDMs 训练方差来源,提出系统性解决方案,通过 P-POTS 和 MIRROR 方法显著提高训练稳定性与准确性,为掩码扩散模型的进一步发展奠定基础。
查看完整摘要 (Abstract)
Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from **inherently** much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. Currently, there has been no theoretical explanation or systematic solution. In this paper, we derive **the first decomposition** of MDM training variance into three sources: {A} masking pattern noise, {B} masking rate noise, and {C} data noise -- while ARMs are only affected by {C}. This cleanly explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a **Pareto-optimal** $t$-sampler that minimizes training variance by sampling harder $t$ values more often with appropriately smaller update steps, and (2) MIRROR, which uses negatively correlated samples to reduce {A}. Experiments show that, compared to standard MDM training, our methods improve accuracy by **7–8\%** on complex reasoning tasks, while simultaneously reducing run-to-run variability to **near ARM levels**, substantially narrowing the gap with strong ARM baselines; in most settings, even the best baseline method runs remain below the worst run of our method.
基础/前沿模型 (含LLM)
LLM 预训练
#scaling law; agent; LLM
TL;DR:Using LLM agents to discover scaling laws for LLMs
🎯 研究动机预测模型性能的尺度法则发现依赖于缓慢且特定的人工试验,亟需自动化解决方案。
❓ 解决问题探索使用LLM代理在自动化发现尺度法则中的潜力,并提升现有方法的准确性和实用性。
🔍 现象分析现有代理在生成准确定律公式方面表现不佳,人工推导的法则也难以实现准确外推。
🛠️ 主要方法提出了一种基于进化的SLDAgent代理,同时优化尺度法则模型和参数,自动探索变量间的复杂关系。
📊 数据与实验收集了超过5000个实验数据,并设计了七个多样化任务以验证方法的效果。
⭐ 主要贡献首次实现了AI自动发现比人工法则更准确的尺度法则,验证其在预训练和微调中的实际效用,建立了AI主导科学发现的新范式。
查看完整摘要 (Abstract)
Discovering scaling laws for predicting model performance at scale is a fundamental and open-ended challenge, mostly reliant on slow, case specific human experimentation. To investigate the potential for LLMs to automate this process, we collect over 5,000 experiments from existing literature and curate seven diverse scaling law discovery tasks. While existing agents struggle to produce accurate law formulas, this paper introduces SLDAgent, an evolution-based agent that co-optimize the scaling law model and the parameters, enabling it to autonomously explore complex relationships between variables. For the first time, we demonstrates that SLDAgent can automatically discover laws that exhibit consistently more accurate extrapolation than their established, human-derived counterparts across all tasks.
Through comprehensive analysis, we elucidate why these discovered laws are superior and verify their practical utility in both pretraining and finetuning applications. This work establishes a new paradigm for agentic scientific discovery, showing that AI systems can understand their own scaling behavior, and can contribute novel and practical knowledge back to the research community.
基础/前沿模型 (含LLM)
LLM 预训练
#Optimizer #AdamW
TL;DR:Improving Training with One Line of Code
🎯 研究动机优化器在Transformer预训练中的稳定性与效率一直是社区关注的问题,现有方法改进有限且需复杂调整。
❓ 解决问题提出一种在动量优化器中仅需单行代码修改的方法,以实现更快、更稳定的优化效果。
🔍 现象分析通过理论分析,证明该修改保留了Adam的Hamiltonian特性,并在Lyapunov分析下维持收敛性保障,同时展示出新的优化器家族潜力。
🛠️ 主要方法采用对动量优化器的单行代码修改,命名为谨慎优化器(如C-AdamW和C-Lion),应用理论推导分析其收敛与性能表现。
📊 数据与实验在LLM预训练与后训练任务中实现稳定加速,并在MAE预训练中无需复杂超参数调整即获得更优结果。
⭐ 主要贡献提出单行代码修改方法开启优化器新方向,理论与实验均显示其能有效改善预训练效率与结果,同时具有适用性广特点。
查看完整摘要 (Abstract)
AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a \textbf{single-line modification in Pytorch} to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only consistent speed-up on LLM pretraining and post-training tasks, but also better results in MAE pretraining, with minimum extra tuning on hyperparameters.
基础/前沿模型 (含LLM)
LLM 预训练
#Convex optimization #Scaling law #Hyperparameter transfer
TL;DR:Convexity emerges in deep learning for general optimizers and model architectures, precisely characterizing and predicting loss in large scale.
🎯 研究动机深度学习的非凸损失函数优化动态难以分析和控制,但在不同任务、优化器和超参数下其动态表现出类凸性。
❓ 解决问题探索凸性和Lipschitz连续性在深度学习中的适用性,提出通过学习率调度精确控制损失动态的方法。
🔍 现象分析实验证明,深度学习在短时间训练后变得弱凸,损失可通过最终迭代的上界预测,并据此推导出最优学习率的缩放规律。
🛠️ 主要方法基于凸性视角,建立损失和学习率的缩放定律,可跨越不同训练长度(80倍)和模型规模(70倍)进行精确外推。
📊 数据与实验在多个任务、模型和超参数设置上进行泛化验证,实验证实所提方法在大规模训练中的预测能力和适用性。
⭐ 主要贡献首次将凸性特性应用于深度学习,提出损失动态控制方法和学习率缩放定律,为优化器设计和超参数迁移提供理论依据。
查看完整摘要 (Abstract)
Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as $80\times$ across training horizons and $70\times$ across model sizes.
基础/前沿模型 (含LLM)
LLM 预训练
#data curation #speech language models #pretraining #synthetic data
TL;DR:We study three data-centric methods to improve speech-language interleaved pretraining with a focus on spoken question-answering capabilities
🎯 研究动机口语问答能力是互动式人工智能系统的核心功能。现有的语音-语言模型未明确揭示数据处理与整理对性能的影响,限制了理解和优化路径。
❓ 解决问题通过数据驱动方法系统研究语音-语言预训练数据的加工、合成与融合策略,提升语音问答相关模型的性能。
🔍 现象分析现有数据模态研究已表明数据处理与整理对模型性能的显著作用,但语音-语言数据目前缺乏类似的全面探索。
🛠️ 主要方法探讨三方面数据处理:(1)网络语音数据预处理,(2)合成数据集的构建,(3)语音文本片段的交叉编排训练方式。
📊 数据与实验设计数据控制实验,预训练参数规模达3.8B的语音-语言模型SpeLangy,大幅超越参数规模达3倍的竞品模型。
⭐ 主要贡献揭示有效数据整理对模型性能的影响,为语音-语言数据优化提供设计思路;SpeLangy实现10.2%绝对性能提升,推动数据驱动研究领域前沿进展。
查看完整摘要 (Abstract)
Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs. We focus on three questions fundamental to speech-language pretraining data: (1) how to process raw web-crawled audio content for speech-text pretraining, (2) how to construct synthetic datasets to augment web-crawled data and (3) how to interleave (text, audio) segments into training sequences. We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation and guide future data-centric exploration in SpeechLMs.
基础/前沿模型 (含LLM)
LLM 预训练
#Pretraining
TL;DR:We study LLM pretraining with logit distillation and show that it comes with a tradeoff: it boosts test-time scaling but impairs in-context learning. We analyze this tradeoff in detail and develop mitigation strategies.
🎯 研究动机近年来,蒸馏技术在大型语言模型(LLM)预训练中再次受到关注,但其对关键能力如测试时扩展性和上下文学习的影响尚未充分研究。
❓ 解决问题分析蒸馏预训练对测试时扩展性和上下文学习能力之间的权衡,并提出缓解这一权衡的策略。
🔍 现象分析观察到蒸馏预训练显著提升了测试时扩展性,但削弱了上下文学习能力,尤其对通过归纳头建模的能力影响较大。
🛠️ 主要方法在双字模型的沙盒环境下研究蒸馏预训练,隔离产生上述权衡的核心因素,并利用这些洞察优化预训练设计。
📊 数据与实验基于Llama-3.2和Gemma模型系列进行研究,同时在简化环境中验证现象和机制。
⭐ 主要贡献揭示蒸馏预训练的权衡机制,提出优化设计建议,为大型语言模型预训练提供新视角和指导。
查看完整摘要 (Abstract)
In the past year, distillation has seen a renewed prominence in large language model (LLM) pretraining,
exemplified by the Llama-3.2 and Gemma model families. While distillation has historically been
shown to improve statistical modeling, its effects on new paradigms key to modern LLMs—such as
test-time scaling and in-context learning—remain underexplored. In this work, we make three main
contributions. First, we show that pretraining with distillation yields models that exhibit remarkably
better test-time scaling. Second, we observe that this benefit comes with a trade-off: distillation
impairs in-context learning capabilities, particularly the one modeled via induction heads. Third, to
demystify these findings, we study distilled pretraining in a sandbox of a bigram model, which helps
us isolate the common principal factor behind our observations. Finally, using these insights, we shed
light on various design choices for pretraining that should help practitioners going forward.
基础/前沿模型 (含LLM)
LLM 预训练
#language model #pretraining #training objective #mixed training objective #masked diffusion
TL;DR:We simultaneously train language models on autoregressive and masked-diffusion objectives, resulting in flexible models that outperform the single-objective models in both settings.
🎯 研究动机传统自回归语言模型具有训练效率高的优势,但容易过拟合;而掩码扩散模型虽然抗过拟合能力强,但训练效率较低。结合两者优势显得尤为重要。
❓ 解决问题提出了一种双目标训练方法,同时优化自回归和掩码扩散目标,以解决训练效率与过拟合之间的权衡问题。
🔍 现象分析自回归模型在高数据重复率下易过拟合,而掩码扩散模型则在此情境下表现更稳定。实验表明,双目标训练能够在所有情境中实现性能提升。
🛠️ 主要方法采用无架构修改的方式,同时优化自回归目标与掩码扩散目标,通过实验探索两者最佳权重比例以提升模型性能。
📊 数据与实验通过训练和评估50个语言模型,在不同数据重复率下验证双目标训练方法的有效性,观察其在两种目标下的性能表现。
⭐ 主要贡献证明双目标训练能够有效结合两种目标的优点,提出一种适用于多种应用场景的灵活训练策略,并提供了优化权重配置的经验依据。
查看完整摘要 (Abstract)
This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal balance between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal balance is similar whether targeting autoregressive or masked-diffusion downstream performance.
基础/前沿模型 (含LLM)
LLM 预训练
#Large models #Data-centric AI #AutoML
🎯 研究动机大规模和多样化数据集驱动了大型模型的进展,但最佳数据混合方式的选择仍是未解决的问题。
❓ 解决问题提出 FastMix 框架,实现无需预定义规则或高资源模拟的高效数据混合优化,统一处理预训练和后训练阶段。
🔍 现象分析将数据混合选择重构为双层优化问题,证明混合比例优化等价于在均匀采样下分配每个数据源的损失权重。
🛠️ 主要方法采用迭代优化程序,包括模型参数更新(内循环)与基于验证反馈的混合比例更新(外循环),实现基于梯度的混合优化。
📊 数据与实验通过预训练和后训练实验验证了效率与性能,在预训练中仅需 1.3 GPU 小时显著超越 RegMix 和 CLIMB,在后训练中以 2.2 GPU 小时获最佳结果。
⭐ 主要贡献提出一种高效的自动化数据混合优化方法,显著降低计算开销,同时提升了模型性能和实验可扩展性。
查看完整摘要 (Abstract)
While large and diverse datasets have driven recent advances in large models, identifying the optimal data mixture for pre-training and post-training remains a significant open problem. We address this challenge with FastMix, a novel framework that automates data mixture discovery while training only a single proxy model. Instead of relying on predefined heuristics or resource-intensive simulations, FastMix jointly optimizes mixture coefficients and model parameters, substantially improving efficiency and scalability over prior approaches. At the core of FastMix is a reformulation of mixture selection as a bilevel optimization problem.
Under this reformulation, we show that optimizing mixture ratios is mathematically equivalent to assigning per-source loss weights under uniform source sampling. This embeds the mixture coefficients directly into the differentiable iterative optimization objective, enabling efficient, gradient-based optimization of both mixture and model. To solve the optimization problem, FastMix implements an approximate iterative optimization procedure, alternating between (i) updating model parameters on data sampled according to current mixture ratios (inner loop) and (ii) updating mixture ratios based on validation feedback (outer loop). Across pre- and post-training, FastMix outperforms baselines while drastically reducing search cost: in pre-training, it attains an average score of 48.2 with 1.3 GPU-hours ($\times 550$ vs. RegMix; $\times 55$ vs. CLIMB), and in post-training (SFT) it leads with 65.4 with a $+5.5$ gain over the next best, completing search in 2.2 GPU-hours compared to the 115 GPU-hours required by CLIMB/RegMix.
基础/前沿模型 (含LLM)
LLM 预训练
#large language models #memorization #knowledge acquisition #datasets
TL;DR:We develop a synthetic dataset that enables us to study memorization and generalization of factual knowledge in LLMs.
🎯 研究动机语言模型既可学习语言结构,也能获取事实知识,但其对事实记忆的机理理解仍然有限。
❓ 解决问题研究语言模型在训练过程中如何记忆事实与逐字记忆,并探究两者的关系。
🔍 现象分析语言模型可逐字记忆训练数据的长序列,但对事实记忆的机制认知不足。
🛠️ 主要方法提出了一套合成数据集,通过模拟虚构事件及问题回答,深入分析模型的两种记忆行为。
📊 数据与实验创建了类网页文本的虚构事件数据集,并设计了基于此数据集的训练实验,揭示不同记忆模式的差异,同时记录构建逼真虚构数据的挑战。
⭐ 主要贡献开发了首个专注研究事实记忆与逐字记忆的合成数据集,为深入理解语言模型的知识获取和记忆机制提供了新工具。
查看完整摘要 (Abstract)
When language models are trained on textual data, they acquire both knowledge about the structure of language as well as knowledge of facts about the world.
At inference time, their knowledge of facts can be leveraged to solve interesting problems and perform useful knowledge work for users.
It is well known that language models can verbatim memorize long sequences from their training data.
However, it is much less well understood how language models memorize facts seen during training.
In this work, we propose a new dataset to specifically empower researchers to study the dual processes of fact memorization and verbatim sequence memorization.
The dataset consists of synthetically-generated, webtext-like documents about fictional events, as well as question-answer pairs about the events.
We conduct training experiments showing how synthetic data about fictional events can be effective in teasing apart different forms of memorization.
We also document the challenges in effectively building realistic, fictional synthetic data.
基础/前沿模型 (含LLM)
LLM 预训练
#LLMs #Arithmetic #Embedding #Numbers
🎯 研究动机当前语言模型将数字视为普通词元,这导致频率偏差及数字碎片化问题,限制了模型对数字的有效处理。
❓ 解决问题提出一种基于傅里叶特征的单词元数字嵌入方法,避免频率偏差和碎片化,提升数字相关任务的精度与效率。
🔍 现象分析发现预训练的大型语言模型在处理数字词元时,内部学习了类似傅里叶特征的编码方式。
🛠️ 主要方法设计FoNE方法,通过傅里叶特征直接将数字映射至嵌入空间,每位数字仅使用两个维度,有效保持数字完整性。
📊 数据与实验实验中,一个3800万参数的Transformer使用FoNE方法训练,在加减乘法任务中超越微调后的Llama-3.2-1B模型,并实现了100%测试精度;在6位小数加法任务中,FoNE相较于传统嵌入需求数据减少64倍,且减少了3-6倍的数字词元使用。
⭐ 主要贡献提出了一种高效、准确的数字嵌入方法,解决了频率偏差和碎片化问题,显著提升了算术任务性能并降低了数据与计算需求。
查看完整摘要 (Abstract)
Language models treat numbers in the same way as ordinary word tokens, which introduces two major issues: (1) embeddings of numerical tokens primarily reflect their frequency in text corpora rather than their inherent numerical properties, leading to frequency bias, and (2) numbers are often split into multiple tokens, forcing the model to aggregate these pieces to recover their values. Inspired by the observation that pre-trained Large Language Models (LLMs) internally learn Fourier-like features for number tokens, we propose **Fo**urier **N**umber **E**mbedding **(FoNE)**, a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation.
Compared to traditional subword and digit-wise embeddings, FoNE achieves higher accuracy on arithmetic tasks, requires significantly less training data, and offers more efficient training and inference.
A $38$M-parameter Transformer trained from scratch with FoNE outperforms a fine-tuned Llama-3.2-1B model on addition, subtraction, and multiplication. FoNE is also the only method that achieves $100\\%$ accuracy on over 100,000 test examples across these tasks. On 6-digit decimal addition, FoNE needs 64$\times$ less data than subword and digit-wise embeddings to reach $\ge 99\\%$ accuracy, while using 3$\times$ and 6$\times$ fewer tokens per number, respectively.
基础/前沿模型 (含LLM)
LLM 预训练
#Pretraining #Supervised Finetuning #Reasoning #LLM
TL;DR:Dissecting the Synergy of Reasoning Data in Pretraining and SFT
🎯 研究动机当前模型常在后训练阶段通过高质量推理数据提升推理性能,但推理数据在预训练中的作用仍不明确,引发了数据分配时机对模型表现的系统性研究需求。
❓ 解决问题探究推理数据在预训练和后训练阶段的影响,以及如何优化数据分配以提高语言模型的推理能力。
🔍 现象分析研究发现推理数据的前置引入对模型的基础能力建立至关重要,且后期微调无法完全弥补早期缺失;数据多样性在预训练中尤为重要,但后训练阶段更依赖数据质量。
🛠️ 主要方法采用多阶段实验设计,系统考察推理数据在不同训练阶段的规模、质量和多样性对模型性能的影响。
📊 数据与实验使用推理密集型数据,分别在预训练和后训练阶段进行对比实验,量化不同数据分配策略对模型推理能力的提升效果。
⭐ 主要贡献首次系统揭示推理数据在预训练中的重要性,提出跨训练阶段数据分配优化原则,挑战语言建模与推理分离的传统观念,为构建更强大的模型提供指导。
查看完整摘要 (Abstract)
The prevailing paradigm for enhancing the reasoning abilities of Large Language Models (LLMs) revolves around post-training on high-quality, reasoning-intensive data. While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage---a practice that is relatively more proprietary and less openly characterized---the role of such data in pretraining remains unclear. In particular, due to the opaqueness of pretraining corpora in most frontier models, the effect of reasoning data introduced at different phases of pre- and/or post-training is relatively less reported in the scientific literature. This raises several important but unsettled questions: Is adding reasoning data earlier during pre-training any better than introducing it during post-training, when the token counts are controlled? Could earlier inclusion risk overfitting and harm generalization, or instead establish durable foundations that later fine-tuning cannot recover? To address these questions, we conduct the first systematic study of how reasoning data—varying in scale, diversity, and quality—affects LLM performance when introduced at different stages of training. Our findings reveal that front-loading reasoning data into pretraining is critical (19% average gain), establishing foundational capabilities that cannot be fully replicated by later-stage SFT, even with more data. We uncover an asymmetric principle for optimal data allocation: pretraining benefits most from broad diversity in reasoning patterns (11% average gain), while SFT is more sensitive to data quality (15% average gain with high quality data). Furthermore, we show that high-quality pretraining data has latent effects, activated only after SFT, and that naively scaling SFT data can be detrimental, washing away the benefits of early reasoning injection. Collectively, our results challenge the conventional separation of language modeling and reasoning, providing a principled guide for strategically allocating data across the entire training pipeline to build more capable models.
基础/前沿模型 (含LLM)
LLM 预训练
#LLM pretraining #Curriculum Learning #Model Weight Average
TL;DR:Use model weight average to enhance curriculum learning in LLM pretraining.
🎯 研究动机大语言模型(LLMs)通常使用混合质量的数据进行训练,尽管已经经过精心筛选,但高质量数据仍然稀缺。课程式预训练(Curriculum Learning)按数据质量从低到高的顺序进行,是一种提升数据利用率的自然方法。但现有研究显示,这类策略的效果有限,需进一步探讨原因。
❓ 解决问题该研究揭示了课程训练策略与学习率递减调度不兼容的关键问题,使高质量数据的利用受限。旨在改进课程式预训练效果,协调数据排序与优化策略之间的矛盾。
🔍 现象分析实验发现,当使用恒定学习率时,课程式训练显著优于随机数据顺序的训练;但在标准学习率递减设置下,其优势减弱,表明学习率递减策略削弱了课程学习的潜在效果。
🛠️ 主要方法提出两种缓解策略:其一,采用温和的学习率递减曲线,最终学习率显著高于当前标准方法;其二,用模型权重平均替代学习率递减,即对训练末期的多个模型检查点进行加权平均。
📊 数据与实验基于总量为30B token的数据集和1.5B参数规模的模型进行验证,实验涵盖多种数据质量度量指标。通过组合提出的方法,在一系列基准测试上实现了1.64%的平均性能提升。
⭐ 主要贡献重新评估课程式LLM预训练的有效性,揭示了学习率调度与数据排序策略间的影响关系;通过简单有效的优化改进方法,显著提升课程训练性能,为数据设计与优化方法协同设计提供了新视角。
查看完整摘要 (Abstract)
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
基础/前沿模型 (含LLM)
LLM 预训练
#data #sampling #data efficiency #LLMs #data curation #data quality
🎯 研究动机大型语言模型的训练成本高昂,因此研究如何通过数据高效的方法优化模型质量与资源消耗的权衡具有重要意义。
❓ 解决问题探索基于数据质量评估和覆盖多样性优化的数据选择方案,以提升训练效率与性能,同时降低数据规模需求。
🔍 现象分析覆盖采样方法通常能在性能上接近使用完整数据集的训练表现,而基于数据质量评估的方法在显著减少数据量时仍能超越全数据训练。
🛠️ 主要方法提出 AskLLM 利用指令微调的模型直接评估训练样本质量,并提出密度采样通过建模数据分布实现覆盖和多样性目标。
📊 数据与实验在 T5 风格模型的预训练中测试了 22 种数据策划技术,运行了大量训练和后续微调评估任务,验证了方法效果。
⭐ 主要贡献证明了 AskLLM 和密度采样分别是数据质量和覆盖优化的最佳方法,并展示 AskLLM 在仅使用 10% 数据时仍能显著提升效率与性能。
查看完整摘要 (Abstract)
The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, \ie, techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, AskLLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose density sampling, which models the data distribution to select a diverse sample. Testing the effect of $22$ different data curation techniques on the pre-training of T5-style of models, involving hundreds of pre-training runs and post fine-tuning evaluation tasks, we find that AskLLM and density are the best methods in their respective categories. While coverage sampling techniques often recover the performance of training on the entire dataset, training on data curated via AskLLM consistently outperforms full-data training---even when we sample only $10$\% of the original dataset, while converging up to $70$\% faster.
基础/前沿模型 (含LLM)
LLM 预训练
#In-Context Learning #Attention Mechanism #In-Context Gradient Descent #Transformer #Universal Approximation
TL;DR:We prove that a minimal Transformer with frozen weights emulates a broad class of algorithms by in-context prompting
🎯 研究动机探讨固定权重的Transformer如何通过上下文提示实现算法模拟,深度连接算法与模型提示机制。
❓ 解决问题证明最小化Transformer能通过固定权重支持广泛的算法模拟,并揭示其用于普遍算法操作的潜力。
🔍 现象分析发现单层软注意力能精确模拟特定函数操作;通过构造特定提示,固定权重的多层Transformer可通用地模拟广泛算法。
🛠️ 主要方法设计提示编码算法参数到token表示,以放大点积差值,从而强制软注意力层执行目标计算,无需参数更新。
📊 数据与实验通过数值实验验证理论推导的准确性,展示提示可高效地驱动Transformer进行任务特定算法操作。
⭐ 主要贡献确立基于提示的算法普适性框架,揭示Transformer模型的算法通用性,为GPT式模型的提示驱动应用奠定理论基础。
查看完整摘要 (Abstract)
We prove that a minimal Transformer with frozen weights emulates a broad class of algorithms by in-context prompting.
We formalize two modes of in-context algorithm emulation.
In the *task-specific mode*, for any continuously differentiable function $f: \mathbb{R} \to \mathbb{R}$, we construct a single-head softmax attention layer whose forward pass reproduces functions of the form $f(w^\top x - y)$ to arbitrary precision.
This general template subsumes many popular machine learning algorithms (e.g., gradient descent, linear regression, ridge regression).
In the *prompt-programmable mode*, we prove universality:
a single fixed-weight two-layer softmax attention module emulates all algorithms from the task-specific class (i.e., each implementable by a single softmax attention) via only prompting.
Our key idea is to construct prompts that encode an algorithm’s parameters into token representations, creating sharp dot-product gaps that force the softmax attention to follow the intended computation.
This construction requires no feed-forward layers and no parameter updates.
All adaptation happens through the
prompt alone.
Numerical results corroborate our theory.
These findings forge a direct link between in-context learning and algorithmic emulation, and offer a simple mechanism for large Transformers to serve as prompt-programmable interpreters of algorithms.
They illuminate how GPT-style foundation models may swap algorithms via prompts alone, and establish a form of algorithmic universality in modern Transformer models.
基础/前沿模型 (含LLM)
LLM 预训练
#Test-time Training #Large language model #LLM
🎯 研究动机传统的'训练后部署'范式限制了大型语言模型(LLM)在面对真实任务中持续流入的新信息时动态调整权重的能力。
❓ 解决问题现有的测试时训练(TTT)方法在LLM中应用时存在架构不兼容、计算低效及目标与语言建模任务不匹配等问题。
🔍 现象分析通过实验表明,现有方法因目标选择不当和权重更新开销过大,难以扩展至高效动态适应复杂上下文的任务场景。
🛠️ 主要方法提出In-Place测试时训练(In-Place TTT)框架,以MLP中的最终投影矩阵作为可适应的快速权重,并设计与自回归语言模型的下一词预测任务对齐的目标函数,结合块式更新机制提升扩展性。
📊 数据与实验实验验证了在128k上下文任务中,参数量为4B的模型通过In-Place TTT实现优异性能;预训练结合方法亦超过现有相关TTT技术,且通过消融实验分析关键设计。
⭐ 主要贡献提出了一种无需从头重训且可无缝集成LLM的TTT解决方案,为大型语言模型的持续学习提供了新的方向。
查看完整摘要 (Abstract)
The static "train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce **In-Place Test-Time Training (In-Place TTT)**, a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.
基础/前沿模型 (含LLM)
LLM 预训练
#Embedding Model #LLMs #Retriever
🎯 研究动机现有基于大语言模型的文本嵌入研究大多专注于数据扩展或合成,但对训练技术和数据质量的研究较少,导致性能受限。
❓ 解决问题提出一系列训练技术和高质量数据策略,以改进大语言模型在嵌入任务中的性能和泛化能力,并突破传统方法的限制。
🔍 现象分析小模型若采用高效的训练技术和高质量数据,可以在性能上超越同尺寸模型,甚至接近远大于自身的模型,证明了模型尺寸与性能非线性相关。
🛠️ 主要方法提出三阶段训练流程,包括弱监督预训练、有监督微调和带细粒度信号的对比蒸馏;并结合聚焦式重加权、在线困难负样本混合等技术来增强难例处理和负样本构建。
📊 数据与实验构建涵盖20类预训练数据及100类微调和蒸馏数据集,采用任务特定指令、困难负样本挖掘和基于实例的多类标注;实验表明模型在大规模文本嵌入基准中超过同类模型,并与更大规模模型性能持平或更优。
⭐ 主要贡献开发出性能卓越的KaLM-Embedding-V2嵌入模型,实现小于1B参数的高效嵌入;制定高效训练流程和高质量数据策略,为紧凑模型树立新标杆,并开源代码、数据及模型。
查看完整摘要 (Abstract)
Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2 from the Lychee-KaLM team, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with fine-grained soft signals, integrated with focal-style reweighting and online hard-negative mixing to emphasize difficult samples and enrich hard negatives, respectively. For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation, to improve both performance and generalization, leveraging task-specific instructions, hard-negative mining, and example-based multi-class labeling to ensure high quality. Combining these techniques, our KaLM-Embedding-V2 series achieves state-of-the-art performance on the Massive Text Embedding Benchmark, outperforming models of comparable size and rivaling models 3-26x larger, setting a new standard for versatile and compact embedding models under 1B parameters. The code, data, and models are available at https://kalm-embedding.github.io/.
基础/前沿模型 (含LLM)
LLM 预训练
#Knowledge Fusion #Model Merging #Large Language Model #Task Vector
TL;DR:GraftLLM enables efficient cross-capability transfer in large LLMs via compact SkillPacks, preserving knowledge, preventing forgetting, and supporting heterogeneous model fusion.
🎯 研究动机大语言模型在多任务整合、模型压缩和知识融合中面临跨能力迁移的挑战,通过提高知识迁移效率可以增强其适应性和性能。
❓ 解决问题现有方法在异构大模型的知识蒸馏中存在忽视学生模型潜力和遗忘问题,同时 PEFT 方法难以有效吸收知识,限制了模型融合的应用。
🔍 现象分析现有跨能力迁移重点关注小型同构模型,无法满足大规模异构模型场景需求,导致参数冲突和遗忘现象。
🛠️ 主要方法提出 GraftLLM 方法,通过 SkillPack 格式存储源模型能力,采用模块感知的自适应压缩策略,实现参数高效更新和任务知识的保留。
📊 数据与实验在多场景实验中验证了方法效果,表明 GraftLLM 在知识迁移、知识融合和无遗忘学习方面优于现有技术。
⭐ 主要贡献提出了一种可扩展、高效的跨能力迁移方法 GraftLLM,支持无遗忘学习、知识融合和高效的异构大模型整合。
查看完整摘要 (Abstract)
Cross-capability transfer represents a key challenge in large language model (LLM) research, particularly in multi-task integration, model compression, and knowledge fusion. Recent works such as FuseLLM and FuseChat have shown the potential of transferring multiple model capabilities to lightweight models, thereby enhancing adaptability and efficiency. This motivates our investigation into more efficient methods for cross-capability transfer. However, existing merging approaches primarily focus on small, homogeneous models, limiting their applicability.
For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student model’s inherent capability and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs.
To address these issues, we introduce **GraftLLM**, a novel grafting-based method that stores source model capabilities in a target model + SkillPack format. This approach preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning and model fusion. We employ a module-aware adaptive compression strategy for parameter updates, ensuring efficient storage while **preserving task-specific knowledge**. The resulting SkillPack serves as a compact and transferable knowledge carrier, ideal for **heterogeneous LLM fusion**.
Experiments across various scenarios demonstrate that GraftLLM outperforms existing techniques in knowledge transfer, knowledge fusion, and forget-free learning, providing a scalable and efficient solution for cross-capability transfer.
基础/前沿模型 (含LLM)
LLM 预训练
#Large Language Models #Pretraining #Concepts #Sparse Autoencoders
🎯 研究动机大语言模型传统的预训练目标以预测下一个词为主,这种方法依赖词级困惑度优化,难以捕获连续概念中的深层语义信息。
❓ 解决问题提出一种新的预训练框架,将离散的下一个词预测与连续概念表示结合,从而提高训练效率与模型性能。
🔍 现象分析实验表明,结合概念学习与隐层交错处理能显著提高样本效率,超越传统词预测方法和知识蒸馏。
🛠️ 主要方法引入名为CoCoMix的框架,通过稀疏自编码器学习连续概念,并将其嵌入模型的隐层表示,与词表征交错整合,形成端到端优化流程。
📊 数据与实验在包括语言建模和推理任务的多个基准数据集上进行实验评估,验证模型在效率、准确性及可解释性上的优越表现。
⭐ 主要贡献提出了结合离散预测与连续概念的方法,提升了模型性能和解释性,为引导模型内部推理过程提供了透明的可操作方式。
查看完整摘要 (Abstract)
Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts ``continuous concepts'' learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction and knowledge distillation. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model’s internal reasoning process.
基础/前沿模型 (含LLM)
LLM 预训练
#Large Language Models #Efficient Training #Representation Learning
TL;DR:We propose a Late-to-Early Training (LET) method that lets LLMs explicitly learn later knowledge in earlier steps and earlier layers, leading to faster and improved LLM training.
🎯 研究动机大型语言模型的预训练成本高,阻碍了快速迭代发展。探索能否利用小型预训练模型加速大型模型的训练是一个尚未深入研究的重要问题。
❓ 解决问题提出了一种新的训练范式 Late-to-Early Training (LET),旨在通过在早期训练阶段引入预训练模型的后期表示,显著加快训练速度并提高模型性能。
🔍 现象分析传统预训练范式忽视了早期层和后期知识的关联,LET 方法通过早期步骤学习后期知识和早期层学习后期层表示,显著提升了模型的收敛效率与泛化能力。
🛠️ 主要方法通过两个关键机制——晚知识早学机制和晚层表征早用机制,指导模型在早期学习阶段同时吸收预训练模型后期的知识与表征。
📊 数据与实验在 1.4B 和 7B 参数规模的模型上,利用 Pile 数据集验证。结果表明,1.4B 模型的训练速度提升至 1.6 倍,同时下游任务准确率提高近 5%,即使预训练模型参数仅是目标模型的 1/10。
⭐ 主要贡献提出了 LET 训练范式,有效加速大型语言模型训练;验证了早期学习结构的优化潜力;通过实验展示了 LET 方法的普适性与显著性能提升效果。
查看完整摘要 (Abstract)
As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: Can we leverage existing small pretrained models to accelerate the training of larger models? In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6× speedup with nearly 5% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10x fewer parameters than the target model.
基础/前沿模型 (含LLM)
LLM 预训练
#meta-tokens #language models #pre-training #positional encoding
TL;DR:We pre-train language models with inserted meta-tokens, demonstrating strong performance and length generalization on synthetic tasks for long-context modeling. We explain these results based on positional encoding and implicit context compression.
🎯 研究动机当前基于Transformer的语言模型难以可靠捕获远距离上下文信息,限制了长上下文学习的能力。
❓ 解决问题提出使用meta-tokens及其对应的meta-attention机制,以改善模型对远距离上下文的捕获能力。
🔍 现象分析通过信息论分析发现,meta-tokens能够增强位置编码,作为内容锚点压缩前文上下文并进行隐式缓存。
🛠️ 主要方法在语言模型中插入meta-tokens并配备meta-attention机制,结合因果多头注意力进行预训练。
📊 数据与实验模型在小于100B的token数据上进行预训练,在一系列长上下文合成任务中表现优异,可扩展到2倍上下文窗口。
⭐ 主要贡献提出了一种简单高效的预训练方法,基于meta-tokens和meta-attention实现长度泛化,提供了机制层面的新见解。
查看完整摘要 (Abstract)
Transformer-based language models (LMs) notably struggle to reliably capture distant contextual information. This work introduces a novel approach using meta-tokens -- special tokens injected during pre-training -- paired with a dedicated meta-attention mechanism to guide LMs to use these tokens. We pre-train a language model equipped with meta-attention in addition to causal multi-head attention on <100B tokens, achieving strong performance on a suite of synthetic tasks. Our method facilitates length generalization up to 2$\times$ the context window after extension with YaRN. We provide an information-theoretic analysis which reveals that meta-tokens \textit{sharpen} the positional encoding, allowing them to operate as content-based anchors that compress preceding context and “cache” it within the meta-token. We empirically confirm this by visualizing model internals to study the residual stream. Together, our findings demonstrate that meta-tokens and meta-attention provide a simple, data-efficient pre-training method, grounded by new mechanistic insights into their role in enabling length generalization behavior.
基础/前沿模型 (含LLM)
LLM 预训练
#factuality #tail knowledge #synthetic data #synthetic continued pretraining
TL;DR:We let the model generate self-learning strategies and train on them at scale to learn tail facts more consistently.
🎯 研究动机大语言模型记忆大量知识,但在学习和回忆具体知识时可靠性不足,缺乏确保知识稳定学习的方法。
❓ 解决问题提出 Active Reading 框架,旨在通过自生成学习策略系统性吸收稀有知识,提高知识学习的一致性和可靠性。
🔍 现象分析模型对知识的记忆依赖训练数据中的事实分布及其它难以理解的因素,导致知识学习存在巨大差异性。
🛠️ 主要方法利用 Active Reading,让模型生成自主学习策略并在规模化数据上进行训练,从而显著提高对稀有知识的吸收能力。
📊 数据与实验基于 SimpleQA 和 FinanceBench基准测试,使用 Active Reading 在源文档训练专家模型显著提升表现,并开发了一个基于1万亿生成数据的 WikiExpert-8B 模型。
⭐ 主要贡献通过 Active Reading,提高大语言模型的知识一致性和准确性;推出 WikiExpert-8B,与超大模型相比在事实型问答任务上具有竞争力。
查看完整摘要 (Abstract)
LLMs are known to store vast amounts of knowledge in their parametric memory.
However, learning and recalling facts from this memory is known to be unreliable, depending largely on the prevalence of particular facts in the training data and other factors which are poorly understood.
Practitioners are lacking tools which will allow them to ensure that the models learn a given body of knowledge reliably and consistently.
To this end, we propose Active Reading: a framework where we train models to study a given set of material with self-generated learning strategies.
First, we demonstrate models trained with Active Reading on expert domains absorb significantly more knowledge than vanilla finetuning and other data augmentations.
We train expert 8B models that achieve 66% on a Wikipedia-grounded subset of SimpleQA (+313% relative over vanilla finetuning) and 26% on FinanceBench (+160% relative over vanilla finetuning) by applying Active Reading to the source documents for each benchmark.
Finally, we show that Active Reading can be utilized at pre-training scale to build more factual models.
As a demonstration of this, we release WikiExpert-8B, a Wikipedia-expert model trained on 1 trillion generated tokens, which outcompetes models with hundreds of billions of parameters on factual QA.
基础/前沿模型 (含LLM)
LLM 预训练
#Compression #Information Theory #Learning #Generalisation #LLMs #Interpretability
TL;DR:LLMs learn an optimal compression of the internet.
🎯 研究动机当前对大型语言模型(LLMs)表征空间结构理解不足,限制了其可解释性及与人类学习的关联探讨。
❓ 解决问题提出一种信息理论框架,将LLMs视为有损压缩系统,旨在优化其学习机制及性能预测。
🔍 现象分析LLMs通过训练保留与目标相关的信息,接近信息瓶颈理论的压缩界,不同模型因数据及训练策略差异表现出不同的压缩方式。
🛠️ 主要方法分析模型的压缩最优性及信息结构如何预测其下游任务性能,通过统一的信息论框架研究表征与性能的关联。
📊 数据与实验综合多个开源权重的模型,对广泛任务基准进行性能评估,验证压缩最优性与表现之间的关系。
⭐ 主要贡献提出可扩展的信息理论视角,统一描述LLMs的学习机制,增强模型可解释性与性能预测能力。
查看完整摘要 (Abstract)
Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model's compression, and the information present in it, can predict downstream performance on across a wide array of benchmarks, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.
基础/前沿模型 (含LLM)
LLM 预训练
#external memory #parametric memory
🎯 研究动机现有增强大型语言模型知识利用的方式在非参数化检索生成与参数微调之间存在效率与能力的权衡问题。
❓ 解决问题提出一种能够内部化检索模式的轻量级参数模块,以取代高延迟的外部文档访问,同时避免微调方法导致的遗忘问题。
🔍 现象分析非参数方法效率较低且集成不深;微调方法可能损害语言模型的广泛能力并引发遗忘风险。
🛠️ 主要方法预训练一个模仿 $k$NN 检索器行为的 MLP 内存模块,通过概率插值与 Transformer 解码器结合,实现全参数化的检索知识访问。
📊 数据与实验在五个问答基准上相对提升 12.3%,在九个 NLP 通用任务上取得 5.2 分的绝对提升,并在 HaluEval 测试中减少高达 10 分的幻觉,同时实现比 RAG 快 2.5 倍的推理速度。
⭐ 主要贡献提出一种融合高效推理与有效知识访问的实用方案,弥补非参数与参数方法之间的差距,显著提高语言模型性能与可靠性。
查看完整摘要 (Abstract)
Modern approaches to enhancing Large Language Models' factual accuracy and knowledge utilization face a fundamental trade-off: non-parametric retrieval-augmented generation (RAG) provides flexible access to external knowledge but suffers from high inference latency and shallow integration, while parametric fine-tuning methods like LoRA risk catastrophic forgetting and degraded general capabilities. In this work, we propose MLP Memory, a lightweight parametric module that learns to internalize retrieval patterns without explicit document access. By pretraining an MLP to imitate a $k$NN retriever's behavior on the entire pretraining dataset, we create a differentiable memory component that captures the benefits of retrieval-based knowledge access in a fully parametric form. Our architecture integrates this pretrained MLP Memory with Transformer decoders through simple probability interpolation, achieving 12.3\% relative improvement on five question-answering benchmarks and 5.2 points absolute gain across nine general NLP tasks, while reducing hallucinations by up to 10 points on HaluEval. Moreover, MLP Memory delivers 2.5$\times$ faster inference than RAG with superior accuracy. Our findings show that learning retrieval patterns parametrically bridges the gap between efficient inference and effective knowledge access, offering a practical alternative to both RAG and fine-tuning approaches.
基础/前沿模型 (含LLM)
LLM 预训练
#generation #evaluation #memorization #novelty #benchmark #creativity
TL;DR:We quantify and evaluate the novelty in LLM generations as the harmonic mean of the n-gram originality and output quality, and analyse how model scale, training methods and inference methods impact it.
🎯 研究动机随着大语言模型(LLM)被广泛应用于创造性任务和科学探索,评估其生成内容的创新性已成为重要课题。现有研究仅关注生成内容与训练数据的原创性,而忽视内容质量的问题。
❓ 解决问题提出一种新的创新性评估指标,以原创性和质量的调和平均值衡量模型生成内容的创新性,解决低质量原创和偏向记忆内容的评估局限性。
🔍 现象分析分析了模型规模、训练方法和推理方式对生成内容创新性的影响,发现提升模型规模和进行后训练显著提高质量与创新性;基础模型的改进在相同规模下主要促进原创性;推理方法对创新性的影响较小,且在一定程度上提高原创性而降低质量。
🛠️ 主要方法通过测量生成内容中训练数据未见的 n-gram 比例和任务特定质量得分的调和平均值,提出一套衡量 LLM 创新性的框架。
📊 数据与实验在三个开放数据模型(OLMo、OLMo-2 和 Pythia)及三个创造性任务(故事创作、诗歌写作、创造性工具使用)上进行实验,比较模型生成内容与互联网人类创作的创新性。
⭐ 主要贡献引入了一种平衡原创性与质量的创新性评价指标,揭示了模型规模和后训练对创新性的显著提升作用,并指出推理方法改进空间的局限性,为未来模型创造力提升提供了方向。
查看完整摘要 (Abstract)
As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as originality with respect to model training data, but original outputs can be of low quality. In contrast, non-expert judges more reliably score quality but may favor memorized outputs, limiting the reliability of human preference as a metric. We introduce a new novelty metric for LLM generations that balances originality and quality---the harmonic mean of the fraction of \ngrams unseen during training and a task-specific quality score. Using this framework, we identify trends that affect the novelty of generations from three families of open-data models (OLMo, OLMo-2, and Pythia) on three creative tasks---story completion, poetry writing, and creative tool use. We find that model-generated text from some base LLMs is less novel than human-written text from the internet. However, increasing model scale (OLMo 1B to 7B to 32B) and post-training reliably improves novelty due to improvements in output quality. We also find that improving the base model at the same scale (\eg OLMo 7B to OLMo-2 7B) leads to higher novelty due to higher originality. Finally, we observe that inference-time methods, such as prompting and providing novel in-context examples, have a much smaller effect on novelty, often increasing originality at the expense of quality. This highlights the need for further research into more effective elicitation strategies as we use models for creative applications.
基础/前沿模型 (含LLM)
LLM 预训练
#privacy auditing #natural identifiers #dataset inference #differential privacy #LLMs
TL;DR:We propose to leverage natural identifiers that are unique strings generated from random seeds, such as Ethereum addresses, to enable reliable privacy and data auditing of pretrained LLMs.
🎯 研究动机大语言模型在隐私审计方面面临重大挑战,现有方法需在训练时插入特殊数据或依赖难以获取的非成员数据集,限制了审计的可扩展性和实用性。
❓ 解决问题提出自然标识符作为全新的隐私和数据审计工具,可规避现有方法需重新训练模型或构造特定数据集的限制。
🔍 现象分析自然标识符如加密哈希和短链因其在训练数据中普遍存在,且能生成无限随机样本,符合训练数据分布,为后续审计提供潜在可能。
🛠️ 主要方法利用自然标识符生成的随机字符串,替代传统的插入式测试数据,实现差分隐私审计和数据集推断,且无需重新训练或访问非成员数据集。
📊 数据与实验通过实验证明自然标识符在无需额外调整模型的情况下,有效支持后续隐私审计及数据集推断任务。
⭐ 主要贡献首次引入自然标识符用于大模型隐私审计和数据集推断,大幅提升后验审计的可扩展性和实用性,不依赖额外的训练代价。
查看完整摘要 (Abstract)
Assessing the privacy of large language models (LLMs) presents significant challenges. In particular, most existing methods for auditing *differential privacy* require the insertion of specially crafted canary data *during training*, making them impractical for auditing already-trained models without costly retraining. Additionally, *dataset inference*, which audits whether a suspect dataset was used to train a model, is *infeasible* without access to a private non-member held-out dataset. Yet, such held-out datasets are often unavailable or difficult to construct for real-world cases since they have to be from the same distribution (IID) as the suspect data. These limitations severely hinder the ability to conduct scalable, *post-hoc* audits. To enable such audits, this work introduces **natural identifiers (NIDs)** as a novel solution to the above-mentioned challenges. NIDs are structured random strings, such as cryptographic hashes and shortened URLs, naturally occurring in common LLM training datasets. Their format enables the generation of unlimited additional random strings from the same distribution, which can act as alternative canaries for audits and as same-distribution held-out data for dataset inference. Our evaluation highlights that indeed, using NIDs, we can facilitate post-hoc differential privacy auditing *without any retraining* and enable dataset inference for any suspect dataset containing NIDs without the need for a private non-member held-out dataset.
基础/前沿模型 (含LLM)
LLM 预训练
#online learning #bandits #LLM routing #staged deployment #streaming model arrivals #regret bounds #budget/capacity constraints
TL;DR:StageRoute periodically redeploys LLMs and cost-aware routes queries online to track a streaming model frontier with near-optimal regret.
🎯 研究动机随着大语言模型(LLM)的快速迭代,服务商需要在并发限制与单次查询成本的约束下,动态管理流式到来的模型库存。
❓ 解决问题如何通过在线方法优化模型的阶段性部署与单次查询的路由,以最小化遗憾(regret)并高效利用资源。
🔍 现象分析模型的快速更新会引发部署周期的动态性,传统解决方案难以在成本与吞吐约束下灵活实现近似最优的决策。
🛠️ 主要方法提出层次化算法StageRoute:通过奖励的置信上界与成本的置信下界选择下一阶段最多M个模型,并通过约束条件下的分层路由算法处理实时查询。
📊 数据与实验在多种任务与紧预算条件下,实验表明StageRoute在遗憾度量上接近强基准,验证了其理论与实用性。
⭐ 主要贡献提出了一种新型架构优化算法,理论上证明其近似最优性(遗憾为$\tilde{\mathcal{O}}(T^{2/3})$),并通过实验证实其在动态工作负载下的高效表现。
查看完整摘要 (Abstract)
The rapid pace at which new large language models (LLMs) appear, and older ones become obsolete, forces providers to manage a streaming inventory under a strict concurrency cap and per-query cost budgets. We cast this as an online decision problem that couples *stage-wise deployment* (at fixed maintenance windows) with *per-query routing* among live models. We introduce *StageRoute*, a hierarchical algorithm that (i) optimistically selects up to $M_{\max}$ models for the next stage using reward upper-confidence and cost lower-confidence bounds, and (ii) routes each incoming query by solving a budget- and throughput-constrained bandit subproblem over the deployed set. We prove a regret of $\tilde{\mathcal{O}}(T^{2/3})$ with a matching lower bound, establishing near-optimality, and validate the theory empirically: *StageRoute* tracks a strong oracle under tight budgets across diverse workloads.
基础/前沿模型 (含LLM)
LLM 预训练
#Large language models #anticipatory capacity
TL;DR:Next-ToBE introduces a soft-target distribution that activates and refines anticipatory capacity in LLMs, improving reasoning performance by looking ahead beyond the immediate next token.
🎯 研究动机现有自回归LLMs尽管具有一定的远程预测能力,但如何系统性增强和利用这种能力以提升推理性能仍不明确。
❓ 解决问题通过引入一种新方法Next-ToBE,激发和改进LLMs的预判能力,突破传统下一词预测中目标过于刚性的限制。
🔍 现象分析LLMs当前的软性预测概率与未来窗口的相关token存在较强关联,但这种能力受到单一一热目标约束的抑制。
🛠️ 主要方法构造包含未来若干token分布的软目标,引入时间和语义相关性的动态权重设计,结合模型固有预测倾向进行微调或预训练。
📊 数据与实验实验验证了Next-ToBE在提升推理性能中的表现,同时证明了其相比MTP基线在内存和计算效率上的优势。
⭐ 主要贡献提出了一种有效扩展LLMs预测视野的新策略,不仅显著提高了推理能力,还为预训练阶段培养远程预测能力提供了新途径。
查看完整摘要 (Abstract)
Auto-regressive large language models (LLMs) exhibit a non-trivial capacity to "anticipate'' long-range future tokens despite being trained to predict only one token at a time. Nevertheless, how to systematically profile, enhance and leverage such capacity to practically improve LLM reasoning performance remains unclear. In this paper, we propose **Next Token-Bag Exploitation (Next-ToBE)** to tackle this challenge. Next-ToBE quantifies LLM’s anticipatory capacity by measuring how well tokens in the future window are pre-captured by the model’s current softmax probabilities. This capacity is strongly correlated with LLM generative quality but often suppressed by the rigid one-hot objective in next-token prediction. To address this, we replace the {one-hot target vector} in next-token prediction with a soft target distribution
spanning additional future tokens. Specifically, the immediate next token retains the highest importance, while more distant ``look-ahead tokens'' are also included to enrich supervision, with their importance dynamically determined by temporal and semantic relevance patterns to inject forward-looking pressure.
Besides, the fitting process emphasizes the model’s intrinsic anticipatory tendency, thus preserving the confidence and fidelity of the pre-trained model to improve training stability.
Overall, Next-ToBE not only effectively activates LLM anticipatory capacity through fine-tuning, yielding notable gains in reasoning performance with higher memory and computational efficiency against the MTP baselines, but also shows great potential in pretraining setting by successfully cultivating this capacity from scratch. These highlight its value as an effective strategy to extend the prediction horizon of LLMs, enabling them to see further, and reason better.
基础/前沿模型 (含LLM)
LLM 预训练
#language modeling #pondering language models #pretraining #continuous embedding space
TL;DR:We pretrain language models to ponder within a continuous embedding space.
🎯 研究动机人类在表达复杂语句之前会进行思考,从而实现更深层次的认知处理。本研究旨在将这种思考过程引入语言模型,以提升其认知能力和生成质量。
❓ 解决问题现有语言模型在生成复杂句子元素时缺乏深入的推理和自我优化过程,影响结果的精确性和一致性。本研究提出一种自监督学习机制,使模型能够在生成过程中实现嵌入空间内的反复优化。
🔍 现象分析通过实验发现,模型在嵌入空间中进行反复推理优化后,能够显著提升在下游任务中的表现,体现出更强的泛化能力和认知效率。
🛠️ 主要方法在单步生成过程中,模型不直接采样实际单词,而是根据预测分布生成加权嵌入向量,并以此作为输入多次迭代优化,直至达到最佳生成状态。
📊 数据与实验使用GPT-2、Pythia和LLaMA等主流模型架构,并在9个下游任务基准上进行广泛评测。结果显示,增强后的模型显著优于原有模型,甚至规模较小的模型能够超越规模较大的模型。
⭐ 主要贡献提出了一种嵌入空间思考机制,使语言模型能够通过自监督学习实现更深层次的生成优化,显著提升了模型效率和性能,并验证了方法的通用性及灵活性。
查看完整摘要 (Abstract)
Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort.
In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations.
Experiments across three widely used open-source architectures—GPT-2, Pythia, and LLaMA—and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, our PonderPythia models demonstrate remarkable effectiveness: PonderPythia-2.8B surpasses Pythia-6.9B and rivals Pythia-12B, while our PonderPythia-1B matches TinyLlama-1.1B, a model trained on 10 times more data.
基础/前沿模型 (含LLM)
LLM 预训练
#Learning rate schedules #Large language models (LLMs)
🎯 研究动机探讨学习率调度对大型语言模型预训练后经监督微调后性能的影响,该问题尚未得到充分研究。
❓ 解决问题提出并验证不使用学习率衰减的Warmup-Stable-Only(WSO)调度策略,提升模型在监督微调阶段的表现。
🔍 现象分析学习率衰减方法虽能有效降低预训练损失,但其结果会导致模型陷入更尖锐的极小值,不利于下游任务适应性;相比之下,WSO 方法可保持更平缓的损失极小值,增强下游任务表现。
🛠️ 主要方法在Warmup后维持稳定、无衰减的学习率(WSO),对比传统的基于衰减的学习率调度方法,分析预训练与微调阶段的性能差异。
📊 数据与实验通过包含 10 亿和 80 亿参数大小的模型,在多种训练方案(中段训练和过度训练)中实验验证 WSO 在微调后均优于传统方法。
⭐ 主要贡献证明了学习率衰减对预训练后适应性的负面影响;提出 WSO 方案提升模型下游任务适应性;为模型训练和发布提供实际指导。
查看完整摘要 (Abstract)
We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT).
Decay-based learning rate schedulers are widely used to minimize pre-training loss.
However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored.
In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay.
Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training.
The result also holds across different regimes with mid-training and over-training.
Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas WSO preserves flatter minima that support adaptability.
These findings indicate that applying LR decay to improve pre-training metrics may compromise downstream adaptability.
Our work also provides practical guidance for training and model release strategies, highlighting that pre-training models with WSO enhances their adaptability for downstream tasks.
基础/前沿模型 (含LLM)
LLM 预训练
#Pretrained Large Language Models #Knowledge Offloading
TL;DR:A new class of language models that offloads factual knowledge to an external database rather than encoding it in their parameters
🎯 研究动机传统语言模型将语言模式和事实知识编码于参数中,难以检查、验证或更新具体信息。研究需探索更透明和可编辑的知识表达方式。
❓ 解决问题提出新的方法,通过将事实知识外部化到数据库,以减少知识存储于模型参数中的依赖,从而提高模型的灵活性和知识管理能力。
🔍 现象分析现有语言模型的知识编码高度依赖于参数,导致其难以直接修改或验证特定事实,不适合动态知识需求。
🛠️ 主要方法设计了Limited Memory Language Models (LMLM);在预训练中屏蔽模型对外部检索事实的训练损失,从而使模型学习通过查找而非记忆获得知识。
📊 数据与实验在标准基准测试中进行实验,结果显示LMLM在维持竞争性性能的同时显著减少了模型存储需求。
⭐ 主要贡献开发了一种结合内外部知识的新型语言模型,验证其可显式编辑和验证知识,优化性能与灵活性间的平衡。
查看完整摘要 (Abstract)
Neural language models are black-boxes--both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We introduce Limited Memory Language Models (LMLM), a new class of language models that externalizes factual knowledge to external database during pre-training rather than memorizing them. Our pre-training approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases.
基础/前沿模型 (含LLM)
LLM 预训练
#scaling laws #data efficiency #pre-training
TL;DR:Since compute grows faster than the web, we design simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better performance with sufficient compute.
🎯 研究动机随着计算能力增长快于网络文本的可用性,研究如何在固定数据限制和无限计算条件下优化预训练效率,推进计算与数据的平衡发展。
❓ 解决问题针对现有数据受限方法容易出现过拟合的问题,提出改进方案以提升计算扩展规律的最终性能,并实现更高的数据效率。
🔍 现象分析发现增加迭代次数或参数规模会导致过拟合,通过调节正则化使权重衰减增大至标准值的30倍后显著改善模型收敛行为;独立模型集成比单一正则化方法更具优势。
🛠️ 主要方法结合多种优化策略,包括迭代次数优化、正则化调整、参数扩展及模型集成,使模型在有限数据量下达成较低损失的扩展规律极限,并将集成效能通过模型蒸馏迁移至较小的学生模型。
📊 数据与实验实验使用200M标记数据,通过算法改进实现数据效率提升至基线方法的5.17倍,并验证不同token预算下的可扩展性及算法在下游任务中的泛化性能提升。
⭐ 主要贡献提出一系列简单但有效的算法设计,使预训练在未来高计算环境中更具数据效率,提供了基于扩展规律极限的新优化方法,并验证了对下游任务的性能提升。
查看完整摘要 (Abstract)
Since compute grows much faster than web text available for language model pre-training, we ask how one should approach pre-training under fixed data and no compute constraints. We first show that existing data-constrained approaches of increasing epoch count and parameter count overfit, and we improve upon such recipes by tuning regularization, finding that the optimal weight decay is $30\times$ larger than standard practice. Since our regularized recipe monotonically decreases loss following a power law in parameter count, we estimate its best possible performance via the \textbf{asymptote} of its scaling law rather than the performance at a fixed compute budget. We then identify that ensembling independently trained models achieves a significantly lower loss asymptote than the regularized recipe. Our best intervention combining epoching, regularization, parameter scaling, and ensemble scaling achieves an asymptote at 200M tokens using $5.17\times$ less data than our baseline, and our data scaling laws predict that this improvement persists at higher token budgets. We find that our data efficiency gains can be realized at smaller parameter counts as we can distill an ensemble into a student model that is 8$\times$ smaller and retains $83$% of the ensembling benefit. Finally, our interventions designed for validation loss generalize to downstream benchmarks, achieving a $9$% improvement for pre-training evals. Our results show that simple algorithmic improvements can enable significantly more data-efficient pre-training in a compute-rich future.
基础/前沿模型 (含LLM)
LLM 预训练
#Language Models #Pre-training #Reasoning #Evaluation #Efficiency
TL;DR:We enable small proxy models to reliably predict large model reasoning performance using next-token prediction on reasoning traces with task-aligned weighting, dramatically reducing pre-training recipe search cost.
🎯 研究动机大规模语言模型的预训练成本高昂,优化配方难以在小模型上测得可靠的推理性能。推理能力通常表现为 emergent 行为,仅在较大模型中可靠出现。
❓ 解决问题提出 rBridge 方法,使小规模代理模型(≤1B 参数)能够预测大模型的推理性能,解决小模型无法准确评估推理性能的问题。
🔍 现象分析推理性能高度依赖模型规模,直接反映在负对数似然与任务对齐程度上,且小模型在任务相关性上常有不足。
🛠️ 主要方法通过任务对齐加权的负对数似然方法,使用大模型的推理轨迹作为金标准标签,使小模型预训练目标更接近目标任务。
📊 数据与实验在覆盖1B到32B规模的六个推理基准上验证方法,结果显示成本下降逾100倍,同时具备在1B到7B参数间转移预测关系的能力。
⭐ 主要贡献显著降低推理数据集排名成本,提供一种低成本探索推理导向预训练的方法,首次在小模型与大模型间高效构建推理性能预测桥梁。
查看完整摘要 (Abstract)
Given the prohibitive cost of pre-training large language models, it is essential to leverage smaller proxy models to optimize recipes before scaling up. However, this approach becomes challenging for reasoning capabilities, which exhibit \textit{emergent} behavior that only appears reliably at larger model sizes, often exceeding 7B parameters. To address this, we introduce \tsc{rBridge}, showing that small proxies ($\leq$1B) can effectively predict large-model reasoning by aligning more closely with \textbf{(1)} the pre-training objective and \textbf{(2)} the target task. \tsc{rBridge} achieves this by weighting negative log-likelihood with task alignment, using reasoning traces from frontier models as gold labels. In our experiments, \tsc{rBridge} \textbf{(i)} reduces dataset ranking costs by over 100$\times$ relative to the best baseline, \textbf{(ii)} achieves the strongest correlation across six reasoning benchmarks at 1B to 32B scale, and \textbf{(iii)} transfers predictive relationships across pre-training recipes at 1B to 7B scale. These findings indicate that \tsc{rBridge} offers a practical path for exploring reasoning-oriented pre-training at lower cost.
基础/前沿模型 (含LLM)
LLM 预训练
#training re-evaluation curve #data curriculum / data placement #large language model (LLM) pre-training #AdamW EMA timescale #learning-rate schedules #tokens-per-parameter ratio
TL;DR:We evaluate fully-trained LLMs on their original training data, measuring retention across steps; a predictive model of the resulting "re-evaluation curve" identifies optimal spots for high-quality data, surpassing default end-of-training placement.
🎯 研究动机数据课程是成功训练大规模语言模型(LLMs)的关键,但最佳数据放置策略尚不明确。
❓ 解决问题提出一种名为训练重评曲线(TREC)的诊断工具,以衡量模型训练期间数据的保留效率,并通过预测优化数据放置策略。
🔍 现象分析通过分析111M至3.9B参数模型的TREC发现,高质量数据放置在TREC的低点显著提升模型表现。
🛠️ 主要方法利用AdamW优化器的隐式EMA系数预测TREC曲线,在训练前主动设计数据课程,同时分析现有训练方案中的数据放置问题。
📊 数据与实验在包含9000亿标记的3.9B参数LLM持续预训练上验证将高质量数据对齐至TREC低点的有效性。
⭐ 主要贡献提出TREC作为数据课程优化的指导,利用TREC预测突破数据放置的后验限制,显著提升大模型训练的性能与效率。
查看完整摘要 (Abstract)
Data curriculums have become central to successful LLM training, yet principles governing optimal data placement remain unclear. We introduce the *training re-evaluation curve (TREC)*, a diagnostic that retrospectively evaluates training batches *using the final model weights*. The TREC characterizes how well a trained model retains training data as a function of *when* the data was encountered during training. Analyzing TRECs for models from 111M to 3.9B parameters, we show that placing high-quality data at low points on the TREC significantly improves performance. Importantly, while a TREC is initially observable only after training, we demonstrate it can be *predicted in advance* from AdamW’s implicit EMA coefficients, enabling proactive curriculum design. By predicting TRECs for published training recipes, we explain prior ablations and reveal suboptimal data placements. We also align high-quality data with TREC minima in order to improve continual pre-training of a 3.9B-parameter LLM trained on 900B tokens.
基础/前沿模型 (含LLM)
LLM 预训练
#language models #large language models #scaling laws #evaluations #generative evaluations #sampling
TL;DR:Scaling laws for generative evals of language models during pretraining
🎯 研究动机神经扩展定律推动了语言模型的快速增长,但对于生成性评估的扩展行为研究仍然有限,尤其是数学问题解决和软件工程等领域。
❓ 解决问题研究生成性评估中不同预训练扩展定律的表现,预测最昂贵模型的生成性指标,并分析新超参数对扩展行为的影响。
🔍 现象分析发现生成性评估引入了新超参数(如 $k$),显著影响扩展行为的稳定性;不同定律在参数稳定性和预测性能上有明显差异。
🛠️ 主要方法设计并验证三种预训练扩展定律:(1)基于预训练计算量,(2)基于模型参数与预训练数据量,(3)基于黄金参考方案的对数似然。
📊 数据与实验实验表明计算量与参数定律在最后 $1.5 extord{-}2.5$ 个数量级内稳定,而黄金参考方案定律稳定性更高,跨越约 5 个数量级;预测性能在不同 $k$ 值下有所差异。
⭐ 主要贡献提出完整框架,理论证明计算扩展定律为参数与数据定律的计算优化包络,助力研究人员预测生成模型性能并加速模型开发。
查看完整摘要 (Abstract)
Neural scaling laws have driven the field's ever-expanding exponential growth in parameters, data and compute. While scaling behaviors for pretraining losses and discriminative benchmarks are well established, generative benchmarks such as mathematical problem-solving or software engineering remain under-explored.
We propose and evaluate three different pretraining scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model using cheaper models.
Our three scaling laws differ in the covariates used: (1) pretraining compute, (2) model parameters and pretraining tokens, (3) log likelihoods of gold reference solutions.
First, we demonstrate that generative evaluations introduce new hyperparameters (in our setting, $k$) that act as a control lever for scaling behavior, modulating both the scaling law parameters and the predictability of performance.
Second, we identify a stark difference in parameter stability: while the compute and parameters+tokens laws stabilize for only the last $\mathord{\sim}1.5\mathord{-}2.5$ orders of magnitude, the gold reference likelihood law is uniquely stable, converging across $\mathord{\sim}5$ orders. Third, in terms of predictive performance, we find all three scaling laws perform comparably, although the compute law predicts slightly worse for small $k$ and the gold reference law predicts slightly worse for large $k$. Finally, we establish a theoretical connection, proving that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance, accelerating progress toward models that can reason, solve, and create.
基础/前沿模型 (含LLM)
LLM 预训练
#Large language models #pretraining #memory #long-tail #knowledge #reasoning #forgetting
TL;DR:We pretrain transformers with hierarchical parametric memories that automatically store long-tail world knowledge, fetched by context at inference time to boost small language model performance.
🎯 研究动机当前语言模型依赖参数规模提升性能,但将所有知识压缩到参数中既不高效也不适合边缘设备。针对只需使用部分知识的现状,提出一种可扩展的解决方案。
❓ 解决问题通过设计一种记忆增强架构和预训练策略,解决语言模型存储长尾知识的效率与推理时间资源受限问题。
🔍 现象分析实验表明,小模型结合层级记忆能够以远小于传统大型模型的参数规模实现相当性能,并验证了记忆设计在不同架构中的一致表现。
🛠️ 主要方法采用层级参数化记忆架构,将长尾知识存储于内存参数,小语言模型负责常识捕获与推理能力。训练中支持上下文调用记忆以增强推理性能。
📊 数据与实验在万亿级 token 实验中,使用160M参数的小模型结合由18M参数组成的记忆,可达到与超2倍参数规模模型相当的效果,进一步测试记忆扩展能力至21B参数。
⭐ 主要贡献提出一种高效的长尾知识存储机制,显著降低模型参数需求,同时保持性能,验证层级记忆在语言模型中的广泛适用性。
查看完整摘要 (Abstract)
The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-parameters memory fetched from a 4.6B memory bank obtains comparable performance to a regular model with more than 2x the parameters. We study the optimal type and size of parametric memories in transformers, scaling them to over 21B parameters. We find that our proposed hierarchical feed-forward memories work robustly across transformer architectures, whether added during pretraining or post-hoc.
基础/前沿模型 (含LLM)
LLM 预训练
#LLM #large language models #pretraining #data filtering #data pruning
TL;DR:Token frequency stats can replace perplexity for LLM data filtering—1000× faster, equally effective.
🎯 研究动机大规模语言模型需要有效的数据筛选以提高学习效率,而传统基于困惑度的方法存在时间成本高和处理噪声样本不稳定的缺陷。
❓ 解决问题提出一种基于优先性的数据筛选方法,通过语料库中的词项频率统计代替困惑度,从而降低成本并提高筛选效果。
🔍 现象分析困惑度方法虽然性能较强,但处理噪声和分布外样本时表现不佳且耗时巨大;而词项频率可作为困惑度的快速替代指标。
🛠️ 主要方法利用语料库中的词项频率统计计算词语优先级,通过均值和标准差筛选文档,无需模型推断即可高效执行数据过滤。
📊 数据与实验在20个下游基准测试中使用,实验表明该方法性能超过传统困惑度筛选,同时在代码、数学语言和多语言数据中表现优良。
⭐ 主要贡献提出了一种简单且强大的数据过滤方法,将筛选时间成本降低1000倍,同时在多种类型数据和任务中实现最高的平均性能。
查看完整摘要 (Abstract)
As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has demonstrated strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000× compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision.
基础/前沿模型 (含LLM)
LLM 预训练
#Reinforcement Learning #Pretraining #Reasoning #Large Language Models
TL;DR:A verifier‑free reinforcement pretraining framework for language modeling.
🎯 研究动机当前的大规模语言模型通常在预训练阶段以下一词预测损失为主,而强化学习多用于模型训练的后期优化阶段,但这种训练范式可能并非最优。
❓ 解决问题提出一种新的预训练目标,将强化学习中的探索精神引入语言模型的预训练,以提前促进模型独立思考和推理能力的形成。
🔍 现象分析采用信息驱动的强化奖励信号,使模型能够基于上下文和推理链的结合提升对下一词的预测概率,从而实现推理能力的早期学习。
🛠️ 主要方法引入一种基于推理链的信息增益奖励,计算在结合推理链条件与上下文条件下的词预测对比,对全文流进行无验证器的高效奖励密集训练。
📊 数据与实验在Qwen3-1.7B和NVIDIA‑Nemotron‑Nano‑12B两个模型上进行实验,显著提升多个数学与科学基准的推理性能,尤其在AIME25和MMLU‑Pro等高推理任务中表现最佳。
⭐ 主要贡献提出RLP预训练目标,成功将强化学习重定义为预训练方法,打破传统先监督再强化的训练范式,实现跨模型架构与规模的推理能力大幅提升。
查看完整摘要 (Abstract)
The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning---exploration---to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight‑benchmark math‑and‑science suite by 19%. With identical post‑training, the gains compound, with the largest improvements on reasoning‑heavy tasks such as AIME25 and MMLU‑Pro. Applying RLP to the hybrid NVIDIA-Nemotron-Nano-12B-v2-Base increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.
基础/前沿模型 (含LLM)
LLM 预训练
#Large Language Models #Data Augmentation #Synthetic Pretraining Data
🎯 研究动机大语言模型的扩展受到数据稀缺性和训练过程中文本重复导致的性能下降的限制。
❓ 解决问题提出了一种新方法,通过生成多样化、语境丰富的文本变体,以解决数据重复问题并支持更高效的模型扩展。
🔍 现象分析传统的数据增强方法依赖固定的种子系统,存在生成文本质量和多样性不足的问题。实验深入探讨了生成质量与评估指标间的关系。
🛠️ 主要方法开发了MGA改革框架,自适应生成体裁-受众对,系统化地将现有语料重新格式化以增强文本多样性。
📊 数据与实验构建了包含7700亿标记的MGACorpus,实验验证了其在数据预算和模型尺寸扩展(至13B参数)方面均优于数据重复和上采样方法。
⭐ 主要贡献提出了MGA框架,显著缓解了数据重复瓶颈,提供了一条优化训练集的可靠路径,有效支持大规模语言模型的扩展。
查看完整摘要 (Abstract)
Despite the impressive capabilities of large language models across various tasks, their continued scaling is severely hampered not only by data scarcity but also by the performance degradation associated with excessive data repetition during training.
To overcome this critical bottleneck, we introduce the Massive Genre-Audience (MGA) reformulation method, a framework designed to augment corpora in a way that supports more effective model performance scaling.
Instead of relying on complex, predefined seed systems, MGA systematically reformulates existing corpora into diverse, contextually-rich variations by adaptively generating genre-audience pairs.
We present this framework and the resulting 770 billion token MGACorpus, created as a practical instantiation of our methodology.
We experimentally validate MGA's core benefits by demonstrating superior scaling properties, in terms of both model size and data budget, against data repetition and upsampling (up to 13B parameters).
Furthermore, our comprehensive analysis investigates the role of synthesis principles in generation quality and reveals nuances in evaluating model capabilities using standard loss metrics.
Our work shows that a systematic framework like MGA provides a reliable pathway to substantially augment training datasets, effectively alleviating repetition bottlenecks and enabling more efficient scaling of large language models.
基础/前沿模型 (含LLM)
LLM 预训练
#Data Selection #Data Mixing #Data Curation #Large Language Models
🎯 研究动机现有大语言模型数据策展方法多为离线范式,独立于训练过程,导致工程开销大、适应模型或任务变化时需重新运行整个流程。离线方法通过硬过滤或重采样改变数据量,常损害数据多样性,影响泛化能力。
❓ 解决问题提出将数据策展重新定义为在线加权问题,在训练中动态调整样本重要性,而非静态预处理,以避免工程冗余和泛化损失。旨在开发一种自适应框架,在保持训练样本数不变的情况下提升模型性能。
🔍 现象分析离线数据策展方法(如选择和混合)与训练分离,导致模型对数据分布变化敏感,且可能因过滤或重采样降低数据多样性,从而限制跨任务泛化效果。在线方法有望通过动态调整克服这些局限性。
🛠️ 主要方法引入ADAPT框架,通过基于相似性的质量信号指导自适应每样本学习率,动态重新加权训练样本。它作为隐式课程学习器,随模型演化逐步从粗粒度模式转向细粒度语义区分。
📊 数据与实验在指令调优和大规模预训练任务上测试,使用多个基准数据集进行验证。实验显示ADAPT在相同FLOPs下,持续优于离线选择/混合及先前在线方法,实现更强的跨基准泛化性能。
⭐ 主要贡献首次将数据策展系统化为在线加权问题,提出ADAPT动态框架,无需改变训练样本数即可提升泛化。证明在线重新加权方法优于传统离线范式,并为数据策展提供了可扩展的解决方案。
查看完整摘要 (Abstract)
Data curation is a critical yet under-explored area in large language model (LLM) training. Existing methods, such as data selection and mixing, operate in an offline paradigm, detaching themselves from training. This separation introduces engineering overhead and makes the curation brittle: the entire pipeline must be re-run under model/task shifts. Moreover, offline methods alter data size through hard filtering or resampling, often sacrificing data diversity and harming generalization. We propose to rethink data curation as an online reweighting problem, where sample importance is dynamically adjusted during training via loss weighting rather than static pre-processing. Specifically, we introduce ADAPT (Adaptive Data reweighting for Pretraining and FineTuning), a dynamic online framework that reweights training samples with adaptive per-sample learning rates guided by similarity-based quality signals, without changing the number of training samples. Unlike offline methods that enforce a static data distribution, ADAPT acts as an implicit curriculum learner, progressively shifting focus from coarse-grained patterns to fine-grained semantic distinctions as the model evolves. Experiments on both instruction tuning and large-scale pretraining show that ADAPT consistently outperforms offline selection/mixing and prior online methods, achieving stronger cross-benchmark generalization under equal FLOPs.
基础/前沿模型 (含LLM)
LLM 预训练
#data #datasets #pretraining #pre-training #retrieval #llm #llms #test time compute
🎯 研究动机当前的大型语言模型通过海量预训练数据获得强大的任务解决能力,但对预训练过程从数据中提取知识的效率尚未充分研究。
❓ 解决问题探索预训练数据在测试时的再利用效率,量化预训练过程中未充分利用的数据价值,并分析其随模型规模的变化特性。
🔍 现象分析通过检索增强生成技术,发现预训练后检索公开数据集可显著提高任务的准确性,并且这种提升在去污染数据后仍然存在。
🛠️ 主要方法引入检索增强生成与测试时计算技术,将标准数据集中的上下文信息重新注入到模型决策中,从而提升性能。
📊 数据与实验实验基于公开数据集(如MMLU、Math-500、SimpleQA)和公共模型(如LLaMA 3.1 8B)进行,展示了显著的性能提升,例如MMLU的准确率提升10个百分点。
⭐ 主要贡献揭示现有预训练方法未充分利用数据中的信息,提出利用额外测试时计算资源的潜力,为改进大模型训练与推理效率提供新思路。
查看完整摘要 (Abstract)
Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging additional compute at test time to parse the retrieved context, demonstrating a 10 percentage point improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results suggest that today's pre-training methods do not make full use of the information in existing pre-training datasets, leaving significant room for progress.
基础/前沿模型 (含LLM)
LLM 预训练
#machine unlearning #large language models
TL;DR:We propose a method for unlearning by leveraging model checkpoints during training, achieving improved performance for data-level unlearning
🎯 研究动机大型语言模型可能包含隐私数据、版权材料或错误信息,而完全重训模型以移除这些数据在计算上代价高昂。
❓ 解决问题提出一种高效低成本的算法来移除特定数据对模型的影响,同时保持其他部分模型性能不受影响。
🔍 现象分析现有数据遗忘算法难以精确地从大型语言模型中消除特定数据的影响。
🛠️ 主要方法通过使用模型训练中的检查点,提出一种名为 MSA(Model State Arithmetic)的算法,利用历史模型状态抵消目标数据的影响。
📊 数据与实验在多个基准、模型和评估指标上进行实验,MSA 展现出与现有遗忘算法相当甚至优于它们的性能。
⭐ 主要贡献提出 MSA 算法,使大型语言模型更灵活地支持数据遗忘,为高效移除数据影响提供新思路。
查看完整摘要 (Abstract)
Large language models are trained on massive corpora of web data, which may include private data, copyrighted material, factually inaccurate data, or data that degrades model performance. Eliminating the influence of such problematic datapoints on a model through complete retraining---by repeatedly pretraining the model on datasets that exclude these specific instances---is computationally prohibitive. To address this, unlearning algorithms have been proposed, that aim to eliminate the influence of particular datapoints at a low computational cost, while leaving the rest of the model intact. However, precisely unlearning the influence of data on a large language model has proven to be a major challenge. In this work, we propose a new algorithm, MSA (**M**odel **S**tate **A**rithmetic), for unlearning datapoints in large language models. MSA utilizes prior model checkpoints--- artifacts that record model states at different stages of pretraining--- to estimate and counteract the effect of targeted datapoints. Our experimental results show that MSA achieves competitive performance and often outperforms existing machine unlearning algorithms across multiple benchmarks, models, and evaluation metrics, suggesting that MSA could be an effective approach towards more flexible large language models that are capable of data erasure.
基础/前沿模型 (含LLM)
LLM 预训练
#Large Language Models #Downstream Metrics #Pretraining #Evaluation #Benchmarks #LLM
🎯 研究动机传统大模型的扩展规律主要关注预训练损失,但无法可靠地预测下游任务表现,因此需要探索直接预测下游任务的框架。
❓ 解决问题提出并验证一种能直接从训练预算扩展到下游任务准确率的模型框架,从而更精准地刻画下游能力的扩展规律。
🔍 现象分析通过实验证明在固定的参数-数据比例下,下游任务表现可以由一个简单的双参数扩展规律准确描述。
🛠️ 主要方法建立了一个基于训练预算的扩展框架,可用于推断模型在更大规模训练预算下的下游任务表现。
📊 数据与实验实验基于规模从几亿到350B训练 tokens的模型,验证了限制参数量至17B时的扩展规律,并成功预测更大规模预算的能力表现。
⭐ 主要贡献提出新的扩展规律框架,发布全面的模型损失及下游表现结果,支持可复现性并促进后续相关研究。
查看完整摘要 (Abstract)
While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of downstream accuracy from the training budget. We demonstrate that for a fixed token-to-parameter ratio, a simple two-parameter scaling law accurately describes this relationship. Our findings are validated by experiments on models with up to 17B parameters trained on up to 350B tokens, showing that downstream capabilities scaling can be described using a scaling law. Furthermore, we extend this framework to extrapolate and predict accuracy of target model with up to 6.7x larger training budget based on a set of smaller experiments. We release a complete list of model losses and downstream evaluation results at various different scales to support reproducibility and encourage future research.
基础/前沿模型 (含LLM)
LLM 预训练
#diffusion #discrete diffusion #diffusion language models #scaling #scaling laws #optimal batch size #critical batch size
TL;DR:We find that uniform diffusion language models outscale both masked diffusion and autoregressive models in terms of both compute- and data-bound scaling.
🎯 研究动机现代大型语言模型的预训练需求巨大,模型的扩展行为成为区分不同方法的关键因素。现有研究提出离散扩散语言模型(DLMs)作为自回归语言模型(ALMs)的替代选项,但其扩展行为仍未被充分探索。
❓ 解决问题分析DLMs在不同噪声类型下的扩展行为,特别关注超参数(如批量大小和学习率)对其计算效率和数据需求的影响,并比较其与ALMs的表现差异。
🔍 现象分析研究发现DLMs的扩展行为强烈依赖于噪声类型:统一扩散噪声在计算边界的扩展中表现优异,可在减少数据需求的情况下提高训练效率,尤其在数据有限的场景中具有优势。
🛠️ 主要方法通过插值方式在掩码扩散和统一扩散之间平滑切换,重新表述离散扩散的变分下界(ELBO)为信噪比形式,并简化理论及实现流程。
📊 数据与实验对一款规模达到10B参数、使用$10^{22}$ FLOPs训练的统一扩散模型进行实验,验证预测的扩展行为,成为公开最大规模的统一扩散语言模型,同时开放训练代码及模型。
⭐ 主要贡献揭示DLMs扩展行为与噪声类型的关系;提出一种改进的离散扩散理论框架;展示统一扩散在数据约束训练环境中的潜力;推动更大规模扩散模型的研究,并实现开源。
查看完整摘要 (Abstract)
Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor.
Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs).
However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs.
We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate.
Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs.
While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-constrained training environments.
We scale our uniform diffusion model up to 10B parameters trained for $10^{22}$ FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.
In the process of deriving the scaling laws, we reformulate the discrete diffusion ELBO in terms of signal-to-noise ratio, closing the gap to continuous diffusion theory and simplifying both theory and implementation.
Training code and models are open-sourced: upon acceptance
基础/前沿模型 (含LLM)
LLM 预训练
#Scaling laws #diffusion models #transformers
TL;DR:We demonstrate the existence of scaling laws for diffusion transformers.
🎯 研究动机扩散变换器(DiT)在图像和视频生成中展现了出色的能力,但其规模定律尚未充分探索,而这些定律能够精确预测模型规模与数据需求。
❓ 解决问题首次验证了扩散变换器的规模定律以及其与计算预算之间的关系,通过实验揭示其预训练损失遵循幂律关系。
🔍 现象分析实验表明,DiT的预训练损失在不同计算预算范围内符合幂律关系,同时其预训练损失趋势与生成性能(如FID)一致,即使在不同数据集上也具有一致性。
🛠️ 主要方法基于广泛的实验计算(从1e17到6e18 FLOPs),研究了DiT的规模定律,并扩展到具体的计算方案下进行生成性能预测。
📊 数据与实验使用跨多种数据集的大规模实验来验证扩散变换器的规模定律,并实现了从计算预算到生成质量的映射验证。
⭐ 主要贡献提出并验证了扩散变换器的规模定律,提供一个预测数据需求和模型表现的基准,降低开发成本并优化模型规模。
查看完整摘要 (Abstract)
Diffusion transformers (DiT) have already achieved appealing synthesis and scaling properties in content recreation, \emph{e.g.,} image and video generation.
However, scaling laws of DiT are less explored, which usually offer precise predictions regarding optimal model size and data requirements given a specific compute budget.
Therefore, experiments across a broad range of compute budgets, from 1e17 to 6e18 FLOPs are conducted to confirm the existence of scaling laws in DiT \emph{for the first time}. Concretely, the loss of pretraining DiT also follows a power-law relationship with the involved compute.
Based on the scaling law, we can not only determine the optimal model size and required data but also accurately predict the text-to-image generation loss given a model with 1B parameters and a compute budget of 1.5e21 FLOPs.
Additionally, we also demonstrate that the trend of pretraining loss matches the generation performances (e.g., FID), even across various datasets, which complements the mapping from compute to synthesis quality and thus provides a predictable benchmark that assesses model performance and data quality at a reduced cost.
基础/前沿模型 (含LLM)
LLM 预训练
#Training loss curve collapse #Compute-efficient LLM pre-training #Tokens-per-parameter (TPP) #AdamW EMA timescale #Learning-rate schedules #Scale-stable dynamics (μP) #Early stopping for hyperparameter tuning
TL;DR:We show that loss curves *collapse* across LLM scales when training at fixed TPP and with AdamW timescale set optimally for that TPP, making collapse a marker of compute-efficient training and a tool for tuning, diagnostics, and early stopping.
🎯 研究动机LLM训练的高效性依赖于关键量随模型和数据规模的可预测性,但现有研究对实用规模的训练曲线预测尚不充分。
❓ 解决问题探索在实际训练中,模型损失曲线能否通过合理的参数优化实现跨模型规模的归一化折叠,并运用于训练诊断与调优。
🔍 现象分析研究发现,通过适当设置超参数,LLM的损失曲线在一定归一化方法下呈现折叠现象,并与计算效率相关联。
🛠️ 主要方法提出一种基于Tokens-per-Parameter(TPP)与AdamW EMA时序的优化设置,实验验证各超参数组合在不同数据预算下实现损失曲线折叠的效果。
📊 数据与实验使用大型数据集和联合缩放方法,在LLM家族中进行广泛实验,验证曲线折叠的诊断和早停能力,同时训练一个竞争性模型‘Celerity’。
⭐ 主要贡献提出损失曲线折叠作为计算效率的标志,开发基于此的早期诊断工具与超参数优化方法,为高效LLM训练提供理论和应用支持。
查看完整摘要 (Abstract)
Effective LLM training depends on predictable scaling of key quantities—such as final loss and optimal hyperparameters—with model and dataset size. Qiu et al. (2025) recently showed that this predictability can extend beyond scalars: whole training loss curves can *collapse* onto a universal trajectory after a simple normalization. What remains unclear is whether this phenomenon persists for LLM families trained under *practical scaling recipes*, where width, depth, learning rate, batch size, and weight decay are scaled jointly. We show that it does: loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws. Collapse therefore emerges as a signature of compute-efficient training. We demonstrate two applications at scale: (1) deviation-from-collapse provides a sensitive, early diagnostic of training pathologies, and (2) predictability of collapsed curves enables early stopping in large-scale hyperparameter tuning. Finally, we train a competitive LLM family, *Celerity*, using these insights, establishing collapse as an effective tool for developing efficient LLMs.
基础/前沿模型 (含LLM)
LLM 预训练
#encoders #pretraining #objective #mlm #ntp #retrieval
TL;DR:We train encoders and decoders on identical data and compare the architectures for various tasks, beating ModernBERT and Llama 3.2 with open-data
🎯 研究动机目前大语言模型领域侧重于仅解码模型,但仍有大量场景使用仅编码模型,如分类和检索任务。现有研究对编码与解码架构的比较存在参数规模、训练方法和数据集不一致的问题。
❓ 解决问题提出使用相同训练配方的一套开源模型(Ettin),统一比较编码与解码架构的性能,并探索两者在不同任务中的表现及相互适应的局限性。
🔍 现象分析编码模型在分类和检索任务中表现优异,而解码模型在生成任务中效果更佳。然而,将解码模型调整为编码任务(或反之)的性能较差,无法超越直接使用目标架构的模型表现。
🛠️ 主要方法训练一组由1700万到10亿参数的编码和解码模型,使用最多包含2万亿标注数据的完全公开数据集进行统一训练,并生成SOTA(最先进)模型配方。
📊 数据与实验实验对比ModernBERT和Llama 3.2等模型,验证Ettin套件模型在各自任务性能中的领先性,并通过开源200多个模型检查点和训练数据,支持进一步的深入分析。
⭐ 主要贡献首次提出统一训练标准的编码与解码模型套件,验证模型在分类、检索与生成任务中的各自适用性,并实现了超越当前最先进模型的性能,推动了开源和多任务模型研究。
查看完整摘要 (Abstract)
The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training.
基础/前沿模型 (含LLM)
LLM 预训练
#language models #tokenization #pretraining #finetuning #subword understanding
TL;DR:A simple stochastic tokenization method—randomly splitting tokens before pretraining—that dramatically improves fine-grained, subword-level understanding in language models without any compromise to benchmark performance or increase in training cost.
🎯 研究动机大语言模型在处理多位数字、拼写错误、缩写等子词级任务时表现不佳,部分原因在于现有分词方法掩盖了单词的细粒度结构。
❓ 解决问题解决当前分词方法在提升子词级理解上的局限性,同时避免增加训练成本或牺牲基准性能。
🔍 现象分析现有的字符级或随机丢弃分词方法虽有所改进,但一方面需要更高计算成本,另一方面改进效果不稳定。
🛠️ 主要方法提出了一种简单的随机分词方法 StochasTok,在训练过程中随机分割词元,从而让模型接触到词的内部结构。
📊 数据与实验通过对字符计数、子串识别和数学任务等子词级语言任务的实验,验证了 StochasTok 的有效性,同时还表明其能够无缝适配于现有的模型训练流程。
⭐ 主要贡献在不增加训练成本的前提下,显著提升了语言模型的子词级理解能力,且可以通过后期训练直接提升预训练模型的表现,为更大规模模型的应用提供了潜力。
查看完整摘要 (Abstract)
Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still struggle disproportionally with seemingly simple subword-level tasks, like counting the number of 'r's in 'strawberry'. A key factor behind these failures is tokenization, which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper, we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to ‘see’ their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs’ downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok’s simplicity allows seamless integration at any stage of the training pipeline, and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models.
基础/前沿模型 (含LLM)
LLM 预训练
#Large Language Models (LLMs) #Citation Networks #Graph Neural Networks (GNNs)
🎯 研究动机随着大语言模型(LLMs)被广泛用于生成参考文献列表,研究其生成的引用是否与人类的引用存在可区分性具有重要意义。
❓ 解决问题开发方法区分由LLMs生成的参考列表与人类参考列表,通过语义特征和图结构特性揭示LLMs生成内容的可检测模式。
🔍 现象分析从图结构上看,LLMs生成的引用与真实引用差异较小,但在语义特征上,LLMs生成引用留下了显著的可检测指纹。
🛠️ 主要方法结合图神经网络(GNNs)与语义嵌入,使用节点特征和标题/摘要嵌入,分别通过随机森林和GNN对两者进行区分。
📊 数据与实验使用SciSciNet数据集中的10,000篇论文与约275,000条引用,构建真实引用与LLMs生成引用的配对图,并通过不同的模型和实验验证结果的鲁棒性。
⭐ 主要贡献发现LLMs生成的参考列表在图拓扑结构上与真实引用高度相似,但语义层面有可检测的差异,为检测和去偏提供了关键方向。
查看完整摘要 (Abstract)
Large language models are increasingly used to curate bibliographies, raising the question: are their reference lists distinguishable from human ones? We build paired citation graphs, ground truth and GPT-4o-generated (from parametric knowledge), for 10,000 focal papers ($\approx$ 275k references) from SciSciNet, and added a field-matched random baseline that preserves out-degree and field distributions while breaking latent structure. We compare (i) structure-only node features (degree/closeness/eigenvector centrality, clustering, edge count) with (ii) 3072-D title/abstract embeddings, using an RF on graph-level aggregates and Graph Neural Networks with node features. Structure alone barely separates GPT from ground truth (RF accuracy $\approx$ 0.60) despite cleanly rejecting the random baseline ($\approx$ 0.89--0.92). By contrast, embeddings sharply increase separability: RF on aggregated embeddings reaches $\approx$ 0.83, and GNNs with embedding node features achieve 93\% test accuracy on GPT vs.\ ground truth. We show the robustness of our findings by replicating the pipeline with Claude Sonnet 4.5 and with multiple embedding models (OpenAI and SPECTER), with RF separability for ground truth vs.\ Claude $\approx 0.77$ and clean rejection of the random baseline. Thus, LLM bibliographies, generated purely from parametric knowledge, closely mimic human citation topology, but leave detectable semantic fingerprints; detection and debiasing should target content signals rather than global graph structure.
基础/前沿模型 (含LLM)
LLM 预训练
#language model #pretraining #synthetic data
TL;DR:New pretraining paradigm that bootstraps model performance via synthetic data -- no external teacher needed.
🎯 研究动机传统语言模型预训练难以有效建模跨文档的丰富相关性,限制了模型性能提升空间。
❓ 解决问题提出一种无需外部教师指导的新预训练范式,通过合成数据提升语言模型性能,解决现有方法中跨文档信息利用不足的问题。
🔍 现象分析现有预训练方法主要聚焦单文档内的因果关系,未能充分利用文档间的潜在关联,造成性能瓶颈。
🛠️ 主要方法设计SBP预训练技术,从种子数据中学习文档间关系后生成大规模语料,进行联合训练以提升模型能力。
📊 数据与实验采用匹配计算量的实验设置,基于1T令牌进行预训练,验证3B和6B参数模型能显著优于强基线,并接近理想上限的性能提升。
⭐ 主要贡献提出一种合成文档的预训练新方法,高效建模文档间潜在关联,实现最高60%的性能提升,同时提供贝叶斯解释框架。
查看完整摘要 (Abstract)
We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training.
While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance.
We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter and a 6B-parameter model on up to 1T tokens from scratch.
We find SBP consistently improves upon a strong repetition baseline and delivers up to 60% of performance improvement attainable by an oracle upper bound with access to 20x more unique data.
Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it.
Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.
基础/前沿模型 (含LLM)
LLM 预训练
#Large Language Models #Autoregressive Modeling #Generative Modeling #Efficient Training
TL;DR:DIST2Loss trains discrete models to respect token distances, boosting performance in diverse domains, especially with limited data.
🎯 研究动机传统离散自回归模型使用独热目标训练,忽略了符号间的度量关系(如数值、空间坐标的差异),限制了模型在需要理解距离含义任务上的性能。DIST2Loss旨在通过距离感知的训练目标弥补这一缺陷。
❓ 解决问题针对独热目标无法反映符号间距离的问题,提出了一种基于预定义距离的奖励加权分布目标,提升模型在视觉定位、机器人操作、LLM对齐和图像生成等领域的性能。
🔍 现象分析独热目标将符号视为无关个体,导致模型无法利用其内在的距离语义,这尤其在数据有限或任务依赖度量关系时造成效率和效果的损失。
🛠️ 主要方法提出DIST2Loss,它将独热目标替换为基于符号距离计算的奖励加权分布,可视为熵正则化策略优化的闭式解,避免了强化学习中采样与不稳定的问题。
📊 数据与实验实验覆盖视觉定位(边界框更紧密)、机器人操作(动作学习加速)、LLM对齐(奖励建模增强)和矢量量化图像生成等多个领域,验证了其在数据效率和下游任务上的优势。
⭐ 主要贡献为离散自回归模型提供了一种简单通用的距离感知监督方案,替代独热目标;通过理论推导将强化学习机制稳定融入训练,并在多领域应用中显著提升了性能。
查看完整摘要 (Abstract)
Large language models (LLMs) operate as autoregressive predictors over discrete token vocabularies, a formulation that has enabled their adaptation far beyond natural language to vision, robotics, and multimodal reasoning. However, training against one-hot targets disregards metric relationships between tokens and limits effectiveness on tasks where distance is meaningful, such as numerical values, spatial coordinates, or quantized embeddings. We introduce DIST2Loss, a distance-aware objective for discrete autoregressive models that replaces one-hot targets with reward-weighted distributions derived from predefined token distances. DIST2Loss can be interpreted as the closed-form solution to entropy-regularized policy optimization with known per-token rewards, retaining the core mechanism of reinforcement learning while avoiding sampling, rollouts, and instability. Our experiments show that DIST2Loss improves data efficiency and downstream performance across diverse domains. It yields tighter bounding boxes in visual grounding, accelerates robotic manipulation by improving action learning, enhances reward modeling for LLM alignment, and strengthens vector-quantized image generation. These results demonstrate that distance-aware supervision offers a simple and general alternative to one-hot supervision for discrete autoregressive models.
基础/前沿模型 (含LLM)
LLM 预训练
#Large Language Models #Scaling Laws #Misalignment #Bias-Variance
TL;DR:We propose decomposing AI error with the bias and variance framework, and analyze scaling behavior across task complexity and model scale.
🎯 研究动机随着人工智能能力增强,其被用于更复杂和关键的任务,但失败风险也随之增加,因此需理解高能力 AI 的失败模式及其失误特征。
❓ 解决问题探讨高级 AI 模型在复杂任务中失败的性质,区分系统性目标偏离与不可预测的紊乱行为,特别关注错误的偏差-方差分解特性。
🔍 现象分析研究发现,模型规模扩大和任务复杂性增加时,模型失败表现更趋于混乱行为,且这种不一致性与模型规模和任务设置密切相关。
🛠️ 主要方法基于偏差-方差分解框架,测量 AI 模型的错误不一致性,分析其在任务复杂性和模型规模扩展下的变化行为。
📊 数据与实验针对多种任务和前沿模型进行实验,评估在不同推理与行动长度场景下的错误不一致性表现。
⭐ 主要贡献提出错误不一致性测量框架,发现模型规模和任务复杂性可增加行为紊乱性,强调需优先研究目标设定与奖赏机制以缓解意外紊乱风险。
查看完整摘要 (Abstract)
As AI becomes more capable, we entrust it with more general and consequential tasks. The risks from failure grow more severe with increasing task scope. It is therefore important to understand
the ways extremely capable AI models will fail: Will they fail by systematically pursuing goals we do not intend? Or will they fail by being a hot mess, and taking nonsensical actions that do not further any goal? We operationalize this question using a bias-variance decomposition of the errors made by AI models: An AI's *error incoherence* on a task is measured over test-time randomness as the fraction of its error that stems from variance rather than bias in task outcome. Across all tasks and frontier models we measure, we find that the longer models spend reasoning and taking actions, *the more incoherent* their failures become. We observe that error incoherence changes with model scale in a way that is task and experiment dependent. However, in several settings larger, more capable models are more incoherent than smaller models.
Consequently, scale alone seems unlikely to eliminate incoherence. Instead, as more capable AIs pursue harder tasks, requiring more sequential action and thought, our results predict failures to be accompanied by more incoherent behavior.
This suggests a future where AIs sometimes cause industrial accidents (due to unpredictable misbehavior), but are less likely to exhibit consistent pursuit of a misaligned goal.
This increases the relative importance of alignment research targeting reward hacking or goal misspecification.
基础/前沿模型 (含LLM)
LLM 预训练
#Tokenisation #tokenization #language modelling #compression #LLM #NLP
TL;DR:We prove that selecting a tokeniser which maximises a dataset's compression is NP-complete and does not admit a PTAS (unless P=NP), even when their inputs are defined on a binary alphabet.
🎯 研究动机当前关于分词的研究已证明其为 NP 完全问题,但此前的假设基于无限大的字母集,这在实际中不合理,因此有必要分析限定字母集下的分词难度。
❓ 解决问题研究固定大小字母集(如二进制、Unicode 等)下分词问题的计算复杂性,并确认该问题的核心难点是否因字母集大小而改变。
🔍 现象分析即使在二进制或单一字母集下,分词问题仍然为 NP 完全问题,同时具有 APX 难度,这表明其本质上的计算不可行性。
🛠️ 主要方法分析两种分词方式——自底向上的合并操作与直接选择词汇表,并通过复杂性理论证明其在限定字母集下的 NP 完全性质与 APX 难度。
📊 数据与实验论文通过理论推导与复杂性分类,无需实际数据集,直接在数学框架中证明相关结论。
⭐ 主要贡献首次证明有限字母集下分词问题的 NP 完全性与 APX 难度,为理论研究与算法设计提供了重要依据,并指出基于启发式与近似算法是解决分词难题的关键。
查看完整摘要 (Abstract)
Recent works have shown that tokenisation is $\mathsf{NP}$-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets—an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode-characters. We close this gap by analysing tokenisation over bounded alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. We prove that even with binary alphabets, both variants are not only $\mathsf{NP}$-complete, but also $\mathsf{APX}$-hard and thus admit no polynomial-time approximation scheme (unless $\mathsf{P}=\mathsf{NP}$). We further show that direct tokenisation remains $\mathsf{NP}$-complete even when applied to unary alphabets. These results establish that the computational intractability of tokenisation is not an artifact of large alphabets or complex constructions, but a fundamental barrier. Overall, our results explain why current practical algorithms such as BPE and UnigramLM are heuristic, and point toward approximation algorithms being an important path going forward for tokenisation research.
基础/前沿模型 (含LLM)
LLM 预训练
#Label Aggregation
TL;DR:We created CrowdFM, a foundation model for label aggregation, which is pre-trained once on synthetic data to accurately combine noisy crowdsourced labels on any new dataset without retraining.
🎯 研究动机从嘈杂的众包标签中推断准确的真实值是机器学习中的核心挑战,传统依赖于数据集特定的参数估计方法,缺乏可扩展性和知识迁移能力。
❓ 解决问题目前的通用聚合模型未能有效捕捉众包标注的结构和行为复杂性,导致真实场景中的性能受限。
🔍 现象分析众包数据标注中存在多样化的行为模式和复杂的结构交互,这些特性未被充分利用,影响了模型的泛化能力。
🛠️ 主要方法提出了CrowdFM,它是一个基于双向图神经网络的基础模型,通过在大规模、领域随机化的合成数据上进行预训练,使用大小不变的初始化和基于注意力的消息传递机制学习群体智慧的普适原则。
📊 数据与实验在22个真实基准数据集上的实验表明,CrowdFM在准确性和效率上均超越了大多数为单个数据集量身定制的方法。
⭐ 主要贡献首次设计了一个能泛化到任意新数据集的众包标签聚合基础模型;通过学习通用的集体智能原则支持多种下游应用,例如工人评估和任务分配;提供了一个公开的代码库供研究者使用。
查看完整摘要 (Abstract)
Inferring ground truth from noisy, crowdsourced labels is a fundamental challenge in machine learning. For decades, the dominant paradigm has relied on dataset-specific parameter estimation, a non-scalable method that fails to transfer knowledge. Recent efforts toward universal aggregation models do not account for the structural and behavioral complexities of human-annotated crowdsourcing, resulting in poor real-world performance. To address this gap, we introduce CrowdFM, a foundation model for crowdsourced label aggregation. At its core, CrowdFM is a bipartite graph neural network that is pre-trained on a vast, domain-randomized synthetic dataset to learn diverse behavioral patterns. By leveraging a size-invariant initialization and attention-based message passing, it learns universal principles of collective intelligence and generalizes to new, unseen datasets. Extensive experiments on 22 real-world benchmarks show that our single, fixed model consistently matches or surpasses bespoke, per-dataset methods in both accuracy and efficiency. Furthermore, the representations learned by CrowdFM readily support diverse downstream applications, such as worker assessment and task assignment. Codes are available at https://github.com/liiuhaao/CrowdFM.
基础/前沿模型 (含LLM)
LLM 预训练
#large language models (LLMs) #pretraining #experiments #memorization
TL;DR:We show that it is possible to conduct multiple pretraining experiments during the training of a single LLM.
🎯 研究动机受控预训练实验是研究训练数据与LLM行为关系的重要工具,但高昂的计算成本限制了其实验范围。
❓ 解决问题提出一种新方法,在单次LLM训练中同时进行多个预训练实验,从而降低计算成本。
🔍 现象分析发现单次训练可以复现多项关于数据污染、数据投毒及记忆化的现有工作,同时揭示动态知识获取、数学推理和水印技术的新发现。
🛠️ 主要方法通过单次训练动态更新数据,以模拟不同实验条件,并采用持续预训练依赖测试(CPDT)分析实验间交互的影响。
📊 数据与实验使用包含2100亿tokens的数据集训练最多2.7B参数的模型,并在单次训练中完成十项实验。
⭐ 主要贡献证明单次预训练可执行多个实验,为在有限算力下进行严谨科学实验提供了新方法,同时提出了评估实验交互的新技术CPDT。
查看完整摘要 (Abstract)
Recent work has demonstrated that controlled pretraining experiments are a powerful tool for studying
the relationship between training data and large language model (LLM) behavior.
However, the computational cost of pretraining presents a significant constraint. To overcome this constraint, we propose a new approach where multiple experiments are conducted simultaneously during a *single* training run. We validate our approach by performing ten experiments while training on 210B tokens, with models of up to 2.7B parameters. Although models are trained only once, we can replicate the results of multiple previous works on data contamination, poisoning, and memorization. We also conduct novel investigations into knowledge acquisition, mathematical reasoning, and watermarking. For example, we dynamically update the training data until a model acquires a particular piece of knowledge. Remarkably, the influence of the experiments on the model's training dynamics and overall performance is minimal. However, interactions between experiments may act as a confounder in our approach. We propose continual pretraining dependence testing (CPDT), a novel technique to test for interactions with continual pretraining experiments, finding them to be negligible in our setup. Overall, our results suggest that performing multiple pretraining experiments within a single training run can enable rigorous scientific experimentation with large models on a compute budget.
基础/前沿模型 (含LLM)
LLM 预训练
#calibration #LLM #semantic #uncertainty #theory
TL;DR:We show that LLMs can be semantically calibrated, and we develop theory for when and why.
🎯 研究动机LLMs在输出语义内容的置信度估测方面存在不足,亟需探索其语义校准能力及背后的理论机制。
❓ 解决问题研究LLMs是否具备语义层面的校准能力,及其为何能够在仅进行下一个词预测训练的情况下实现语义校准。
🔍 现象分析发现LLMs在开放式问答任务中对语义答案的置信度评估表现良好,但指令微调和链式思维推理会破坏这种校准能力。
🛠️ 主要方法定义$B$-校准这一语义校准概念,并通过连接校准与局部损失最优性提出理论机制,预测校准的条件和破坏情况。
📊 数据与实验通过多种开放式问答任务实验验证预测,覆盖语义分类、指令微调和链式推理场景的校准效果。
⭐ 主要贡献首次提出并理论解释LLMs的语义校准能力,揭示其对应机制及实验验证,统一了校准与生成预测的联系。
查看完整摘要 (Abstract)
Large Language Models (LLMs) often lack meaningful confidence estimates for the semantic content of their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in various open-ended question-answering tasks, despite training only on next-token prediction. To formalize this phenomenon, we introduce "$B$-calibration," a notion of calibration parameterized by the choice of equivalence classes. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges in base LLMs, leveraging a recent connection between calibration and local loss optimality. This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) instruction-tuning procedures systematically break this calibration, and (3) chain-of-thought reasoning breaks calibration (intuitively because models cannot predict their final answers before completing their generation). To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.
基础/前沿模型 (含LLM)
LLM 预训练
#language models #tokenization #automata #transducers
TL;DR:We present a method for converting a language model over one set of tokens into a language model over another set of tokens
🎯 研究动机现代语言模型生成的输出格式未必能直接满足下游任务的需求,例如字节对到词级预测或DNA到氨基酸的转换需要特殊处理。
❓ 解决问题提出一种方法,可通过确定性字符串转换,将一个语言模型的输出映射到另一种期望的形式,构建全功能的语言模型。
🔍 现象分析将确定性字符串转换视为概率分布的函数变换,现有方法未将其形式化为完整语言模型,可通过有限状态机高效实现。
🛠️ 主要方法基于有限状态转换器(FST),开发精确算法及高效近似算法,通过边缘化处理将语言模型与FST组合,同时保持模型参数不变,实现对转换后输出的条件推断。
📊 数据与实验实验涉及三个领域,分别为从token到字节、从token到单词,以及从DNA到氨基酸的转换,展示了预训练模型在推理时的适配能力。
⭐ 主要贡献正式提出了基于确定性字符串转化的语言模型构建框架,提供了算法及其理论分析,并通过实验验证在不同应用场景中的有效性。
查看完整摘要 (Abstract)
Modern language models define distributions over strings, but downstream tasks often require different output formats. For instance, a model that generates byte-pair strings does not directly produce word-level predictions, and a DNA model does not directly produce amino-acid sequences. In such cases, a deterministic string-to-string transformation can convert the model's output to the desired form.
This is a familiar pattern in probability theory: applying a function $f$ to a random variable $X\sim p$ yields a transformed random variable $f(X)$ with an induced distribution. While such transformations are occasionally used in language modeling, prior work does not treat them as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. We focus on transformations representable as finite-state transducers---a commonly used state-machine abstraction for efficient string-to-string mappings. We develop algorithms that compose a language model with an FST to *marginalize* over source strings mapping to a given target, propagating probabilities through the transducer without altering model parameters and enabling *conditioning* on transformed outputs. We present an exact algorithm, an efficient approximation, and a theoretical analysis. We conduct experiments in three domains: converting language models from tokens to bytes, from tokens to words, and from DNA to amino acids. These experiments demonstrate inference-time adaptation of pretrained language models to match application-specific output requirements.
基础/前沿模型 (含LLM)
LLM 预训练
#hyperparameter transfer #hyperparameter tuning #scaling laws #optimization dynamics #maximal update parameterization #science of deep learning
🎯 研究动机深度学习模型规模扩大导致超参数优化成本高昂,规模感知的超参数传递方法可以在小规模模型中找到的最佳设置直接应用于大规模模型,减少性能损失。
❓ 解决问题缺乏对快速超参数传递机制的深层理解,尤其是现有方法如最大更新参数化(μP)在模型宽度扩展中的效果机制。
🔍 现象分析通过合成和实际场景分析,展示了快速传递的条件:在某些情况下存在计算效率的优势,而在另一些情况下失败(即使使用μP)。
🛠️ 主要方法提出一种优化轨迹的新分解方法,识别出快速随模型宽度收敛并决定最优超参数的部分,以及对损失改进影响较小但继续收敛的部分。
📊 数据与实验在合成场景中提供定量示例,并在实际场景(例如大型语言模型训练)中通过实验证实提出的分解方法的有效性。
⭐ 主要贡献建立了快速超参数传递的系统性理论框架,提出关键机制的假设,并在实践中验证了这些机制的可行性。
查看完整摘要 (Abstract)
The growing scale of deep learning models has rendered exhaustive hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware HPs, which can enable direct transfer of optimal settings from small-scale grid searches to large models with minimal performance loss. Such approaches are useful when the optimal settings converge "fast" enough with scale. While approaches like the Maximal Update Parameterization ($\mu$P) have empirically displayed fast transfer when scaling model width, a deeper conceptual understanding of the mechanisms that enable this is still missing. Our work establishes a systematic conceptual framework for analyzing fast HP transfer across different synthetic and practical scenarios. In synthetic settings, we present various quantitative examples where transfer either offers a provable computational advantage or fails even under $\mu$P.
We then propose a key property that enables the fast transfer often observed in practice: through a novel decomposition of the optimization trajectory, we identify one component that rapidly converges with model width and determines the optimal HPs, and the other that continues to improve the loss with increased width but has negligible impact on HP choice. We conjecture that this decomposition elucidates the key mechanisms behind fast transfer and empirically validate it in practical settings such as LLM training.
基础/前沿模型 (含LLM)
LLM 预训练
#Backdoor Defense #Anomaly Detection #Gradient-Based Attribution #Attention Mechanisms #Explainability #Pre-trained Language Models
TL;DR:An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models
🎯 研究动机预训练语言模型在自然语言处理任务中表现出色,但易受后门攻击,这种攻击利用触发模式嵌入恶意行为,对其安全性构成威胁。
❓ 解决问题该研究旨在设计一种解释性强的防御方法,通过检测触发模式激活时的注意力和梯度异常,抵御后门攻击。
🔍 现象分析后门攻击导致模型对触发词产生异常注意力分布和梯度归因倾斜,使触发词主导信息解读,忽略上下文语义。
🛠️ 主要方法提出一种结合注意力和梯度归因信息的推断时异常评分机制,用于检测后门触发行为,对异常输入标记并定位触发词。
📊 数据与实验在多种文本分类任务和后门攻击场景中进行了广泛实验,与现有防御方法相比,显著降低了攻击成功率。
⭐ 主要贡献提出了一种可解释的基于梯度-注意力异常评分的防御方法,提升了后门攻击检测与触发定位能力,并为后门防御研究提供了解释性新视角。
查看完整摘要 (Abstract)
Pre-trained language models have achieved remarkable success across a wide range of natural language processing (NLP) tasks, particularly when fine-tuned on large, domain-relevant datasets. However, they remain vulnerable to backdoor attacks, where adversaries embed malicious behaviors using trigger patterns in the training data. These triggers remain dormant during normal usage, but, when activated, can cause targeted misclassifications. In this work, we investigate the internal behavior of backdoored pre-trained encoder-based language models, focusing on the consistent shift in attention and gradient attribution when processing poisoned inputs; where the trigger token dominates both attention and gradient signals, overriding the surrounding context. We propose an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information. Extensive experiments on text classification tasks across diverse backdoor attack scenarios demonstrate that our method significantly reduces attack success rates compared to existing baselines. Furthermore, we provide an interpretability-driven analysis of the scoring mechanism, shedding light on trigger localization and the robustness of the proposed defense.
基础/前沿模型 (含LLM)
LLM 预训练
#Performance Prediction #Scaling Law #Large Language Models #Pretraining
TL;DR:We developed a Clustering-On-Difficulty (COD) framework that accurately predicts LLM downstream performance, achieving a 1.55% average prediction error on a 70B parameter model.
🎯 研究动机大规模语言模型的训练成本不断提高,需要精准预测其在下游任务中的表现,以更好理解模型的扩展规律。
❓ 解决问题现有方法无法有效应对突然涌现的能力和任务难度不均导致的预测误差及性能度量不稳定问题。
🔍 现象分析模型能力在关键规模下可能突然显现,同时任务间难度和性能扩展模式的差异加剧了预测的不确定性。
🛠️ 主要方法提出了基于任务难度聚类的 COD 框架,利用性能扩展规律预测任务簇的表现,并通过映射函数推断整体性能。
📊 数据与实验在一个具有 700 亿参数的 LLM 上测试,该框架在八个关键基准上的平均预测误差仅为 1.55%。
⭐ 主要贡献提供了一个稳定、高效的性能预测方法,为理解模型扩展规律和预训练期间的模型监测提供了切实可行的解决方案。
查看完整摘要 (Abstract)
The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for comprehensive understanding of scaling properties. This is challenged by: 1) the emergence phenomenon, where unpredictable capabilities appearing suddenly at critical model scales; and 2) uneven task difficulty and inconsistent performance scaling patterns, leading to high metric variability. Current prediction methods lack accuracy and reliability. We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction. The COD framework clusters tasks by their difficulty scaling features, thereby constructing a more stable and predictable task subset that exhibits well-behaved scaling characteristics with the increase of compute budget. We adopt a performance scaling law to predict cluster-wise performance with theoretical support. Predictable subset performance acts as an intermediate predictor for the full evaluation set. We further derive a mapping function to accurately extrapolate the performance of the subset to the full set. Applied to an LLM with 70B parameters, COD achieved a 1.55\% average prediction error across eight key LLM benchmarks, thus providing actionable insights for scaling properties and training monitoring during LLM pre-training.
基础/前沿模型 (含LLM)
LLM 预训练
#loss landscape #empirical theory #pre-training #fine-tuning
TL;DR:Explore the Basin Phenomenon in LLM Landscape.
🎯 研究动机随着大规模语言模型尺寸增长,模型参数空间稳定性显著提高,但对局部扰动的行为仍需深入了解,以优化模型性能与鲁棒性。
❓ 解决问题揭示大规模语言模型损失景观中的盆地现象,并研究盆地特性如何影响模型性能的保持与退化。
🔍 现象分析预训练形成基础能力盆地,微调进一步细分为特定能力盆地;模型在盆地内性能稳定,盆地外功能骤然衰退;恶意微调沿近似最差方向移动,引发性能快速下降。
🛠️ 主要方法通过理论分析与实验验证,研究损失盆地的大小与形态如何界定微调性能退化和输入扰动的模型鲁棒性。
📊 数据与实验采用多种预训练和微调场景,包括安全性、数学与代码能力测试,结合对抗方向分析模型性能表现。
⭐ 主要贡献首次揭示大语言模型损失景观中的盆地现象;论证盆地大小对模型性能保持及鲁棒性的影响;提出扩大盆地有助于优化微调效果与模型稳定性。
查看完整摘要 (Abstract)
We discover the emergence of basins in the loss landscape of large language models. As model scale increases, LLMs become progressively more resilient to random perturbations in the parameter space, giving rise to expansive stability regions where models exhibit nearly identical performance, but outside of which their capabilities collapse. We observe that pre-training creates a basic capability basin, and subsequent alignment fine-tuning forms specific capability basins (e.g., safety, math, coding). Thus, we argue that benign fine-tuning confined to the basin should preserve prior capabilities. Besides, we also analyze the loss landscape for worst-case directions, which is consistently sharp and detrimental. We find that adversarial fine-tuning moves along the nearly worst-case directions, thus rapidly degrading model capabilities. Finally, we provide a theoretical analysis demonstrating that the basin size bounds the performance degradation of any fine-tuning, including the adversarial ones, while also guaranteeing the model robustness w.r.t. input perturbations, suggesting the benefit of enlarging basins.
基础/前沿模型 (含LLM)
LLM 预训练
#llm pre-training #learning rate schedule #checkpoint merging #decay-free approach
🎯 研究动机近年来,去衰减学习率策略在保持模型性能的同时替代传统学习率衰减,表现出巨大潜力,同时模型合并技术在该领域尤为突出。
❓ 解决问题现有方法缺乏统一理论支持去衰减学习率与模型合并的联系,且未充分挖掘合并时长对模型性能的核心影响。
🔍 现象分析通过实验证明,与检查点间隔和合并数量相比,合并时长对模型性能影响更显著,并验证高质量退火数据能有效提升模型表现。
🛠️ 主要方法提出一个通用框架WSM,连接学习率衰减与模型合并,通过理论将不同衰减策略转化为模型平均方案,同时兼容多种优化方法。
📊 数据与实验框架在MATH、HumanEval和MMLU-Pro等基准上取得显著性能提升,例如分别提高了3.5%、2.9%和5.5%,并在监督微调中证实其长期改进潜力。
⭐ 主要贡献提供了去衰减学习率与模型合并的理论统一框架,验证了合并时长的关键作用,并实现了现有方法在多项基准上的显著性能改进。
查看完整摘要 (Abstract)
Recent advances in learning rate~(LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies—including cosine decay, linear decay and inverse square root decay—as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration—the training window for checkpoint aggregation—as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. With the high-quality annealing data, our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5\% on MATH, +2.9\% on HumanEval, and +5.5\% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM's potential for long-term model refinement.
基础/前沿模型 (含LLM)
LLM 预训练
#maximal update parametrization #llm #pretraining #hyperparameter transfer #learning dynamics #adamw #mup #weight decay #hyperparameter tuning #scaling law #transformer
TL;DR:Empirically-focused study showing µP requires weight decay to successfully transfer learning rates across model sizes in practice.
🎯 研究动机大规模模型训练中的超参数调节成本极高,如何通过小模型优化率转移到大模型成为关键问题。µP以稳定内部表示更新动态为目标提出了学习率缩放方案,但其假设在实用场景中可能不完全成立。
❓ 解决问题探索µP的假设局限,结合权重衰减在模型训练动态稳定性中的实际作用,优化学习率转移方案。在实践中验证learning rate transfer的有效性与µP设计的适配性。
🔍 现象分析µP的缩放假设仅在训练早期短暂成立,而权重衰减在整个训练过程内更稳定地保持内部表示动态一致性。这说明µP的效果更像隐式学习率预热,其转移性能依赖权重衰减。
🛠️ 主要方法进行大规模实证实验,验证µP的假设与权重衰减对动态稳定性的影响,设计替代性的学习率预热方案以优化转移过程。
📊 数据与实验在LLM训练与其他高价值场景下,综合对比µP与权重衰减的实际作用,分析不同模型宽度下的动态变化及学习率调节效果。
⭐ 主要贡献挑战µP在学习率转移领域的既定理论,提出权重衰减的重要性与改进型预热方案的替代性,为模型超参数转移提供新视角。
查看完整摘要 (Abstract)
Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (µP) proposes a learning rate scaling designed to keep the update dynamics of internal representations stable across different model widths. However, the scaling rules of µP rely on strong assumptions, particularly about the geometric alignment of a layer’s inputs with both its weights and gradient updates. In this large-scale empirical investigation, we show that these assumptions hold only briefly at the start of training in the practical setups where learning rate transfer is most valuable, such as LLM training. For the remainder of training it is weight decay rather than µP that correctly stabilizes the update dynamics of internal representations across widths, facilitating learning rate transfer. This suggests µP's scaling primarily acts as a form of implicit learning rate warmup, allowing us to largely replace it with modified warmup schedules. Together these findings fundamentally challenge prevailing beliefs about learning rate transfer and can explain empirical observations such as why µP requires the independent weight decay variant for good transfer.
基础/前沿模型 (含LLM)
LLM 预训练
#Cross-Entropy Loss; Error-Entropy; Neural Scaling Laws; Loss Decomposition; Large Language Models;
🎯 研究动机交叉熵缩放规律一直是指导大语言模型开发的核心工具,但其在大规模模型上失效,引发了对其内在机理的探索需求。
❓ 解决问题揭示交叉熵缩放规律失效的根本原因,进一步提出更准确描述大模型行为的理论框架。
🔍 现象分析发现交叉熵的缩放规律在大规模模型中不再准确,其下降速度比预期更慢,表明交叉熵并非整体可缩放,仅其隐藏的部分组件具备缩放特性。
🛠️ 主要方法提出将交叉熵分解为错误熵、内在对齐、自信度三部分的全新理论框架,并结合理论与实验证实该分解准确描述了训练动态与优化目标。
📊 数据与实验在多个数据集和32个跨越五个数量级的模型上进行实验,证实只有错误熵遵循稳健的幂律缩放,而其他两部分基本保持不变。
⭐ 主要贡献首次提出错误熵缩放规律,取代传统交叉熵缩放规律,更准确地描述大语言模型的训练行为,促进模型开发与理解。
查看完整摘要 (Abstract)
The cross-entropy scaling law has long served as a key tool for guiding the development of large language models. It shows that cross-entropy loss decreases in a predictable power-law rate as the model size increases. However, recent evidence indicates that this law breaks down at very large scales: the loss decreases more slowly than expected, which causes significant trouble for developing large language models. In this paper, we hypothesize that the root cause lies in the fact that cross-entropy itself does not truly scale; instead, only one of its hidden components does. To investigate this, we introduce a novel decomposition of cross-entropy into three parts: Error-Entropy, Self-Alignment, and Confidence. We show both theoretically and empirically that this decomposition precisely captures the training dynamics and optimization objectives. Through extensive experiments on multiple datasets and 32 models spanning five orders of magnitude in size, we find that only error-entropy follows a robust power-law scaling, while the other two terms remain largely invariant. Moreover, error-entropy constitutes the dominant share of cross-entropy in small models but diminishes in proportion as models grow larger. This explains why the cross-entropy scaling law appears accurate at small scales but fails at very large ones. Our findings establish the error-entropy scaling law as a more accurate description of model behavior. We believe it will have wide applications in the training, understanding, and future development of large language models.
基础/前沿模型 (含LLM)
LLM 预训练
#xLSTM #Transformers #Scaling Laws #Sequence Modeling #TFLA #Linear Attention #Inference
TL;DR:Scaling laws for linear time-complexity xLSTM model, comparing against Transformers for both training and inference.
🎯 研究动机扩展大语言模型的可扩展性定律,探索相较于主流Transformers具备线性时间复杂度的xLSTM模型的性能表现和应用潜力。
❓ 解决问题研究xLSTM在大参数规模和长上下文中的扩展行为,并与Transformers在训练与推理阶段进行性能对比。
🔍 现象分析xLSTM在计算预算相同情况下,凭借其线性复杂度,能够在典型训练和推理场景中实现优于Transformers的交叉熵损失表现。
🛠️ 主要方法采用IsoFLOP和参数拟合方法,从多模型规模与训练数据规模维度分析xLSTM的扩展行为,同时研究模型规模与上下文长度的依赖性。
📊 数据与实验数据规模涵盖80M-7B模型和2B-2T训练数据,实验对比xLSTM与Transformers在计算预算最优和过训练条件下的扩展性能。
⭐ 主要贡献揭示xLSTM在高效扩展中对Transformers的帕累托优势,为未来更高效、灵活的语言模型设计提供指导与实践依据。
查看完整摘要 (Abstract)
Scaling laws play a central role in the success of
Large Language Models (LLMs), enabling the prediction of
model performance relative to compute budgets prior to training.
While Transformers have been the dominant architecture,
recent alternatives such as xLSTM offer linear complexity
with respect to context length while remaining competitive in the billion-parameter regime.
We conduct a comparative investigation on the scaling behavior of Transformers and xLSTM along the following lines, providing insights to guide future model design and deployment.
First, we study the scaling behavior for xLSTM in compute-optimal and over-training regimes using both IsoFLOP and parametric fit approaches on a wide range of model sizes (80M-7B) and number of training tokens (2B-2T).
Second, we examine the dependence of optimal model sizes on context length, a pivotal aspect that was largely ignored in previous work.
Finally, we analyze inference-time scaling characteristics.
Our findings reveal that in typical LLM training and inference scenarios,
xLSTM scales favorably compared to Transformers.
Notably, xLSTM models consistently Pareto-dominate Transformer models, delivering lower cross-entropy loss for the same compute budget.