Actor-Critic / PPO 系列201 篇
🎯 研究动机优化预训练数据的动态组合对于提升大语言模型的泛化至关重要,但现有方法未能在计算效率、采样效率以及结构灵活性之间取得平衡。
❓ 解决问题提出了一种从强化学习视角出发的动态数据混合方法,以解决当前动态混合策略在训练效率和多样化适应能力上的瓶颈。
🔍 现象分析通过理论分析证明,将数据混合策略参数化后可以作为动态线性代理,从而最大化梯度的建设性干扰。
🛠️ 主要方法提出了AC-ODM方法,支持两种模式:代理模式(从小模型学习策略并迁移到大模型)和非代理模式(直接端到端训练),以适应不同的应用场景。
📊 数据与实验在Pythia-1B等多种架构上验证,AC-ODM显著提升了收敛速度和下游任务的准确性,同时带来极低的时间和内存开销。
⭐ 主要贡献提出了高效且灵活的动态数据混合方法AC-ODM,显著提升了大模型预训练效率及泛化性能,为未来数据优化提供了新范式。
查看完整摘要 (Abstract)
Optimizing pretraining data composition is pivotal for LLM generalization. While dynamic mixing outperforms static strategies by capturing evolving training dynamics, current methods fail to reconcile computational efficiency with sample efficiency and structural flexibility for diverse pipelines.We introduce \textbf{Actor--Critic Online Data Mixing (AC-ODM)}, which approaches data mixing from a reinforcement learning perspective with a parameterized policy that we theoretically prove to act as a dynamic linear surrogate maximizing the constructive interference of gradients. To enhance practical flexibility, AC-ODM supports two operational modes: (i) a \textbf{proxy mode} for fixed, pre-prepared corpora, where a policy learned on a small model is transferred to a larger target; and (ii) a \textbf{non-proxy mode} for direct end-to-end training from scratch without priors. Empirically, AC-ODM significantly outperforms prior methods in convergence speed and downstream accuracy across various architectures. On Pythia-1B, it reaches optimal validation perplexity using up to 66\% fewer training steps than competitive baselines, delivering a 27.5\% relative improvement in MMLU accuracy and a 2.23$\times$ higher pass@1 on HumanEval, all while incurring a virtually negligible ($~$0.4\%) per-step wall-clock increase and only 2\% additional memory overhead.
🎯 研究动机当前语音大语言模型(SLLMs)在情感推理中缺乏验证性的声学依据,而 SSL 编码器虽然具有强大的声学表现,但却缺乏可解释性。
❓ 解决问题旨在弥合 SLLMs 的推理能力与 SSL 编码器的声学解释性之间的差距,提高复杂情感表达的识别与解释能力。
🔍 现象分析情感具有复杂性和共现性,现有方法常将少数派标注视为噪声,但未能有效利用这些信息以增强情感推理的全面性。
🛠️ 主要方法提出 ADEPT 框架,通过多轮探询流程,将情感识别转化为候选集生成、证据收集和裁定的结构化管道,并通过 GRPO 与 Evidence Trust Gate 强化基于证据的推理能力。
📊 数据与实验通过实验验证,ADEPT 在主要情感准确率上多数场景均有提升,同时显著改善了次要情感识别能力,并提供可审计的解释性结果。
⭐ 主要贡献实现了从共识学习到基于歧义的情感推理范式转变,引入以少数注解为信息信号的新方法,提升情感识别精度与可解释性。
查看完整摘要 (Abstract)
Speech Large Language Models (SLLMs) enable high-level emotion reasoning, but often produce ungrounded, text-biased judgments without verifiable acoustic evidence. In contrast, SSL encoders such as WavLM yield strong acoustic representations yet remain opaque discriminative models that offer limited interpretability. To bridge this gap, we introduce the Agentic Decoding of Emotion via Probing Tools (ADEPT) framework, which reframes emotion recognition as a multi-turn inquiry process rather than a single-pass prediction. ADEPT transforms an SLLM into an agent that maintains an evolving candidate set and adaptively invokes dedicated semantic and acoustic probing tools within a structured pipeline of candidate generation, evidence collection, and adjudication. Crucially, ADEPT enables a paradigm shift from consensus learning to ambiguity-driven emotion reasoning. Since human affect exhibits complexity and co-occurrence of emotions, we leverage minority annotations as informative signals instead of discarding them as noise. Finally, we integrate Group Relative Policy Optimization (GRPO) with the Evidence Trust Gate to explicitly couple tool-usage behaviors with prediction quality and enforce evidence-based reasoning. Experiments demonstrate that ADEPT improves in most cases the primary emotion accuracy while substantially improving minor emotion characterization, producing explanations grounded in auditable evidence.
🎯 研究动机Agentic强化学习被认为是解决复杂的交互式任务的重要范式,但面临政策优化不稳定性的问题亟需系统性分析与解决方案。
❓ 解决问题提出一种统一框架解析Agentic强化学习中政策梯度维度的影响因素,识别导致训练不稳定的关键来源,并设计稳定的优化方法。
🔍 现象分析通过细粒度分析,揭示了多种政策梯度维度在Agentic强化学习中的作用及其潜在影响,统一了模型不稳定性的认识。
🛠️ 主要方法提出ARLArena解析框架,并基于分析设计了SAMPO方法,从优化维度提高政策训练稳定性,同时确保性能提升。
📊 数据与实验在多种Agentic任务中通过公开代码验证SAMPO的稳定性和性能,实验结果显示其在不同任务间均表现优异且训练稳定。
⭐ 主要贡献提供政策梯度统一视角解析Agentic强化学习问题,提出稳定方法,提高了复杂任务训练的稳定性及LLM相关训练流程的可复现性。
查看完整摘要 (Abstract)
Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. In this paper, we first propose $\textbf{ARLArena}$, a fair and systematic analysis framework that encompasses a broad spectrum of ARL algorithms and decomposes policy optimization (PO) through multiple policy gradient dimensions. Through this fine-grained analysis, we distill a unified perspective on ARL and, guided by the identified governing factors, propose $\textbf{SAMPO}$, a stable agentic PO method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines. Our codebase is open-sourced at https://anonymous.4open.science/r/SAMPO-02B3.
🎯 研究动机医疗和营销领域需优化个性化剂量策略以最大化效用,但实验成本高、预算受限,需高效的主动策略学习方法。
❓ 解决问题现有方法主要针对二元处理和效应估计,缺乏对连续剂量策略优化的研究。
🔍 现象分析剂量反应曲线的结构特性表明,由估计最优剂量处的梯度方差决定的策略优化损失是可界定的。
🛠️ 主要方法提出GVALID策略,通过批量采样优化目标梯度方差,从而高效学习个性化剂量策略。
📊 数据与实验通过实验验证,在严格预算约束下,GVALID显示出优于现有方法的性能。
⭐ 主要贡献首次引入针对连续剂量优化的主动学习框架,理论上将梯度方差与策略损失关联,并设计了一种高效采样算法。
查看完整摘要 (Abstract)
In domains such as healthcare and marketing, learning optimal individualized dosing policies to maximize utility is crucial, yet high experimental costs impose strict budget constraints, necessitating efficient active policy learning. Existing active learning methods in causal inference primarily focus on binary treatments and effect estimation, leaving continuous dosing and policy optimization underexplored. To address this gap, we propose an active learning framework tailored for optimal policy learning. Exploiting the inherent structure of dose-response curves, we theoretically show that the policy optimization regret is bounded by the expected posterior gradient variance at the estimated optimal doses. Motivated by this result, we introduce Gradient Variance Active Learning for Individualized Dosing (GVALID), a batch acquisition strategy that greedily selects samples to minimize target gradient variance for efficient policy learning. Experiments demonstrate that GVALID achieves superior performance under strict budget constraints.
🎯 研究动机随着大型语言模型广泛用于代码生成,保护代码知识产权需要开发适应代码语法约束的水印技术。
❓ 解决问题提出一种框架能在嵌入水印时保持代码功能完整性,并通过统计手段检测生成代码中的细微差异。
🔍 现象分析嵌入水印的代码在功能性和水印检测性之间存在权衡,需要平衡过程层面和结果层面的反馈信号。
🛠️ 主要方法基于强化学习的策略驱动方法,利用参数化模型在下一步预测中智能调整代码符号选择,并通过Gumbel Top-k重参数化实现梯度优化。
📊 数据与实验使用多个基准数据集进行对比评估,结果显示新方法在水印可检测性和代码功能性上均优于现有技术。
⭐ 主要贡献开发了CodeTracer框架,在保护代码完整性的同时嵌入可检测水印;提出融合执行反馈与水印信号的奖励系统;提供公开代码以支持后续研究。
查看完整摘要 (Abstract)
As LLMs increasingly generate production code, protecting intellectual property demands watermarking techniques that respect code's strict syntactic constraints. In this work, we introduce CodeTracer, an innovative adaptive code watermarking framework underpinned by a reinforcement learning training paradigm. At its core, CodeTracer features a policy-driven approach that utilizes a parameterized model to intelligently bias token choices during next-token prediction. This strategy ensures that embedded watermarks maintain code functionality while exhibiting subtle yet statistically detectable deviations from typical token distributions. To facilitate policy learning, we devise a comprehensive reward system that seamlessly integrates execution feedback with watermark embedding signals, balancing process-level and outcome-level rewards. To enable gradient-based optimization of these discrete watermarking decisions, we employ Gumbel Top-k reparameterization. Extensive comparative evaluations demonstrate that CodeTracer outperforms state-of-the-art baselines across multiple benchmarks in both watermark detectability and code functionality. Our code is available at https://anonymous.4open.science/r/CodeTracer-B8EE.
🎯 研究动机强化学习在大型语言模型中已成为关键范式,但在扩散模型中,其目标函数与预训练目标存在差异,导致优化效率问题。
❓ 解决问题针对现有方法增加优化方差与收敛速度较慢的问题,提出一种能够统一预训练与强化学习目标的新方法。
🔍 现象分析理论分析表明现有的DDPO方法是含噪目标的隐式得分/流匹配,增加了优化的方差并减缓了收敛过程。
🛠️ 主要方法提出优势加权匹配(AWM),使用得分/流匹配损失,加权样本的优势值以提高高奖励样本的影响力,同时保持与预训练相同的建模目标。
📊 数据与实验在GenEval、OCR和PickScore基准上进行实验,使用Stable Diffusion 3.5 Medium和FLUX模型,相比Flow-GRPO,收敛速度提升最多达到34倍且生成质量未受影响。
⭐ 主要贡献统一扩散模型的预训练和强化学习目标,提出AWM方法以降低优化方差并提高收敛速度,显著提升实际应用表现并提供源码.
查看完整摘要 (Abstract)
Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where both pre-training and RL post-training stages are grounded in the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce Advantage Weighted Matching (AWM), a policy-gradient method for diffusion. It uses the score/flow-matching loss and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically and reduces variance, yielding faster convergence. This simple yet effective design yields substantial benefits: on the GenEval, OCR, and PickScore benchmarks, AWM delivers up to a $34\times$ speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is provided in the supplementary material.
🎯 研究动机多回合交互中,智能体的思维与观察管理可以提高效率,但现有方法未根据交互回合区分思维必要性和观察效用。
❓ 解决问题提出一种方法,解决如何在交互中自适应地省略冗余思维与观察,从而提升智能体的有效性与效率。
🔍 现象分析通过定量分析,明确不同交互回合中思维和观察对智能体效率与有效性的影响差异。
🛠️ 主要方法提出Agent-Omit框架,通过冷启动数据微调智能体的省略行为,结合双采样机制和量身定制的奖励设计进行强化学习。
📊 数据与实验使用五个智能体基准,实验结果表明Agent-Omit-8B在性能上与七种前沿方法相当,同时在效率-效果权衡上优于七种高效方法。
⭐ 主要贡献提出一种统一的训练框架,实现智能体对冗余思维与观察的自适应省略,并通过理论证明和实验证明其效率与效用优势。
查看完整摘要 (Abstract)
Managing agent thought and observation during multi-turn agent-environment interactions is an emerging strategy to improve agent efficiency. However, existing studies treat the entire interaction trajectories equally, overlooking the thought necessity and observation utility varies across turns. To this end, we first conduct quantitative investigations into how thought and observation affect agent effectiveness and efficiency. Based on our findings, we propose Agent-Omit, a unified training framework that empowers LLM agents to adaptively omit redundant thoughts and observations. Specifically, we first synthesize a small amount of cold-start data, including both single-turn and multi-turn omission scenarios, to fine-tune the agent for omission behaviors. Furthermore, we introduce an omit-aware agentic reinforcement learning approach, incorporating a dual sampling mechanism and a tailored omission reward to incentivize the agent's adaptive omission capability. Theoretically, we prove that the deviation of our omission policy is upper-bounded by KL-divergence. Experimental results on five agent benchmarks show that our constructed Agent-Omit-8B could obtain performance comparable to seven frontier LLM agent, and achieve the best effectiveness-efficiency trade-off than seven efficient LLM agents methods. Our code and data are avaliable at https://anonymous.4open.science/r/Agent-Omit/
🎯 研究动机现有基于 GRPO 的文本到图像生成方法在奖励传播中未区分每步的局部作用,并忽略轨迹内依赖性,导致稀疏奖励问题。
❓ 解决问题提出一种新的 GRPO 框架,TP-GRPO,通过建模逐步奖励和长期效应缓解稀疏奖励问题,提升生成效果。
🔍 现象分析现有方法中早期去噪操作可能通过延迟且隐含的交互影响后续状态,并未充分捕捉步骤间影响的变化。
🛠️ 主要方法TP-GRPO 引入逐步增量奖励以提供密集的学习信号,并通过奖励符号变化检测关键步骤,将长期奖励分配给这些关键步骤以捕捉延迟影响。
📊 数据与实验通过广泛实验验证 TP-GRPO 对奖励信号的利用更有效,并在多个指标上稳定提升生成性能。
⭐ 主要贡献创新性地引入逐步奖励和转折点机制,解决稀疏奖励和长期效果建模问题,实现无超参数的高效框架。
查看完整摘要 (Abstract)
Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action’s ``pure" effect, and (ii) it identifies turning points—steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend—and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation.
🎯 研究动机强化学习中的奖励验证机制常被视为一种树剪枝过程,但该过程容易引发递归空间收缩问题,导致策略探索能力丧失。
❓ 解决问题现有方法如 KL 正则化存在局限性,会产生梯度冲突,无法有效平衡策略的正确性与多样性。本研究提出一种新范式以缓解这一问题。
🔍 现象分析论文揭示了递归空间收缩的内在机制:由正向锐化和负向收缩共同作用,导致合法选择的采样概率逐渐消失。
🛠️ 主要方法提出 Anchored Policy Optimization (APO),通过高置信度支持集定义安全流形,在纠错时选择性引入恢复机制,同时允许策略高效锐化,以防止探索崩溃。
📊 数据与实验通过数学基准实验验证方法有效性,APO 显著提高了 Pass@1 准确性,同时恢复了传统方法中丧失的 Pass@K 多样性。
⭐ 主要贡献提出用支持覆盖替代全局形状匹配的新范式;理论证明 APO 可最大化支持覆盖并实现探索恢复;实验证明该方法打破了准确性与多样性的平衡瓶颈。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model's full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model's high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.
🎯 研究动机代码修复是语言模型的重要能力,能够根据错误程序和测试用例生成通过测试的修复程序。研究旨在扩展代码修复的监督能力,探索模型如何高效处理生成的错误任务。
❓ 解决问题应对模型在自监督过程中生成的错误与真实世界错误间可能发生的偏移问题,提升模型在实际场景中的修复效果。
🔍 现象分析尽管单模型自监督能生成高难度错误,其训练过程可能导致模型只擅长解决自生成问题,而降低面对真实错误的修复能力。
🛠️ 主要方法提出 Anchored Self Play (ASP) 方法,通过引入代码嵌入相似性奖励指导错误生成,并在修复训练中混入真实错误,避免模型从真实错误中偏离。
📊 数据与实验设计 BugSourceBench 数据集,涵盖人类编写错误、人类修改的模型代码错误,以及模型生成错误。实验显示 ASP 在不同错误来源上均提升修复效果。
⭐ 主要贡献ASP 在多种错误来源上提升修复成功率,平均修复率相较标准自监督提高 25%(相对值)/ 7.2 百分点(绝对值),并显著改善模型生成错误和人类编写错误的解决性能。
查看完整摘要 (Abstract)
Code repair is an important capability for language models (LMs): given a buggy program and unit tests, an LM must produce a fixed program that passes the tests. We aim to scale supervision for code repair by having an LM generate bug--fix tasks with unconstrained edits, using unit tests as the only verifier. We propose generator-fixer self-play, in which a single model is trained with reinforcement learning to alternate between generating bugs and fixing them. As the fixer improves, the generator adapts to produce increasingly difficult bugs, yielding an automatic curriculum. However, because unit tests certify correctness but not realism, we find that the generator can drift from bugs encountered in practice, improving repair on self-generated bugs while degrading on real-world bugs. We propose Anchored Self Play (ASP), which anchors self-play with a small reference set by (i) adding a code-embedding similarity reward to guide generation and (ii) mixing reference bugs into fixer training to prevent drift. To reflect LM-assisted programming, where bugs come from humans, LMs, and human edits of LM code, we introduce BugSourceBench, a code repair benchmark spanning human-authored bugs, human-edited buggy LM code, and errors in LM-generated code. Across bug sources, ASP achieves the best fix rates, improving average fix rate by $+25$% (relative) / $+7.2$ pp (absolute) over standard self-play, with gains on both LM-error bugs ($+100$% relative / $+11$ pp absolute) and human-authored bugs ($+7.1$% relative / $+3.4$ pp absolute).
🎯 研究动机多模态大语言模型在处理时间序列异常检测时,受限于粗粒度的启发式推理,难以应对复杂的多维时间序列数据的详细推理需求。
❓ 解决问题通过强化模型基于时间序列的精确结构化细节进行推理,统一实现异常分类、定位和解释,克服现有模型在细粒度推理方面的不足。
🔍 现象分析传统方法对时间序列数据的推理能力有限,无法充分捕捉多维异常的细微特征,影响分类与定位的准确性。
🛠️ 主要方法提出AnomSeer框架,引入基于经典分析方法生成的专家推理链条,并设计时间序列强化策略优化(TimerPO),包括基于最优传输的时间序列优势和正交投影机制,确保辅助信号与检测目标相辅相成。
📊 数据与实验在多种异常场景的数据集中,AnomSeer使用Qwen2.5-VL-3B/7B-Instruct模型,展现出在分类和定位准确性上的显著性能提升,特别是在点异常与频率驱动异常中表现卓越。
⭐ 主要贡献提出AnomSeer方法,解决多模态语言模型对时间序列异常的精细化推理不足,优化分类、定位与解释能力,同时提供合理的推理链支持其决策。
查看完整摘要 (Abstract)
Time-series anomaly detection (TSAD) with multimodal large language models (MLLMs) is an emerging area, yet a persistent challenge remains: MLLMs rely on coarse time-series heuristics but struggle with multi-dimensional, detailed reasoning, which is vital for understanding complex time-series data. We present AnomSeer to address this by reinforcing the model to ground its reasoning in precise, structural details of time series, unifying anomaly classification, localization, and explanation. At its core, an expert chain-of-thought trace is generated to provide a verifiable, fine-grained reasoning from classical analyses (e.g., statistical measures, frequency transforms). Building on this, we propose a novel time-series grounded policy optimization (TimerPO) that incorporates two additional components beyond standard reinforcement learning: a time-series grounded advantage based on optimal transport and an orthogonal projection to ensure this auxiliary granular signal does not interfere with the primary detection objective. Across diverse anomaly scenarios, AnomSeer, with Qwen2.5-VL-3B/7B-Instruct, outperforms larger commercial baselines in classification and localization accuracy, particularly on point- and frequency-driven exceptions. Moreover, it produces plausible reasoning traces that support its conclusions.
🎯 研究动机视觉感知策略学习中的中间推理链通常用自然语言表达,但这种语言推理形式在感知任务中表现较差,需探索更适合视觉任务的推理方式。
❓ 解决问题针对视觉感知需要空间和对象中心的推理问题,提出结构化视觉推理,避免语言推理中的模糊性与不适配性。
🔍 现象分析纯语言型推理链在语义空间进行操作,与视觉感知的空间性和对象中心性要求不一致,导致性能下降。
🛠️ 主要方法提出Artemis方法,将中间推理步骤结构化为标签与边界框对,提供清晰可验证的视觉状态,实现中间状态跟踪和直接监督。
📊 数据与实验在自然图像域中通过目标定位与检测样本进行训练,验证了模型在计数和几何感知任务上的广泛适配能力。
⭐ 主要贡献设计了基于空间且对象中心的推理链规则,创建了一种通用架构,取消对任务特定设计的依赖,提升了感知策略的可扩展性与通用性。
查看完整摘要 (Abstract)
Recent reinforcement-learning frameworks for visual perception policy usually incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space. In response, we introduce Artemis, a perception-policy learning method that performs structured visual reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Building upon verifiable and spatially grounded reasoning chains, Artemis provides a unified architecture for diverse perceptual tasks, without requiring the task-specific designs relied upon by prior perceptual policy models. Trained using grounding and detection sampeles in natural image domains, Artemis generalizes to counting and geometric perception tasks. At its core, a spatially grounded, object-centric chain rule provides a principled foundation for scalable and general perceptual policies.
🎯 研究动机大语言模型(LLM)的推理模式仍然不透明,传统强化学习(RL)策略对整个生成过程赋予均等信任,难以区分关键步骤和常规步骤。
❓ 解决问题通过分析注意力机制对LLM推理过程进行解释,并将优化策略与模型的内部动态相匹配,以实现更细粒度的策略优化。
🔍 现象分析区分了本地和全局关注的注意力头,本地关注表现出在对角线附近的锯齿模式,反映短语块处理;全局关注暴露了对未来生成显著影响的关键令牌。
🛠️ 主要方法提出两个指标量化注意力分布,并捕捉出一种递归的'预计划与锚点'机制,同时设计了三种强化学习策略,动态分配关键节点的信任度。
📊 数据与实验在多种推理任务中测试了提出的策略,实验结果表明在目标任务上获得了一致的性能提升。
⭐ 主要贡献揭示了LLM推理中的'预计划与锚点'节奏,并基于此构建了能实现细粒度信用分配的RL优化策略,有效提升推理性能。
查看完整摘要 (Abstract)
The reasoning patterns of large language models (LLMs) remain opaque, and Reinforcement learning (RL) typically assigns uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work treats attention as a natural substrate for interpreting LLM reasoning and a window for aligning optimization with its internal dynamics. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We quantify these with two metrics measuring the extent of backward attention within a clipped window and the average attention a token receives from subsequent tokens, respectively. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks.
🎯 研究动机当前大型音频语言模型擅长感知任务,但在需要精确声学测量的复杂推理领域表现有限,亟需有效整合外部工具以提升性能。
❓ 解决问题传统方法无法平衡工具使用的全面性与上下文相关性,面临信息过载或选择失效的问题,需要一种动态决定工具使用的解决方案。
🔍 现象分析外部工具能提取细粒度声学特征,但无效的工具调用可能引入噪声且未对模型瓶颈问题产生实质提升。
🛠️ 主要方法提出一种强化学习框架AuTAgent,通过稀疏反馈训练和差异化奖励机制,动态决策工具调用以优化性能。
📊 数据与实验在MMAU Test-mini和MMAR基准上进行实验,结果显示开源与闭源模型的准确率分别提升4.20%/6.20%和9.80%/8.00%,并展现了出色的迁移能力。
⭐ 主要贡献提出了AuTAgent框架,有效解决了工具整合瓶颈,证明了外部工具在增强音频模型推理中的互补作用,同时提升了模型整体性能。
查看完整摘要 (Abstract)
Large Audio Language Models (LALMs) excel at perception but struggle with complex reasoning requiring precise acoustic measurements. While external tools can extract fine-grained features like exact tempo or pitch, effective integration remains challenging: naively using all tools causes information overload, while prompt-based selection fails to assess context-dependent utility. To address this, we propose **AuTAgent** (**Au**dio **T**ool **Agent**), a reinforcement learning framework that learns when and which tools to invoke. By employing a sparse-feedback training strategy with a novel Differential Reward mechanism, the agent learns to filter out irrelevant tools and invokes external assistance only when it yields a net performance gain over the base model. Experimental results confirm that AuTAgent complements the representation bottleneck of LALMs by providing verifiable acoustic evidence. It improves accuracy by 4.20% / 6.20% and 9.80% / 8.00% for open-source and closed-source backbones on the MMAU Test-mini and the MMAR benchmarks, respectively. In addition, further experiments demonstrate exceptional transferability. We highlight the complementary role of external tools in augmenting audio model reasoning.
🎯 研究动机推动大型语言模型在长链推理过程中动态选择和使用外部工具,以提高适应性和复杂任务解决能力。
❓ 解决问题现有方法假设工具库固定,导致在面对新工具集或动态演变工具时适应性受限。
🔍 现象分析固定工具集的限制弱化了模型在多步推理和动态任务中选择最优工具的能力,影响最终性能和泛化性。
🛠️ 主要方法提出AutoTool框架,包括基于SFT与RL的推理稳定优化,以及KL正则化Plackett–Luce排序用于多步工具选择优化。
📊 数据与实验构建包含200k数据的多任务工具选择数据集,覆盖1,000+工具和100+任务,并在Qwen3-8B与Qwen2.5-VL-7B模型上进行实验验证,跨十项基准测试取得显著性能提升。
⭐ 主要贡献AutoTool实现动态工具选择与集成,在数学、科学推理、代码生成、多模态理解等任务上优于现有方法,并展示强大的未见工具泛化能力。
查看完整摘要 (Abstract)
Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, which limits the adaptability of LLM agents to new or evolving toolsets. We present AutoTool, a training framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. AutoTool employs a dual-phase optimization pipeline: (i) SFT and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett–Luce Ranking to refine consistent multi-step tool selection. We further build a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4\% in math \& science reasoning, 4.5\% in search-based QA, 7.7\% in code generation, and 6.9\% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.
🎯 研究动机强化学习用于程序修复时,反馈稀疏且奖励粗粒度,难以定位具体修复行为的问题。
❓ 解决问题提出一种框架,通过增细奖励分配粒度,加强程序修复中对重要代码编辑区域的指引。
🔍 现象分析现有方法的序列级奖励难以识别代码修复中关键的编辑位置,限制了强化学习性能提升。
🛠️ 主要方法BoostAPR采用三阶段流程:监督微调、双重奖励模型训练(序列级与行级)、通过行级模型重分配奖励的PPO优化。
📊 数据与实验在SWE-Gym上训练,基于四个基准进行测试,展现不同语言之间的潜在迁移能力及高竞争性能。
⭐ 主要贡献提出了行级奖励分配方法,显著提升程序修复性能,同时验证了跨语言迁移能力和开源适用性。
查看完整摘要 (Abstract)
Reinforcement learning for program repair is hindered by sparse execution feedback and coarse sequence-level rewards that obscure which edits actually fix bugs. We present BoostAPR, a three-stage framework: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models—a sequence-level assessor and a line-level credit allocator—from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions. This line-level credit assignment operates at an intermediate granularity naturally suited to code changes. Trained on SWE-Gym and evaluated on four benchmarks, BoostAPR achieves 40.7% on SWE-bench Verified (+22.9pp over the base model), 24.8% on Defects4J (Python→Java transfer), 84.5% on HumanEval-Java, and 95.0% on QuixBugs, showing competitive open-source performance with strong cross-language generalization.
🎯 研究动机增强大型语言模型(LLM)的推理能力已成为强化学习(RL)的重要方向,但现有基于PPO-Clip的算法存在探索崩溃的固有缺陷,亟需突破性的优化方法。
❓ 解决问题论文揭示PPO-Clip失败的根本原因——将策略差异度基于欧几里得度量计算,与策略的黎曼流形内在几何不一致,导致探索不均衡并最终崩溃。
🔍 现象分析欧几里得度量导致低概率区域的更新过于保守,高概率区域的更新过于激进,破坏了探索与利用的平衡。
🛠️ 主要方法提出了一种名为RIPO的算法,通过确保策略在黎曼流形上的等距更新,解决几何失配问题,从理论上提升探索与利用的均衡性。
📊 数据与实验在七个竞赛级基准数据集上进行广泛实验,与现有方法相比,RIPO的性能提升显著,最高在AIME24任务上超过GRPO算法60%。
⭐ 主要贡献1)揭示PPO-Clip的几何缺陷;2)提出RIPO算法,实现等距策略优化;3)通过实验验证RIPO在多项任务中的显著优势。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has become a dominant paradigm for enhancing LLMs' reasoning capabilities. However, RL algorithms with PPO-Clip are inherently limited by exploration collapse. Subsequent works remain primarily heuristic and fail to identify the essential cause of PPO-Clip’s failure. This work reveals the fundamental flaw of PPO-Clip: it implicitly measures policy discrepancy using Euclidean metric, which is theoretically inconsistent with the intrinsic geometry on the policy Riemannian manifold. This geometric mismatch results in overly conservative updates in low-probability regions while aggressive in high-probability regions, ultimately collapsing exploration. To correct this geometric flaw, we propose Riemannian Isometric Policy Optimization (RIPO), which guarantees isometric policy updates on the Riemannian manifold, effectively balancing exploration and exploitation. We further show that RIPO achieves a favorable bias-variance trade-off, which stabilizes optimization. Extensive experiments demonstrate that RIPO significantly surpasses existing LLM RL algorithms across seven competition-level benchmarks (up to 60% improvement over GRPO on AIME24).
🎯 研究动机强化学习(RL)方法虽提升了大型语言模型(LLM)的推理能力,但降低了输出多样性,需要更高效的分布匹配技术。
❓ 解决问题现有方法将分区函数仅视为归一化器,未充分利用其包含的每个提示的奖励信息(在线准确率),存在样本效率瓶颈。
🔍 现象分析理论上发现分区函数与每个提示的准确率估计存在关联,可将分区函数重新解释为难度调度信号以提升训练效率。
🛠️ 主要方法提出PACED-RL框架,利用准确信号优先训练信息量大的提示问题,并通过优先回放机制减少估计误差;同时复用GFlowNet训练中的已有信息,以节省计算成本。
📊 数据与实验在多个基准上的实验表明,PACED-RL相比GRPO及现有GFlowNet方法显著提升性能,验证方法的样本效率优势。
⭐ 主要贡献提出PACED-RL重新定义分区函数用途,实现更高效的分布匹配训练;通过理性利用已有信号,降低计算开销并显著提升实验效果。
查看完整摘要 (Abstract)
Reward-maximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose \textbf{Pa}rtition Fun\textbf{c}tion-Guid\textbf{ed} \textbf{RL} (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error–prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.
🎯 研究动机现有的 Text-to-SQL 方法依赖静态工作流,难以处理分布外和长尾场景,限制了实际应用的广度和灵活性。
❓ 解决问题旨在开发一种自适应的动态工作流框架,允许系统根据推理需求实时构建优化的工作流,从而提升性能和扩展性。
🔍 现象分析理论和实验证明动态策略明显优于静态工作流,其性能优势源于候选工作流的异质性特征。
🛠️ 主要方法提出 SquRL 框架,通过强化学习加强语言模型的推理能力,采用规则奖励函数、动态演员屏蔽机制以及伪奖励提升训练效率。
📊 数据与实验基于主流 Text-to-SQL 基准测试展开实验,结果显示动态工作流在复杂查询和分布外查询中显著优于最佳静态方法。
⭐ 主要贡献引入动态工作流的创新思路,开发强化学习框架 SquRL,显著提升 Text-to-SQL 系统在复杂场景下的适用性和性能。
查看完整摘要 (Abstract)
Text-to-SQL has recently achieved impressive progress, yet remains difficult to apply effectively in real-world scenarios. This gap stems from the reliance on single static workflows, fundamentally limiting scalability to out-of-distribution and long-tail scenarios. Instead of requiring users to select suitable methods through extensive experimentation, we attempt to enable systems to adaptively construct workflows at inference time. Through rigorous theoretical and empirical analysis, we demonstrate that optimal dynamic policies consistently outperform the best static workflow, with performance gains fundamentally driven by heterogeneity across candidate workflows. Motivated by this, we propose SquRL, a reinforcement learning framework that enhances LLMs' reasoning capability in adaptive workflow construction. We design a rule-based reward function and introduce two effective training mechanisms: dynamic actor masking to encourage broader exploration, and pseudo rewards to improve training efficiency. Experiments on widely-used Text-to-SQL benchmarks demonstrate that dynamic workflow construction consistently outperforms the best static workflow methods, with especially pronounced gains on complex and out-of-distribution queries.
🎯 研究动机将回归任务转化为序列生成任务的解码式回归方法已展现前景,但传统离散的符号级目标难以对齐连续数值,导致模型精度与泛化性能受限。
❓ 解决问题突破符号级约束对全局数值幅度捕捉不足的瓶颈,提高解码式回归在数值预测的精确性与通用性。
🔍 现象分析符号层面的交叉熵目标无法有效反映连续目标值的全局一致性,表现出局限性,如精度受限与泛化能力不足。
🛠️ 主要方法提出基于强化学习的生成式回归方法 GenRe$^2$,将生成过程建模为马尔科夫决策过程,结合策略梯度方法与密集专家监督,平衡误差幅度与时序信用分配问题。
📊 数据与实验在表格回归、代码度量预测和生成式奖励建模等任务上进行广泛实验,结果表明 GenRe$^2$ 在精度与稳健性上显著优于传统基线。
⭐ 主要贡献1) 提出基于序列级强化学习的解码式回归框架;2) 创新性地解决符号级目标与连续数值对齐问题;3) 为通用数值预测建立新范式,验证该方法的广泛适用性。
查看完整摘要 (Abstract)
Decoding-based regression, which reformulates regression as a sequence generation task, has emerged as a promising paradigm of applying large language models for numerical prediction. However, its progress is hindered by the misalignment between discrete token-level objectives (e.g., cross-entropy) and continuous numerical values. Existing approaches relying on token-level constraints often fail to capture the global magnitude of the target value, limiting their precision and generalization. In this paper, we propose to unlock the potential of decoding-based regression via reinforcement learning. We formulate the generation process as a Markov decision process, utilizing sequence-level rewards to enforce global numerical coherence. Under this framework, we present GenRe$^2$, which combines policy gradient methods to preserve error magnitudes with dense expert supervision, resolving the temporal credit assignment challenge. Extensive experiments across tabular regression, code metric prediction and generative reward modeling demonstrate that GenRe$^2$ consistently outperforms traditional baselines, establishing a robust paradigm for general-purpose numerical prediction.
🎯 研究动机群体强化学习方法在提升大型语言模型性能和代理任务中表现突出,但现有方法对个体步骤的贡献捕获能力不足,尤其是失败轨迹中的关键步骤。
❓ 解决问题现有方法依赖轨迹级归因,难以精细捕捉单步对任务目标的贡献,本研究提出更细粒度的信任分配方法。
🔍 现象分析传统方法主要根据最终结果进行归因,忽视了轨迹中隐藏的关键步骤信息,导致训练效率和结果精准度受限。
🛠️ 主要方法提出GraphGPO方法,将所有轨迹聚合为统一的状态转移图,利用全局信息评估每个状态至目标的距离,结合图结构优势分配单步信用。
📊 数据与实验在多个高难度基准测试上验证,GraphGPO展现了显著的训练效率提升与最先进的性能表现。
⭐ 主要贡献提出了一种基于图的信用分配新框架,解决了轨迹级归因的细粒度问题,改善了强化学习任务的效率与效果。
查看完整摘要 (Abstract)
Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLMs) and have been rapidly extended to agentic tasks. However, their credit assignment relies heavily on coarse-grained trajectory-level attribution according to final outcomes, making it difficult to capture the contribution of individual steps, such as valuable steps obscured within failed trajectories. To uncover latent information and enable more faithful step-level credit assignment, we propose Graph-based Group Policy Optimization (GraphGPO), which first aggregates all rollout trajectories into a unified state-transition graph and then estimates the distance from each state to the task goal using the global information encoded in the graph. Finally, GraphGPO assigns credit to each edge by estimating a graph-based advantage, based on how much the transition reduces the distance to the task goal. In this way, GraphGPO significantly improves training efficiency and achieves state-of-the-art performance across a range of challenging benchmarks.
🎯 研究动机现有的大语言模型后训练技术分为监督微调和强化微调,两者在性能表现上存在权衡,分别面临泛化问题和行为异常的问题。研究旨在统一两者的特点以提高整体性能。
❓ 解决问题解决监督微调行为克隆导致泛化性差和强化微调学习行为异常且对初始策略敏感的问题,同时探索二者结合的可能性。
🔍 现象分析通过实验证明监督微调和强化微调具有互补性质,现有的独立或并行混合策略无法充分发挥两者优势。
🛠️ 主要方法提出一种名为Prefix-RFT的混合方法,结合示例学习与探索学习,通过前缀采样实现两种微调方式的协同优化。
📊 数据与实验以数学推理问题为测试场景,进行实证研究,辅以消融实验分析模型对示例数据质量与数量的鲁棒性。
⭐ 主要贡献提出并验证一种简单有效的混合微调方法Prefix-RFT,其性能超过单独的监督微调和强化微调,以及现有的并行混合强化学习方法,同时提供对该方法的理论与实证支持。
查看完整摘要 (Abstract)
Existing LLMs-post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Each paradigm presents a distinct trade-off: (1) SFT excels at mimicking demonstration data, but can lead to problematic generalization as a form of behaviour cloning. (2) Conversely, RFT can significantly enhance a model's performance but is prone to learn unexpected behaviours, and its performance is sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a test bed, we empirically demonstrate that \ourmethod is simple yet effective. Not only does it surpass the performance of standalone SFT and RFT, but it also outperforms parallel mixed-policy RFT methods. Our analysis highlights the complementary nature of SFT and RFT, validating that Prefix-RFT effectively harmonizes them. Further ablation studies confirm the method's robustness to variations in the quality and quantity of demonstration data.
🎯 研究动机强化学习在函数逼近场景下面临样本复杂度与计算复杂度的权衡问题,特别是在低秩MDP中,现有算法对计算昂贵的oracle需求限制了其实用性。
❓ 解决问题提出一种仅依赖策略评估oracle的高效算法,以突破低秩MDP中现有算法的计算瓶颈,同时保持样本效率。
🔍 现象分析在低秩MDP设置中,策略评估oracle是最具计算效率的选项,前提是可以高效解决监督学习问题。
🛠️ 主要方法设计了一种乐观的actor–critic算法,通过结合策略评估oracle,避免了以往方法中需要昂贵规划或优化oracle的缺点,并扩展至近似低秩MDP场景。
📊 数据与实验在多个标准Gym基准上进行了实验验证,支持理论分析结果并说明方法的实际可用性。
⭐ 主要贡献首次构建了低秩MDP中RL oracle计算效率的层次结构;提出了更高效的actor–critic算法,兼具样本效率和计算可行性;扩展到近似低秩环境,涵盖更广实际场景。
查看完整摘要 (Abstract)
Reinforcement learning (RL) is a fundamental framework for sequential decision-making, in which an agent learns an optimal policy through interactions with an unknown environment. In settings with function approximation, many existing RL algorithms achieve favorable sample complexity, but often rely on computationally intractable oracles. In this paper, we use supervised learning as a computational proxy to establish a clear hierarchy of commonly adopted RL oracles under low-rank Markov Decision Processes (MDPs). This hierarchy shows that policy evaluation is the most computationally efficient oracle, provided that supervised learning can be efficiently solved. Motivated by this observation, we propose a novel optimistic actor–critic algorithm that relies solely on the policy evaluation oracle. We prove that our algorithm outperforms the existing sample complexity guarantees for low-rank MDPs while avoiding computationally expensive planning or optimization oracles commonly assumed in prior works. We further extend our theoretical results to approximately low-rank MDPs and demonstrate that this setting captures a broad class of real-world environments. Finally, we validate our theoretical results with experiments on several standard Gym benchmarks.
🎯 研究动机设备与云协作在部署大语言模型方面具有潜力,但现有方法难以根据任务复杂性有效决定是否需要云协助处理。作者希望通过赋予设备端模型独立决策能力来优化推理效率。
❓ 解决问题现有路由器难以基于提示内容准确判断任务难度,尤其是在涉及复杂推理的情况下。该研究旨在通过后训练强化学习提升设备端模型的决策能力,从而平衡本地处理与云端调用。
🔍 现象分析外部路由器普遍存在决策失误问题,可能导致设备端或云端处理的极端倾向,这制约了设备云协作的效能。需要一种能够内部决策、动态权衡的系统来改善处理流程。
🛠️ 主要方法通过强化学习将后训练过程设定为奖励最大化问题,设计分层奖励机制以鼓励本地解决问题并合理调用云服务,并开发群组级策略梯度算法结合自适应提示过滤缓解策略崩溃问题。
📊 数据与实验实验基于设备端规模的 LLaMA 和 Qwen 模型,覆盖多种推理基准测试,验证了方法在提高性能和缩小与全云模型差距方面的优越性。
⭐ 主要贡献提出了统一的设备云协同推理方法,通过内部决策实现高效任务处理。设计了分层奖励与策略优化技术,显著提升设备端模型推理能力并平衡云调用频率。
查看完整摘要 (Abstract)
Device-cloud collaboration holds promise for deploying large language models (LLMs), leveraging lightweight on-device models for efficiency while relying on powerful cloud models for superior reasoning. A central challenge in this setting is determining, for each incoming query, whether it should be processed locally or offloaded to the cloud. Existing approaches typically rely on external routers, which often struggle to determine difficulty from the prompt itself, especially for tasks involving complex reasoning. Motivated by this limitation, we propose enabling on-device LLMs to decide internally whether to invoke cloud assistance at inference time, with this capability instilled through reinforcement learning based post-training. Casting on-device LLM post-training as a reward maximization problem, we design hierarchical rewards to encourage local problem solving and judicious cloud offloading. To solve the resulting problem, we develop an algorithm featuring a group-level policy gradient that stabilizes optimization, together with adaptive prompt filtering that provides complementary learning signals to mitigate policy collapse (i.e., exclusive local execution or exclusive cloud offloading). Extensive experiments on on-device-scale LLaMA and Qwen models across multiple reasoning benchmarks show that our method consistently outperforms baselines and significantly narrows the gap to full cloud LLMs.
🎯 研究动机大型视觉-语言模型在多模态推理中表现出色,但现有强化学习方法缺乏明确的反事实增强和因果学习机制,导致语义失真问题严重。
❓ 解决问题提出一种新框架 CFPO,通过跨模态反事实增强机制解决视觉证据忽视和长链推理漂移问题,提升因果一致性。
🔍 现象分析强化学习模型容易过度依赖语言先验,忽略视觉信息,或在复杂推理中出现虚假漂移现象。
🛠️ 主要方法CFPO在关键视觉线索被抑制的反事实状态下,利用预测差异最优化政策,实现视觉和文本推理的因果一致性,兼容 GRPO 和 DAPO 等框架。
📊 数据与实验通过广泛实验验证,CFPO较标准强化学习基线方法提升 3.17%-6.25%,较最优感知方法 PAPO 提升 1.32%-2.13%。
⭐ 主要贡献提出反事实策略优化框架 CFPO,引入跨模态反事实增强机制,显著改善多模态因果推理性能,无需外部监督或额外奖励模型。
查看完整摘要 (Abstract)
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal reasoning. However, prevailing reinforcement learning (RL) paradigms lack explicit counterfactual enhancement and causal learning mechanisms. This fundamental deficiency results in severe grounding failures, manifesting as a tendency to ignore visual evidence in favor of language priors or exhibiting hallucination drift during long chain-of-thought reasoning. To address this root cause, we propose CounterFactual Policy Optimization (CFPO), a novel framework that enforces causal consistency between visual perception and textual reasoning. CFPO introduces a cross-modal counterfactual enhancement mechanism, which regularizes the policy by maximizing the discrepancy between the model’s predictions and those from a counterfactual state where critical visual cues are suppressed. This approach seamlessly integrates with standard algorithms like GRPO and DAPO without requiring external reward models or additional supervision. Extensive experiments demonstrate that CFPO significantly improves reasoning fidelity, achieving consistent gains of 3.17%-6.25% over standard RL baselines and 1.32%-2.13% over the state-of-the-art perception-aware method (PAPO).
🎯 研究动机现有基于强化学习的优化方法在角色扮演中导致角色特性损失及风格崩溃,亟需解决角色与环境任务的协调问题。
❓ 解决问题提出一种角色中心的优化框架,旨在在角色扮演任务中保持角色个性与行为逻辑的一致性。
🔍 现象分析传统方法过度关注任务效用,忽视角色特征,导致模型生成的内容缺乏角色的情感一致性与独特性。
🛠️ 主要方法通过解耦任务逻辑与风格奖励、动态调整优化约束以及使用通用回复作为负向基准三种机制提升角色表现的独特性与稳定性。
📊 数据与实验实验涵盖多个角色扮演场景,验证在角色一致性和情感表达等指标上优于现有方法。
⭐ 主要贡献提出CRPO框架,从角色视角优化强化学习目标,为角色扮演代理的多样化与一致性提供新的解决方案。
查看完整摘要 (Abstract)
Recent advancements in Reinforcement Learning (RL), particularly Group Relative Policy Optimization (GRPO), have significantly enhanced the reasoning capabilities of Large Language Models. However, applying these problem-centric optimization methods to role-playing agents often leads to a loss of character fidelity and style collapse, as they prioritize context-specific utility over persona alignment. To address this, we propose Character-Centric Group Relative Policy Optimization (CRPO), a framework designed to realign RL objectives with the role-playing task. CRPO improves character distinctiveness through three mechanisms: decoupling task logic from stylistic rewards to resolve gradient conflicts, dynamically adapting optimization constraints based on character complexity, and utilizing generic responses as negative baselines to prevent the model from reverting to a common distribution. Extensive experiments demonstrate that CRPO outperforms existing methods in consistency, emotion and others.
🎯 研究动机解决强化学习中 Mountain Car 问题的优化控制挑战,并探索简化低维控制任务的策略。
❓ 解决问题首次解析性地解决了 Mountain Car 问题,提出其最优控制解法,填补了36年的研究空白。
🔍 现象分析发现现有强化学习模型与理论最优性能存在显著差距,揭示最优控制方法相对简单但高效。
🛠️ 主要方法提出 Chebyshev 策略作为通用强化学习策略类,能直接替代神经网络,显著降低参数数目并提升训练效率。
📊 数据与实验在 Mountain Car 问题及真实非线性运动控制测试平台中验证,Chebyshev 策略在 PPO、ARS 和 REINFORCE中均表现优于神经网络。
⭐ 主要贡献提出了一种轻量化、高效的新策略,可用于强化学习低维控制任务,兼具性能、可解释性和实时能力。
查看完整摘要 (Abstract)
We analytically solve the Mountain Car problem, a canonical benchmark in RL, and derive an optimal control solution, closing a gap after 36 years. This enables us to reveal two surprising insights: The optimal control is quite simple, yet modern RL agents display a large gap to optimality. Motivated by the analysis of the optimal control, we introduce Chebyshev policies as a universal (i.e. dense) class of RL policies from first principles. They can be trained as drop-in replacements of neural nets, reducing the regret by a factor of 4.18, while requiring 268 times fewer parameters, fostering sample efficiency, explainability and real-time capability. Chebyshev policies are evaluated on further RL environments, including a real-world non-linear motion control testbed. They consistently improve performance over neural nets with PPO, ARS and REINFORCE. Our results demonstrate how Chebyshev policies offer a compelling and lightweight alternative or addition to neural nets for low-dimensional control tasks.
🎯 研究动机RLVR 是扩展大语言模型推理的重要框架,但其优化过程面临训练不稳定和收敛性较差的问题。
❓ 解决问题研究发现硬截断机制丢弃了许多接近边界的高价值信号,限制了性能提升,因此需要一种新方法来恢复这些信号。
🔍 现象分析通过对 GRPO 客观函数的系统研究,明确硬截断导致接近边界的高价值信号被丢弃是主要瓶颈。
🛠️ 主要方法提出一种轻量级方法 Near-boundary Stochastic Rescue (NSR),通过随机保留近边界的信号,使用隐式梯度衰减机制来恢复丢失信息。
📊 数据与实验在多个规模(7B 至 30B)以及不同架构(Dense 和 MoE)模型上实验,验证 NSR 在训练稳定性和性能提升方面优于 DAPO 和 GSPO 等现有强基线。
⭐ 主要贡献解决了硬截断时信号丢失问题,提出一种简单高效的 NSR 方法,显著提升了 RLVR 训练的性能和稳定性。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissection of the GRPO-based objective, we reveal that the rigid clipping decision inherent to the hard-clipping mechanism is the primary bottleneck. Specifically, we find that many high-value signals lie in the **near-boundary** region just beyond the clipping threshold, and are thus discarded. Motivated by this diagnosis, we propose **Near-boundary Stochastic Rescue (NSR)**, a minimal, plug-and-play modification that stochastically retains these slightly out-of-bound tokens to recover lost signals. While NSR, via stochastic sampling, can be interpreted as inducing an implicit gradient decay in expectation, our ablations reveal that its stochastic, boundary-local rescue mechanism is consistently more effective than deterministic gradient decay. Validated by extensive experiments across model sizes from 7B to 30B and both dense and MoE architectures, as a plug-and-play solution, NSR substantially improves training stability and delivers consistent gains over strong baselines such as DAPO and GSPO.
🎯 研究动机现有图像编辑模型在特定目标区域编辑时常导致非目标区域出现不必要的改变,影响内容一致性。
❓ 解决问题提出一种基于区域正则化强化学习的后训练框架CoCoEdit,以兼顾编辑质量和内容一致性。
🔍 现象分析现有奖励机制缺乏对空间信息的敏感性,无法区分编辑目标区域与非目标区域的贡献。
🛠️ 主要方法通过强化学习引入像素级相似度奖励与基于区域的正则化器,优化非编辑区域的内容一致性并提高低奖励样本的编辑效果。
📊 数据与实验扩充现有数据集并构建包含40K高质量样本的训练集;对GEdit-Bench和ImgEdit-Bench标注编辑掩码,引入像素级相似度指标评估内容一致性和编辑质量。
⭐ 主要贡献提出的CoCoEdit方法在内容一致性和编辑质量上超过先进模型,显著提升PSNR/SSIM指标及主观评分;代码将公开。
查看完整摘要 (Abstract)
Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for \textbf{Co}ntent-\textbf{Co}nsistent \textbf{Edit}ing (\textbf{CoCoEdit}) by using region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We introduce a pixel-level similarity reward that complements MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only superior editing scores to state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings. Code will be released.
🎯 研究动机标准 GRPO 方法由于均匀采样和近似均匀权重分配,导致计算资源分配效率低下,对 LLM 推理能力的提升有限。研究表明,更新效果高度依赖于问题难度和模型当前能力,需创新优化策略。
❓ 解决问题当前 RL 方法中,难度较高的问题因正确的采样结果稀少而在训练中受限,导致发现能力不足。论文旨在通过动态调整权重和采样机制优化训练效率。
🔍 现象分析观察到三种动态现象:概率值膨胀、随着准确率提升优势收缩、容易问题迅速收敛而困难问题受限。这些现象揭示了训练效果与问题难度和模型信心之间的强相关性。
🛠️ 主要方法提出 CoDaPO 方法,通过结合 rollout 的置信水平与经验难度为问题分配有限权重,并在 minibatch 中重新采样高价值问题,从而在固定计算预算下增强困难问题的发现能力。
📊 数据与实验采用七个基准数据集验证方法,对比多种 RL 方法,实验表明 CoDaPO 在推理准确性上均有显著提升。
⭐ 主要贡献设计了一种基于置信度和难度自适应的训练策略,解决了 GRPO 中资源分配不均的问题,拓展了 RL 在 LLM 推理优化中的应用。
查看完整摘要 (Abstract)
RL with verifiable rewards can substantially improve LLM reasoning, yet standard GRPO-style training often uses uniform sampling and near-uniform weighting, leading to inefficient computation allocation. We study GRPO by tracking token log-probabilities, group-normalized advantages, and induced token-level update weights. This reveals three recurring dynamics: probability inflation, advantage contraction as accuracy rises, and hierarchical convergence, where easy questions quickly saturate while hard questions remain discovery-limited due to rare correct rollouts. These findings imply that the benefit of each update depends strongly on both question difficulty and the model’s current competence. Motivated by this, we propose Confidence and Difficulty-adaptive Policy Optimization (CoDaPO), which assigns each question a bounded value from rollout confidence and empirical difficulty, then uses it to reweight policy updates and resample high-value questions within minibatches to increase discovery under a fixed compute budget. Across seven benchmarks, CoDaPO consistently improves accuracy over other RL methods.
🎯 研究动机强化学习在语言模型推理上取得了重要进展,但对可验证奖励的依赖限制了其应用。验证器无关的强化学习方法提供了新的路径,但现有设计中存在推理轨迹与答案信息脱节的问题。
❓ 解决问题现有方法基于问题采样推理轨迹,导致探索效率低以及推理轨迹与最终答案不一致的问题。本研究提出了一种联合推理和答案的框架以提升模型性能。
🔍 现象分析传统基于概率奖励信号的方法在推理轨迹生成时无法有效整合答案信息,导致探索过程分散且逻辑连贯性不足。
🛠️ 主要方法提出 Coupled Variational Reinforcement Learning (CoVRL),通过联合优化先验分布和后验分布的复合分布,实现高效探索与推理答案的一致性。
📊 数据与实验基于数学推理和一般推理的基准数据集进行实验,结果显示 CoVRL 相较基础模型提升 12.4%,较现有验证器无关方法提高 2.3%。
⭐ 主要贡献提出了一种结合变分推断与强化学习的联合分布优化框架,为提升语言模型的推理能力提供了理论支持和实证验证。
查看完整摘要 (Abstract)
While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the probabilities that LLMs generate reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.
🎯 研究动机在大规模搜索、推荐和检索增强生成系统中,两阶段排序架构普遍存在,早期排序器(ESR)的端到端训练仍具挑战性,尤其因政策梯度方法的高方差问题限制了可扩展性。
❓ 解决问题提出了一种新的强化学习方法,解决了传统政策梯度方法中因忽略候选集合中每个具体项对奖励的贡献而导致的梯度爆炸方差问题。
🔍 现象分析传统政策梯度方法通过计算候选集合的联合概率传播梯度,这种方式未能关注单一项的边际概率,是高方差的根源。
🛠️ 主要方法提出了'信用分配政策梯度'(CA-PG),通过对包含目标项的所有候选集合进行边际化计算梯度,从而降低方差,同时保持学习正确排名序的能力。
📊 数据与实验在合成数据和真实数据上进行实验,采用典型的Plackett-Luce模型,验证CA-PG在巨大候选集规模下显著提升了训练的收敛速度与稳定性。
⭐ 主要贡献提出了可扩展的信用分配政策梯度方法,显著降低了早期排序器端到端训练中的方差问题,为强化学习在大规模信息检索系统中的应用提供了解决方案。
查看完整摘要 (Abstract)
Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-stage ranker (ESR) generates a candidate set, which is subsequently re-ranked by a late-stage ranker (LSR). While there are many reinforcement learning (RL) methods for training the LSR, end-to-end training of the ESR has proven challenging. In particular, naive application of "vanilla" policy gradient (V-PG) is not scalable for candidate-set sizes relevant for practical use due to exploding variance. This issue arises because V-PG propagates the gradient to the joint probability of the candidate sets, ignoring the contribution of each specific item in the candidate set to the reward. To mitigate this issue, we propose a novel *"credit-assigned" PG (CA-PG)*, which computes gradients with respect to the probability that the target item is chosen in any candidate set, i.e. marginalizing over all candidate sets that contain it. Our theoretical analysis reveals that CA-PG significantly reduces the variance of V-PG by marginalizing over the specific composition of the candidate set, while preserving the ability to learn the correct ranking of actions under a reasonably aligned LSR policy. Experiments on both synthetic and real-world data demonstrate that CA-PG improves the convergence speed and training stability for ESRs utilizing the canonical Plackett-Luce model, especially when the candidate-set size is large.
🎯 研究动机黑盒提示调优(BBPT)旨在优化输入提示,但现有方法无法同时解决提示可解释性和查询效率问题。
❓ 解决问题提出了一种名为 CRL-BPT 的课程强化学习框架,通过动态优化目标,引导生成可解释且高效的提示。
🔍 现象分析当前方法在模仿参考提示和探索新模式之间缺乏有效平衡,导致性能和可解释性受限。
🛠️ 主要方法提出动态课程计划,结合模仿损失和创新损失动态权重,同时设计历史损失归一化与相对奖励校准机制以稳定训练。
📊 数据与实验通过多组实验验证,在严格的 API 调用预算下,CRL-BPT 在性能和提示可解释性上均达到最新的高水平。
⭐ 主要贡献首次将课程强化学习用于黑盒提示调优问题,改善了提示可解释性和查询效率,并开源实现代码。
查看完整摘要 (Abstract)
Black-box prompt tuning (BBPT) aims to optimize input prompts for large models where internal parameters and gradients are inaccessible. However, existing methods fail to simultaneously address the dual challenges of prompt interpretability and query efficiency. To address these challenges, we propose CRL-BPT, a curriculum reinforcement learning framework that utilizes a large language model as an agent to generate human-readable prompts. Specifically, CRL-BPT implements a dynamic curriculum schedule on two auxiliary objectives: an imitation loss and an innovation loss. By dynamically weighting these objectives, CRL-BPT regularizes the RL process, guiding the agent from mimicking reference prompts to discovering novel patterns. Additionally, we introduce tailored stabilization mechanisms comprising historical loss normalization and relative reward calibration to ensure robust training. Extensive experiments demonstrate that CRL-BPT establishes new state-of-the-art performance and generates highly interpretable prompts under a strict budget of API calls. Code is available at https://anonymous.4open.science/r/CRL-BPT.
🎯 研究动机大规模推理模型(LRMs)在实际问题中需要具备有效的工具使用与推理能力,但现有模型在任务分解方面能力不足,影响其复杂问题求解表现。
❓ 解决问题通过改进任务分解和增强反思性推理能力,解决LRMs在复杂工具使用场景中推理重复和无意义反思的问题。
🔍 现象分析通过实证分析发现,LRMs经常表现出“懒惰推理”现象,主要原因是其任务分解能力不足,导致推理过程中反复冗余。
🛠️ 主要方法提出 D-CORE 双阶段训练框架:第一阶段通过自蒸馏激励任务分解能力,第二阶段通过多样性感知强化学习恢复反思性推理能力。
📊 数据与实验在多个基准和模型规模上验证了方法的有效性;通过 BFCLv3 实验表明,D-CORE-8B 模型达到了 77.7% 的精度,超越当前最佳 8B 模型 5.7%,而 D-CORE-14B 以 79.3% 的精度刷新了最新 SOTA。
⭐ 主要贡献辨析并解决了LRMs的‘懒惰推理’现象,提出了一种双阶段框架显著提升模型任务分解和复杂工具使用能力,并以低参数规模超越多个更大模型的性能。
查看完整摘要 (Abstract)
Effective tool use and reasoning are essential capabilities for large reasoning models (LRMs) to address complex real-world problems. Through empirical analysis, we identify a prevalent "Lazy Reasoning" phenomenon, where LRMs frequently engage in repetitive and meaningless reflective reasoning. This occurs primarily due to their inadequate ability to decompose tasks when reasoning in complex tool use scenarios. To address this, we propose a two-stage training framework D-CORE ( Decomposing tasks and Composing Reasoning processes) that first incentivize the LRM’s task decomposition reasoning capability via self-distillation, followed by diversity-aware reinforcement learning (RL) to restore LRM's reflective reasoning capability. D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Experiments on BFCLv3 demonstrate superiority of our method: D-CORE-8B reaches 77.7% accuracy, surpassing the best-performing 8B model by 5.7%. Meanwhile, D-CORE-14B establishes a new state-of-the-art at 79.3%, outperforming 70B models despite being 5× smaller. The source code and data sample are in the supplementary material.
🎯 研究动机推测性推理被提出用于加速多模态模型中的复杂推理,但其效果常受起草与目标验证间的不一致性限制。
❓ 解决问题改善推测性推理的质量和效率,同时克服由于步骤错误传播导致的性能瓶颈。
🔍 现象分析传统推理框架中,推测步骤和目标验证常存在对齐偏差,导致生成质量欠佳且效率不足。
🛠️ 主要方法引入基于强化学习的投机对齐策略优化(SAPO)训练草稿模型;设计基于阈值的验证机制(TBVM)以减少错误传播;开发全并行推理框架(FPSR),通过多步骤并行化实现稳定高效推理。
📊 数据与实验在多个推理密集型基准实验中进行测试,实现高达2.49倍加速,同时保持目标模型的推理准确性。
⭐ 主要贡献显著提升推测性推理效率,提出全面的并行化框架及稳定验证机制,解决推理品质与速度的平衡问题,为多模态模型推理提供新方向。
查看完整摘要 (Abstract)
Speculative reasoning has recently been proposed as a means to accelerate reasoning-intensive generation in large multimodal models, but its effectiveness is often constrained by misalignment between speculative drafts and target-verified reasoning. In this work, we introduce \textit{DREAM-R}, a framework that substantially improves the performance of speculative reasoning. At its core, DREAM-R employs \textit{Speculative Alignment Policy Optimization} (SAPO), a reinforcement-learning objective that trains draft models to generate reasoning steps that are both faithful to target trajectories and concise. We further propose a \textit{Threshold-based Verification Mechanism} (TBVM) that uses a ratio-based criterion to provide stable and interpretable acceptance of speculative steps only when positive evidence clearly dominates, thereby preventing error propagation. Building on these components, we develop a \textit{Fully Parallel Speculative Reasoning} (FPSR) framework that parallelizes draft generation, target-side reasoning, and verification across multi-step reasoning, enabling early stopping and clean fallback. Experiments on reasoning-heavy benchmarks demonstrate up to $2.49\times$ speedup while preserving target-model accuracy, yielding substantial efficiency gains without compromising reasoning quality.
🎯 研究动机研究如何从人类手部物体交互演示中学习灵巧的双手操作策略,以应对长时间行为规划和复杂操作要求。
❓ 解决问题解决大动作空间、时空断续性及人类与机器人手部之间的执行差距等技术难题。
🔍 现象分析基于人类演示的长时间双手操作任务,现有方法在物体状态跟踪和精确操作方面表现不足。
🛠️ 主要方法提出DexMachina算法,通过虚拟物体控制器逐渐引导物体状态,实现策略学习在动作和接触的指导下接管控制。
📊 数据与实验发布一个涵盖多样任务和灵巧手操作的模拟基准,并通过实验证明DexMachina在表现上显著优于基线方法。
⭐ 主要贡献提供基于功能对硬件设计的比较平台,揭示与硬件能力相关的关键发现,同时降低未来研究的技术门槛。
查看完整摘要 (Abstract)
We study the problem of functional retargeting: learning dexterous manipulation policies to track object states from human hand-object demonstrations. We focus on long-horizon, bimanual tasks with articulated objects, which are challenging due to large action space, spatiotemporal discontinuities, and the embodiment gap between human and robot hands. We propose DexMachina, a novel curriculum-based algorithm: the key idea is to use virtual object controllers with decaying strength: an object is first driven automatically towards its target states, such that the policy can gradually learn to take over under motion and contact guidance. We release a simulation benchmark with a diverse set of tasks and dexterous hands, and show that DexMachina significantly outperforms baseline methods. Our algorithm and benchmark enable a functional comparison for hardware designs, and we present key findings informed by quantitative and qualitative results. With the recent surge in dexterous hand development, we hope this work will provide a useful platform for identifying desirable hardware capabilities and lower the barrier for contributing to future research. Videos and more at \url{dexmachina-submission.github.io}
🎯 研究动机强化学习微调大型语言模型存在多样性崩溃问题,导致输出缺乏多样性,需寻求系统性解决方法。现有对策缺乏理论基础且常在正确性和多样性间存在权衡。
❓ 解决问题提供理论证明强化学习微调如何因选择和强化偏差导致多样性崩溃,并提出一种避免此问题的改进机制。
🔍 现象分析观察到奖励修改仅需针对正确的轨迹应用即可提升模型的性能与多样性,从理论上解释现有启发式方法效用的局限性。
🛠️ 主要方法提出一种称为差分平滑的系统性方法,该方法通过优化奖励函数在正确轨迹上的设计,模型正确性与多样性均显著提升。
📊 数据与实验在1B至7B参数模型上进行实验,涉及CountDown及真实世界数学推理任务,AIME24数据集的性能指标Pass@1和Pass@k提升最高达6.7%。
⭐ 主要贡献理论证明RL微调多样性崩溃现象的本质,引入通用性更强的差分平滑方法,并通过多领域实验验证方法有效性与普适性。
查看完整摘要 (Abstract)
It is widely recognized that reinforcement learning (RL) fine-tuning of large language models often leads to \textit{diversity collapse}, where outputs lack variety. Prior work has proposed a range of heuristics to counteract this effect, but these methods are ad hoc: they frequently trade off correctness for diversity, their effectiveness varies across tasks, and in some cases they even contradict one another. In this work, we place these observations on a rigorous foundation. We first provide a formal proof of why RL fine-tuning exhibits diversity collapse via a selection and reinforcement bias. Next, we make a key observation that any reward modification to address diversity collapse only needs to be applied on the correct trajectories. Building directly on this analysis, we introduce a principled method---\textit{differential smoothing}---that provably improves both correctness and diversity, outperforming vanilla RL as well as widely used entropy-based heuristics. Our theory precisely characterizes when existing heuristics help and why they fail, while showing that differential smoothing is universally superior. Extensive experiments with models from 1B to 7B parameters, across domains including CountDown and real-world mathematical reasoning, demonstrate consistent gains. Differential smoothing improves both Pass@1 and Pass@k, with up to 6.7\% improvements on AIME24 dataset.
🎯 研究动机现有的强化学习方法因需要计算边际似然,在处理遮罩扩散语言模型的微调时存在计算上的不可行性。作者旨在探索更高效的算法以改善模型性能和训练稳定性。
❓ 解决问题提出一种能够跳过边际似然计算的新算法,用以解决强化学习方法在遮罩扩散语言模型中微调复杂性高的问题。
🔍 现象分析通过分析基本模型的去遮罩后验分布与奖励倾斜分布之间的动态关系,发现可通过简化的计算流程提高模型微调效率。
🛠️ 主要方法提出离散倾斜匹配算法(DTM),将其表述为一种只需奖励前向评估且支持自适应控制方差的交叉熵损失策略。
📊 数据与实验在数独、Countdown及MATH500基准任务上微调LLaDA-8B-Instruct模型,并展示DTM算法在实现更高准确率和更低计算成本方面的显著优势。
⭐ 主要贡献开发了一种全新的微调方法(DTM),避免了不可行的概率计算,改善了遮罩扩散语言模型的效率和稳定性,同时在多个任务上取得了优异表现。
查看完整摘要 (Abstract)
Masked diffusion large language models (dLLMs) are a promising alternative to autoregressive generation. While reinforcement learning (RL) algorithms have been adapted to be compatible with dLLMs for fine-tuning them, their reliance on the computation of the marginal likelihood to evaluate policy objectives is intractable. To overcome this, we exploit a dynamical relation between the unmasking posterior of the base model and that which targets the reward-tilted distribution to derive Discrete Tilt Matching (DTM), an algorithm that avoids intractable likelihood evaluation entirely. DTM can be phrased as a cross-entropy loss that only requires forward evaluation of rewards and whose variance can be adaptively controlled, improving training stability. We motivate DTM on maze planning tasks, and show that fine-tuning LLaDA-8B-Instruct with DTM achieves higher accuracy at lower compute costs than prior RL-based fine-tuning methods across the Sudoku, Countdown, and MATH500 benchmarks.
🎯 研究动机现有扩散强化学习需要约50步去噪,导致采样过程缓慢,限制了实际应用。
❓ 解决问题提出DMSampler框架,通过快速蒸馏模型加速扩散强化学习的采样过程。
🔍 现象分析使用蒸馏采样器可显著减少采样步骤,无需分类器自由指导且样本质量更优。
🛠️ 主要方法采用双重迭代训练策略,在策略模型和蒸馏采样器间交替优化,并引入混合蒸馏采样与奖励感知蒸馏以提升稳定性和高奖励能力。
📊 数据与实验在文本生成图像和视频任务中,DMSampler在OCR基准和综合评测上超过现有方法,取得最优性能。
⭐ 主要贡献显著减少扩散强化学习的采样开销,提升样本质量,提出高效训练策略,实现实验结果的优秀表现。
查看完整摘要 (Abstract)
We present DMSampler, a framework that accelerates diffusion reinforcement learning by using fast distillation models as its training-time sampling engine. It overcomes the key bottleneck of sampling from the policy model—typically requiring around 50 denoising steps—by employing a co-evolving distilled sampler that needs only 4–8 steps, yielding an order-of-magnitude speedup. This approach inherently offers several advantages: it drastically reduces sampling steps, operates without classifier-free guidance to prevent potential optimization bias, and often yields superior sample quality due to more deterministic denoising trajectories. The core of DMSampler is a dual iterative training scheme, where the policy model and the distillation sampler are alternately optimized to convergence. This scheme is enhanced by two key innovations: hybrid distillation sampling, which blends outputs from both models to ensure training stability, and reward-aware distillation, which explicitly preserves high-reward capabilities during knowledge transfer. Extensive experiments on text-to-image and text-to-video generation demonstrate that DMSampler produces a final policy model which achieves state-of-the-art performance—significantly boosting textual accuracy on OCR-specific benchmarks and outperforming existing diffusion RL methods on comprehensive GenEval and VBench benchmarks.
🎯 研究动机强化学习,尤其是基于可验证奖励的强化学习,在大语言模型训练中占据关键地位。目前优化策略大多沿用预训练和监督微调阶段的方法,而强化学习与这些阶段存在本质差异。
❓ 解决问题探讨在强化学习中是否可以摆脱高内存占用的 AdamW 优化器,转而使用内存更高效的 SGD,同时保持甚至超越现有性能。
🔍 现象分析研究表明,在强化学习中,AdamW 的自适应学习率和动量作用有限;使用 SGD 进行全量微调时,仅更新少于 0.02% 的模型参数而无需额外的稀疏正则化。
🛠️ 主要方法通过假设验证方法,分析和比较 AdamW 与 SGD 在强化学习训练中的表现,重点考察 SGD 的内存效率和优化能力。
📊 数据与实验在大语言模型的强化学习任务中进行实验,结果表明 SGD 不仅在内存占用上显著优于 AdamW,还能匹配甚至超过其性能。
⭐ 主要贡献证明 SGD 在强化学习中的潜力,提出了更为高效的参数更新方式,并提供了有关强化学习优化动态的新见解,改变了对大语言模型训练的传统认知。
查看完整摘要 (Abstract)
Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of next-token-prediction stages (e.g., pretraining and supervised fine-tuning), despite the fundamental differences between RL and these stages emphasized by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rate of AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam’s per-parameter adaptive learning rates and momentum. Confirming our hypothesis, our experiments demonstrate that the substantially more memory-efficient SGD, which is known to perform poorly in supervised learning of large-scale transformers, matches or even outperforms AdamW in RL for LLMs. Remarkably, full fine-tuning with SGD updates fewer than 0.02% of model without any sparsity-promoting regularization, more than 1,000 times fewer than AdamW. Our analysis offers potential reasons for this update sparsity. Our findings provide fresh insights into the optimization dynamics of RL in LLMs and demonstrate that RL can be substantially more parameter-efficient than previously recognized.
🎯 研究动机高质量内核对可扩展的人工智能系统至关重要,但训练生成内核代码的语言模型面临数据不足、环境不够健壮以及奖励机制漏洞等挑战。
❓ 解决问题解决奖励欺骗和懒惰优化问题,通过设计新的方法和环境提升内核生成的鲁棒性及性能表现。
🔍 现象分析研究发现传统政策梯度方法存在偏差问题,且模型可能优先追求表面正确性而非实际性能提升。
🛠️ 主要方法提出KernelGYM环境支持奖励欺骗检测和多轮数据采集,引入TRLOO解决政策梯度偏差,并通过PR和PRS改进训练稳定性和结果质量。
📊 数据与实验基于Kernelbench进行性能验证,训练的Dr. Kernel-14B模型在多项测试中表现超过Claude-4.5-Sonnet及GPT-5。
⭐ 主要贡献设计了分布式GPU环境KernelGYM,提出TRLOO方法优化多轮强化学习,并开发性能领先的内核生成模型Dr. Kernel-14B。
查看完整摘要 (Abstract)
High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to _reward hacking_ and _lazy optimization_. In these cases, models may hack training rewards or prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design **KernelGYM**, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (**TRLOO**) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (**PR**) and Profiling-based Rejection Sampling (**PRS**) to overcome the issue. The trained model, Dr. Kernel-14B, reach performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr. Kernel-14B, which even **outperforms** GPT-5 and Claude-4.5-Sonnet in the Kernelbench level-2 subset.
🎯 研究动机自然语言转化为优化模型需要解决文本到数学公式的准确转化问题,现有方法难以处理隐性建模错误导致的无效结果。
❓ 解决问题提出一种能够应对建模错误的强化学习框架,专注于通过审计优化生成的程序以改善最终结果。
🔍 现象分析现有方法多依赖中间反馈进行纠错,但不能有效内化修正能力,导致错误的模型生成仍能通过验证且结果无效。
🛠️ 主要方法设计了一个两步迭代工作流框架DA-RL,通过终端验证反馈优化共享参数策略,以形成结构化自我修正能力。
📊 数据与实验利用多种自然语言到优化问题的数据集进行实验,展示了通过DA-RL生成的模型在准确性和自我纠正能力方面的显著提升。
⭐ 主要贡献提出了创新的Draft-and-Audit强化学习框架,强化了自然语言到优化建模的准确性与鲁棒性,同时实现了跨轮次的策略协同优化。
查看完整摘要 (Abstract)
Natural language to optimization (NL2Opt) requires translating unstructured text into executable mathematical models. Beyond simple syntax errors, this task suffers from silent modeling failures, where incorrect formulations execute successfully but yield invalid results. We propose Draft-and-Audit RL (DA-RL), a framework that learns optimization modeling as a two-step iterative workflow. Unlike inference-time scaffolds that rely on intermediate solver feedback to guide repairs, DA-RL optimizes a shared-parameter policy using terminal-only verification: the model is rewarded solely based on the execution of the final audited program. This constraint forces the model to internalize rubric-guided revision as a learned capability and encourages the emergence of cross-turn synergy, where the policy learns to generate drafts that are structurally amenable to self-correction.
🎯 研究动机现有强化学习优化方法局限于任务特定性,导致视觉-语言-动作(VLA)模型泛化能力弱。跨任务特征表示对提升模型通用性至关重要。
❓ 解决问题解决当前强化学习优化器过度拟合任务集的问题,通过动态优化方法提升 VLA 模型的泛化性能。
🔍 现象分析深入分析强化学习优化过程中出现任务特异化现象,并强调跨任务特征表现在改善模型通用性中的关键作用。
🛠️ 主要方法提出 DyGRO-VLA 框架,包括信息理论支持的跨任务潜表示捕获及基于混合残差动态优化策略以减轻优化过程中的负面干扰。
📊 数据与实验在标杆数据集 LIBERO 和 RoboTwin2,以及实际环境中进行验证,展示多任务训练和分布移位条件下对比基线方法的一致性提升。
⭐ 主要贡献提出动态分组残差优化框架,显著提高 VLA 模型跨任务泛化能力,推动强化学习在多任务场景中的应用。
查看完整摘要 (Abstract)
Recent progress in Reinforcement Learning (RL) provides a principled approach to optimizing Vision-Language-Action (VLA) models, facilitating a shift from trajectory imitation to active learning in the task environment. Despite improvements in control precision, most RL optimizers remain task-specific, which reduces VLA models from generalist controllers to policies that overfit to a narrow set of tasks. In this study, we conduct an in-depth analysis of this phenomenon and highlight the importance of cross-task feature representations for improving the generalizability of VLA models. Motivated by this finding, we introduce DyGRO-VLA, a two-stage optimization framework that 1) effectively captures cross-task latent representations based on information-theoretic principles, and 2) dynamically refines policy optimization via a mixture-of-RL-residuals. DyGRO-VLA enables the RL optimizer to exploit task-relevant latent information while strategically mitigating adverse interference on the learned representations throughout the optimization process. We evaluate our approach on LIBERO, RoboTwin2 benchmarks, and further validate it on real world, demonstrating consistent improvements over strong baselines under multi-task training and distribution shift.
🎯 研究动机为了推动下一代智能代理(如辅助机器人)的发展,精准理解自我视角下的人类与环境交互至关重要,而现有多模态大语言模型在精度和泛化性方面存在不足。
❓ 解决问题提高多模态大语言模型在自我视角交互理解和像素级目标定位中的推理精度和泛化能力。
🔍 现象分析现有方法在统一场景级分析和实例级定位时表现不佳,限制了模型对跨模态数据的高效解析及交互理解能力。
🛠️ 主要方法提出基于强化学习的EARL框架,结合两阶段解析流程和一组分析指导特征合成器,利用全局交互描述作为语义先验来支持查询导向的推理,并设计多维度奖励机制优化策略。
📊 数据与实验在Ego-IRGBench数据集上验证,EARL在像素定位任务中cIoU达到65.48%,较现有最佳方法提升8.37%,并在分布外评估中表现出卓越的泛化能力。
⭐ 主要贡献提出统一的强化学习框架EARL,创新性设计分析指导特征合成器和复杂奖励机制,显著提升自我视角交互理解的精度与泛化性能。
查看完整摘要 (Abstract)
A precise and comprehensive understanding of human-environment interactions in egocentric vision is essential for next-generation intelligent agents, such as assistive robotics. While existing multimodal large language models (MLLMs) support unified reasoning from scene-level analysis to instance-specific grounding, their accuracy and generalization remain limited. To this end, this paper introduces a novel Egocentric Analysis-guided RL-based method (EARL) that employs Group Relative Policy Optimization (GRPO) to enhance the interaction understanding of MLLMs in first-person vision. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the language answer and corresponding pixel-level grounding mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor from the first stage and treat it as a semantic prior, which is then integrated via a novel Analysis-guided Feature Synthesizer (AFS) to support query-oriented reasoning. Furthermore, to effectively guide policy optimization, we design a sophisticated, multi-faceted reward mechanism that incorporates format correctness, answer relevance, and grounding accuracy. Experimental results demonstrate that EARL achieves an impressive 65.48% cIoU on the Ego-IRGBench benchmark for pixel grounding, surpassing previous state-of-the-art RL-based methods by 8.37%. Superior performance in out-of-distribution evaluations further validates EARL's generalization capability.
🎯 研究动机强化学习后训练对语言模型的对齐效果显著,但训练过程复杂且耗费资源,且不稳定。解决这些问题成为研究重点。
❓ 解决问题提出一种无需训练的推理方法,直接从强化学习的最优策略采样,简化后训练过程并提升效率。
🔍 现象分析使用蒙特卡洛方法估算能量项,通过实验验证其能有效改善生成质量,尤其是在复杂任务如推理、编码和科学领域。
🛠️ 主要方法构建基于能量引导的测试时缩放(ETS)算法,引入参考策略模型与能量项,通过在线蒙特卡洛采样实现收敛,并结合加速框架和重要性采样估计器以提升效率。
📊 数据与实验使用涵盖推理、编码及科学领域的多种基准数据集,在多种语言模型(自回归和扩散模型)上验证算法性能,结果表明生成质量一致提升。
⭐ 主要贡献提出了一种无需训练的强化学习对齐推理框架(ETS),显著提高生成质量和推理效率,解决了后训练资源消耗和不稳定性问题。
查看完整摘要 (Abstract)
Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.
🎯 研究动机强化学习通过可验证奖励提升大语言模型的推理能力,但训练过程中的学习信号容易坍塌,导致性能瓶颈。
❓ 解决问题现有方法忽略了一些仍包含有价值信号的优势退化的 rollout,限制了训练收益。
🔍 现象分析在推算回报值中发现,随着训练进行,许多生成路径的奖励标准差逐渐变为零,导致策略梯度优化无效。
🛠️ 主要方法提出 EchoRL,基于模型输出路径的熵模式识别出优势退化路径中的有用片段,并将其作为辅助监督信号加入 RL 目标。
📊 数据与实验在 10 个基准数据集、5 种大语言模型、7 种 RLVR 方法上验证,展示了 EchoRL 以极低的额外计算成本持续改进训练性能。
⭐ 主要贡献提出一种模块化方法 EchoRL,解决 RLVR 中的信号退化问题,显著提升强化学习训练效率,为大语言模型推理能力强化提供新方向。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards is an effective route for post-training to strengthen the reasoning capability of large language models. However, as training proceeds, the learning signal can collapse thus makes the training gain become marginal and ineffective. Specifically, a growing fraction of prompts' rollouts become advantage-degenerated: all the self-generated rollouts show verified-success, making the standard deviation over their rewards be zero; accordingly each rollout's advantage becomes degenerated (zero) as well. Given such rollouts' advantages, the policy-gradient for model optimization eventually vanishes, capping the training performance. We argue that some of these rollouts still contain valuable learning signals but unfortunately omitted with the existing RLVR methods. In this paper, inspired through analyzing the entropy pattern behind golden trajectories produced by external expert models, we propose EchoRL for better exploiting the advantage-degenerated rollouts to further improve the training performance. EchoRL is a lightweight module that first identifies an EchoClip from verified-success rollouts based on their step-level entropy values, and then feeds this clip back as an auxiliary supervision signal in the RL objective. Extensive experiments across 10 benchmarks, 5 LLM backbones, and 7 popular RLVR post-training methods demonstrate that EchoRL consistently improves RLVR post-training with minimal overhead.
🎯 研究动机3D分子构象生成模型需满足欧几里得对称性,并集中概率分布于热力学优越和力学稳定的结构。但现有的E(3)-等变扩散模型容易受到数据偏差影响,无法捕获高保真哈密顿体系的平衡分布。
❓ 解决问题提出一种名为Elign的框架,通过消除量子化学求解和重复查询的瓶颈,将物理引导的成本进行摊销,并优化生成效率。
🔍 现象分析目前物理引导方法需要高成本的量子化学计算,同时每次生成过程都需重复查询,限制了模型的实际应用。
🛠️ 主要方法采用预训练的机器学习力场(MLFF)替代昂贵的DFT计算,并以强化学习方法在训练阶段整合物理引导,通过FED-GRPO算法对去噪策略进行优化。
📊 数据与实验实验表明,Elign生成的分子构象在DFT能量和稳定性方面优于基准,同时推断过程保持与未引导方法相当的速度。
⭐ 主要贡献提出一种摊销量子化学成本的生成框架,有效提升分子构象生成质量,同时保持高效的推断速度,适于实际应用。
查看完整摘要 (Abstract)
Generative models for 3D molecular conformations must respect Euclidean symmetries and concentrate probability mass on thermodynamically favorable, mechanically stable structures. However, E(3)-equivariant diffusion models often reproduce biases from semi-empirical training data rather than capturing the equilibrium distribution of a high-fidelity Hamiltonian. While physics-based guidance can correct this, it faces two computational bottlenecks: expensive quantum-chemical evaluations (e.g., DFT) and the need to repeat such queries at every sampling step. We present Elign, a post-training framework that amortizes both costs. First, we replace expensive DFT evaluations with a faster, pretrained foundational machine-learning force field (MLFF) to provide physical signals. Second, we eliminate repeated run-time queries by shifting physical steering to the training phase. To achieve the second amortization, we formulate reverse diffusion as a reinforcement learning problem and introduce Force--Energy Disentangled Group Relative Policy Optimization (FED-GRPO) to fine-tune the denoising policy. FED-GRPO includes a potential-based energy reward and a force-based stability reward, which are optimized and group-normalized independently. Experiments show that Elign generates conformations with lower gold-standard DFT energies and forces, while improving stability. Crucially, inference remains as fast as unguided sampling, since no energy evaluations are required during generation.
🎯 研究动机多模态大语言模型在推理能力上受到极端样本扭曲影响,亟需稳定化的归一化方法提升性能。
❓ 解决问题解决基于 std 的归一化方法在处理多模态模型中的样本极端值问题,提高模型的推理能力和鲁棒性。
🔍 现象分析多模态模型由于视觉感知复杂性与推理不确定性,易受几乎全正或全负奖励样本的归一化失真影响。
🛠️ 主要方法提出难度感知的分组归一化(Durian),依据样本难度通过视觉熵和模型信心进行分组,并在组内共享标准差。
📊 数据与实验在多个多模态推理基准上开展实验,结果显示提出方法能够有效提升性能并减少对极端样本的敏感性。
⭐ 主要贡献将难度感知机制引入分组归一化,增强了多模态模型的鲁棒性和推理性能,为多模态 LLMs 提供稳定的优化工具。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) have significantly advanced the reasoning capabilities of large language models. Extending these methods to multimodal settings, however, faces a critical challenge: the instability of std-based normalization, which is easily distorted by extreme samples with nearly positive or negative rewards. Unlike pure-text LLMs, multimodal models are particularly sensitive to such distortions, as both perceptual and reasoning errors influence their responses. To address this, we characterize each sample by its difficulty, defined through perceptual complexity (measured via visual entropy) and reasoning uncertainty (captured by model confidence). Building on this characterization, we propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group. Our approach preserves GRPO's intra-group distinctions while eliminating sensitivity to extreme cases, yielding significant performance gains across multiple multimodal reasoning benchmarks.
🎯 研究动机扩散大语言模型(dLLMs)作为自回归大语言模型(AR-LLMs)的替代方案,具有潜在推理吞吐量优势,但在推理任务中其性能需通过强化学习(RL)进一步优化。
❓ 解决问题现有强化学习算法未能充分适配dLLMs独特特性,特别是在提高推理能力方面的表现不足。
🔍 现象分析初始实验发现小训练批量大小是实现策略分布匹配效果的关键挑战,需要有效的技术来应对。
🛠️ 主要方法提出了分布匹配策略优化算法(DMPO),通过交叉熵优化将dLLM的策略分布匹配到最优的奖励倾斜分布,并引入基于权重的基线减法技术以缓解小批量问题。
📊 数据与实验在多个推理基准测试中,DMPO无监督微调情况下精确度提升高达54.3%,相较基准模型提升66.41%,验证了方法的有效性。
⭐ 主要贡献设计了一种理论支撑的RL微调方法DMPO,显著提高了dLLMs推理任务中的性能,为匹配分布框架在语言模型中的应用提供了新范式。
查看完整摘要 (Abstract)
Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning. However, RL algorithms that are well-suited for dLLMs' unique characteristics have yet to be developed. This paper proposes **Distribution Matching Policy Optimization (DMPO)**, a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key challenge in the implementation with a small training batch size and propose several effective solutions through a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supervised fine-tuning, with an accuracy improvement of up to $54.3\\%$ over previously SOTA baselines and $66.41\\%$ over the base model, underscoring the effectiveness of the distribution matching framework.
🎯 研究动机Vision-Language-Action (VLA) 模型需要适应现实世界的机器人硬件,但高成本的演示数据要求在有限数据预算下实现有效适应。
❓ 解决问题传统的多样性最大化策略可能因估计噪声而导致效率低下,该研究提出应对数据分布的覆盖与密度权衡问题,以优化适应效果。
🔍 现象分析论文提出了“多样性陷阱”,通过覆盖-密度权衡公式将策略误差分解为估计误差与外推误差,揭示有限预算下存在内在最优分配。
🛠️ 主要方法提出锚点为中心的适应框架(ACA),分两阶段操作:通过重复演示稳定核心锚点策略,再重点扩展至高风险边界,以教师强制错误挖掘和约束残差更新优化。
📊 数据与实验通过真实机器人实验,验证了覆盖-密度权衡理论,并证明ACA方法在相同预算条件下显著提高任务可靠性与成功率。
⭐ 主要贡献分析了机器人适应中的多样性权衡问题,提出了锚点为中心的适配框架,并通过实验证明了其优越性,为低成本适应机器人提供新的路径。
查看完整摘要 (Abstract)
While Vision-Language-Action (VLA) models offer broad general capabilities, deploying them on specific hardware requires real-world adaptation to bridge the embodiment gap. Since robot demonstrations are costly, this adaptation must often occur under a strict data budget. In this work, we identify a critical **diversity trap**: the standard heuristic of ``maximizing coverage" by collecting diverse, single-shot demonstrations can be self-defeating due to non-vanishing estimation noise. We formalize this phenomenon as a **Coverage--Density Trade-off**. By decomposing the policy error into estimation (density) and extrapolation (coverage) terms, we characterize an interior optimal allocation of unique conditions for a fixed budget. Guided by this analysis, we propose **Anchor-Centric Adaptation (ACA)**, a two-stage framework that first stabilizes a policy skeleton through repeated demonstrations at core anchors, then selectively expands coverage to high-risk boundaries via teacher-forced error mining and constrained residual updates. Real-robot experiments validate our trade-off framework and demonstrate that ACA significantly improves task reliability and success rates over standard diverse sampling strategies under the same budget.
🎯 研究动机现有大语言模型通常通过强化学习优化生成单一最佳答案,不适用于需要多样答案与不确定性估计的应用场景,例如医疗诊断。
❓ 解决问题提出一种能生成多候选答案的强化学习方法,通过改变目标函数使模型在一次前向传递中生成多个答案,同时内部化推理时的搜索过程。
🔍 现象分析当前单一答案训练方式倾向于重复生成占优势模式的答案,缺乏对多样性和不确定性的良好体现,影响应用效果。
🛠️ 主要方法提出Multi-RLVR和Multi-RLCR方法,前者通过集合级奖励扩展多答案生成,后者加入基于Brier分数的不确定性校准目标,确保生成答案的多样性和准确性。
📊 数据与实验在问答及医疗诊断基准测试中,证明所提方法在答案多样性、召回率和集合级校准分数方面优于单一答案基线,同时提升了生成效率。
⭐ 主要贡献提出了一种计算高效的多答案强化学习框架,优化了模型的多样性、不确定性估计,以及生成效率,为推理扩展提供了新思路。
查看完整摘要 (Abstract)
Large language models (LMs) are typically post-trained via RL to produce a single best answer per query, implicitly optimizing for modal correctness. While effective for benchmark accuracy, this approach is unideal for many applications of interest such as in medical diagnosis, which would benefit from models generating a set of plausible answers (ideally paired with uncertainty estimates).This paper describes a multi-answer reinforcement learning (RL) approach for enabling LMs to do this, where we modify the RL objective to train models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model’s generative process. We instantiate this approach through Multi-Answer Reinforcement Learning with Verifiable Rewards (Multi-RLVR), which generalizes ordinary RLVR to the multi-answer case with a set-level reward. We further extend this approach to Multi-Answer Reinforcement Learning with Calibrated Rewards (Multi-RLCR) which adds a set-level Brier score-based calibration objective to enable LMs to output calibrated uncertainty estimates associated with each answer in the output set. Multi-answer training promotes explicit representation of alternative hypotheses rather than repeated generation of the dominant mode. Across question-answering and medical diagnostic benchmarks, we observe improved diversity, recall, and set-level calibration scores compared to single answer-trained baselines. We further observe that models trained with our approach are more token-efficient, requiring fewer tokens to generate multiple answers than competing approaches. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling.
🎯 研究动机随机实验在动态系统中广泛应用于评估干预效果,但当前方法在非平稳环境下面临高偏差和高方差的问题,难以准确估计全局平均处理效应 (GATE)。
❓ 解决问题提出一种新型的截断策略梯度 (TPG) 估计器,用于减少非平稳马尔可夫环境中的偏差和方差,从而提升干预效果评估的有效性。
🔍 现象分析非平稳性导致干预对系统当前与未来状态的动态影响难以建模,现有方法无法可靠处理短期与长期效应间的复杂关系。
🛠️ 主要方法通过将瞬时结果替换为短期结果轨迹,并对 GATE 的一阶近似进行截断操作,设计出具有策略梯度解释的新估计器,同时证明其统计性改进和偏差降低性能。
📊 数据与实验基于两个真实场景的案例研究验证理论,结果表明该 TPG 估计器在复杂的非平稳环境中实现了低偏差和低方差的效果。
⭐ 主要贡献提出了一种基于策略梯度的全新估计方法,解决了非平稳性下的干预效果评估挑战,为复杂动态系统的随机实验设计奠定了理论与应用基础。
查看完整摘要 (Abstract)
Randomized experiments (or A/B tests) are widely used to evaluate interventions in dynamic systems such as recommendation platforms, marketplaces, and digital health. In these settings, interventions affect both current and future system states, so estimating the global average treatment effect (GATE) requires accounting for temporal dynamics, which is especially challenging in the presence of nonstationarity; existing approaches suffer from high bias, high variance, or both. In this paper, we address this challenge via the novel Truncated Policy Gradient (TPG) estimator, which replaces instantaneous outcomes with short-horizon outcome trajectories. The estimator admits a policy-gradient interpretation: it is a truncation of the first-order approximation to the GATE, yielding provable reductions in bias and variance in nonstationary Markovian settings. We further establish a central limit theorem for the TPG estimator and develop a consistent variance estimator that remains valid under nonstationarity with single-trajectory data. We validate our theory with two real-world case studies. The results show that a well-calibrated TPG estimator attains low bias and variance in practical nonstationary settings, and highlight the value of the policy gradient approach in the design of effective estimators despite complex dynamics.
🎯 研究动机强化学习与可验证奖励(RLVR)能够有效提升大语言模型的推理能力,但现有方法在从零开始的政策优化中面临高采样成本和经验利用低效的问题。
❓ 解决问题现有RLVR方法通过固定推理轨迹重新利用经验会引发政策不匹配,导致模型演化中经验复用的效率大幅降低。
🔍 现象分析模型能力和政策行为在训练过程中动态变化,固定轨迹的经验复用方式无法适应政策变化,影响优化效果。
🛠️ 主要方法提出一种经验增强政策优化(EAPO)方法,通过引入基于动作级经验的先验,选择性地在关键决策点注入经验,并引入适配的重要性采样机制确保稳健和无偏的学习过程。
📊 数据与实验使用Qwen-2.5-math 7b和Qwen-3-8B模型在五个不同基准上进行实验,结果显示EAPO能够持续提升推理性能,优于现有SOTA RLVR方法。
⭐ 主要贡献设计了基于动作的策略适配经验复用框架EAPO,为动态政策优化提供新的解决思路,并显著提高大语言模型推理能力。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for improving the reasoning capabilities of large language models (LLMs). However, existing RLVR methods typically rely on on-policy optimization from scratch, resulting in high sampling costs and inefficient utilization of accumulated experience. As model capabilities and policy behaviors evolve during training, recent attempts to reuse experience via fixed reasoning trajectories further suffer from policy mismatch. Motivated by these limitations, we argue that experience in RLVR should not be reused as fixed reasoning trajectories, but instead expressed in a policy-adaptive manner. In this work, we propose Experience-Augmented Policy Optimization (EAPO), which leverages a prior RL-optimized policy as an action-level experience prior and selectively injects experience at critical decision points during rollout. To ensure stable and unbiased learning from experience-augmented rollouts, EAPO further incorporates an adapted importance sampling scheme. Experiments on using Qwen-2.5-math 7b and Qwen-3-8B on five different benchmarks demonstrate that EAPO consistently improves reasoning performance over state-of-the-art RLVR methods.
🎯 研究动机多轮智能体需应对动态变化的决策环境和目标,而现有方法在经验复用上存在上下文适应性不足的问题。
❓ 解决问题提出一种混合记忆策略,通过部分复用成功经验,改进多轮工具使用政策的自我演化能力。
🔍 现象分析完整轨迹难以迁移,工具级复用缺乏上下文考虑,导致现有方法在复杂场景下的能力受限。
🛠️ 主要方法设计了一个以混合记忆为核心的框架,结合工具图和紧凑的情境化记忆,在推理时动态平衡情境回忆与程序化执行,并引入基于记忆的强化学习范式,优化长轨迹探索。
📊 数据与实验在多轮工具使用基准测试中相较强基线提高推理性能达50%以上,在分布外任务上强化学习策略性能提升达40%以上。
⭐ 主要贡献提供了一个能动态平衡情境与程序化记忆的多轮智能体框架,显著提升推理效率和策略泛化能力,同时优化了强化学习探索过程。
查看完整摘要 (Abstract)
As intents unfold and environments change, multi-turn agents face continuously shifting decision contexts. Although reusing past experience is intuitively appealing, existing approaches remain limited: full trajectories are often too context-specific to transfer, while tool-level reuse ignores the context and environment. In this paper, we introduce a hybrid episodic–procedural memory strategy (H-EPM) that enables experience-induced self-evolution of multi-turn tool-use policies, by adaptively reusing partially overlapping successful experiences in both inference and training. Inspired by human episodic–procedural integration, we build a tool graph from accumulated trajectories, where recurring tool-to-tool dependencies capture procedural routines and each edge is augmented with a compact episodic summaries of relevant context. At inference, the agent dynamically balances episodic recall for contextual reasoning and procedural execution for routine steps. Beyond inference, H-EPM introduces a memory-guided reinforcement learning paradigm that directly addresses a core challenge in multi-turn agent RL: ineffective exploration over long trajectories. By biasing exploration toward historically successful tool transitions, H-EPM learns a stronger policy that generalizes during inference without relying on domain-specific experience collection. Experiments show that H-EPM consistently delivers substantial inference-time gains over strong baselines across multi-turn tool-use benchmarks, reaching up to 50\%+. It also boosts RL policy performance, achieving up to 40\%+ improvement on out-of-distribution tasks.
🎯 研究动机传统蛋白质设计流程存在结构生成与序列设计分离的问题,序列设计通常不考虑功能目标且面临训练推理不一致性。
❓ 解决问题解决基于*Best-of-N*推理协议的训练推理偏差,同时融入功能约束,以优化序列设计的性能和精度。
🔍 现象分析标准方法和逆折叠模型通常功能非敏感,且无法有效结合结构生成的功能目标,导致序列-结构间的设计退化。
🛠️ 主要方法提出FIDIA强化学习框架,通过功能约束奖励和基于*Best-of-N*推理的策略优化,直接提升序列设计的适应性和奖励期望值。
📊 数据与实验在通用Motif支架基准测试中验证,进一步以疫苗设计和酶亲和力增强为案例证明在复杂生物制药场景中的有效性。
⭐ 主要贡献结合功能导向和推理一致性的优化策略,显著提升蛋白质序列设计的成功率与精度,扩展了设计方法的应用范围。
查看完整摘要 (Abstract)
Computational protein design typically employs a sequential workflow of structure generation followed by sequence (re)design. While structure generators can be explicitly conditioned on functional objectives, inverse folding models are constrained by their function-agnostic nature and sequence-structure degeneracy. More critically, the associated training objectives do not account for the *Best-of-N* (BoN) inference protocol, resulting in a fundamental training-inference misalignment. Here, we propose FIDIA, a reinforcement learning framework that enables **F**unction-**I**nformed sequence **D**esign via **I**nference-**A**ligned policy optimization. Specifically, FIDIA integrates functional constraints into composite rewards and explicitly optimize the induced policy under BoN toward high-fitness sequence regions. We achieve this via a grounded gradient estimator that directly maximizes the expected maximum reward. FIDIA consistently outperforms both standard and RL-optimized baselines in success rate and precision on a general motif scaffolding benchmark. Further experiments on realworld cases including vaccine and affinity-enhancing enzyme design validate FIDIA’s efficacy in complex therapeutic and biocatalytic contexts.
🎯 研究动机迭代生成策略如扩散模型和流匹配具有较高的连续控制表达能力,但其动作日志密度不可直接获得,使最大熵强化学习面临复杂性挑战。
❓ 解决问题提出一种名为FLAC的无显性概率框架,通过惩罚速度场的动能来调节策略随机性,解决动作日志密度不可访问的问题。
🔍 现象分析将策略优化视为广义Schrödinger桥问题,提出高熵参考过程,与此保持接近即可优化回报,无需显性动作密度估计。
🛠️ 主要方法基于动能作为偏离参考过程的代理,制定能量正则化的策略迭代方案,并设计一种利用拉格朗日双重机制自动调整动能的实用离线算法。
📊 数据与实验在高维基准测试中,与强基线方法相比,FLAC实现了卓越或可比的性能,同时避免了显性密度估计需求。
⭐ 主要贡献提出了一种最大熵强化学习新框架,将动能正则化引入策略优化,提供了无需显性概率估计的实用算法。
查看完整摘要 (Abstract)
Iterative generative policies, such as diffusion models and flow matching, offer superior expressivity for continuous control but complicate Maximum Entropy Reinforcement Learning because their action log-densities are not directly accessible. To address this, we propose \textbf{Field Least-Energy Actor-Critic (FLAC)}, a likelihood-free framework that regulates policy stochasticity by penalizing the kinetic energy of the velocity field. Our key insight is to formulate policy optimization as a Generalized Schr\"odinger Bridge (GSB) problem relative to a high-entropy reference process (e.g., uniform). Under this view, the maximum-entropy principle emerges naturally as staying close to a high-entropy reference while optimizing return, without requiring explicit action densities. In this framework, kinetic energy serves as a physically grounded proxy for divergence from the reference: minimizing path-space energy bounds the deviation of the induced terminal action distribution. Building on this view, we derive an energy-regularized policy iteration scheme and a practical off-policy algorithm that automatically tunes the kinetic energy via a Lagrangian dual mechanism. Empirically, FLAC achieves superior or comparable performance on high-dimensional benchmarks relative to strong baselines, while avoiding explicit density estimation.
🎯 研究动机针对在查询图像中基于少量支持示例进行目标对象的上下文内定位的需求,现有方法在非类别特定场景中表现较差,且依赖显式类别监督,限制了实例级的应用潜力。
❓ 解决问题提出一种无需类别监督的上下文内视觉定位框架,解决现有方法在语义偏差与视觉相关性不足上的问题,提升目标对象实例级定位的准确性。
🔍 现象分析现有方法对语义先验依赖较强,难以应对未命名或实例特定的对象定位,同时过于依赖模型规模而非优化目标,导致泛化性不足。
🛠️ 主要方法使用两阶段训练框架,通过支持框与查询图像间的上下文注意力优化本地化过程,并结合基于强化学习的组相对策略优化方法(GRPO),显著降低定位误差。
📊 数据与实验采用一个7B参数的模型进行训练,相较于规模更大的72B参数模型,实验结果显示该方法在综合上下文定位目标上的表现更优;通过广泛消融实验评估了各组件的贡献。
⭐ 主要贡献提供了一个无类别监督的实例级定位框架,证明优化模型目标比单纯模型扩展更有效;增强了视觉与语义的耦合能力,为个性化图像搜索和编辑等应用奠定基础。
查看完整摘要 (Abstract)
In-context localization (ICL) seeks to localize a target object specified by a small set of support examples in a query image, operating on the fly without training or parameter updates. Despite rapid advances in vision–language models (VLMs), achieving category-agnostic and visually grounded ICL remains an open problem, even though it is essential for applications such as image editing, personalized visual search, and retrieval. Existing methods are fragile and rely on explicit category supervision, which not only limits applicability in realistic settings with unnamed or instance-specific objects but also introduces category bias that steers predictions toward semantic priors rather than visual evidence. We introduce a two-stage training framework that explicitly optimizes in-context attention between support bounding boxes and query images without category supervision. We further refine localization via reinforcement learning using Group Relative Policy Optimization (GRPO) to directly minimize localization error. This formulation enforces visual correspondence over semantic priors, yielding robust instance-level localization. Empirically, a 7B-parameter model trained with our objectives outperforms models up to 72B parameters, demonstrating that context-aware localization objectives can surpass scaling alone. Comprehensive ablations validate the contribution of each component.
🎯 研究动机调度问题广泛存在于基于候选状态选择的场景中,亟需设计可解释且易迁移的调度规则框架来应对复杂任务。
❓ 解决问题提出一种基于因子化调度原则的框架,通过结构化的可识别约束对复杂调度问题进行分解,同时提升规则的可解释性与迁移能力。
🔍 现象分析调度问题的本质简化为优先级排序,传统方法在跨系统尺度的泛化和动态优化中表现不足。
🛠️ 主要方法设计了一个因子化调度原则框架,采用单变量与成对函数的可分解结构,并结合基于策略目标与条件分布的时间差信号进行学习。
📊 数据与实验在合成与真实调度任务上实验,验证了该框架在性能、可解释性及零样本跨系统泛化能力上的优势。
⭐ 主要贡献提出一种新型调度框架,结合结构化分解与动态优化,显著提升调度规则的可解释性与跨任务迁移能力。
查看完整摘要 (Abstract)
Scheduling problems arise from repeatedly selecting one item from a set of candidates based on their states. These problems often reduce to assigning priority scores and choosing the highest-ranked item. In this work, we propose a factorized scheduling principle (FSP) framework to learn interpretable and transferable scheduling rules. The FSP framework represents system states as condition distributions and decomposes a global scheduling principle into additive univariate and pairwise components with identifiability constraints. The scheduling principle enables the framework to maintain a simple priority-based structure during deployment. This principle is learned by using a policy-based objective combined with a temporal-difference signal defined on the condition distribution. Experiments on synthetic and realistic scheduling tasks demonstrate the FSP framework's strong performance, interpretability, and zero-shot generalization across different system scales.
🎯 研究动机视觉运动策略旨在通过示范学习复杂的操作任务,但生成平滑且连贯的轨迹仍然存在挑战,尤其是在精准度与远见的平衡上存在困难。
❓ 解决问题现有方法在优化动作分布时通常忽略跨区块的连贯性,导致长时间范围内动作的不一致性,阻碍了连贯轨迹的生成。
🔍 现象分析跨区块的不连续性显著影响长时间动作的学习能力,这表明需要一种既关注邻近精度又兼顾长远规划的策略。
🛠️ 主要方法提出了FocalPolicy结合频率优化区块划分和局部锚定流匹配,利用时域对齐目标监督邻近动作,并通过频域结构正则化提升跨区块连贯性。
📊 数据与实验通过大量实验验证方法的优越性,结果显示FocalPolicy持续超越现有方法,并表明所提出模块在其他基线模型上的有效泛化能力。
⭐ 主要贡献提出了一种结合时间与频域监督的新型视觉运动策略,显著改善了跨区块动作连贯性,同时提升了策略的学习效率与通用性。
查看完整摘要 (Abstract)
Visuomotor policies aim to learn complex manipulation tasks from expert demonstrations. However, generating smooth and coherent trajectories remains challenging, as it requires balancing proximal precision with distal foresight. Existing approaches typically focus on optimizing intra-chunk action distributions, often neglecting the inter-chunk coherence. Consequently, inter-chunk discontinuities significantly impede the learning of coherent long-horizon actions. To overcome this limitation and achieve a synergetic balance between precision and foresight, we propose FocalPolicy, a foresight-aware visuomotor policy that combines Frequency-Optimized Chunking with Locally Anchored flow matching. We introduce a foresight composite objective that supervises time-domain alignment within the proximal actions while regularizing frequency-domain structure over multiple future action chunks to improve cross-chunk coherence. To efficiently learn complex action distributions, we design locally anchored campling to enhance target signal propagation efficiency during consistency flow matching training. Extensive experiments demonstrate that our method consistently outperforms existing approaches while further validating the effective generalizability of our proposed modules to other baseline models. The project will be released as open source.
🎯 研究动机针对复杂场景点云中的3D物体分割任务,现有方法依赖场景级人工标注且仅能识别简单物体,缺乏有效的物体先验支持。
❓ 解决问题提出一种无需依赖人类标注的框架,在缺乏标注的情况下提升复杂场景中多类别物体分割的效果。
🔍 现象分析传统方法在学习过程中缺乏语义和几何先验,无法充分支持3D物体的高效分割,尤其在无监督及长尾分布场景中表现受限。
🛠️ 主要方法设计了基于超点的物体发现代理,并通过语义和几何奖励模块协同引导代理执行增量式超点合并,这些奖励来源于自监督的2D/3D基础模型。
📊 数据与实验在多个基准数据集上进行广泛实验验证,结果显示即便是在零样本和长尾场景中,方法仍优于现有基线模型。
⭐ 主要贡献开发了一种可扩展、无需标注的3D物体分割框架,实现了基于自监督基础模型提供反馈的创新性奖励设计,强化了方法的泛化能力与适用性。
查看完整摘要 (Abstract)
We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint-based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self-supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi-class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero-shot and long-tail scenarios, underscoring its potential for scalable, label-free 3D object segmentation.
🎯 研究动机现有扩散策略难以在轨迹规划中实现语义相关的行为控制及高效的行为泛化。
❓ 解决问题提出一种新的参数化扩散框架,更好地平衡随机性与精确行为引导,实现复杂约束下的高效适应与新行为发现。
🔍 现象分析通过在潜在流形中建立轨迹的语义距离关联,分析传统扩散方法在行为引导中的有限性。
🛠️ 主要方法设计参数化扩散策略(PDP),在光滑连续空间中构建潜在流形,实现策略间的平滑插值和未见约束下的高效泛化。
📊 数据与实验在多模态仿真和真实机器人平台上的复杂基准任务中测试,特别是在需探索新行为的环境中表现突出。
⭐ 主要贡献提出PDP框架,显著提升扩散策略的行为引导和泛化能力,支持新约束的无权重更新适应能力。
查看完整摘要 (Abstract)
We propose Parameterized Diffusion Policy (PDP), a framework that learns a diffusion policy parameterized in a smooth continuous space. By structuring a latent manifold such that distances between latents' values reflect the semantic similarity of physical trajectories, we transform diffusion from a mechanism of stochastic diversity into a precise tool for behavior steering. Our approach also enables smooth interpolation between known strategies and efficient generalization to novel constraints without the need to update policy weights. We demonstrate that PDP significantly improves adaptation performance on complex multimodal benchmarks in both simulation and real-robot hardware compared to regular diffusion policy, particularly in scenarios requiring the discovery of novel behaviors.
🎯 研究动机长时间交互中的动作贡献难以区分,优化方差高,限制了强化学习在复杂任务中的应用效果。
❓ 解决问题提出了一种新方法,通过减少长时间交互中的优化方差,提高强化学习模型表现的稳定性。
🔍 现象分析理论和实验证明了在意图空间中聚合语义相似的状态和动作能够有效降低估计器的方差,提高策略表现。
🛠️ 主要方法设计了后见策略优化方法(HPO),利用当前策略分布与后见分布的Wasserstein距离在意图空间中抽取低方差信号。
📊 数据与实验基于公开的数据集进行实验,验证了所提方法在长时间交互任务中的鲁棒性和性能改进;代码已公开提供。
⭐ 主要贡献提出了一种创新的强化学习优化方法,解决了长时间交互任务中的优化方差问题,并提升了策略性能的稳定性。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has become a widely adopted technique for improving large language models (LLMs) on complex tasks. Despite this progress, existing RL methods still face challenges in training agents with longer-horizon interactions. One major bottleneck is distinguishing the contribution of different actions in long-horizon interaction, leading to high optimization variance. To address this, we introduce a novel policy gradient method, Hindsight Policy Optimization (HPO), that projects both the current policy distribution and the hindsight distribution into an intent space and extracts low-variance learning signals from the Wasserstein distance between them. We theoretically and empirically show that aggregating semantically similar states and actions in the intent space yields a bounded-variance estimator and improves policy performance stably. Our code is available online.
🎯 研究动机LLM 代理逐渐应用于长期任务场景,提高测试时的学习能力至关重要,而现有方法多依赖手工设计的记忆更新规则,难以对齐多步目标。
❓ 解决问题通过训练记忆更新过程,优化冻结的 LLM 在连续交互中的性能,从而解决记忆更新与多步目标不一致的问题。
🔍 现象分析现有方法在多回合任务中无法稳定分配奖励信号,导致训练效果有限且难以实现细粒度的信用分配。
🛠️ 主要方法提出 MemoPilot,作为记忆更新协处理器,将更新过程表示为多轮决策问题,并通过多轮 GRPO 端到端优化;引入回合奖励信号和上下文无关的回合级收益估计,提升训练稳定性与精度。
📊 数据与实验在多轮剪刀石头布和无限德州扑克测试中验证,MemoPilot 在 Elo 评分中达到顶级性能,优于强基线和专有模型(1762: LHE, 1590: RPS)。
⭐ 主要贡献设计了新型记忆协处理器 MemoPilot,显著提升 LLM 代理的测试时学习能力;提出通用的回合奖励信号及收益估计方法,提升训练稳定性与多任务适应性。
查看完整摘要 (Abstract)
Large language model (LLM) agents are increasingly deployed in long-running settings where improving through experience at test time becomes important. A common approach is to update an explicit memory after each interaction to guide future decisions. However, most existing methods rely on hand-designed prompting rules, making it difficult to align memory updates with downstream objectives over multi-step horizons consistently. We propose MemoPilot, a plug-in memory copilot that explicitly trains the memory update process to improve a frozen LLM's performance across sequential interactions. We formulate memory updating as a multi-turn decision problem and optimize it end-to-end with multi-turn GRPO. Our training recipe introduces (i) a turn-wise reward signal and (ii) a context-independent, turn-level advantage estimation across rollouts, enabling finer-grained credit assignment and more stable training in multi-turn settings. We evaluate MemoPilot on two testbeds: multi-round Rock-Paper-Scissors (RPS) and Limit Texas Hold'em (LHE). Across both environments, MemoPilot substantially improves test-time learning of a frozen player over strong baselines, ranking first in Elo ratings on both games (1762 on LHE and 1590 on RPS) and outperforming all baseline memory methods and proprietary models, including Deepseek-V3.2.
🎯 研究动机为解决预训练机器人策略往往难以高效掌握高复杂技能的问题,提出利用分布收缩强化学习提升其表现能力。
❓ 解决问题如何将预训练生成的广覆盖行为策略优化为高成功率的专业策略。
🔍 现象分析通过在线反馈将高成功率行为放大,从而实现复杂长时序操控任务的稳定、高效学习。
🛠️ 主要方法提出DICE-RL框架,结合扩散式预训练、大样本效率的残差离策略强化学习和选择性行为正则化策略,实现价值引导的动作选择。
📊 数据与实验在模拟环境和真实机器人中进行大规模实验,展示了DICE-RL具有性能提升的可靠性、稳定性和样本效率。
⭐ 主要贡献提出DICE-RL新框架,有效提高预训练生成策略的性能,为强化学习在复杂任务中的应用提供新思路。
查看完整摘要 (Abstract)
We introduce Distribution Contractive Reinforcement Learning (DICE-RL), a framework that uses reinforcement learning (RL) as a “distribution contractor” to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing “pro” policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency, enabling mastery of complex long-horizon manipulation skills both in simulation and on a real robot. Project website: [dice-rl-anonymous.github.io](https://dice-rl-anonymous.github.io/).
🎯 研究动机强化学习已成为提升大型语言模型推理能力的重要手段,现有的监督微调方法在处理分布外任务时表现受限,亟需探索更具泛化能力的推理机制。
❓ 解决问题开发具备组合泛化能力的语言模型推理机制,以通过从复杂推理轨迹中提取可复用模块来应对陌生任务配置。
🔍 现象分析RL 的探索特性能够挖掘推理轨迹中的潜在结构,从而实现组合泛化;相比单独训练基础模块,从复杂轨迹中学习反而能实现更强的泛化能力。
🛠️ 主要方法提出分层潜变量选择模型,将推理轨迹分解为基础操作模块与路径重用机制,结合理论论证和强化学习方法实现该模型的探索和泛化。
📊 数据与实验设计了多种受控实验验证理论假设,显示 RL 能从复杂轨迹中提取模块并重组以解决新任务,同时探讨了 SFT 与 RL 的结合效果。
⭐ 主要贡献提出新的分层潜变量模型以阐释组合泛化机制;理论证明 RL 具备识别潜在结构的能力;设计有效的 SFT 与 RL 联合训练流程以扩大模块覆盖并强化探索。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has emerged as a key mechanism for transforming LLMs into robust reasoners. While supervised fine-tuning (SFT) often limits models to the distribution of observed reasoning traces, RL post-training significantly improves performance on out-of-distribution (OOD) tasks that require unfamiliar recombinations of familiar steps. We argue that this improvement is driven by **compositional generalization**, which we formalize through a **Hierarchical Latent Selection Model**. In this framework, reasoning traces are generated by a cascade of discrete latent selection variables corresponding to reusable atomic modules, including both skills (local operations) and routing mechanisms (how intermediate information is selected, reused, and composed). We theoretically show that RL’s exploratory nature provides sufficient coverage to identify latent structure and enable compositional generalization. We design controlled experiments to validate this theory. Our results demonstrate that RL can extract atomic modules from compound traces and recombine them to solve new configurations. Moreover, we find that training on compound traces can yield stronger generalization than training on isolated atomic modules. Finally, we investigate relations between SFT and RL and identify an effective protocol in which SFT ensures coverage of all atomic modules, while RL focuses on novel compositions beyond the SFT support to encourage exploration.
🎯 研究动机强化学习的奖励验证在大规模推理模型的训练后阶段非常重要,但现有算法如GRPO存在多样性崩塌的问题,需要更好地优化准确性和多样性之间的平衡。
❓ 解决问题传统GRPO算法倾向于导致正确模式的单一化(winner-take-all现象),通过几何化的方式重新设计算法,旨在缓解这一问题并避免准确性-多样性折衷。
🔍 现象分析GRPO在概率单纯形上导致动态流向顶点的碰撞场,使得正确模式数量显著减少,并进一步引发后期熵崩塌问题。
🛠️ 主要方法提出G$^2$RPO,通过对向量场进行编辑,在优势层面添加基于模式概率的粒度奖励,以鼓励正确但代表性不足的模式,从而提升多样性。
📊 数据与实验在7B和14B模型上通过数学推理任务以及AIME 2024/2025基准测试,G$^2$RPO相比GRPO显著提高了正确模式覆盖率(172%--205%),同时提高了 exttt{pass@1}指标(+1.4至+7.9个百分点)。
⭐ 主要贡献提出了一种几何角度优化的强化学习算法,实现了多样性与性能的兼顾,为准确性与熵平衡提供了一种无需折衷的新方法。
查看完整摘要 (Abstract)
Reinforcement learning with verifiable rewards (RLVR) is a cornerstone of post-training for large reasoning models, yet widely used algorithms such as Group Relative Policy Optimization (GRPO) often exhibit \textbf{diversity collapse}. We provide a geometric diagnosis by formalizing GRPO as a dynamical flow on the probability simplex. Under a mode-based coarse-graining of rollouts, we show that GRPO induces a \textbf{collision field} over correct modes, monotonically pushing towards simplex vertices and thus yielding a \textbf{winner-take-all} regime. To address this systematically, we introduce \textbf{G$^2$RPO (Geometric GRPO)}, which reshapes RLVR via principled \textbf{vector-field editing}. Concretely, we intervene at the advantage level by adding granularity bonuses inversely proportional to mode probabilities, encouraging underrepresented correct modes. The bonus has a natural geometric interpretation, and its potential performance side effects can be mitigated, thereby avoiding the usual accuracy--diversity trade-off. In experiments with 7B and 14B models trained on a math reasoning task and evaluated on \textbf{AIME 2024/2025}, GRPO loses up to \textbf{57\%} of active correct modes. In contrast, G$^{2}$RPO increases active correct-mode coverage by \textbf{172\%--205\%}, reduces concentration on any single correct mode, prevents the late-stage \emph{entropy crash}, and improves \texttt{pass@1} by \textbf{+1.4} to \textbf{+7.9} points relative to GRPO. Overall, diversity is not merely a regularizer but a \textbf{geometric property} to be controlled to improve the model without trapping it in a single dominant strategy.
🎯 研究动机语言模型需要同时满足多种人类偏好,为此强化学习管道开始引入多种奖励设计。然而现有的多奖励优化方法未充分考量各奖励间的独立性,导致训练效果不佳。
❓ 解决问题直接应用 GRPO 方法时,多种奖励的融合会导致优势值分辨率丧失,进而影响收敛性甚至导致训练失败。本文旨在改进此问题,以实现更高效的多奖励优化。
🔍 现象分析现有方法在多奖励组合正常化时,会产生奖励值塌缩现象,损失训练信号的细粒度信息,导致准确性降低及训练稳定性差。
🛠️ 主要方法提出 GDPO 方法,通过对各个奖励的分离式归一化保留其相对差异,避免塌缩问题,从而提升多奖励优化的准确性与稳定性。
📊 数据与实验采用工具调用、数学推理、代码推理三项任务进行对比实验,评估指标包括准确性(如准确率、错误率)及约束遵循(如格式、长度)。结果显示 GDPO 在所有任务中均优于 GRPO。
⭐ 主要贡献突破性提出 GDPO 方法,提高了多奖励强化学习的优化效果;展现其在多任务场景中的通用性与稳定性;增强语言模型在人类偏好对齐方面的表现。
查看完整摘要 (Abstract)
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
🎯 研究动机强化学习在多模态代理的后训练中表现出潜力,但其数据效率较低,特别是在交互数据稀缺且快速过时的情况下。
❓ 解决问题提出一种新的策略优化目标,以提升强化学习在有限样本交互数据下的稳定性和效率。
🔍 现象分析现有截断重要性采样方法中的硬剪裁会导致梯度为零的问题,严重影响策略更新的效率与稳定性。
🛠️ 主要方法提出基于高斯重要性采样的策略优化方法,用对数比例定义的高斯信任权重替代硬剪裁,平滑地抑制极端权重值,并引入可调约束以控制更新幅度。
📊 数据与实验在不同规模的重放缓冲数据上进行实验,从近似在线到严重陈旧数据情境,结果表明该方法在偏差-方差权衡、训练稳定性和样本效率上优于基线方法。
⭐ 主要贡献设计了一种基于高斯信任权重的策略优化框架,理论分析证明其稳定性与鲁棒性;实现了最先进的性能,并提升了样本利用效率和训练稳定性;代码公开以支持后续研究。
查看完整摘要 (Abstract)
Post-training with reinforcement learning (RL) has recently shown strong promise for advancing multimodal agents beyond supervised imitation. However, RL remains limited by poor data efficiency, particularly in settings where interaction data are scarce and quickly become outdated. To address this challenge, GIPO (Gaussian Importance sampling Policy Optimization) is proposed as a policy optimization objective based on truncated importance sampling, replacing hard clipping with a log-ratio–based Gaussian trust weight to softly damp extreme importance ratios while maintaining non-zero gradients. Theoretical analysis shows that GIPO introduces an implicit, tunable constraint on the update magnitude, while concentration bounds guarantee robustness and stability under finite-sample estimation. Experimental results show that GIPO achieves state-of-the-art performance among clipping-based baselines across a wide range of replay buffer sizes, from near on-policy to highly stale data, while exhibiting superior bias–variance trade-off, high training stability and improved sample efficiency. Code is provided in supplementary material.
🎯 研究动机强化学习中的过程奖励模型(PRM)可实现更细粒度的奖励分配,但其与通常的结果奖励模型(ORM)间的关系尚未被充分理解。
❓ 解决问题证明 GRPO 算法结合 ORM 实际上等价于特定条件下的基于 Monte Carlo 的 PRM,进而分析该算法中的缺陷并提出改进方法。
🔍 现象分析发现 GRPO 的目标函数在处理不平衡的过程步骤和奖励时,会在不同条件下阻碍探索与利用之间的平衡。
🛠️ 主要方法通过理论分析提出 GRPO 内隐的 PRM 结构并改进算法为 $$-GRPO,以缓解上述缺陷,提升性能。
📊 数据与实验在调优大语言模型时,$$-GRPO 在下游推理任务中表现优于标准 GRPO,并且以更快速度达到性能峰值,且训练时间和成本影响微乎其微。
⭐ 主要贡献揭示 GRPO 隐含的 PRM 本质,识别原算法缺陷并实现改进,提供了在无需显式 PRM 的情况下提升模型性能的新思路。
查看完整摘要 (Abstract)
Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($\lambda$-GRPO), and show that LLMs tuned with $\lambda$-GRPO outperform LLMs tuned with standard GRPO on downstream reasoning tasks$\textemdash$and reach peak performance more rapidly. These results show that we can leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance without employing an explicit PRM, and with a negligible impact on training time and cost.
🎯 研究动机现有对比多视角聚类方法依赖预定义的簇数量,限制了其在缺乏先验知识的真实世界场景中的灵活性。解决未知 $K$ 的聚类问题十分重要。
❓ 解决问题提出一种框架 GROK,通过引入一种基于集群决策的智能体,自动决定聚类的最佳簇数 $K$,针对未知 $K$ 的多视角聚类问题提供解决方案。
🔍 现象分析传统方法在未知簇数的场景下无法有效适应或优化,未能充分探索数据的真实聚类结构。
🛠️ 主要方法GROK采用一种基于组相对策略优化(GRPO)的深度强化学习机制,通过状态感知、组决策和几何反馈三个阶段协同作用实现最佳簇结构的自主探索。
📊 数据与实验在多个实验数据集上测试,验证了GROK框架在未知 $K$ 的场景下具有优越的聚类性能,与现有方法相比更具优势。
⭐ 主要贡献首次将GRPO引入到无监督领域,设计了决策、奖励及反馈闭环机制,实现了簇数的自主确定并显著提升了多视角聚类性能。
查看完整摘要 (Abstract)
Existing contrastive multi-view clustering methods rely on a pre-defined cluster number, limiting their flexibility in real-world scenarios lacking prior knowledge. To address this, we propose GROK, a novel framework driven by a cluster decision agent for unknown-$K$ multi-view clustering. It pioneers the adaptation of group relative policy optimization (GRPO) —a reinforcement learning strategy for LLM reasoning— into the unsupervised domain to autonomously determine the optimal $K$. Specifically, the agent orchestrates the clustering process through three synergistic phases. First, in the state perception phase, we employ a structure-aware adaptive backbone to aggregate multi-view data, providing the agent with consistent and discriminative consensus observations. Second, in the group decision phase, we introduce an action space divide-and-conquer strategy and an adaptive reward function. Equipped with these mechanisms, the agent performs group sampling and relative advantage estimation within the discrete action space of candidate $K$ values, autonomously searching for the optimal $K$ via reward maximization. Finally, via geometric feedback, geometric clustering guidance mechanism transforms the agent's structural hypotheses into explicit differentiable constraints to reshape feature manifolds, thereby closing the perception-decision-feedback loop. Experimental results demonstrate that GROK achieves superior clustering performance in unknown-$K$ scenarios by autonomously exploring the underlying cluster structure.
🎯 研究动机当前代码生成中的强化学习方法主要针对 Python 进行优化,其他编程语言的泛化性能较弱。多语言解决方案虽然提供了丰富的语义和更广的搜索空间,但独立训练面临优化不平衡和知识迁移缺失的问题。
❓ 解决问题为解决低资源编程语言优化困难以及跨语言知识迁移不足的问题,提出一种新的跨语言联合优化策略。
🔍 现象分析独立训练方法无法充分利用高资源语言的知识迁移,并导致不同语言间的优化不平衡,影响低资源语言的生成性能。
🛠️ 主要方法提出 GXPO 方法,通过为同一问题在不同编程语言生成解决方案并组成训练组,联合优化语言特定信号和跨语言信号,实现更平衡的优化和知识迁移。
📊 数据与实验扩展现有的 LiveCodeBench 为多语言评测工具 ML-LCB,覆盖 8 种编程语言。实验表明 GXPO 在多语言环境下表现优异,尤其在低资源语言上的性能提升显著。
⭐ 主要贡献提出了 GXPO 方法,实现了可扩展的多语言强化学习框架;引入了 ML-LCB 数据集,统一了多语言代码生成的评测标准;验证了方法在低资源编程语言上的显著优势。
查看完整摘要 (Abstract)
Current reinforcement learning (RL) methods for code generation are predominantly optimized on Python, showing weak generalization to other programming languages (PLs). Although leveraging multilingual solutions offers richer semantics and a wider search landscape, naive independent training across languages suffers from optimization imbalance and fails to effectively transfer knowledge from high-resource languages. We propose Group Cross-lingual Relative Policy Optimization (GXPO), which forms training groups by generating solutions for the same problem in multiple PLs and jointly optimizes language-specific and cross-language signals, enabling more balanced optimization and improved transfer to low-resource PLs. We additionally introduce Multilingual LiveCodeBench (ML-LCB), extending LiveCodeBench to a unified multilingual evaluation setting. On ML-LCB across 8 PLs, GXPO consistently improves performance, with pronounced gains on low-resource PLs, demonstrating scalable multilingual RL for language-consistent code generation.
🎯 研究动机针对低秩适应(LoRA)在强化学习验证奖励(RLVR)中的初始化问题进行研究,以弥补其在监督微调(SFT)与RLVR表现差异的认知空白。
❓ 解决问题解决如何在RLVR中初始化低秩矩阵以提高模型稳定性与性能的问题。
🔍 现象分析发现标准LoRA在RLVR中表现优于PiSSA和MiLoRA,这些结构初始化的变种不仅效果欠佳,还可能导致训练不稳定。
🛠️ 主要方法提出保几何特性的正交初始化理论分析,并基于此设计两种新的LoRA变种:LoRA-RLPO与LoRA-RLMO。
📊 数据与实验在数学推理基准上进行实验,验证正交初始化能够稳定RLVR训练并超越标准LoRA,同时对PiSSA与MiLoRA在RLVR中的不足提供解释。
⭐ 主要贡献理论分析证明正交初始化可最小化模型与全微调结果间的差距,提出新变种并稳定RLVR训练,揭示现有变种在RLVR中的局限性。
查看完整摘要 (Abstract)
Low-Rank Adaptation (LoRA) and its variants enable parameter-efficient fine-tuning of large language models under the supervised fine-tuning (SFT) paradigm. However, their efficacy and behavior under Reinforcement Learning with Verifiable Rewards (RLVR) are less well understood. In particular, two structurally initialized LoRA variants, PiSSA and MiLoRA, which outperform standard LoRA under SFT, can underperform standard LoRA under RLVR and may even exhibit training instability. These observations suggest that how to initialize the low-rank matrices in RLVR remains unclear. In this work, we develop a theoretical analysis of LoRA in RLVR, showing that orthonormal initialization achieves the minimal gap between LoRA’s outcome and that of full fine-tuning. Guided by this insight, we propose geometry-preserving orthonormal initialization for low-rank adaptation in RLVR, leading to two new variants, LoRA-RLPO and LoRA-RLMO. Experiments on mathematical reasoning benchmarks show that our orthonormal initialization stabilizes RLVR training and outperforms standard LoRA, contrasting with PiSSA and MiLoRA. Finally, our unified analysis also explains why PiSSA and MiLoRA can underperform in RLVR, which may be of independent interest.
🎯 研究动机多跳事实验证需要跨证据进行复杂推理,但大语言模型常出现幻觉和逻辑链断裂问题,现有方法缺乏对证据与主张间因果依赖的明确建模。
❓ 解决问题引入新的框架,通过结构因果模型进行推理,将事实验证转化为构造性因果推断过程,提高透明性与准确性。
🔍 现象分析实验证明推理链长度与验证准确性呈倒U型相关,结构复杂度过高会导致性能退化。
🛠️ 主要方法提出基于规则的强化学习策略——组相对策略优化(GRPO),动态平衡结构深度与简洁性。
📊 数据与实验在HoVer和EX-FEVER数据集上进行广泛实验,验证所提框架在多跳事实验证任务中的优越性。
⭐ 主要贡献提出了基于因果模型结合策略优化的新框架,显著提升了复杂事实验证任务的可靠性和可解释性,同时超越现有多跳推理基准方法。
查看完整摘要 (Abstract)
Multi-Hop Fact Verification (MHFV) necessitates complex reasoning across disparate evidence, posing significant challenges for Large Language Models (LLMs) which often suffer from hallucinations and fractured logical chains. Existing methods, while improving transparency via Chain-of-Thought (CoT), lack explicit modeling of the causal dependencies between evidence and claims. In this work, we introduce a novel framework that grounds reasoning in a Structural Causal Model (SCM), treating verification as a constructive causal inference process. We empirically identify an "inverted U-shaped" correlation between reasoning chain length and accuracy, revealing that excessive structural complexity degrades performance. To address this, we propose a Rule-based Reinforcement Learning strategy using Group Relative Policy Optimization (GRPO). This approach dynamically optimizes the trade-off between structural depth and conciseness. Extensive experiments on HoVer and EX-FEVER demonstrate that our SCM-GRPO framework significantly outperforms state-of-the-art baselines, offering a reliable and interpretable solution for complex fact verification.
🎯 研究动机长时间跨度任务中,基于大语言模型(LLM)的智能体因稀疏的基于结果奖励难以有效分配中间步骤的贡献,从而面临显著挑战。
❓ 解决问题解决由于策略梯度与熵耦合导致的学习动态低效问题,包括对高置信度正确动作的低效更新和对不确定动作的潜在过度更新。
🔍 现象分析发现传统策略梯度方法存在熵耦合问题,不仅限制了正确动作的高效学习,还可能使探索阶段不稳定。
🛠️ 主要方法提出熵调制策略梯度(EMPG),通过调整学习信号:放大高置信度正确动作更新,惩罚高置信度错误动作,并减弱不确定步骤的更新,同时增加鼓励更明确解决路径的奖励项。
📊 数据与实验在三个具有挑战性的智能体任务(WebShop、ALFWorld、Deep Search)上进行实证实验,表现出显著性能提升并优于现有强策略梯度基线方法。
⭐ 主要贡献解决长时间跨度任务中奖励稀疏问题,创新提出EMPG框架,显著提升了基于LLMs智能体的策略训练效率和稳定性。
查看完整摘要 (Abstract)
In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that recalibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines.
🎯 研究动机识别机器人动作序列中的语义分解,将任务级运动意图与执行级细化区分开来,以优化动作生成流程和精度。
❓ 解决问题通过频谱分析解决传统方法难以同时捕捉全局运动轨迹和精细行为表现的问题。
🔍 现象分析通过离散余弦变换(DCT)发现低频成分反映全局轨迹,高频成分包含细节时序、对齐和接触行为特征。
🛠️ 主要方法提出因果光谱策略(CSP),将动作生成分为粗到精的因果过程:从观测和语言预测粗略运动,并基于已执行轨迹生成细部修正。
📊 数据与实验在模拟与真实场景的精密操作任务中评估,CSP在精度敏感任务中始终优于强基线;引入受人类启发的遥操作噪声增强方法以提高鲁棒性。
⭐ 主要贡献揭示动作序列的频谱结构特性,提出基于因果光谱分解的策略学习框架,并显著提升精密操作任务性能和泛化能力。
查看完整摘要 (Abstract)
In this paper, we identify a semantic decomposition in robot action sequences, separating task-level motion intent from execution-level refinements. By analyzing actions in the spectral domain using the discrete cosine transform (DCT), we observe that low-frequency components capture global motion trajectories, while high-frequency components encode precise timing, alignment, and contact behaviors. Motivated by this structure, we propose Causal Spectral Policy (CSP), which models action generation as a causal coarse-to-fine process: coarse motion is predicted from observation and language, and fine corrections are generated conditionally on the realized trajectory. Across simulation and real-world evaluations, CSP consistently outperforms strong baselines on precision-sensitive manipulation tasks. Additionally, we propose human-inspired teleoperation noise injection as a data augmentation method under which our approach demonstrates strong robustness to noisy demonstrations.
🎯 研究动机强化学习在通过奖励信号优化大型语言模型行为方面具有潜力,但状态价值估计这一关键问题在后训练过程中尚未得到充分研究。
❓ 解决问题提出方法以提高LLM后训练中的状态价值估计精度,从而实现更稳定且高效的强化学习训练。
🔍 现象分析传统方法,如PPO的评论器,通常退化为粗略的组平均估计,无法准确反映状态价值。
🛠️ 主要方法提出Numca和Hista两种方法,其中Numca通过响应中的数值作为状态表示计算状态价值,Hista则利用隐藏状态中的语义信息将分散的响应分组。
📊 数据与实验构建了State Value Estimation Benchmark (SVEB),实验表明改进的估计方法在多种RL算法中均能显著提升训练性能。
⭐ 主要贡献提出了两种新颖的状态价值估计方法,构建了状态价值评估基准,并验证了其对加强LLM后训练稳定性和性能的有效性。
查看完整摘要 (Abstract)
Reinforcement Learning (RL) refines large language models (LLMs) by directly optimizing model behavior with reward signals. Although accurate state value estimation is essential for stable training in classical RL settings, it remains an understudied challenge in LLM post-training. In this work, we demonstrate that accurate value estimation can stabilize and improve post-training. First, we construct State Value Estimation Benchmark (SVEB) and show that critics of standard approaches like PPO simply degenerate toward a coarse group-average baseline. To overcome this, we propose two techniques. One is a heuristic method *Numca*, which uses numbers in responses as state representation to calculate state value. Another is a general hidden-state-based framework *Hista*, which utilize the semantic information in hidden states to group disjoint responses. Experiments show that when equipped with these improved estimates, training gains better performance consistently with different RL algorithms.
🎯 研究动机传统强化学习算法在大型语言模型中无法有效区分重要推理步骤与无关内容,导致奖励信号分配不精细,从而限制了推理任务的性能提升。
❓ 解决问题通过跟踪模型内注意力机制生成的信息流结构,实现针对答案区域的推理流程追踪,以优化强相关令牌的奖励分配。
🔍 现象分析现有方法仅依靠局部启发式策略,忽视信息传播的全局结构,难以捕捉关键推理步骤与长距离依赖关系。
🛠️ 主要方法提出FlowTracer框架,将注意力权重聚合生成有向无环图,重新赋权以保留至答案区域的信息影响路径,采用信息流骨架提取推理路径并根据流量透过量评分令牌。
📊 数据与实验通过多个推理任务设计实验验证方法性能,相较于基线算法在各种任务中均有显著提升。
⭐ 主要贡献实现基于信息流的细粒度令牌奖励信号优化,提升答案相关推理步骤的学习效率,提高强化学习模型在推理任务中的表现。
查看完整摘要 (Abstract)
Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces \emph{answer-targeted reasoning flow} on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.
🎯 研究动机复杂的几何问题解决需要交织推理,但现有多模态大语言模型在图形生成与逻辑推理的结合上存在局限性。
❓ 解决问题解决监督微调(SFT)在交织绘图与推理任务中表现退化的根本原因,提升模型对因果依赖关系的理解能力。
🔍 现象分析SFT主要实现分布对齐,但未能内化绘图与推理步骤间的因果依赖,导致推理性能显著下降。
🛠️ 主要方法提出名为Faire的强化学习框架,引入三个因果约束,实现从表面模仿到功能对齐的转变。
📊 数据与实验通过在复杂几何推理基准上进行广泛实验,验证Faire框架在内化绘图过程中的有效性和竞争性表现。
⭐ 主要贡献揭示SFT的局限性,提出Faire强化学习框架,并展示其在几何推理任务中的显著性能提升。
查看完整摘要 (Abstract)
Solving complex geometric problems inherently requires \textit{interleaved reasoning}: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in visual generation and plotting, we identify a counter-intuitive and underexplored phenomenon. Naively applying Supervised Fine-Tuning (SFT) on interleaved plot–solution data leads to a substantial degradation in reasoning performance compared to text-only baselines. We argue that this failure stems from a fundamental limitation of SFT, which primarily induces \textit{distributional alignment}: the model learns to reproduce the surface format of interleaved plotting but fails to internalize the causal dependency between the generated plot and reasoning steps. To overcome this limitation, we propose Faire (\textbf{F}unctional \textbf{a}lignment for \textbf{i}nterleaved \textbf{re}asoning), a reinforcement learning framework that enforces three casual constraints to move beyond superficial imitation toward \textit{functional alignment}. Extensive experiments show that Faire induces a qualitative shift in model behavior in which the plotting is effectively internalized, yielding competitive performance on challenging geometric reasoning benchmarks.
🎯 研究动机大语言模型通过长推理链提高准确性,但推理代价显著增加。现有方法在推理过程中对 token 的信息分配控制有限,存在改进空间。
❓ 解决问题提出一种基于信息理论的后训练框架,优化 token 层面的信息分配,从而提高推理效率并减少冗余计算。
🔍 现象分析通过理论分析,验证信息感知的 token 优化方法可以显著降低推理冗长性,同时保持结果的正确性。
🛠️ 主要方法设计了 IAPO 框架,利用条件互信息为每个 token 分配权重,通过压缩低效推理步骤提升整体效率。
📊 数据与实验实验表明,IAPO 在多个推理数据集上推理长度减少了最高 36%,同时准确性优于现有 token 优化的强化学习方法。
⭐ 主要贡献提供了一种普适的信息感知方法,有效改善了大语言模型的推理效率,并公开了相关代码以支持进一步研究。
查看完整摘要 (Abstract)
Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token’s conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36\%, outperforming existing token-efficient RL methods across various reasoning datasets. Our results demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://anonymous.4open.science/r/agent_rl-107E/.
🎯 研究动机强化学习的高频振荡控制信号影响物理部署中的安全性与稳定性,需要改进现有方法以实现平滑控制。
❓ 解决问题显式动作分块增加了策略输出维度,导致优化困难并与标准逐步交互不兼容。本研究提出隐式动作分块框架以解决此问题。
🔍 现象分析现有方法未能有效平衡时间抽象和反应性控制,动作空间扩展引发的优化问题阻碍了平滑连续控制。
🛠️ 主要方法提出Dual-Window Smoothing (DWS),采用执行窗口确保物理平滑性,价值窗口修正评论员偏差,并加入基于一阶动作差异的时间正则化以促进全球连续性。
📊 数据与实验在DeepMind Control Suite、工业能量管理任务及复杂视觉驾驶任务上进行实验,结果显示DWS优于LipsNet++和SmODE等先进基线,且在复杂任务中实现100%成功率。
⭐ 主要贡献提出不扩展动作空间的隐式动作分块方法DWS,显著提升平滑控制效果、安全性与行动稳定性,并扩展强化学习物理部署的适用场景。
查看完整摘要 (Abstract)
Reinforcement learning often produces high-frequency oscillatory control signals that undermine the safety and stability required for physical deployment. Explicit action chunking addresses this by predicting fixed-horizon trajectories but increases the policy output dimension to R^hd, leading to optimization difficulties and incompatibility with standard step-wise interaction. To overcome these challenges, this paper proposes Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control. Unlike explicit methods, DWS enforces temporal coherence without expanding the action space. It uses a dual-window design: an execution window that ensures physical smoothness through deterministic modulation, and a value window that aligns temporal-difference targets over the horizon to correct critic bias caused by open-loop execution. DWS also includes a lightweight actor-side temporal regularizer based on first-order action differences to promote global continuity. This design effectively bridges the gap between temporal abstraction and reactive step-wise control. Experiments on benchmarks including the DeepMind Control Suite and industrial energy management tasks with vector states show that DWS outperforms state-of-the-art (SOTA) baselines such as LipsNet++ and SmODE. In complex vision-based autonomous driving tasks, DWS achieves smoother control, safer behavior with reduced jitter, and attains a 100% success rate.
🎯 研究动机多轮人机协作在互动式服务中具有重要意义,但优化过程易受到中间奖励稀疏性及用户响应高随机性的影响。
❓ 解决问题提出一种针对稀疏奖励信号和用户行为不稳定的优化方法,以改善交互过程中策略学习的训练效果。
🔍 现象分析现有方法依赖不稳定的细粒度奖励信号,难以捕捉交互过程中的人类偏好语义一致性。
🛠️ 主要方法提出 ITPO 方法,通过隐式过程奖励模型从稀疏结果信号中推导细化的轮次奖励,并采用归一化机制提升训练稳定性。
📊 数据与实验基于数学辅导、文档撰写、医疗推荐三类多轮协作任务,结合 PPO、GRPO 和 RLOO 方法进行实证分析,验证收敛性能优越性。
⭐ 主要贡献提出一种稳健的轮次奖励优化框架,并实证其在多轮协作任务中的效果提升及与人类判断的一致性。
查看完整摘要 (Abstract)
Multi-turn human-AI collaboration is fundamental to deploying interactive services such as adaptive tutoring, conversational recommendation, and professional consultation. However, optimizing these interactions via reinforcement learning is hindered by the sparsity of verifiable intermediate rewards and the high stochasticity of user responses. To address these challenges, we introduce Implicit Turn-wise Policy Optimization (ITPO). ITPO leverages an implicit process reward model to derive fine-grained, turn-wise process rewards from sparse outcome signals. Unlike volatile token-level rewards, these turn-level signals exhibit superior robustness and may utilize a normalization mechanism to further enhance training stability. We evaluate ITPO across three representative multi-turn collaborative tasks: math tutoring, document writing, and medical recommendation. Empirical results demonstrate that ITPO, when combined with PPO, GRPO, or RLOO, consistently achieves improved convergence than existing baselines. Elaborate trajectory analysis confirms that ITPO infers turn-wise preferences that are semantically aligned with human judgment.
🎯 研究动机当前非对称演员-评论员方法广泛应用于部分可观测强化学习,但通常依赖完整状态访问条件评论员,这在实际中难以实现。需要一种改进框架来减少对完整状态的依赖。
❓ 解决问题提出一种新框架,使评论员能够基于任意状态相关的特权信号进行条件化,而无需完整状态访问,以解决训练中非对称方法的现实约束问题。
🔍 现象分析任意特权信号能够生成无偏的策略梯度估计,这表明通过合理选择特权信息,可以扩展适用信息集合,优化学习效果。
🛠️ 主要方法设计两个特权信号选择准则:基于依赖性测试的预训练准则,以及基于价值预测精度改进的后验准则,辅助框架实现信息优化。
📊 数据与实验在部分可观测基准任务与合成环境上实验表明,选择合理特权信号的性能可匹配或超越完全状态非对称基线,同时显著减少对状态信息的依赖。
⭐ 主要贡献提出一个无需完整状态访问的非对称框架,扩展了可用特权信号范围;设计了特权信号选择准则;验证了框架在多任务中的有效性与性能提升。
查看完整摘要 (Abstract)
Asymmetric actor-critic methods are widely used in partially observable reinforcement learning, but typically assume full state observability to condition the critic during training, which is often unrealistic in practice. We introduce the informed asymmetric actor-critic framework, allowing the critic to be conditioned on arbitrary state-dependent privileged signals without requiring access to the full state. We show that any such privileged signal yields unbiased policy gradient estimates, substantially expanding the set of admissible privileged information. This raises the problem of selecting the most adequate privileged information in order to improve learning. For this purpose, we propose two novel informativeness criteria: a dependence-based test that can be applied prior to training, and a criterion based on improvements in value prediction accuracy that can be applied post-hoc. Empirical results on partially observable benchmark tasks and synthetic environments demonstrate that carefully selected privileged signals can match or outperform full-state asymmetric baselines while relying on strictly less state information.
🎯 研究动机知识库问答需要弥合自然语言与知识图谱的语义差距,但现有方法在生成逻辑表达时常存在幻觉性查询或模板化推理的问题。
❓ 解决问题设计一种框架强化大型语言模型的交互式优化能力,避免现有模型在逻辑生成上的幻觉与简单模仿。
🔍 现象分析当前方法要么生成缺乏知识图谱验证的查询,要么过度依赖僵化的模板,缺乏对环境的真正理解。
🛠️ 主要方法提出KBQA-R1框架,将知识库问答设定为多轮决策过程,通过强化学习在结构化动作空间中优化推理策略,并通过引用拒绝采样生成对齐数据解决冷启动问题。
📊 数据与实验在WebQSP、GrailQA、GraphQuestions数据集上进行实验,显示出该方法在性能上达到了最新的最优水平。
⭐ 主要贡献首次将强化学习引入知识库问答优化交互推理性能;设计了引用拒绝采样方法,提升冷启动时的推理数据质量;在多个基准数据集上取得领先表现。
查看完整摘要 (Abstract)
Knowledge Base Question Answering (KBQA) challenges models to bridge the gap between natural language and strict knowledge graph schemas by generating executable logical forms. While Large Language Models (LLMs) have advanced this field, current approaches often struggle with a dichotomy of failure: they either generate hallucinated queries without verifying schema existence or exhibit rigid, template-based reasoning that mimics synthesized traces without true comprehension of the environment. To address these limitations, we present **KBQA-R1**, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning. Treating KBQA as a multi-turn decision process, our model learns to autonomously navigate the knowledge base using a structured action space, refining its reasoning strategies based on concrete execution feedback rather than static supervision. Furthermore, we introduce Referenced Rejection Sampling (RRS), a data synthesis method that resolves cold-start challenges by strictly aligning reasoning traces with ground-truth action sequences. Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance. Code is available at https://anonymous.4open.science/r/KBQA-R1-814F.
🎯 研究动机大型语言模型在链式推理中经常生成冗长且不正确的响应,浪费大量计算资源。动态中途放弃可以通过在生成过程中提前终止不具前景的推理路径来缓解此问题。
❓ 解决问题为动态中途放弃的决策提供严谨的指导规则,优化计算与信息之间的权衡,避免仅凭经验性方法进行处理。
🔍 现象分析现有的放弃方法多在生成前后决策,忽视了生成中长期低效推理路径的切断。动态放弃的潜在提升空间尚未充分挖掘。
🛠️ 主要方法以正则化强化学习框架建模,将放弃视作显式动作,并以放弃奖励参数衡量计算与信息间的平衡。提出一种高效的价值函数近似方法,用理论分析证明了放弃规则的优越性。
📊 数据与实验在数学推理任务上进行了实证研究,结果验证了所提出方法相比现有基线能显著提升选择性准确性。
⭐ 主要贡献首次从理论层面系统分析动态放弃问题,设计了基于价值函数的严谨决策策略,显著提升了大型语言模型的推理效率及精确性。
查看完整摘要 (Abstract)
Large language models (LLMs) using chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning tasks support our theory and demonstrate improved selective accuracy over existing methods.
🎯 研究动机通用形态控制需要跨异构机器人形态学习通用策略,但现有方法计算成本高,跨任务泛化能力有限,需针对新任务重新训练。
❓ 解决问题如何设计高效的控制器架构,既能降低部署开销,又能实现跨任务的有效策略迁移。
🔍 现象分析Transformer 控制器虽然有效,但需要高昂的计算代价,并且现有方法的策略迁移性能受限。
🛠️ 主要方法提出 DivMorph 模型,将 Transformer 权重通过 SVD 分解为基础知识单元,结合动态软门控机制,根据任务和形态嵌入调整单元,生成通用知识与特定调整项,达到知识解耦并高效部署。
📊 数据与实验通过大量实验验证,DivMorph 的跨任务样本效率提升 3.3 倍,单智能体部署模型大小减少 16.7 倍,实现最新性能表现。
⭐ 主要贡献设计了模块化训练范式 DivMorph,通过知识分解与动态组配,显著提升策略迁移效率,并降低资源消耗,为形态控制领域提供了高效解决方案。
查看完整摘要 (Abstract)
Universal morphology control aims to learn a universal policy that generalizes across heterogeneous robot morphologies, with Transformer-based controllers emerging as a dominant choice. However, such architectures incur substantial computational costs, resulting in high deployment overhead, and existing methods exhibit limited cross-task generalization, necessitating training from scratch for each new task. To this end, we propose DivMorph, a modular training paradigm that leverages knowledge diversion to learn \textit{decomposable controllers}. DivMorph factorizes randomly initialized Transformer weights into \textit{basic knowledge units} via SVD and employs dynamic soft gating, conditioned on task and morphology embeddings, to adaptively modulate these units into universal \textit{learngenes} and morphology- and task-specific \textit{tailors} during training, thereby achieving knowledge disentanglement. By selectively activating relevant components, DivMorph adaptively recomposes the controller, enabling efficient policy deployment and effective policy transfer to novel tasks. Extensive experiments demonstrate that DivMorph achieves state-of-the-art performance, improving sample efficiency for cross-task transfer by 3.3$\times$ and reducing model size for single-agent deployment by 16.7$\times$.
🎯 研究动机现有大语言模型在复杂指令执行上表现出色,但精确控制输出长度仍是难题,主要归因于模型对长度认知的内在不足。
❓ 解决问题提出一种名为 LARFT 的训练框架,通过强化学习和后见长度认知机制解决模型在长度认知与行为对齐上的缺陷。
🔍 现象分析当前方法通常通过外部信号或优化目标施加长度约束,却忽视了模型自身对长度信息的认知缺失问题。
🛠️ 主要方法将策略数据转化为后见长度认知任务,同时优化模型的长度信息内部表征与行为策略,以实现可靠的长度指令执行。
📊 数据与实验在四种基础模型上进行测试,在三个长度指令跟随基准上平均提升 20.92 分,同时在四项通用能力基准上仅微幅下降 1.45 分。
⭐ 主要贡献提出一种新型训练框架 LARFT,显著提升模型长度指令执行能力,同时保持通用能力表现。
查看完整摘要 (Abstract)
Despite the strong performance of Large Language Models (LLMs) on complex instruction-following tasks, precise control of output length remains a persistent challenge. Existing methods primarily attempt to enforce length constraints by externally imposing length signals or optimization objectives, while largely overlooking the underlying limitation: the model's intrinsic deficit in length cognition. To address this, we propose \textbf{LARFT} (\textbf{L}ength-\textbf{A}ware \textbf{R}einforcement \textbf{F}ine-\textbf{T}uning), a training framework that aligns the model's length cognition with its action. Specifically, LARFT integrates length-oriented reinforcement learning with a hindsight length awareness. By transforming on-policy data into hindsight self-awareness tasks where the model learns to identify the actual length of its own generation, LARFT jointly optimizes the model’s internal representation of length information and refines its policy to satisfy length constraints, thereby achieving precise and reliable length instruction following. Extensive experiments across four base models demonstrate that LARFT outperforms existing baselines, achieving an average improvement of \textbf{+20.92} points across three length instruction following benchmarks with only a marginal decline of \textbf{-1.45} points on four general capability benchmarks.
🎯 研究动机高容量生成策略在行为克隆中表现良好,但受限于示范覆盖不足和分布偏移问题,且直接对大型动作解码器进行强化学习微调往往不稳定且样本效率低。
❓ 解决问题设计一种轻量级适配方法,在不修改动作解码器的情况下提升冻结生成策略的任务性能,同时保留多模态结构。
🔍 现象分析生成策略的改进需要既能避免分布外查询导致的不稳定解码动态,又需在优化下游价值的同时控制噪声扰动幅度。
🛠️ 主要方法提出了 Lagrangian Perturbation Diffusion Steering (LP-DS) 方法,通过学习紧凑的噪声空间扰动模块来调整高斯噪声输入,并利用拉格朗日信任域目标函数稳健优化扰动模块以实现策略改进。
📊 数据与实验在 RoboMimic 操控、OpenAI Gym 运动控制和 Adroit 灵巧操作基准上进行测试,LP-DS 在样本效率、成功率和回报上提高显著,同时保持动作空间的多样性。
⭐ 主要贡献提出了一个高效、稳定且泛化性强的策略改进方法,在多个基准测试中显著提升了性能,与现有方法相比回报提升高达25%。
查看完整摘要 (Abstract)
Behavior cloning with high-capacity generative policies achieves strong imitation performance, but performance is often constrained by limited demonstration coverage and sensitivity to distribution shift. While reinforcement learning can improve task performance, directly fine-tuning large action decoders is often unstable and sample inefficient. We propose **Lagrangian Perturbation Diffusion Steering (LP-DS)**, a lightweight adaptation method that improves a frozen generative policy while preserving its multimodal structure. LP-DS learns a compact noise-space perturbation module that shifts Gaussian noise inputs before decoding, enabling policy improvement without modifying the action decoder. To prevent off-manifold latent queries and unstable denoising dynamics, we optimize this module with a Lagrangian trust-region objective that maximizes downstream value while constraining perturbation magnitude, yielding stable and sample-efficient learning. Across RoboMimic manipulation, OpenAI Gym locomotion, and Adroit dexterous manipulation benchmarks, LP-DS improves sample efficiency, success, and return while maintaining diverse behavior, as quantified by higher action-space entropy using the Kozachenko--Leonenko k-nearest neighbor estimator, with return improvements of up to 25\% over prior baselines. Anonymous project page: https://sites.google.com/view/lp-ds/home.
🎯 研究动机预训练大型语言模型的选择和微调耗时且成本高昂,选择最佳模型需兼顾性能与实际部署限制,传统方法难以满足需求。
❓ 解决问题提出一种多目标自动机器学习框架,旨在高效识别符合任务数据集需求的候选模型最优解,提高选择效率,减少资源浪费。
🔍 现象分析传统优化方法在模型选择时需反复试验,且最低测试损失模型未必是实际应用中的最佳选择,存在资源分配效率低的问题。
🛠️ 主要方法基于标志性微调产生候选模型早期性能指标,并通过强化学习的元学习策略从历史性能数据中学习选择策略,构建Pareto前沿。
📊 数据与实验在若干数据集上实验表明,新方法比穷举搜索减少平均73%的搜索时间,同时覆盖超99%的目标空间超体积。
⭐ 主要贡献开发开源的多目标自动机器学习框架LAMPS,显著改善模型选择效率与资源分配策略,为多目标模型选择任务提供新思路。
查看完整摘要 (Abstract)
Selecting a pretrained large language model (LLM) to fine-tune for a task-specific dataset can be time-consuming and costly. With several candidate models available to choose from, varying in size, architecture, and pretraining data, finding the best model for a specific task often involves extensive trial and error. In addition, the "best" model may not necessarily be the one with the lowest test loss, as practical considerations such as deployment costs, inference throughput, and limited search budgets might also play crucial roles. To address this, we introduce LAMPS (LAnguage Model Pareto Selection), a novel and open-source multi-objective AutoML framework that meta-learns a resource allocation policy to efficiently identify (or approximate) the Pareto front of candidate LLMs for a task-specific dataset. It is based on two key ideas: (1) landmark fine-tuning, which generates early performance indicators of the candidate models, and (2) meta-learning via reinforcement learning, which learns an effective selection policy from historical performance data (a meta-dataset). Our results show that, on held-out datasets, LAMPS reduces search time by an average of 73\% compared to exhaustive search, while still covering more than 99\% of the optimal target space hypervolume.
🎯 研究动机组合动作空间的强化学习难度在于动作集合呈指数级增长且受复杂约束,直接参数化策略难以实现。现有方法在表达能力与通用性上存在不足,亟需突破性解决方案。
❓ 解决问题提出一种新的策略表示方式,将现代生成式策略的表达能力引入组合强化学习,同时确保动作的可行性,由设计保证。
🔍 现象分析现有方法嵌入任务特定的价值函数或学习确定性结构化策略,会牺牲通用性与策略表达力,无法应对复杂组合约束。
🛠️ 主要方法提出LSFlow,通过球面流匹配在连续紧凑的潜在空间中学习随机策略,并借助组合优化解算器将潜在样本映射到有效结构化动作,同时引入平滑贝尔曼算子解决代价函数非平滑问题。
📊 数据与实验在多个复杂组合强化学习任务上,LSFlow相比当前最先进基线方法平均性能提升20.6%,验证了方法的有效性。
⭐ 主要贡献开创性地结合生成式策略与组合行为的结构特性,创新提出球面流策略,提高了策略生成的效率与可行性,实现了显著性能提升。
查看完整摘要 (Abstract)
Reinforcement learning (RL) with combinatorial action spaces remains challenging because feasible action sets are exponentially large and governed by complex feasibility constraints, making direct policy parameterization impractical. Existing approaches embed task-specific value functions into constrained optimization programs or learn deterministic structured policies, sacrificing generality and policy expressiveness. We propose a solver-induced \emph{latent spherical flow policy} that brings the expressiveness of modern generative policies to combinatorial RL while guaranteeing feasibility by design. Our method, LSFlow, learns a \emph{stochastic} policy in a compact continuous latent space via spherical flow matching, and delegates feasibility to a combinatorial optimization solver that maps each latent sample to a valid structured action. To improve efficiency, we train the value network directly in the latent space, avoiding repeated solver calls during policy optimization. To address the piecewise-constant and discontinuous value landscape induced by solver-based action selection, we introduce a smoothed Bellman operator that yields stable, well-defined learning targets. Empirically, our approach outperforms state-of-the-art baselines by an average of 20.6\% across a range of challenging combinatorial RL tasks.
🎯 研究动机现代机器人在高频率动作执行时难以保证时间和空间的一致性,这限制了其在复杂任务中的表现。
❓ 解决问题提出一种在潜在空间中学习高频连续动作的方法,以提高高频控制的时间和空间一致性。
🔍 现象分析传统动作分块方法在高频率(如60Hz)下无法生成平滑且一致的动作,导致机器人动作断续或不协调。
🛠️ 主要方法利用变分自编码器(VAE)将高频动作学习从动作空间转移到潜在空间,并设计‘重用后精炼’策略以改进相邻动作分块的连续性。
📊 数据与实验在三个真实世界的接触类机器人任务中进行测试,证明了该方法在任务完成过程中动作更平滑、停顿更少。
⭐ 主要贡献显著提高机器人在高频率控制下完成复杂任务的时间和空间一致性,同时提出了适用于异步推理的动作精炼方法。
查看完整摘要 (Abstract)
Modern robotic policies increasingly rely on action chunking to execute complex tasks in the physical world. While action chunking improves temporal consistency at moderate action frequencies, it becomes insufficient when the action frequency is further increased (e.g., to 60~Hz). At such high frequencies, policies often fail to generate actions that are both temporally smooth and spatially consistent. We address this challenge by shifting high-frequency action learning from the action space to a latent space with variational autoencoder (VAE). This formulation significantly improves both temporal and spatial consistency of high-frequency control. To enable smooth real-time execution, we further introduce Reuse-then-Refine, a chunk-level refine strategy that improves continuity between adjacent action chunks under asynchronous inference. As a result, robots controlled by our policy can execute complex contact-rich tasks continuously, with less pauses and jerky motions. Experiments on three real-world contact-rich robotic tasks show that our approach consistently completes tasks with smooth motions.
🎯 研究动机在大语言模型代理中,记忆模块日益重要,但现有系统多为离线且与查询无关,导致效率低下或丢失关键信息。
❓ 解决问题设计一个能动态调整性能与成本平衡的运行时记忆框架,解决现有方法在成本控制与查询适应性上的局限。
🔍 现象分析通过实验揭示不同预算层次(低/中/高)在方法复杂性、推理行为和模型容量上的权衡效果,明确各策略在不同预算限制下的优势。
🛠️ 主要方法提出名为 BudgetMem 的框架,以轻量化路由机制在模块间进行预算层次路由,通过强化学习训练神经路由策略平衡任务表现与内存成本。
📊 数据与实验在 LoCoMo、LongMemEval 和 HotpotQA 数据集上测试,结果表明 BudgetMem 在高预算下超越强基线,并在紧预算条件下提供更优的准确性–成本折衷。
⭐ 主要贡献设计并验证了一个动态预算控制的记忆框架,探索了实施、推理和容量三种预算策略,系统性分析了其在不同预算场景下的表现权衡。
查看完整摘要 (Abstract)
Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance–cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy–cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.
🎯 研究动机扩散语言模型(dLLMs)在多个任务中的效果已与自回归模型匹敌,但推理效率尚有优化空间,尤其在采样策略上存在改进潜力。
❓ 解决问题当前基于启发式的掩码解除策略需手动调节参数且对较大块大小效果不佳,本文旨在设计一种无需人工调节的高效采样策略。
🔍 现象分析启发式方法如置信阈值虽提升了样本质量与词元处理效率,但在块大小增加时表现下降,揭露其局限性。
🛠️ 主要方法将掩码扩散采样建模为马尔科夫决策过程,采用单层Transformer架构训练采样策略,通过置信度预测动态决定掩码解除。
📊 数据与实验结合半自回归与全扩散生成模式的实验结果表明,训练策略在性能上与先进启发式方法匹配并超越其弱点。
⭐ 主要贡献提出了基于强化学习的采样策略,克服了启发式方法的局限性,提升扩散语言模型在全扩散任务中的生成表现。
查看完整摘要 (Abstract)
Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One critical design aspect of dLLMs is the \textit{sampling procedure} that selects which tokens to unmask at each diffusion step. Indeed, recent work has found that heuristic strategies such as confidence thresholding improve both sample quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger block sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive (block) generation, while outperforming them in the full-diffusion setting.
🎯 研究动机监督微调(SFT)和基于可验证奖励的强化学习(RLVR)是提升大语言模型推理能力的常用方法,但现有方法在结合两者时容易产生冲突。
❓ 解决问题现有的单阶段融合方式因监督更新不均衡,可能削弱奖励优化效果。提出有效的监督机制以提升强化学习的优化效果。
🔍 现象分析直接权重调整或调度策略的融合方法未能有效提高奖励收益,同时还导致训练动态不稳定。
🛠️ 主要方法提出名为 BRIDGE 的框架,使用双优化环,在元训练中通过一个轻量级低秩适配器动态协调 SFT 和 RL 的目标,最大化奖励差距信号。
📊 数据与实验在三个模型规模和五个推理基准测试上评估,BRIDGE 方法相较现有基线实现了平均超三分的绝对提升,并展现更稳定的训练表现。
⭐ 主要贡献提出可扩展的 SFT 改进机制 BRIDGE,证明了在强化学习目标中引入选择性知识转移的有效性,对推理任务有显著性能提升。
查看完整摘要 (Abstract)
Supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) are two widely used post-training paradigms for improving the reasoning ability of large language models (LLMs). Recent methods attempt to integrate SFT and RLVR in a single stage by reweighting or scheduling their objectives. However, such coupling can be counterproductive because supervised updates are not uniformly beneficial for reward optimization, which can diminish reward gains. To address this, we propose \textsc{BRIDGE}, a scalable framework in which SFT learns to supervise RL by selectively transferring knowledge that improves reward optimization. Specifically, \textsc{BRIDGE} employs two nested optimization loops during meta-training: the inner loop updates base model parameters using a fused SFT--RL gradient. Concurrently, the outer loop updates a lightweight low-rank adapter (LoRA) to coordinate the two objectives by maximizing a reward-gap signal, defined as the reward of joint SFT--RL training over an RL-only baseline. Across three model scales and five reasoning benchmarks, \textsc{BRIDGE} consistently outperforms two-stage cold start, naive mixing, and representative single-stage integration baselines, yielding over three points average absolute improvement and more stable training dynamics.
🎯 研究动机行人再识别模型对长尾干扰(如稀有视角、遮挡、复杂背景)敏感,但现有生成式数据增强多为开放式循环,难以验证生成样本能否提升模型区分能力。
❓ 解决问题设计一个闭环的生成增强框架,通过学习生成指令策略,解决开放式循环中生成样本不够有效的问题。
🔍 现象分析现有方法中生成式增强的条件设置多为启发式,缺乏对生成样本效果的反馈验证,导致模型针对性提升不足。
🛠️ 主要方法提出ReasonAug框架,利用冻结生成器和语义推理代理,通过层次化规划生成结构化编辑指令,并引入MAGR和SAE机制优化奖励与探索,平衡身份保持与干扰多样性。
📊 数据与实验在Market-1501与MSMT17数据集上进行实验,验证框架的闭环优化能力能够显著提升数据增强效果,实现了最先进的模型表现。
⭐ 主要贡献通过闭环方法解决数据增强难题,提出MAGR和SAE机制提升生成质量,将生成与再识别模型需求直接对齐,显著提升训练数据的区分能力。
查看完整摘要 (Abstract)
Person re-identification (ReID) models are sensitive to long-tail nuisances (e.g., rare viewpoints, occlusions, complex backgrounds), yet current generative augmentation is largely open-loop: prompts/conditions are sampled heuristically without verifying whether the synthesized samples improve ReID discriminability. We introduce ReasonAug, a closed-loop framework that learns an image-conditioned instruction policy for a frozen generator, turning augmentation into a sequential decision problem over instruction tokens. A Semantic Reasoning Agent (SRA) performs hierarchical planning from global semantics to identity-critical local cues, producing structured edit instructions whose utility is verified by downstream ReID feedback. To make closed-loop optimization reliable, we propose Metric-Aligned Gated Reward (MAGR), which converts metric-learning objectives into a dense reward while gating task shaping by identity preservation to prevent reward hacking, and Structure-Aware Entropy (SAE), which allocates exploration per token to lock identity-critical cues while diversifying nuisance factors. Experiments on Market-1501 and MSMT17 demonstrate state-of-the-art performance, confirming that closing the augmentation loop and learning what to generate yield more discriminative training data than open-loop alternatives.
🎯 研究动机半监督指代表达分割面临有限标注和伪标签不可靠的问题,亟需提高像素级语言对齐的精度。
❓ 解决问题通过引入一种自进化框架,将伪标签构建问题转换为可学习的决策过程,有效提升伪标签质量并减轻监督稀缺困境。
🔍 现象分析利用多模态大语言模型提取语义与空间先验,结合层次化分割网络,以解决伪标签选择时的多模态信号不稳定问题。
🛠️ 主要方法提出强化伪标签选择机制,通过奖励高效像素级监督,结合多模态模型预测,优化分割模型与伪标签的联合学习循环。
📊 数据与实验在 RefCOCO、RefCOCO+ 和 RefCOCOg 数据集上进行实验,验证方法的有效性及其在分割任务中的泛化能力。
⭐ 主要贡献设计了一种半监督伪标签自进化框架,显著提升指代表达分割精度,为语言与视觉任务的对齐提供了新思路。
查看完整摘要 (Abstract)
Semi-supervised referring expression segmentation (SS-RES) aims to achieve precise pixel-level language grounding under limited annotation, yet suffers from limited supervision and unreliable pseudo-labels when exploiting unlabeled image–text pairs. In this work, we propose Learning to Label, a reinforced self-evolving framework (L2L) that casts pseudo-label construction as a learnable decision-making process. To build foundational understanding, we leverage a multimodal large language model to extract semantic–spatial priors, which are instantiated as initial soft segmentation proposals and elevated—together with textual cues—into learnable guidance signals that condition a hierarchical segmentation network. To ensure stable learning, a reinforced pseudo-label selection is further formulated as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions. This reinforced self-evolving loop enables joint optimization of the segmentation model and pseudo-labels, progressively enhancing label reliability under sparse supervision. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate improvements over existing methods, validating its effectiveness and generalization.
🎯 研究动机现有大语言模型在生成复杂推理路径方面表现较强,但在自我验证能力上仍存在显著不足,暴露出生成与自验证能力的不对称性。
❓ 解决问题研究如何通过提升模型的自我验证能力,间接加强其生成性能,并解决生成与验证能力之间的协同优化问题。
🔍 现象分析训练过程中发现,提升生成能力并未显著改善自验证能力,然而学习自验证可以反向有效地提高生成性能,生成路径更加高效和准确。
🛠️ 主要方法提出一个多任务强化学习框架,将生成与自验证优化为两个独立且互补的目标,通过联合训练实现性能提升。
📊 数据与实验在多个基准数据集与不同模型上进行广泛实验,显示与仅生成训练相比,该方法在生成与验证性能上均有显著提升。
⭐ 主要贡献揭示生成与自验证能力的不对称性,并提出融合自验证的生成训练框架,提升推理质量与效率,为改进大语言模型提供新思路。
查看完整摘要 (Abstract)
Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.
🎯 研究动机现有扩散大语言模型(dLLMs)呈现出准确性与并行性间的权衡问题,特别是提高每次前向传播生成的 token 数(TPF)时任务准确性下降。亟需优化预训练模型的速度和质量表现以打破这一限制。
❓ 解决问题通过强化学习框架优化预训练模型的速度–质量边界,旨在找到可高度并行化且准确的采样轨迹,而非在所有采样轨迹中强制进行激进解码。
🔍 现象分析传统的块状 dLLMs 在激进解码下牺牲了任务准确性;部分轨迹可实现并行度与任务表现的平衡,但目前缺少针对性优化手段。
🛠️ 主要方法提出 LightningRL 框架,基于 Group Relative Policy Optimization(GRPO)并进行改进,包括奖励分离归一化、正确轨迹上的负对数似然正则化,以及基于 TPF 动态采样滤波提升训练效率。
📊 数据与实验在数学和代码任务上进行实验,展示 LightningRL 在多个任务中均提升了准确性和并行度,将平均 TPF 提高至 7.3,MBPP 数据集上最高达 11.10。
⭐ 主要贡献突破了块状扩散语言模型的准确性与并行性权衡,提出了有效的强化学习方法,优化了模型的速度–质量表现,为大语言模型加速推理提供了新思路。
查看完整摘要 (Abstract)
Diffusion Large Language Models (dLLMs) enable parallel token generation, and their block-wise variants have attracted significant attention. However, existing dLLMs usually exhibit an accuracy–parallelism trade-off, where raising tokens per forward (TPF) via aggressive parallel decoding often degrades task accuracy. To address this, we suggest developing a post-training approach to directly optimize the speed–quality frontier of pre-trained dLLMs. Conceptually, we do not require the model to decode aggressively along all sampling trajectories, but rather to find several highly parallelizable ones that can yield correct results. To this end, we resort to a reinforcement learning paradigm, i.e., LightningRL, to optimize rewards regarding both the final accuracy and inference parallelism. LightningRL follows the Group Relative Policy Optimization (GRPO) framework, with further improvements for dLLMs: 1) stabilized training via per-reward decoupled normalization, 2) token-level negative log-likelihood (NLL) loss on correct trajectories for regularization, and 3) improved training efficiency through dynamic sampling with TPF-aware filtering. Across maths and code tasks, LightningRL consistently advances the Pareto frontier, maintaining competitive accuracy while increasing parallelism to an average TPF of 7.3 (up to 11.10 on MBPP).
🎯 研究动机现有视频异常检测方法依赖大量标注或专家知识,这限制了模型减少人类干预获取异常知识的能力。
❓ 解决问题提出一种无需参数更新即可通过语义经验先验优化模型输出的策略,减少对大规模人工标注的依赖。
🔍 现象分析结合语义和多模态推理,旨在提升模型在跨场景和特定场景中的异常检测能力,并符合人类风险偏好。
🛠️ 主要方法提出Linguistic Relative Policy Optimization (LRPO),利用群组相对语义优势生成一般经验和场景经验,并通过异常对齐奖励优化推理轨迹。
📊 数据与实验在XD-Violence、UCF-Crime及UBNormal数据集上进行广泛实验验证,模型在无需调参的条件下显著优于现有方法。
⭐ 主要贡献提出LRPO方法,创新性地结合语义生成和奖励优化,实现无标注调优的高效视频异常检测。
查看完整摘要 (Abstract)
Video anomaly detection (VAD) with multimodal large language models has shown strong potential, yet most existing methods still depend on large-scale annotations or expert-designed priors, limiting their ability to acquire anomaly knowledge with as little human intervention as possible. To address this, we propose Linguistic Relative Policy Optimization (LRPO), which distills group-relative semantic advantages from multiple reasoning trajectories into a linguistically expressed anomaly experience prior, and adapts the model by injecting this prior into the context to steer its output distribution without any parameter updates. LRPO builds two complementary experience representations: general experience captures transferable anomaly preferences across scenarios, while scenario experience models context-dependent anomaly rules for targeted refinement. To further improve the learned experience, we introduce an anomaly alignment reward that guides trajectory optimization to match human risk preferences and reinforce temporally grounded reasoning. Extensive experiments on XD-Violence, UCF-Crime, and UBNormal demonstrate that LRPO significantly outperforms existing state-of-the-art methods under tuning-free settings.
🎯 研究动机现有视觉-语言操控研究多针对刚性机械臂,其固定形态在狭小或复杂环境中适应性不足。相比之下,软体机械臂因其可变形性提供了更大的潜力,但面临如本体感知不可靠和分布式低级驱动等问题。
❓ 解决问题提出 ManiSoft 基准,通过结合真实软体动力学与弹性接触力约束的模拟器,研究软体机械臂在视觉-语言操控中的独特挑战。
🔍 现象分析基准测试显示,当前策略在清晰场景中表现较好,但在随机化环境下性能大幅下降;失败主要源于视觉感知的本体状态估计误差及对变形特性的利用不足。
🛠️ 主要方法采用基于高层路径点分解的规划模型和低层强化学习策略生成扭矩命令,以实现从路径分解到控制执行的有效衔接。
📊 数据与实验ManiSoft 提供 6,300 个多样化场景及专家轨迹,用于策略训练与评测;设计了四项任务,分别突出末端协调与避障等不同变形控制能力。
⭐ 主要贡献提出首个针对软体机械臂视觉-语言操控的基准 ManiSoft,并通过实验分析揭示现有方法在随机化场景下的局限性,为刚性与软性机械臂间的研究桥梁奠定基础。
查看完整摘要 (Abstract)
Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce ManiSoft, a benchmark for vision-language manipulation with soft arms. \ManiSoft{} features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, ManiSoft includes an automated pipeline that generates $6{,}300$ diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation.
🎯 研究动机扩散模型虽然在强化学习中表现出色,但其迭代生成过程会导致较高的训练和推理开销。为提高效率,亟需开发新的政策表示方法。
❓ 解决问题这一研究旨在解决扩散模型在强化学习中的高计算成本问题,提出一种基于少步流生成模型的高效政策表示方法。
🔍 现象分析通过对MuJoCo和深度控制套件进行实验,发现MeanFlow模型在保持性能的同时显著减少了训练和推理时间。
🛠️ 主要方法提出MeanFlow政策优化方法(MFPO),在最大熵强化学习框架下优化政策,结合软政策迭代以克服动作概率评估及软政策改进的挑战。
📊 数据与实验实验基于MuJoCo平台和DeepMind Control Suite的多个基准任务,验证模型在效率和性能上的优势。
⭐ 主要贡献提出MeanFlow政策优化框架,显著降低训练和推理开销,同时性能达到或超过现有扩散模型基线,为高效强化学习提供新解法。
查看完整摘要 (Abstract)
Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo and DeepMind Control Suite benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time.
🎯 研究动机现有的大型语言和视觉-语言模型在推理时需要依赖代理记忆,但其基于启发式检索或高成本的模型排序方式存在局限性,缺乏对记忆元素进行有效组合的能力。
❓ 解决问题该研究提出了一种新的框架 MemDecoder,用于以高效的方式动态选择代理记忆元素,并解决现有方法在记忆组合上的不足。
🔍 现象分析实验发现,当前代理记忆选择方法在复杂任务中表现有限,原因包括高计算成本和缺乏任务相关性优化。
🛠️ 主要方法设计一个轻量级 Transformer 编码-解码器,将记忆元素的组合视为自回归索引解码问题,并通过监督学习和强化学习优化这一流程。
📊 数据与实验在视觉问答、数学推理和科学问答数据集上进行测试,结果显示 MemDecoder 的任务性能超过现有方法,验证了其架构设计与学习算法的有效性。
⭐ 主要贡献提出了一种高效的代理记忆组合方法,创新性地使用索引解码思路和排名优化算法,大幅提升推理质量与效率。
查看完整摘要 (Abstract)
Agentic memory—conditioning large language and vision–language models on past cases, external knowledge, or meta‑experiences—has become a key mechanism for improving inference‑time reasoning. However, existing approaches largely rely on heuristic retrieval or expensive LLM‑based reranking, and do not explicitly learn how to compose memory for a given query. To address these limitations, we propose MemDecoder, a learned framework for adaptive agentic memory selection. MemDecoder formulates memory composition as an autoregressive index decoding problem over a retrieved candidate set, using a lightweight Transformer encoder–decoder to generate an ordered sequence of memory elements. This design enables efficient, task‑aware few‑shot reasoning without generating textual demonstrations. MemDecoder can be trained via supervised fine‑tuning and reinforcement learning with verifiable rewards. We further introduce a ranking‑aware variant of Group Relative Policy Optimization that exploits pairwise comparisons within response groups to provide richer learning signals. Experiments across visual question answering, mathematical reasoning, and scientific question answering benchmarks show that MemDecoder consistently outperforms prior agentic memory selection methods, demonstrating the benefits of the architectural design and learning algorithm of MemDecoder.
🎯 研究动机长时序任务中的语言代理需要处理复杂的连续决策,但强化学习在训练此类代理时面临显著挑战。为改善这一情况,需解决信用分配不精准和样本效率低的问题。
❓ 解决问题针对信用误归因和样本稀缺导致的学习信号丢失,提出一种新的政策学习框架,引入里程碑式的分段策略以加强信用归属的精度。
🔍 现象分析早期正确动作因后续失败而被错误惩罚,以及成功路径过于稀缺,造成采样效率低和学习信号几乎丧失。
🛠️ 主要方法提出BEACON框架,通过分割轨迹并在里程碑边界进行奖励重塑,细化部分进度的回馈,同时采用双尺度优势估算避免远端失败对局部动作评价的干扰。
📊 数据与实验在ALFWorld、WebShop和ScienceWorld三个数据集上,BEACON的表现均优于GRPO和GiGPO,特别是在长时序任务中大幅提升成功率和样本利用率。
⭐ 主要贡献首次为长时序语言代理提供了基于里程碑的信用分配范式,大幅提高了训练性能和样本利用效率,并已开放代码以供研究社区使用。
查看完整摘要 (Abstract)
While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near-total loss of learning signal. We introduce a milestone-guided policy learning framework, BEACON, that leverages the compositional structure of long-horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO. Notably, on long-horizon ALFWorld tasks, BEACON achieves 92.9\% success rate, nearly doubling GRPO's 53.5\%, while improving effective sample utilization from 23.7\% to 82.0\%. These results establish milestone-anchored credit assignment as an effective paradigm for training long-horizon language agents. Code is in supplementary materials and will be publicly released.
🎯 研究动机早退出神经网络在监督学习中表现优异,但在深度强化学习领域的应用尚待探索,迫切需要提升计算效率。
❓ 解决问题如何在深度强化学习中使用早退出机制优化计算资源,同时确保策略性能不受到显著影响。
🔍 现象分析传统深度强化学习模型的推理效率较低,计算复杂度与任务类型紧密相关,部分计算可能存在冗余。
🛠️ 主要方法提出一种新的演员-评论员架构BEXA,结合早退出分支和基于约束的值判定机制,使策略根据输入复杂度动态调整计算。
📊 数据与实验基于MuJoCo环境评估,测试了BEXA在SAC与TD3方法上的效率与性能,实验结果显示推理效率显著提高且性能保持稳定。
⭐ 主要贡献展示早退出机制在深度强化学习中应用的潜力,为提升计算效率提供了方法论支持,同时确保性能不受显著下降。
查看完整摘要 (Abstract)
Early exit neural networks, which adapt computation to input complexity, have proven effective in supervised learning but remain largely unexplored in deep reinforcement learning (DRL). In this paper, we propose the use of Budgeted EXit Actor (BEXA), which is a novel actor-critic architecture that integrates early exit branches into the actor network. These branches are trained via the underlying DRL method and use a constrained value-based criterion to decide when to exit, allowing the policy to dynamically adjust its computation. BEXA is general, easy to tune and compatible with any off-policy actor-critic method. We evaluate BEXA using different DRL methods such as SAC and TD3 on a suite of MuJoCo tasks. Our results demonstrate a substantial improvement in inference efficiency with minimal or no loss in performance. These findings highlight early exits as a promising direction for improving computational efficiency in DRL.
🎯 研究动机强化学习中的正则化机制在离散动作领域表现出色,但在连续动作领域中,现有方法的表现尚未超越单纯熵正则化方法,亟需改进。
❓ 解决问题提出一种改进的连续动作领域强化学习算法,解决现有基于KL-熵正则化方法性能不足的问题。
🔍 现象分析通过理论分析和实验验证发现,在连续动作领域中,限制演员的对数概率项能有效提升算法性能,并与优势学习框架存在深度关联。
🛠️ 主要方法设计Mirror Descent Actor Critic (MDAC)算法,并引入对评论器损失函数中的对数概率项进行约束的方法,保证正则化项的有效性。
📊 数据与实验实验选择连续动作领域常用数据集,通过探索不同约束函数,实证MDAC在与非正则化和单纯熵正则化方法的对比测试中表现更优。
⭐ 主要贡献提出一种适用于连续动作领域的强化学习新算法MDAC,理论和实验验证约束机制的有效性,为优势学习和正则化方法的结合提供新视角。
查看完整摘要 (Abstract)
Regularization is a core component of recent Reinforcement Learning (RL) algorithms. Mirror Descent Value Iteration (MDVI) uses both Kullback-Leibler divergence and entropy as regularizers in its value and policy updates. Despite its empirical success in discrete action domains and strong theoretical guarantees, the performance of KL-entropy-regularized methods does not surpass that of a strong entropy-only-regularized method in continuous action domains. In this study, we propose Mirror Descent Actor Critic (MDAC) as an actor-critic style instantiation of MDVI for continuous action domains, and show that its empirical performance is significantly boosted by bounding the actor's log-probability terms in the critic's loss function, compared to a non-bounded naive instantiation. Further, we relate MDAC to Advantage Learning by recalling that the actor's log-probability is equal to the regularized advantage function in tabular cases, and theoretically discuss when and why bounding the advantage terms is validated and beneficial. We also empirically explore effective choices for the bounding functions, and show that MDAC performs better than strong non-regularized and entropy-only-regularized methods with an appropriate choice of the bounding functions.
🎯 研究动机视觉-语言-动作模型在动作分块长度上存在全局预测与局部精度的权衡挑战,迫切需要一种方法实现两者的兼顾。
❓ 解决问题提出一种名为混合视野(MoH)的策略,以同时优化长远视角的全局预测能力和短期视角的精细控制能力。
🔍 现象分析长视野的动作分块提升全局预测能力但降低局部精度,而短视野则相反。现有方法难以平衡两者。
🛠️ 主要方法将动作分块划分为具有不同视野的多个片段,利用共享的动作 Transformer 并行处理,最后通过线性门控机制融合输出,实现动态调整与高效推理。
📊 数据与实验在基于流的策略和单步回归策略上进行大量实验,验证模型在仿真和真实任务中的一贯显著性能提升。尤其在混合任务下,以 99% 的成功率在 LIBERO 数据集中刷新 SOTA,仅需 30k 次训练迭代。
⭐ 主要贡献提出 MoH 策略,兼具长远预测和精细控制能力;作为插拔式模块,具备高效性与动态性;大幅提高任务成功率和模型吞吐量,显著优于现有基线。
查看完整摘要 (Abstract)
Vision-language-action models exhibit an inherent trade-off in action chunk length (``horizon''): longer horizons improve global foresight but degrade fine-grained local control, while shorter ones yield the opposite. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. In brief, MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It offers three appealing benefits. i) Long-term foresight and short-term precision are jointly exploited within a single model. ii) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. iii) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based and one-step regression policies demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $\pi_{0.5}$ with MoH reaches a new state-of-the-art with 99\% average success rate on LIBERO after only $30k$ training iterations.
🎯 研究动机模型路由通过针对查询选择合适的语言模型可显著降低推理成本并维持高精度。然而,现有路由方法难以适应新模型或动态预算约束。
❓ 解决问题提出一种可扩展且可控的路由框架SCOPE,解决当前模型路由框架在模型选择固定化和动态决策能力不足的问题。
🔍 现象分析传统路由方法依赖固定模型组进行选择,无法针对新模型及问题动态推理成本与性能之间的权衡需求进行灵活调整。
🛠️ 主要方法基于强化学习训练的框架,SCOPE通过检索模型在类似问题中的表现预测其成本与性能,将模型路由转化为动态决策问题并允许用户控制精度与成本权衡。
📊 数据与实验实验表明SCOPE在性能优先时将精度提升至25.7%,在效率优先时降低了高达95.1%的推理成本,充分展示了其灵活适应用户需求的能力。
⭐ 主要贡献通过提出SCOPE框架,将模型路由从固定化选择扩展至动态决策,并实现对未知模型的适配,显著提升模型性能与成本之间的平衡能力。
查看完整摘要 (Abstract)
Model routing chooses which language model to use for each query. By sending easy queries to cheaper models and hard queries to stronger ones, it can significantly reduce inference cost while maintaining high accuracy. However, most existing routers treat this as a fixed choice among a small set of models, which makes them hard to adapt to new models or changing budget constraints. In this paper, we propose SCOPE (Scalable and Controllable Outcome Performance Estimator), a routing framework that goes beyond model selection by predicting their cost and performance. Trained with reinforcement learning, SCOPE makes reasoning-based predictions by retrieving how models behave on similar problems, rather than relying on fixed model names, enabling it to work with new, unseen models. Moreover, by explicitly predicting how accurate and how expensive a model will be, it turns routing into a dynamic decision problem, allowing users to easily control the trade-off between accuracy and cost. Experiments show that SCOPE is more than just a cost-saving tool. It flexibly adapts to user needs: it can boost accuracy by up to **25.7\%** when performance is the priority, or cut costs by up to **95.1\%** when efficiency matters most.
🎯 研究动机现有机器人操作策略往往将不同类型的操作阶段混合处理,缺乏对粗略移动和精细交互的行为分离,高度复杂性限制了精度和效率的提升。
❓ 解决问题该研究提出了一种双阶段架构,旨在通过解耦移动与操作行为,提高机器人操控任务的精确性和学习效率,推动与人类运动模式的协调。
🔍 现象分析对人类的运动习惯进行观察,发现粗略定位和接触关键交互具有明显的阶段性,可通过上下文线索显式区分并优化。
🛠️ 主要方法设计了一种基于视觉、语言和动作的框架,采用可学习的阶段选择器将双专家策略分离,同时利用MLLM生成阶段标签以提升模型对任务动态的适应能力。
📊 数据与实验在RoboTwin2数据集上验证,该方法成功率为68.9%,优于单一策略基线24%,且在数据量减少10倍和训练步数降低40%的条件下仍达到最佳表现。
⭐ 主要贡献提出了一种结构化、行为分离的机器人操控框架,通过阶段解耦显著提升操作精度与效率,并为高精度任务提供了一种高效学习策略。
查看完整摘要 (Abstract)
We present Move-Then-Operate, a Vision–language–action framework that explicitly decouples robotic manipulation into two distinct behavioral phases: coarse relocation (move) and contact-critical interaction (operate). Unlike monolithic policies that conflate these heterogeneous regimes, our architecture employs a dual-expert policy routed by a learnable phase selector, introducing a structural inductive bias that isolates phase-specific dynamics. Phase labels are automatically generated via an MLLM-based pipeline conditioned on lightweight contextual cues such as end-effector velocity and subtask decomposition to ensure alignment with human motor patterns. Evaluated on the RoboTwin2 benchmark, our method achieves an average success rate of $68.9\%$, outperforming the monolithic $\pi_0$ baseline by +$24\%$. It matches or exceeds models trained on $10\times$ more data and reaches peak performance in $40\%$ fewer training steps, demonstrating that architectural disentanglement of move and operate phases is a highly effective and efficient strategy for mastering high-precision manipulation.
🎯 研究动机传统基于强化学习的GRPO方法在单一任务推理中表现优异,但无法在多任务场景下实现稳定性能,影响实际部署的可靠性。
❓ 解决问题解决多任务GRPO方法中任务优化失衡和零梯度现象导致的信号失真问题,实现任务间的均衡优化和可靠性能。
🔍 现象分析多任务适配的GRPO中部分任务占据资源,而其他任务停滞不前;任务诱导的梯度频率不一致进一步加剧优化失衡。
🛠️ 主要方法提出MT-GRPO算法,通过动态调整任务权重优化最差任务表现,并引入保持比例的采样器以反映权重变化,平衡任务间的梯度贡献。
📊 数据与实验在3任务和9任务场景下实验,MT-GRPO在最差任务准确率上较基线方法提升16–28%和6%;且在3任务场景中将达到50%最差任务准确率所需的训练步数减少50%。
⭐ 主要贡献实现多任务环境下可靠的大模型推理,提出高效优化算法MT-GRPO,在最差任务表现和训练效率上显著超越现有方法。
查看完整摘要 (Abstract)
RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16–28\% and 6\% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50\% fewer training steps to reach 50\% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.
🎯 研究动机视觉结果在多模态大语言模型中变得愈发重要,验证的可靠性与精细化对泛化型基础模型的扩展需求日益突出。
❓ 解决问题提出一种多模态元验证方法,重点探讨如何将验证器生成的理由反馈有效融入多模态验证器的训练中,以提升验证结果的可靠性和解释性。
🔍 现象分析发现符号型验证器输出(如边界框)相比文本解释在元验证理由中表现更优;此外,分离二值判断与元验证的强化学习目标比联合优化效果更好。
🛠️ 主要方法构建了OmniVerifier-M1验证器,利用符号型元验证和分离式强化学习目标实现鲁棒验证与精细化错误定位,同时支持区域级动态自我修正。
📊 数据与实验通过大量实验验证了OmniVerifier-M1在多模态验证任务中的效果,表明其在可解释性、可靠性以及精细化验证能力上显著优于现有方法。
⭐ 主要贡献提出一种可解释且精细化的多模态验证框架,支持更安全和可控的基础模型部署,并为未来多模态验证研究指明方向。
查看完整摘要 (Abstract)
Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate ***multimodal meta-verification***, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train **OmniVerifier-M1**, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables **M1-TTS**, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.
🎯 研究动机强化学习结合可验证奖励可改善大语言模型推理,但稀疏终端奖励导致样本效率低。现有研究通过加入自然语言反馈缓解,但未有效优化反馈与奖励关联性。
❓ 解决问题将自然语言反馈生成视为双层问题,优化评论的有效性对推理任务奖励的影响,提高模型性能和样本效率。
🔍 现象分析传统固定或辅助性反馈生成方法可能导致表面合理的反馈未能提升验证奖励,需更精细耦合评论与执行策略。
🛠️ 主要方法提出双层自然语言Actor-Critic框架(Bi-NAC),通过Stackelberg双层规划联合训练评论生成器和策略执行器,实现奖励优化与反馈利用。
📊 数据与实验在MATH-500、MBPP和GPQA数据集中实验,Bi-NAC显著提升样本与参数效率,小型模型超越大规模基线(如2B模型在MATH-500任务中达到46.6%,优于3B基线41.4%)。
⭐ 主要贡献提供一种通过双层框架对齐评论生成与策略优化的高效解决方案,以提升推理任务的复杂性处理能力和资源利用效率。
查看完整摘要 (Abstract)
Reinforcement learning with verifiable rewards can improve LLM reasoning, but learning is sample-inefficient under sparse terminal rewards. Prior work mitigates this by adding natural language critiques, yet it typically treats critique generation as fixed or auxiliary, so correct-sounding feedback may not translate into higher verified reward. We argue that natural language actor-critic for reasoning is inherently bilevel: the usefulness of the critique is defined by its downstream effect on the actor after adaptation. We formalize this coupling as a Stackelberg bilevel program and derive Bilevel Natural Language Actor-Critic (Bi-NAC), which jointly trains a critic to generate reward-improving feedback and an actor to exploit it. Across reasoning benchmarks, Bi-NAC improves sample and parameter efficiency over RL baselines and fixed-critic feedback methods. We perform experiments on MATH-500, MBPP, and GPQA demonstrating that Bi-NAC significantly enhances parameter and sample efficiency, enabling smaller models to outperform larger baselines. Specifically, our 2B model consistently outperforms the larger 3B GRPO baseline across all tasks (e.g., 46.6% vs. 41.4% on MATH-500), while our 6B model surpasses the 7B GRPO baseline (e.g., 49.3% vs. 43.6% on GPQA). These results show that aligning actor and critic via bilevel formulation provides a robust and efficient alternative for solving complex reasoning tasks.
🎯 研究动机多模态情感推理强调从视觉、声学和语言信号推导情感及其文本解释,但现有的情感导向型模型在多模态感知上表现不足,无法充分利用多模态线索并避免跨模态幻觉。
❓ 解决问题提出一种强化学习框架,明确优化模型的多模态感知能力,以解决当前模型多模态线索利用不足和跨模态信息幻觉的问题。
🔍 现象分析现有模型在推理路径中未充分利用多模态信息,且存在生成其他模态中并不存在的特定模态语句的现象。
🛠️ 主要方法提出OPPO框架,包括全局感知奖励用于分解真实推理由细化线索训练,及全局感知损失通过KL惩罚抑制跨模态幻觉,保障多模态感知的完整性与忠实性。
📊 数据与实验引入诊断性基准数据集MEP-Bench用以量化利用率与忠实性;实验结果表明OPPO在MER-UniBench中达到最新性能,并在MEP-Bench上显著提升上述两项指标。
⭐ 主要贡献首次明确提出基于强化学习优化多模态感知的框架,为情感推理任务提供了更高效和可靠的解决方案,并构建诊断基准数据集推动领域评估标准的发展。
查看完整摘要 (Abstract)
Recent Omni-MLLMs are driving a paradigm shift in multimodal emotion recognition from label-only prediction toward *Multimodal Emotion Reasoning* (MER), where models output both emotions and textual explanations grounded in visual, acoustic, and linguistic signals. However, we show that current emotion-oriented Omni-MLLMs still lack *reliable omni-modal perception*: they (i) underutilize multimodal cues in their reasoning trajectories and (ii) exhibit unfaithful behavior, often hallucinating modality-specific statements from other modalities. Building on these insights, we propose **OPPO** (**O**mni-**P**erception **P**olicy **O**ptimization), a reinforcement learning framework that explicitly optimizes multimodal perception. First, an Omni-Perception Reward decomposes ground-truth reasoning into fine-grained visual, acoustic, and emotion cues and rewards trajectories that semantically recover these cues. Second, an Omni-Perception Loss compares the policy under full and unimodally masked inputs, applying a KL penalty only to modality-specific evidence tokens to suppress cross-modal hallucination. We further introduce *MEP-Bench*, a diagnostic benchmark that quantifies *utilization* and *faithfulness*. Experiments show that OPPO achieves state-of-the-art performance on MER-UniBench and substantially improves utilization and faithfulness scores on MEP-Bench, highlighting the importance of sufficient and faithful omni perception for multimodal emotion reasoning. The code is provided in the Supplementary Materials.
🎯 研究动机现代大规模语言模型的工具调用能力是其扩展知识与技能的核心,但其有效性与效率尚未系统研究。
❓ 解决问题研究工具调用的评估敏感性及其在强化学习训练中的计算效率问题。
🔍 现象分析发现轻微的未文档化实现选择(如随机种子、系统提示、多轮模板设计等)会显著影响评估结果,并导致排行榜不可靠;标准强化学习中存在学习信号稀疏和优化计算成本高的问题。
🛠️ 主要方法提出两种优化技术,提高基于强化学习的工具调用训练效率,显著加速训练时间且不损失性能。
📊 数据与实验系统分析了多轮工具调用评估流水线,并通过实验验证了提出方法的效率提升效果。
⭐ 主要贡献揭示了工具调用评估的敏感性及标准化需求;优化了强化学习训练的效率,为提高语言模型的工具调用能力提供了新方法。
查看完整摘要 (Abstract)
Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowledge. This paper studies tool-calling along two complementary axes: \textbf{effectiveness}, i.e., how this capability is \textit{measured}, and \textbf{efficiency}, i.e., how it is \textit{learned}. On effectiveness, we systematically analyze tool-calling evaluation pipelines and show that results can be highly sensitive to seemingly minor, often undocumented implementation choices including the random seed, system prompt, multi-turn template construction, and how prior interaction/reasoning history is carried forward. These choices can lead to substantial differences in reported performance, especially in multi-turn settings where without rigorous standardization, leaderboard rankings are unreliable. On efficiency, we examine standard reinforcement learning (RL) for tool-calling and identify two sources of computational waste: (i) during rollouts, many prompts produce no learning signal, and (ii) during policy updates, optimization incurs high computational cost. Guided by these findings, we introduce two techniques that accelerate RL-based tool-calling training, achieving substantial wall-clock speedup without degrading performance.
🎯 研究动机工具集成强化学习(TIRL)允许大型语言模型通过与外部工具交互实现多步推理。然而,现有基于GRPO的方案(如Search-R1)训练中存在收敛失败的问题,阻碍了其广泛应用。
❓ 解决问题论文通过识别并分析降低或停滞的响应似然性(LLD)作为训练崩溃的核心机制,提出了一种正则化方法以稳定训练。
🔍 现象分析LLD引发的自我强化循环(LLD Death Spiral)导致低置信度响应和梯度爆炸,促进训练失败。实验揭示了一个三阶段轨迹:早期停滞、逐步衰退和加速崩溃。
🛠️ 主要方法提出了LLDS正则化方法,仅在响应似然性降低时激活,并针对责任标记进行细粒度调整,从而减少干扰、稳定训练。
📊 数据与实验在七个数据集上验证了方法有效性,包含Qwen2.5-3B和Qwen2.5-7B模型,分别较传统GRPO训练提高性能45.2%和37.1%。
⭐ 主要贡献首次揭示并系统分析LLD现象,提出LLDS方法稳定TIRL模型训练,为实现可扩展的工具集成强化学习提供了有效路径。
查看完整摘要 (Abstract)
Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a likelihood-preserving regularization LLDS that activates only when a response action’s likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference. Our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements across seven benchmarks, including relative improvements of +45.2% on Qwen2.5-3B and +37.1% on Qwen2.5-7B over vanilla GRPO training. Our results establish LLD as a previously overlooked bottleneck in GRPO- based TIRL and provide a practical path toward stable, scalable training of tool-integrated RL.
🎯 研究动机为了解决强化学习中因稀疏二值奖励导致的效率低和优化不稳定问题,提升大语言模型的推理能力。
❓ 解决问题现有方法对参考策略的偏差惩罚不加区分,可能抑制模型的性能提升。本研究通过优化方向与更新幅度解耦,解决这一问题。
🔍 现象分析传统方法在模型超越参考策略时会阻碍其提升,而在表现不佳时也缺乏加速调整的机制,导致收敛低效。
🛠️ 主要方法提出单向策略优化(OWPO),通过非对称重加权实现更新方向与幅度分离;对表现落后时加速对齐,对表现超越时保持增益,并通过迭代更新参考策略实现‘棘轮效应’巩固提升。
📊 数据与实验实验验证了OWPO在多种基准数据集上的有效性,显著优于现有强基线(如DAPO、OPD和MOPD),能连续进化而无需外部参考模型。
⭐ 主要贡献提出了OWPO方法,突破了固定先验的瓶颈;理论与实验证明了在模型自我进化中的广泛应用潜力,稳定提升性能。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) has become a promising paradigm for scaling reasoning capabilities of Large Language Models (LLMs). However, the sparsity of binary verifier rewards often leads to low efficiency and optimization instability. To stabilize training, existing methods typically impose token-level constraints relative to a reference policy. We identify that such constraints penalize deviations indiscriminately; this can flip verifier-determined direction when the policy attempts to outperform the reference, thereby suppressing gains. To resolve this, we propose **One-Way Policy Optimization (OWPO)**, a method based on the principle of decoupling optimization direction from update magnitude. In OWPO, the verifier dictates the update direction, while the reference policy serves only to adjust the magnitude. Specifically, OWPO applies asymmetric reweighting: it performs **Accelerated Alignment** for Inferior deviations (where the policy lags behind the reference) and **Gain Locking** for Superior deviations (where the policy surpasses the reference). Furthermore, by incorporating iterative reference updates, OWPO creates a ``Ratchet Effect'' that continuously consolidates gains. Experimental results demonstrate that OWPO outperforms strong baselines, including DAPO, OPD, and MOPD, breaking the bottleneck of fixed priors to enable continuous self-evolution without reliance on external reference models.
🎯 研究动机领域适配通常需通过监督微调和强化学习两阶段训练,但前期的监督微调可能限制探索能力并增加训练成本。
❓ 解决问题提出无需监督微调的直接强化学习方法,以解决现有方法中多阶段训练的效率低下和分布收缩问题。
🔍 现象分析在先前研究中,强化学习在无监督微调的情况下难以从零开始学习领域知识和行为。
🛠️ 主要方法设计了一种名为 OnePO 的单阶段策略优化方法,通过自适应目标进化机制加速知识注入,并通过教师退休机制避免受限于过时的策略。
📊 数据与实验在 Qwen3-8B-Base 模型上进行实验,仅用 20K 样本就达到了 HealthBench 67.2 分的性能,同时在其他基准测试中表现出竞争力。
⭐ 主要贡献提出了 SFT-free 的领域适配新范式,显著减少了多阶段训练的复杂度,并证明了强化学习在单阶段中也能培养高性能领域专家。
查看完整摘要 (Abstract)
Domain adaptation transforms general-purpose LLMs into specialized experts for specific domains or tasks. This process typically follows a two-stage recipe: first, Supervised Fine-Tuning (SFT) to inject domain knowledge or induce specific behaviors (e.g., reasoning patterns), followed by Reinforcement Learning (RL) for self-improvement. However, *does RL truly require a pre-SFT as cold-start phase?* We argue that pre-SFT is inherently problematic: (1) it indiscriminately reinforces knowledge and behaviors from references regardless of whether the LLM has already acquired them, leading to distribution contraction that constrains subsequent exploration; (2) it introduces substantial overhead in multi-stage training and data curation. While our pilot studies reveal that, without pre-SFT, RL struggles to acquire off-policy knowledge from scratch, we bridge this gap with **One-stage Policy Optimization (OnePO)**. OnePO is an SFT-free paradigm that enables LLMs to selectively internalize off-policy knowledge and behaviors directly during RL evolution. Crucially, we design an **Adaptive Objective Evolution** mechanism for rapid knowledge injection and a **Teacher Retirement** mechanism that prevents off-policy anchoring. Experiments demonstrate that OnePO successfully transforms the Qwen3-8B-Base model into a high-performance medical LLM in one RL stage, achieving competitive performance on HealthBench (67.2) and other benchmarks using only 20K samples. This highlights SFT-free RL can efficiently cultivate domain experts without the need for traditional multi-stage pipelines.
🎯 研究动机传统视觉生成中的强化学习方法使用样本奖励,但容易导致奖励欺骗,损害图像多样性并产生视觉异常。
❓ 解决问题提出利用分布奖励优化生成模型,解决模式坍塌问题并改善与真实数据分布的对齐程度。
🔍 现象分析样本独立优化导致生成样本集中于单一方向,损害多样性,分布奖励通过全局视角减缓这一问题。
🛠️ 主要方法引入子集替换策略高效计算分布奖励,并采用后验模型合并系数优化,缓解训练推断不一致问题。
📊 数据与实验实验表明方法在多个基础模型上显著提升FID-50K分数,如SiT模型从8.30降至5.77,EDM2模型从3.74降至3.52,同时提升图像质量与多样性。
⭐ 主要贡献开发基于分布奖励的新框架,解决视觉生成中的模式坍塌问题,提高生成效果和多样性,并提供可扩展的优化策略。
查看完整摘要 (Abstract)
Conventional reinforcement learning strategies for visual generation typically employ sample-wise reward functions, yet this practice frequently results in reward hacking that degrades image diversity and introduces visual anomalies. To address these limitations, we present a novel framework that finetunes generative models using distribution-wise rewards, ensuring better alignment with real-world data distributions. Unlike rewards that evaluate samples individually, distribution-wise reward accounts for the data distribution of the samples, mitigating the mode collapse problem that occurs when all samples optimize towards the same direction independently. To overcome the prohibitive computational cost of estimating these rewards, we introduce a subset-replace strategy that efficiently provides reward signals by updating only a small subset of a generated reference set. Additionally, we apply RL to optimize post-hoc model merging coefficients, potentially mitigating the train-inference inconsistency caused by introducing stochastic differential equation (SDE) in regular RL practices. Extensive experiments show our approach significantly improves FID-50K across various base models, from 8.30 to 5.77 for SiT and from 3.74 to 3.52 for EDM2. Qualitative evaluation also confirms that our method enhances perceptual quality while preserving sample diversity.
🎯 研究动机现有基于强化学习的多模态大语言模型在伪装目标检测中性能有限,原因在于对象与背景高度融合带来的多重匹配和定位难题。
❓ 解决问题解决多对象匹配挑战、低质量样本影响及视觉干扰物的错误定位问题。
🔍 现象分析伪装目标检测面临背景与目标难以分离的问题,且低质量样本和纹理相似的干扰物加剧了这一困难。
🛠️ 主要方法提出PMSPO框架,以逐步匹配和语义感知策略优化为核心,包括使用Sinkhorn多对象匹配IoU奖励、多样本学习增益过滤(PLGF)、以及语义对比奖励规则对特征语义进行校正。
📊 数据与实验在伪装目标检测基准上进行实验,结果显示该方法在所有评估指标中均达到了强化学习领域的最新水平(SOTA)。
⭐ 主要贡献首次结合课程学习与语义感知优化,提出有效解决伪装目标检测问题的新方法,显著提升了多对象匹配和目标定位性能。
查看完整摘要 (Abstract)
Reinforcement learning-based Multimodal Large Language Models (MLLMs) provide new perspectives for visual grounding, yet face significant challenges in Camouflaged Object Detection (COD) where objects blend seamlessly with backgrounds. This stems primarily from: difficulties in multi-object matching, the detrimental effects of low-quality samples, and erroneously localizing visual distractors with similar textures to true objects. We propose Progressive Matching and Semantic-aware Policy Optimization (PMSPO), a curriculum learning-based framework that employs Sinkhorn multi-object matching IoU reward during training for multi-object alignment, utilizes Positive Learning Gain Filtering (PLGF) to curate high-quality samples, and transforms deep visual features into semantic contrastive reward rules to calibrate target background semantics. Experiments on COD benchmarks demonstrate that PMSPO achieves state-of-the-art (SOTA) performance among reinforcement learning methods across all evaluation metrics.
🎯 研究动机在示例编程任务(PBE)中,现有的大型语言模型(LLMs)难以准确捕捉示例意图,导致生成的程序部分满足或完全偏离目标。
❓ 解决问题提出一种面向过程监督的强化学习方法,旨在提高LLMs对示例意图的理解和复杂PBE任务的处理能力。
🔍 现象分析LLMs在应对复杂PBE任务时,主要问题是难以准确推理示例的逻辑意图,导致生成步骤缺乏精细化反馈。
🛠️ 主要方法通过构建推理树生成PBE过程监督数据集,训练基于偏好学习的过程奖励模型,并结合课程学习策略和PPO算法优化生成过程。
📊 数据与实验在标志性PBE基准测试上进行实验,方法以56.61%的平均通过率显著超过现有最先进方法8.73%。
⭐ 主要贡献提出了一个新颖的过程奖励模型构建框架,通过细粒度监督显著提升了LLMs在PBE任务中的性能,同时验证了课程学习的有效性。
查看完整摘要 (Abstract)
Programming-by-Example (PBE), as a typical few-shot inductive reasoning paradigm, aims to synthesize corresponding algorithms from a set of input-output examples. Although Large Language Models (LLMs) have demonstrated strong program synthesis potential, they still remain ineffective when handling complex PBE tasks. Specifically, LLMs often struggle to accurately grasp the underlying intent of examples, resulting in synthesized programs that either partially satisfy the examples or completely deviate from the target. To address these limitations, we introduce a process-supervised reinforcement learning method that provides fine-grained feedback during the synthesis process, improving the ability of LLMs to capture the intended behavior of provided examples. Firstly, we develop a reasoning tree construction method that is used to build a PBE process supervision dataset. Subsequently, we train a process reward model through preference learning to evaluate the effectiveness of reasoning steps. Finally, we introduce a curriculum learning strategy based on the difficulty of PBE tasks, using Proximal Policy Optimization (PPO) to optimize the model. Experimental results on representative PBE benchmarks show that our approach achieves an average pass rate of 56.61\%, significantly outperforming the state-of-the-art baseline by 8.73\%.
🎯 研究动机合成媒体的快速发展使深度伪造检测成为保障在线安全和信任的重要课题,而现有的检测模型因缺乏高质量大规模数据集而受限。
❓ 解决问题现有多模态大语言模型在深度伪造检测中表现不佳,推理与视觉证据常不一致或存在幻想性解释。
🔍 现象分析多模态推理模型存在推理解释与图像内容脱节的问题,这限制了其在深度伪造检测中的可靠性和解释性。
🛠️ 主要方法提出段落级相对策略优化算法(PRPO),通过强化学习机制将模型的推理与图像内容在段落级别进行对齐。
📊 数据与实验构建了一个包含推理标注的深度伪造检测数据集,实验显示PRPO显著提高检测准确率并在推理评分中达到4.55/5.0;消融实验验证了其优于GRPO的效果。
⭐ 主要贡献提出PRPO算法,将多模态推理对齐视觉证据,创建推理标注数据集,提升深度伪造检测的准确性与解释性。
查看完整摘要 (Abstract)
The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.
🎯 研究动机强化学习增强了大型语言模型在复杂任务中的能力,但现有方法因单一策略网络引发参数分配不均的问题。简单任务占据了主要资源,限制了复杂任务的解决能力。
❓ 解决问题探索如何引入专家混合方法以提高策略网络的任务分配效率,同时保障复杂任务的模型资源;解决传统专家混合方法中路由分配的碎片化问题。
🔍 现象分析传统混合专家模型采用基于 token 的路由分配方式,会将阶段一致性模式分散到不同专家,削弱了专家对特定任务的专注能力。
🛠️ 主要方法提出了 PA-MoE,包含一个轻量级阶段路由器,无需预定义阶段类别,直接从强化学习目标中学习隐性阶段边界,并设计一致的专家分配机制以保存阶段特定的任务专业性。
📊 数据与实验在多个任务中进行实验验证,结果表明 PA-MoE 优于传统方法,有效减少简单任务的主导效应,并增强复杂任务的解决能力。
⭐ 主要贡献提出 PA-MoE 架构,改进了传统专家混合的阶段分配方法;增强了模型在复杂任务中的表现;提供开源代码供研究者复现与应用。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a single policy network, causing simplicity bias where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose Phase-Aware Mixture of Experts (PA-MoE). It first features a lightweight phase router that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE. Code is available at https://anonymous.4open.science/r/PA-MoE-576C/.
🎯 研究动机研究如何通过后训练改进基础模型在序列预测任务中的性能,同时探索基础模型性能的限制性因素。
❓ 解决问题解决使用策略梯度方法在后训练阶段达到最优性能的理论框架,并分析基础模型支持边界外的泛化障碍。
🔍 现象分析发现基础模型的‘似然分位数’(Likelihood Quantile)属性决定了后训练的表现,而当序列长度增加时,在基础模型支持外扩展需要指数级的奖励查询量。
🛠️ 主要方法提出在后训练中通过引入过程奖励模型规避高维序列的维度诅咒,并结合自适应学习率的SGD和策略梯度提升训练效率。
📊 数据与实验理论结果说明了在标准$$-margin假设下,策略梯度方法可在近最优查询次数和错误率内实现最优性能,但无需额外实验证明。
⭐ 主要贡献提出并分析了基于过程奖励的策略梯度方法,并系统性揭示了基础模型在后训练中的限制性障碍,为高效后训练策略提供理论依据。
查看完整摘要 (Abstract)
We study post-training linear autoregressive models with outcome and process rewards. Given a context $\boldsymbol{x}$, the model must predict the response $\boldsymbol{y} \in \mathcal{Y}^N$, a sequence of length $N$ that satisfies a standard $\gamma$ margin assumption extended to sequences. We prove that on test samples where the base model achieves a non-trivial likelihood $\alpha$, a variant of policy gradient (PG) can achieve likelihood $1 - \varepsilon$ with an essentially minimax optimal number of reward queries $\tilde{\mathcal{O}}((\alpha^{-1} + \varepsilon^{-1})/\gamma^2)$. However, a barrier arises for going beyond the support of the base model. We prove that the overall expected error after post-training with outcome rewards is governed by a property of the base model we call the *Likelihood Quantile* (LQ), and that variants of PG, while minimax optimal, may require a number of reward queries exponential in $N$ to go beyond this support, regardless of the pre-training algorithm. To overcome this barrier, we study post-training with a process reward model, and demonstrate how PG variants in this setting avoid the curse of dimensionality in $N$ via dependence on a token-level LQ. Along the way, we prove that under the margin condition, SGD with adaptive learning rate (LR) achieves a near optimal test error for statistical learning, and PG with adaptive LR achieves a near optimal number of mistakes for online learning while being computationally efficient whenever possible, both of which may be of independent interest.
🎯 研究动机海报生成需要结合视觉美感与信息层次,但现有模型在文本渲染的精确性和编辑能力上存在不足。
❓ 解决问题将海报生成框架从静态预测扩展为逐步优化的代理工作流,弥补现有方法缺乏自我修正能力的缺陷。
🔍 现象分析现有文本到图像模型在生成高质量视觉内容方面有进展,但专业设计所需的结构化生成与多阶段优化能力尚未满足。
🛠️ 主要方法提出PosterAgent框架,通过引入分阶段奖励机制的强化学习(SARL),训练模型在初始草稿生成和逐步优化中分配精确的奖励。
📊 数据与实验通过大量实验验证PosterAgent的有效性,显著优于现有强基线方法,体现其在图形设计中的潜力。
⭐ 主要贡献首次将代理工作流引入海报生成,提出SARL方法以支持多轮优化,提升视觉效果与文本信息呈现的协调性。
查看完整摘要 (Abstract)
Poster generation is a complex task demanding a harmonious integration of visual aesthetics and information hierarchy. While recent text-to-image models have advanced visual synthesis, they remain non-editable and struggle with precise text rendering. Conversely, existing layout-generation methods offer structure but typically rely on static, one-shot predictions, lacking the mechanism for self-correction essential to professional design. Inspired by the iterative workflow of human designers, we introduce PosterAgent, a novel framework that reformulates poster creation as an agentic workflow involving initial drafting followed by iterative refinement. To effectively train this multi-turn capability, we propose Stage-Aware Reinforcement Learning (SARL), which decouples the optimization into draft-specific and refinement-specific phases, ensuring precise credit assignment for both initial drafting and incremental refinement actions. Extensive experiments demonstrate that PosterAgent significantly outperforms strong baselines, validating the potential of agentic systems in graphic design.
🎯 研究动机文本到图像生成领域的流匹配训练面临优势归因不准确的问题,限制了现有优化方法的表现。
❓ 解决问题提出通过时间步的聚合形成连续'块',将策略优化范式从步骤层面转向块层面,以减轻优势归因问题的负面影响。
🔍 现象分析现有的基于步骤的优化方式存在细粒度的不准确性,直接影响生成性能和用户偏好对齐能力。
🛠️ 主要方法开发了块级强化学习方法——GCPO,基于块级时间聚合实现策略优化,以提升流匹配性能。
📊 数据与实验在标准文本到图像生成基准和偏好对齐任务上进行实验,GCPO相比GRPO实现最多43%的性能增益。
⭐ 主要贡献首次引入块层面的策略优化方法,改善了文本到图像流匹配的后训练过程,显著提升生成质量和偏好对齐表现。
查看完整摘要 (Abstract)
Recent Progress in post-training flow matching for text-to-image (T2I) generation with Group Relative Policy Optimization (GRPO) has demonstrated strong potential. However, it is hindered by a critical limitation: inaccurate advantage attribution. In this work, we argue that aggregating consecutive timesteps into a coherent `chunk' and shifting the policy optimization paradigm from GRPO's step level to the chunk level can effectively mitigate the negative impact of this issue. Building on this insight, we propose Group Chunking Policy Optimization (GCPO), the first chunk-level reinforcement learning approach for post-training flow matching. Extensive experiments demonstrate that GCPO achieves superior performance on both standard T2I benchmarks and preference alignment, with up to $43\%$ additional gains over GRPO, highlighting the promise of chunk-level policy optimization.
🎯 研究动机传统的强化学习方法在优化 LoopLM 的推理能力时效果有限,因其仅关注终态奖励,与模型的多步潜在推理过程不匹配。
❓ 解决问题提出一种新框架 RLTT,通过奖励分配至完整的推理轨迹,解决了现有方法中奖励分配过于稀疏的问题。
🔍 现象分析以往的 GRPO 方法难以捕捉 LoopLM 中复杂的推理过程,导致其在基准测试上的表现受限。
🛠️ 主要方法设计 RLTT 框架,通过轨迹级强化学习实现密集奖励分配,无需额外验证器,且训练开销与现有方法类似。
📊 数据与实验在 Ouro-2.6B-Thinking 模型上进行实验,测试数据包括数学推理数据集 MATH-500、AIME24 和 BeyondAIME,以及非数学类推理基准。
⭐ 主要贡献RLTT 显著提升了数学推理任务的准确率,同时展现了优秀的跨领域迁移能力,为强化学习在循环语言模型中的应用提供了新思路。
查看完整摘要 (Abstract)
Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed—standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model's internal computation. To resolve this, we introduce **RLTT (Reward Latent Thought Trajectories)**, a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-2.6B-Thinking under identical training and inference conditions, **RLTT yields substantial improvements over GRPO on challenging mathematical reasoning benchmarks, improving accuracy by +14.4\% on MATH-500, +16.6\% on AIME24, and +10.0\% on BeyondAIME**. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs.
🎯 研究动机流匹配模型在文本到图像生成任务中表现卓越,但其强化学习流程面临样本效率低和提示过拟合的问题,限制了性能与泛化能力。
❓ 解决问题旨在解决生成多样性不足和提示过拟合导致的性能下降,通过优化提示设计提高强化学习的适应能力和效率。
🔍 现象分析当前模型在语义等价但风格不同的提示上表现骤降,且生成多样性不足导致强化学习优化样本利用率低。
🛠️ 主要方法提出PromptRL框架,引入可训练的语言模型作为提示优化代理,与流匹配强化学习流程协同提升生成性能,并重塑优化动态。
📊 数据与实验在GenEval、OCR准确性和PickScore等基准上取得SOTA表现,并以更少的训练迭代超越或接近多阶段训练模型如ReasonNet。
⭐ 主要贡献提出PromptRL方法,增强提示适配能力与优化效率,在更少训练迭代下实现性能突破,揭示提示设计在强化学习中的关键作用。
查看完整摘要 (Abstract)
Flow matching models (FMs) have revolutionized text-to-image (T2I) generation, with reinforcement learning (RL) serving as a critical post-training strategy for alignment with reward objectives. In this research, we show that current RL pipelines for FMs suffer from two underappreciated yet important limitations: sample inefficiency due to insufficient generation diversity, and pronounced prompt overfitting, where models memorize specific training formulations and exhibit dramatic performance collapse when evaluated on semantically equivalent but stylistically varied prompts. We present \textbf{PromptRL} (\textbf{P}rompt \textbf{M}atters in \textbf{RL} for Flow-Based Image Generation), a framework that incorporates language models (LMs) as trainable prompt refinement agents directly within the flow-based RL optimization loop. This design yields two complementary benefits: rapid development of sophisticated prompt rewriting capabilities and, critically, a synergistic training regime that reshapes the optimization dynamics. PromptRL achieves state-of-the-art performance across multiple benchmarks, obtaining scores of 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore. Furthermore, we validate the effectiveness of our RL approach on large-scale image editing models, improving the EditReward of FLUX.1-Kontext from 1.19 to 1.43 with only 0.06 million rollouts, surpassing Gemini 2.5 Flash Image (also known as Nano Banana), which scores 1.37, and achieving comparable performance with ReasonNet (1.44), which relied on fine-grained data annotations along with a complex multi-stage training. Our extensive experiments empirically demonstrate that PromptRL consistently achieves higher performance ceilings while requiring over 2$\times$ fewer rollouts compared to naive flow-only RL.
🎯 研究动机现有大语言模型代理仅能被动执行指令,无法高效处理多轮交互任务,难以满足真实世界用户需求。
❓ 解决问题如何在多轮交互场景中权衡任务性能与用户满意度,提升代理的主动性和用户对其行为的接受度。
🔍 现象分析被动代理无法灵活适应用户意图,而过度依赖用户反馈会降低用户满意度,框架需解决两者间的平衡问题。
🛠️ 主要方法提出BAO框架,结合行为增强提升主动推理与信息收集能力,并引入行为正则化压制低效或冗余交互,确保代理行为与用户期望对齐。
📊 数据与实验使用UserRL基准套件进行评估,通过与RL基线和前沿代理的对比实验,验证了该方法在多轮任务中的优越性能。
⭐ 主要贡献开发了能高效训练多轮交互主动代理的通用RL框架,显著提升了复杂场景下的用户满意度和任务完成表现。
查看完整摘要 (Abstract)
Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns, enabling efficient task completion beyond passive instruction following and making them essential for real-world, user-centric applications. Agentic reinforcement learning (RL) has recently emerged as a promising solution for training such agents in multi-turn settings, allowing interaction strategies to be learned from feedback. However, existing pipelines face a critical challenge in balancing task performance with user engagement, as passive agents can not efficiently adapt to users' intentions while overuse of human feedback reduces their satisfaction. To address this trade-off, we propose BAO, an agentic RL framework that combines behavior enhancement to enrich proactive reasoning and information-gathering capabilities with behavior regularization to suppress inefficient or redundant interactions and align agent behavior with user expectations. We evaluate BAO on multiple tasks from the UserRL benchmark suite, and demonstrate that it substantially outperforms RL baselines under controlled comparisons, while achieving comparable or even superior performance to frontier LLM agents, highlighting its effectiveness for training proactive, user-aligned LLM agents in complex multi-turn scenarios.
🎯 研究动机现有基于GRPO的强化学习算法易受启发式信任域逼近限制,导致优化不稳定。特别是当重要性比率超出裁剪范围时,难以进行有效规范化。
❓ 解决问题提出一种更稳健的方法来直接执行信任域约束,从而提高优化过程的稳定性和解释性。
🔍 现象分析当前方法中的全局重要性比率裁剪和分组规范化无法有效处理异常样本,导致优化过程缺乏可靠性。
🛠️ 主要方法引入QUATRO框架,通过精确信任域构造获得稳定的优化目标,实现策略更新的明确控制和内在熵调控。
📊 数据与实验在多个数学推理基准上进行验证,结果表明QUATRO在高策略陈旧性和高学习率下表现出训练的稳定性。
⭐ 主要贡献实现了直接的信任域约束优化架构,提供了稳定和可解释的策略优化方法,并在强化学习驱动的LLM微调中提升了鲁棒性和效率。
查看完整摘要 (Abstract)
GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.
🎯 研究动机现有基于Transformer的视觉跟踪方法使用手工热图作为空间先验,但这种启发式监督与评估指标(如IoU和AUC)不一致。
❓ 解决问题提出基于强化学习的RELO框架,将目标定位建模为决策问题,使训练目标与评估标准更好对齐,解决手工热图对性能的限制。
🔍 现象分析传统方法使用手工设计的热图监督,仅作为替代目标,与实际评估标准之间存在偏差,限制了模型性能。
🛠️ 主要方法RELO使用序列级强化学习,结合瞬时IoU奖励和序列级AUC奖励,优化定位行为,无需依赖人工热图。
📊 数据与实验在LaSOT$_\mathrm{ext}$上进行实验,RELO达到57.5% AUC,无需模板更新,显著优于现有方法。
⭐ 主要贡献提出强化学习跟踪框架RELO,通过优化评估指标显著提升性能,开创了视觉目标跟踪的新方向,同时提供代码和模型以促进研究社区发展。
查看完整摘要 (Abstract)
Existing one-stream Transformer-based visual trackers localize targets by training a classification head with a handcrafted spatial prior encoded as a heatmap. However, this heuristic supervision merely serves as a surrogate objective, which misaligns with evaluation metrics such as IoU and AUC. To address this limitation, we propose RELO, a reinforcement-learning tracking framework that formulates target localization as a decision-making problem within the Transformer-based tracking paradigm. Unlike prior-driven localization learning, RELO performs sequence-level reinforcement learning to optimize localization behavior using both instantaneous IoU and sequence-level AUC rewards, better aligning the training objective with real evaluation criteria. As a result, RELO not only eliminates the need for handcrafted heatmaps, but also achieves superior performance. For instance, RELO attains 57.5\% AUC on LaSOT$_\mathrm{ext}$ without template updates, establishing a new state-of-the-art performance. Code and models will be made available.
🎯 研究动机为应对现有 LLM 水印技术面临的安全性过度评价问题,研究其在最恶劣情况下的鲁棒性与漏洞。
❓ 解决问题利用KL散度球形式化自适应鲁棒半径,揭示水印在自适应攻击下的脆弱性,挑战传统评价方法的局限性。
🔍 现象分析理论证明优化攻击上下文和模型参数会显著降低水印的鲁棒半径,使得水印对改写攻击极度脆弱。
🛠️ 主要方法提出RLCracker,基于强化学习的自适应攻击方法,从有限水印样本中学习,有效去除水印且保持语义一致性。
📊 数据与实验在1,500-token文本上,使用3B模型进行实验,在仅训练100个短样本的情况下实现98.5%的水印移除成功率,远超GPT-4o的6.75%,并验证了其对多种模型和水印方案的泛化性。
⭐ 主要贡献通过自适应鲁棒半径与RLCracker揭示LLM水印的关键漏洞,开创性地构建对抗性评价框架,推动水印技术实际应用下的安全性提升。
查看完整摘要 (Abstract)
Large language model (LLM) watermarking has shown promise in detecting AI-generated content and mitigating misuse, with prior work claiming robustness against paraphrasing and text editing. In this paper, we argue that existing evaluations are not sufficiently adversarial, obscuring critical vulnerabilities and overstating the security. To address this, we introduce the adaptive robustness radius, a formal metric that quantifies the worst-case resilience of watermarks against adaptive adversaries. By lifting the paraphrase space into a KL-divergence ball, we approximate this radius and theoretically demonstrate that optimizing the attack context and model parameters can significantly reduce the approximated radius, making watermarks highly vulnerable to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning (RL)–based adaptive attack that erases watermarks while preserving semantic fidelity. RLCracker requires only limited watermarked examples and zero access to the detector. Despite weak supervision, it empowers a 3B model to achieve 98.5\% removal success with minimal semantic shift on 1,500-token Unigram-marked texts after training on only 100 short samples. This performance dramatically exceeds 6.75% by GPT-4o and generalizes across five model sizes over ten watermarking schemes.
🎯 研究动机现有基于高斯分布的连续控制策略学习方法在梯度噪声较大及保守策略更新情况下表现不稳定。
❓ 解决问题通过重新设计策略网络结构,引入离散的分类行为表达方式,提高策略优化鲁棒性和性能。
🔍 现象分析标准方法采用较浅的MLP网络作为策略网络,易导致优化脆弱性及无法充分利用复杂策略表达能力。
🛠️ 主要方法提出离散分类行为网络,将每个动作维度表示为多个分箱的分布,并结合正则化策略网络设计。
📊 数据与实验在多种连续控制基准测试环境中验证方法,实验显示新网络设计显著提升性能并超越现有最先进方法。
⭐ 主要贡献用离散分类行为和正则化网络替代传统高斯策略网络,解决了脆弱优化问题,取得了稳定且优异的表现。
查看完整摘要 (Abstract)
On-policy deep reinforcement learning remains a dominant paradigm for continuous control, yet standard implementations rely on Gaussian actors and relatively shallow MLP policies, often leading to brittle optimization when gradients are noisy and policy updates must be conservative. In this paper, we revisit policy representation as a first-class design choice for on-policy optimization. We study discretized categorical actors that represent each action dimension with a distribution over bins, yielding a policy objective that resembles a cross-entropy loss. Building on architectural advances from supervised learning, we further propose regularized actor networks, while keeping critic design fixed. Our results show that simply replacing the standard actor network with our discretized regularized actor yields consistent gains and achieve the state-of-the-art performance across diverse continuous-control benchmarks.
🎯 研究动机自然策略梯度方法通过考虑分布空间的几何结构提升优化效果,但其高昂的计算成本限制了实际应用。
❓ 解决问题提出一种新的方法,以避免费舍尔矩阵的显式构建与复杂求解,从而降低自然策略梯度计算的开销。
🔍 现象分析基于费舍尔矩阵的传统方法需要高计算成本,使得在复杂任务中难以高效应用。
🛠️ 主要方法提出了一种名为随机优势变换(RAT)的方法,利用随机块 Kaczmarz 迭代在策略梯度中高效计算正则化自然梯度,使得方法无需依赖复杂的求解与结构特定的近似。
📊 数据与实验实验证明,RAT 在连续控制和视觉控制任务中性能优于或匹配现有的自然梯度方法,同时具备简单实现和良好的适配性。
⭐ 主要贡献提出了RAT方法,理论上给出收敛性保证,并在多种控制任务中验证了其高效性与广泛适用性。
查看完整摘要 (Abstract)
Natural policy gradients improve optimization by accounting for the geometry of distribution space, but their practical use is limited by the cost of estimating and inverting the Fisher matrix. We present Randomized Advantage Transformation (RAT), a method for estimating Tikhonov-regularized natural policy gradients via direct backpropagation. By applying the Woodbury formula, we reformulate the regularized natural gradient as vanilla policy gradients with a transformed advantage. RAT computes this transformation efficiently via randomized block Kaczmarz iterations on on-policy mini-batches, avoiding explicit Fisher construction, conjugate-gradient solvers, and architecture-specific approximations. We provide convergence guarantees for RAT and demonstrate empirically that it matches or exceeds established natural-gradient methods across continuous and visual control benchmarks, while remaining simple to implement and compatible with various architectures.
🎯 研究动机现有的策略优化方法依赖启发式裁剪机制,容易导致高回报但高偏差的更新被强制截断,从而限制算法性能。
❓ 解决问题通过显式约束策略比率的方差,提出一种更为合理的信任区域近似方法,避免二元硬裁剪的弊端。
🔍 现象分析该方法作为一种软约束机制,既能保留新发现的关键梯度信号,又能降低陈旧离线数据的权重,并支持其重用。
🛠️ 主要方法提出了 R$^2$VPO(Ratio-Variance Regularized Policy Optimization),使用原始-对偶优化框架实现策略比率方差约束。
📊 数据与实验实验覆盖了7种语言模型规模、10个机器控制任务,验证了该方法在数学推理任务的小模型上显著提升性能,同时在连续控制领域提高样本效率和稀疏奖励场景的适应性。
⭐ 主要贡献提出了基于比率方差正则化的策略优化方法,为稳定、数据高效的策略优化奠定了理论基础,并在广泛任务中表现出色。
查看完整摘要 (Abstract)
Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost by indiscriminately truncating high-return yet high-divergence updates. We demonstrate that explicitly constraining the *policy ratio **variance*** provides a principled local approximation to trust-region constraints, eliminating the need for binary hard clipping. By acting as a distributional ''soft brake'', this approach preserves critical gradient signals from novel discoveries while naturally down-weighting and enabling the reuse of stale, off-policy data. We introduce **R$^2$VPO** (Ratio-Variance Regularized Policy Optimization), which implements this constraint via a primal–dual optimization framework. Extensive evaluations across $7$ LLM scales, spanning both fast and slow reasoning paradigms, and $10$ robotic control tasks demonstrate the generality of the proposed approach. R$^2$VPO achieves substantial performance gains on mathematical reasoning benchmarks, with particularly pronounced improvements on smaller models, while significantly improving sample efficiency. Furthermore, it consistently outperforms PPO baselines in continuous control domains, particularly in sparse-reward and dynamic environments. Together, these findings establish ratio-variance regularization as a principled foundation for stable and data-efficient policy optimization.
🎯 研究动机针对小样本时间序列预测中的训练数据稀缺和过拟合问题,提出更有效的数据增强方法。
❓ 解决问题设计一种能够动态选择数据增强位置与方式的框架,以缓解模型过拟合并提升泛化能力。
🔍 现象分析通过预测模型多样性衡量哪些样本更易过拟合,这些样本用于指导增强策略的设计,将其作为训练重点。
🛠️ 主要方法提出基于强化学习的ReAugment框架,利用模型集合的奖励函数学习增强策略,专注于过拟合区域的转换数据。
📊 数据与实验在不同架构的预测模型上进行实验,包括小样本与标准时间序列任务,验证了方法的有效性。
⭐ 主要贡献提出数据增强与强化学习结合的新框架ReAugment,为小样本时间序列预测提供创新解决方案,有效缓解过拟合风险。
查看完整摘要 (Abstract)
Few-shot time series forecasting is fundamentally challenged by the scarcity of high-quality training data and the risk of severe overfitting. To address this issue, we propose ReAugment, a reinforcement learning (RL) framework that explicitly learns where and how to augment time series data. ReAugment maintains a zoo of forecasting models and measures prediction diversity across them to identify training samples that are most prone to overfitting. These samples serve as anchor points and are used as inputs to the data augmentation process. We then employ an RL approach to learn transformation policies, using a model zoo-guided reward function to bias the transformed data to overfit-prone regions of the training distribution that are most beneficial for generalization. A key advantage of the RL formulation is that it avoids backpropagating gradients through the forecasting models, thereby mitigating gradient vanishing. Experiments across diverse forecasting architectures demonstrate the effectiveness of ReAugment in both few-shot and standard time series forecasting.
🎯 研究动机当前深度生成模型在低数据量和数据不平衡的表格数据环境下难以充分学习复杂的数据分布,生成合适的训练数据面临挑战。
❓ 解决问题根据理论分析,生成模型应优先学习条件分布 P(y|X) 而非完整联合分布,以在数据有限的条件下提升数据效率。
🔍 现象分析全分布学习可能过于冗余,无助于提升下游模型性能;优先保留预测信号是更有效的策略。
🛠️ 主要方法提出 ReTabSyn,通过强化学习管道在合成器训练过程中直接提供特征相关性保留的反馈,优化模型以增强预测信号。
📊 数据与实验通过小样本、不平衡类别和分布漂移的多种基准测试,ReTabSyn 在所有场景下均超过当前最优基线性能。
⭐ 主要贡献提出具有强化反馈机制的表格数据生成方法,优先优化预测信号并提升下游效用,且方法可扩展应用于多种生成控制需求。
查看完整摘要 (Abstract)
Deep generative models can help with data scarcity and privacy by producing synthetic training data, but they struggle in low-data, imbalanced tabular settings to fully learn the complex data distribution. We argue that striving for the full joint distribution could be overkill; for greater data efficiency, models should prioritize learning the conditional distribution $P(y\mid \bm{X})$, as suggested by recent theoretical analysis. Therefore, we overcome this limitation with \textbf{ReTabSyn}, a \textbf{Re}inforced \textbf{Tab}ular \textbf{Syn}thesis pipeline that provides direct feedback on feature correlation preservation during synthesizer training. This objective encourages the generator to prioritize the most useful predictive signals when training data is limited, thereby strengthening downstream model utility. We empirically fine-tune a language model-based generator using this approach, and across benchmarks with small sample sizes, class imbalance, and distribution shift, ReTabSyn consistently outperforms state-of-the-art baselines. Moreover, our approach can be readily extended to control various aspects of synthetic tabular data, such as applying expert-specified constraints on generated observations.
🎯 研究动机分析熵正则化环境下,演员-评论员算法中评论员的关键作用,以减少更新过程中的方差并提高收敛效率。
❓ 解决问题探索如何通过准确的评论员估计提升演员更新的稳定性及快速收敛性能,尤其是在熵正则化的有限贴现环境中。
🔍 现象分析发现精确的评论员作为基准能有效减少更新方差;当评论员误差较小时,方差减小与快速收敛特性仍可保持。
🛠️ 主要方法提出先学习评论员再更新演员的策略,确保评论员在每次演员更新后保持准确性,从而优化整体算法效果。
📊 数据与实验未详述具体数据集,可推测通过理论分析和仿真实验验证了评论员误差对算法收敛效率的影响。
⭐ 主要贡献明确评论员对演员-评论员算法中的核心性并提供理论支持,证明在熵正则化环境下,方差减小与样本复杂度优化的可能性。
查看完整摘要 (Abstract)
In this paper, we study the role of the critic in actor-critic for entropy-regularized, finite, discounted environments. We establish that, when the critic is exact, using the latter as a baseline is an actual variance-reduction method. In this case, actor-critic with stochastic gradients matches the sample complexity of deterministic policy gradient, reaching an $\epsilon$-optimal regularized value with $\tilde{O}(\log(1/\epsilon))$ samples. In practice, the critic is learned alongside the actor: the variance of the actor update is then influenced by the critic's variance and bias. Specifically, when the critic has a sufficiently small error, the variance reduction and rapid convergence are preserved. This suggests to learn the critic first, keeping it up to date after each actor update, underscoring the pivotal role of accurate critic estimation in actor-critic methods.
🎯 研究动机Meta-RL 在非参数设定下,因不同任务的回报尺度差异显著,导致梯度干扰问题难以解决。
❓ 解决问题现有方法通过归一化处理回报尺度,但受制于固定离散化和量化误差,仍无法有效优化。
🔍 现象分析回报尺度的多样性和梯度对齐的难度是导致任务优化不平衡的核心原因,需从分布视角重新建模。
🛠️ 主要方法提出 Reflect-then-Correct 框架,通过 Sinkhorn 散度对任务分布进行几何对齐,并结合递归误差建模和自适应重要性权重实现优化平衡。
📊 数据与实验在 Meta-World ML-10 和 ML-45 基准上进行实验,结果表明 RTC 超越现有基准方法。
⭐ 主要贡献提出了结合分布对齐与递归误差修正的新框架,为任务优化中的统计偏差问题提供了理论和实验支持。
查看完整摘要 (Abstract)
Meta-Reinforcement Learning (Meta-RL) faces significant challenges in non-parametric settings, where vastly different return scales across diverse tasks cause severe gradient interference. Existing categorical solutions attempt to normalize these scales but often fail due to rigid discretization and quantization errors. To address this, we propose Reflect-then-Correct (RTC), a framework that models meta-values using Sinkhorn divergence. By treating distributions as adaptive floating particles, RTC achieves a geometry-aware alignment of distinct meta-task structures. However, while Sinkhorn updates harmonize gradients, they introduce statistical bias via sampling estimation. RTC overcomes this by ''reflecting'' on the temporal accumulation of Bellman inconsistencies through a recursive error model and ''correcting'' the optimization via adaptive importance weights that prioritize transitions critical for accuracy. We provide theoretical guarantees for this reweighting strategy and demonstrate that RTC outperforms existing baselines on the challenging Meta-World ML-10 and ML-45 benchmarks.
🎯 研究动机当前医学视觉语言模型(VLMs)在3D计算机断层扫描(CT)分析中存在优化目标与临床标准的不一致问题,导致严重的临床错误。现有强化学习框架偏重语言流畅性而忽略事实准确性。
❓ 解决问题提出一种新的结构化系统——CABS,用于分解放射学报告为可验证语义单元,并通过新的优化框架解决标准强化学习中对医学事实的忽略现象。
🔍 现象分析传统强化学习方法存在“机制性偏差”,奖励机制更注重表面相似性,导致模型优先优化语言表达能力而非医学事实认知。
🛠️ 主要方法开发TIF-GRPO框架,将控制理论融入策略优化,将临床推理视为异常发现的伪时间轨迹,并通过积分反馈回路调节解剖学敏感的奖励,抑制长期遗漏和评估幻觉。
📊 数据与实验基于3D CT数据集进行实验,验证方法在异常检测与临床准确性上的显著提升,显现出模型在医学细粒度调控上的优越性。
⭐ 主要贡献提出CABS分解框架,设计跨临床推理的积分反馈奖励机制,为医学VLMs设立新的优化范式,有效提升异常检测能力与临床可信度。
查看完整摘要 (Abstract)
The advancement of Medical Vision-Language Models (VLMs) for 3D Computed Tomography (CT) analysis is hindered by a misalignment between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms rely on lexical proxy signals that induce ``\textbf{evaluation hallucinations}'', where models prioritize linguistic fluency over factual accuracy, leading to fatal clinical errors. To bridge this gap, we introduce the \textbf{Clinical Abnormality Benchmarking Substrate (CABS)}, a structured system that decomposes radiology reports into verifiable semantic units. Using CABS, we identify a ``\textbf{mechanistic divergence}'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbf{Trajectory-Integral Feedback GRPO (TIF-GRPO)}, a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs.
🎯 研究动机针对基础模型在不断变化的下游任务中的适应问题,现有方法多关注数据重放、模型扩展等,而对学习范式的核心作用研究不足。
❓ 解决问题比较监督微调(SFT)与强化微调(RFT)两种后训练范式在持续学习过程中的知识保留能力,从而解决灾难性遗忘和模型泛化性下降的问题。
🔍 现象分析实验表明,SFT会导致严重的灾难性遗忘,而RFT能够自然保留先验知识,且在多任务训练中表现出可比的性能。此外,RFT还能提升模型在标准基准上的通用知识表现。
🛠️ 主要方法提出一种基于强化微调的实例筛选算法(RIF-RFT),通过选择可学习样本,提高RFT训练效率,同时揭示其选择性更新机制对知识稳定性的关键作用。
📊 数据与实验基于多模态下游任务开展实验,选择 Qwen2.5-VL-7B-Instruct 作为基础模型,综合评估灾难性遗忘、知识保留及模型通用性表现。
⭐ 主要贡献揭示强化微调在持续后训练中的自然稳定性,提出高效的样本筛选算法(RIF-RFT),为多模态模型的持续学习提供了一种鲁棒范式。
查看完整摘要 (Abstract)
Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to ever-evolving downstream tasks. While existing research primarily focus on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted across multiple multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieves performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model's general knowledge on standard benchmarks, while SFT degrades general model capabilities severely. Further analysis reveals that this stability is not primarily due to explicit mechanisms like KL penalty or chain-of-thought reasoning. We investigate RFT's learning dynamics and find that its selective update mechanism inherently prevents interference with established knowledge. Based on this insight, we propose a rollout-based instance filtering algorithm (RIF-RFT) that enhances the training efficiency of RFT by focusing on learnable samples. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training
🎯 研究动机FHIR 是医疗数据互操作的主流标准,但回答基于 FHIR 的复杂临床问题需要多步推理与资源聚合,现有工具强化的 LLM 模型在资源选择和访问约束上表现不佳。
❓ 解决问题将 FHIR 数据上的推理任务建模为结构化图上的序列决策问题,解决多步推理中资源选择错误和约束违反的问题。
🔍 现象分析现有方法在基准测试中回答正确率仅为 50%,暴露了基于提示的封闭模型在多步推理和数据完整性约束方面的不足。
🛠️ 主要方法提出基于多轮推理的 CodeAct 代理模型,并通过强化学习进行后训练,结合自定义工具和 LLM Judge 机制提供基于执行结果的奖励反馈。
📊 数据与实验在 FHIR-AgentBench 基准上评估,实验表明使用 Qwen 2.5-7B 的 RL 后训练模型将正确率从 50% 提升至 64%。
⭐ 主要贡献提供了一个完整的后训练流程,包括环境搭建、工具融合、模型训练和评估,显著提升了基于 FHIR 的结构化临床图上多轮推理能力。
查看完整摘要 (Abstract)
Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multi-step reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool-augmented LLM agents (retrieval, code execution, multi-turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR-AgentBench, a benchmark for realistic question answering over real-world hospital data, and frame reasoning on FHIR as a sequential decision-making problem over a queryable structured graph. We implement a multi-turn CodeAct agent and post-train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution-grounded rewards. Compared to prompt-based, closed-model baselines, RL post-training improves performance while enforcing data-integrity constraints. Empirically, our approach improves answer correctness from 50% (o4-mini) to 64% on FHIR-AgentBench using a smaller and cheaper Qwen 2.5-7B model. We present an end-to-end post-training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi-turn reasoning over structured clinical graphs.
🎯 研究动机强化学习难以处理大型组合动作空间的问题,这限制了其在许多现实场景中的应用。
❓ 解决问题提出了一种利用离散扩散模型的新框架,以有效处理复杂的组合动作空间,并改善策略学习性能和稳定性。
🔍 现象分析实验表明,在样本效率与训练稳定性之间存在重要权衡,FKL在初始收敛速度上表现优异,而RKL则提供更高的训练稳定性和最终性能。
🛠️ 主要方法使用策略镜像下降定义正则化目标分布,并将策略更新转换为分布匹配问题,通过扩散模型来复制该稳定目标。
📊 数据与实验在多个复杂组合基准测试中(包括DNA序列生成、宏动作强化学习、多智能体系统)进行了评估,验证了方法的效率和性能。
⭐ 主要贡献提出了一种离散扩散策略新框架,提供稳定高效的策略优化,显著提升了强化学习在复杂动作空间中的表现。
查看完整摘要 (Abstract)
Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many real-world problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state-of-the-art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro-actions, and multi-agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines. Crucially, our extensive empirical analysis reveals a key trade-off: FKL demonstrates superior sample efficiency and faster initial convergence, whereas RKL ensures greater training stability and higher asymptotic performance.
🎯 研究动机在基于强化学习的策略优化中,结合可验证奖励机制以提高大语言模型的推理能力具有重要应用价值。
❓ 解决问题分析 GRPO 在处理可验证(二元)奖励时的动态表现、损失函数特性和成功率增益机制。
🔍 现象分析发现奖励的均值+方差校准引入了一种对比损失,其中对比样本来自旧策略生成的合成数据,最优策略的成功率依赖奖励的统计特性与正则化强度。
🛠️ 主要方法研究了不同正则化策略(镜像正则化、参考策略正则化及二者结合)的影响,并推导出迭代公式,揭示了稳定点特性。
📊 数据与实验通过理论分析与实验确认策略的成功率始终超过参考策略,并证明了 GRPO 的成功率放大能力。
⭐ 主要贡献提出并分析了 GRPO 在可验证奖励下的策略动态与损失构造,明确其成功率增益机制并给出了理论收敛结果。
查看完整摘要 (Abstract)
Group Relative Policy Optimization (GRPO) was introduced recently and used to train DeepSeek\textendash R1 for promoting reasoning in LLMs under verifiable (binary) rewards. We show that the mean{+}variance calibration of these rewards induces a contrastive loss in which the contrastive samples are synthetic data drawn from the previous policy. While GRPO was originally paired with clipping to keep updates near the old policy, we analyze variants that differ in reward normalization (mean-only vs.\ mean{+}variance) and in how they regularize updates using KL divergence: either penalizing divergence from the previous model (\emph{mirror}), penalizing divergence from a fixed reference model $\pi_{\mathrm{ref}}$, or combining both forms of regularization. For each, the optimal policy $\pi_n$ admits an explicit form in terms of the binary reward and the first and second order statistics of the reward under $\pi_{n-1}$, as well as the policies $\pi_{n-1}$ and $\pi_{\mathrm{ref}}$. Iterating results in a sequence $\{\pi_n\}$ whose \emph{probability of success (PoS)} obeys a simple recurrence that converges to a fixed point determined by the reference PoS and the regularization strength. We further show that this fixed point exceeds the reference, demonstrating that GRPO amplifies the policy's probability of success.
🎯 研究动机近年来强化学习后训练大幅提升了大型语言模型在长链推理任务中的性能,但其高推理成本推动了向小模型蒸馏的需求。
❓ 解决问题现有知识蒸馏方法多基于监督微调,依赖固定教师轨迹或KL散度正则化,与强化学习结合时易出现分布不匹配和目标干扰等问题。
🔍 现象分析教师监督可能与学生发展的推理分布不一致,而KL正则化与奖励最大化目标竞争,导致需要谨慎调整损失平衡。
🛠️ 主要方法提出了RL-aware Distillation (RLAD),利用选择性模仿策略在强化学习中动态引导学生,仅在有利于策略更新时效仿教师。核心模块为信任域比率蒸馏(TRRD),通过基于教师-旧策略混合的PPO/GRPO风格似然比目标,实现在学生轨迹中的优势感知和信任域控制。
📊 数据与实验在逻辑推理和数学基准数据集上,RLAD稳定优于离线蒸馏、标准GRPO和基于KL的教师-学生在线知识蒸馏方法。
⭐ 主要贡献提出RLAD和TRRD框架,解决强化学习知识蒸馏中的分布不匹配和目标干扰问题,提升学生模型的推理性能并有效平衡探索与模仿。
查看完整摘要 (Abstract)
Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback–Leibler (KL) divergence based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student’s evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose \emph{RL-aware distillation} (RLAD), which performs selective imitation during RL---guiding the student toward the teacher only when it improves the current policy update. Our core component,Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD}consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher–student knowledge distillation.
🎯 研究动机随着大语言模型的发展,从对话式聊天机器人向通用智能代理的过渡迅速推进,但如何在体现同理心的交流和预算意识的决策之间取得平衡仍是未解难题。
❓ 解决问题现有方法难以捕捉复杂的策略权衡,因此本文提出了一种新的强化学习框架,针对任务导向对话中的效用与成本平衡问题进行优化。
🔍 现象分析传统方法未能充分考虑用户个性化与多层次决策中的全局约束,难以有效提升实际场景中的任务完成效率和成本管理能力。
🛠️ 主要方法提出 InteractCS-RL 框架,包括基于用户的交互框架和成本感知的多轮策略优化方法(CMPO),通过混合优势估计策略与 PID-Lagrangian 成本控制实现效用与成本的动态平衡。
📊 数据与实验在定制的业务场景中验证框架优势,并在工具-代理-用户交互基准测试中证明其在多个领域的鲁棒性,结果显示性能显著超越其他基线模型。
⭐ 主要贡献重新定义任务导向对话为多粒度强化学习过程,设计结合生成过程奖励与全局约束的创新策略优化方法,为真实世界服务代理的效用-成本平衡提供有效解决方案。
查看完整摘要 (Abstract)
The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized real business scenarios demonstrate that InteractCS-RL significantly outperform other baselines across three evaluation dimensions. Further evaluation on tool-agent-user interaction benchmarks verify InteractCS-RL robustness across diverse domains.
🎯 研究动机现有的重参数化策略梯度方法虽然具备高样本效率,但面临昂贵的动态雅可比矩阵未被充分利用以及训练不稳定性的问题。
❓ 解决问题通过优化样本重用机制,并解决直接应用重参数化方法可能引发的不稳定性,提升训练效率和稳定性。
🔍 现象分析样本重用虽然能减轻动态雅可比矩阵计算成本过高的问题,但未经设计的尝试可能导致训练的进一步不稳定。
🛠️ 主要方法提出了 RPO 方法,统一了基于时间回溯的 PPO 精简目标框架,并结合剪裁策略梯度与显式的KL散度正则化技术确保稳定性。
📊 数据与实验使用多种复杂任务验证,结果显示RPO具备高样本效率,同时稳定性优于现行算法,并取得或超越了最新性能表现。
⭐ 主要贡献提出了一个统一框架,将样本重用与策略优化相结合,并通过剪裁调控和正则化方法解决了重参数化训练的不稳定性。
查看完整摘要 (Abstract)
By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the under-utilization of computationally expensive dynamics Jacobians and inherent training instability. While sample reuse offers a remedy for under-utilization, no prior principled framework exists, and naive attempts risks exacerbating instability. To address these challenges, we propose Reparameterization Proximal Policy Optimization (RPO). We first establish that under sample reuse, RPG naturally optimizes a PPO-style surrogate objective via Backpropagation Through Time, providing a unified framework for both on- and off-policy updates. To further ensure stability, RPO integrates a clipped policy gradient mechanism tailored for RPG and employs explicit Kullback-Leibler divergence regularization. Experimental results demonstrate that RPO maintains superior sample efficiency and consistently outperforms or achieves state-of-the-art performance across diverse tasks.
🎯 研究动机强化学习与可验证奖励(RLVR)优化大语言模型逻辑推理,但生成多样性受限。现有方法如负样本强化(NSR)虽部分解决问题,但可能压制正负响应间的语义分布共享。
❓ 解决问题提出一种新的残差强化学习方法(ResRL),通过解耦正负响应间的语义分布,提升逻辑推理能力,同时保持生成多样性。
🔍 现象分析负样本强化方法可能导致语义分布干扰问题,抑制生成多样性。论文理论分析了负正头梯度干扰并关联懒惰似然位移(LLD),以指导权重分配。
🛠️ 主要方法使用SVD分解将负样本隐表示投影到低秩正子空间,并通过残差修正负梯度以优化推理性能和生成质量。
📊 数据与实验在跨数理、代码、代理任务和函数调用的十二项基准上进行测试,ResRL在数学推理中表现优异,平均提高9.4%(Avg@16)及7.0%(Pass@128)。
⭐ 主要贡献提出创新性ResRL方法,解耦正负语义分布,提升推理和生成多样性,全面优于强基线并提供代码公开支持。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at https://anonymous.4open.science/r/ResRL.
🎯 研究动机大型语言模型在复杂推理任务中表现优异,但基于可验证奖励的强化学习训练资源消耗巨大,亟需提高数据和计算效率。
❓ 解决问题针对RLVR训练中样本复杂度和计算成本问题,探索降低计算负担同时保持推理性能的方法。
🔍 现象分析理论上证明解锁推理能力所需的样本下限,并通过实验验证少量训练样本即可实现强推理性能。
🛠️ 主要方法提出动态单次策略优化(DoPR),基于奖励波动与探索驱动动态选取每批次的单个信息性样本以更新策略,显著降低计算开销。
📊 数据与实验实验评估表明,DoPR在保持竞争性推理准确率的同时减少近一个数量级的训练开销。
⭐ 主要贡献提供一种可扩展且资源高效的RL后训练方案,促进复杂推理任务中大型语言模型的广泛应用。
查看完整摘要 (Abstract)
Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks, with reinforcement learning under verifiable rewards (RLVR) emerging as a principled framework for aligning model behavior with reasoning chains. Despite its promise, RLVR remains prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training. In this work, we revisit the fundamental question of data and compute efficiency in RLVR. We first establish a theoretical lower bound on the sample complexity required to unlock reasoning capabilities, and empirically validate that strong performance can be achieved with a surprisingly small number of training instances. To tackle the computational burden, we propose Dynamic One-Shot Policy Refinement (DoPR), a uncertainty-aware RL strategy that dynamically selects a single informative training sample per batch for policy updates, guided by reward volatility and exploration-driven acquisition. DoPR reduces rollout overhead by nearly an order of magnitude while preserving competitive reasoning accuracy, offering a scalable and resource-efficient solution for LLM post-training. This approach offers a practical path toward more efficient and accessible RL-based training for reasoning-intensive LLM applications.
🎯 研究动机人类意图转化为参数化 CAD 模型是概念设计阶段的重要挑战,涉及非结构化、多模态的输入(例如手绘草图和文本描述)。
❓ 解决问题提出一种统一的方法,将多层次的人类意图直接映射到可执行代码,无需目标 CAD 模型的先验假设。
🔍 现象分析当前方法在处理手绘草图、文本描述等异构输入时表现有限,难以保证参数化 CAD 模型的几何和拓扑一致性。
🛠️ 主要方法设计了一个两阶段框架,包括多任务协同对齐以弥合异构输入的表征差距,以及空间感知强化学习以增强几何和拓扑一致性。
📊 数据与实验构建 HiCAD 大规模数据集,包含手绘草图、文本描述和参数化 CAD 代码,通过多任务实验验证提出方法在高保真 CAD 生成中的明显优越性。
⭐ 主要贡献首次提出统一框架解决复杂输入到参数化 CAD 的直接转化,构建 HiCAD 数据集并在几何一致性和高保真生成方面显著超越现有基线。
查看完整摘要 (Abstract)
Parametric CAD modeling from human intent remains challenging, particularly during the conceptual design stage, where design goals are expressed through incomplete and unstructured modalities (e.g., hand-drawn sketches and textual descriptions). In this work, we rethink the human intent-to-CAD pipeline and propose a unified method that directly maps multi-level human intents to executable codes, without assuming the prior existence of target CAD models. To support our study, we construct HiCAD, the first large-scale dataset aligning hand-drawn sketches, textual descriptions, and parametric CAD codes. Based on this, we introduce HiCAD, a two-stage framework comprising Cooperative Multi-Task Alignment to bridge the representational gap between heterogeneous inputs, and Spatial-Aware Reinforcement Learning to enforce geometric and topological consistency. Extensive experiments demonstrate that our method significantly outperforms existing baselines across multiple tasks, validating its effectiveness and robustness in transforming heterogeneous human intents into high-fidelity parametric CAD models.
🎯 研究动机扩散模型用于视觉任务存在不可处理的似然问题,限制了开发高效的强化学习方法。
❓ 解决问题系统分析强化学习设计空间,探索似然估计对算法性能的核心影响。
🔍 现象分析采用基于ELBO的似然估计可显著提升优化效率与稳定性,相比特定损失函数对性能影响更具主导作用。
🛠️ 主要方法将目标函数、似然估计方法及采样策略解耦,验证最终生成样本的ELBO似然估计优越性。
📊 数据与实验基于SD 3.5 Medium测试多个奖励基准,与现有方法对比,显著提升GenEval分数并提高计算效率。
⭐ 主要贡献提出似然估计的重要性,提供更高效的优化流程;提升广泛任务性能并超越多种现有方法。
查看完整摘要 (Abstract)
Reinforcement learning has been widely applied to diffusion and flow models for visual tasks such as text-to-image generation. However, these tasks remain challenging because diffusion models have intractable likelihoods, which creates a barrier for directly applying popular policy-gradient type methods. Existing approaches primarily focus on crafting new objectives built on already heavily engineered LLM objectives, using ad hoc estimators for likelihood, without a thorough investigation into how such estimation affects overall algorithmic performance. In this work, we provide a systematic analysis of the RL design space by disentangling three factors: i) policy-gradient objectives, ii) likelihood estimators, and iii) rollout sampling schemes. We show that adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional. We validate our findings across multiple reward benchmarks using SD 3.5 Medium, and observe consistent trends across all tasks. Our method improves the GenEval score from $0.24$ to $0.95$ in $90$ GPU hours, which is $4.6\times$ more efficient than FlowGRPO and $2\times$ more efficient than the SOTA method DiffusionNFT without reward hacking.
🎯 研究动机强化学习是大模型微调的关键方法,但现有主流算法PPO在大规模词汇表中表现出结构性不足,比例裁剪机制导致更新效率及稳定性问题。
❓ 解决问题针对PPO不适用于大词汇表的核心问题,提出更为精准的策略更新约束方法,解决低概率词过度惩罚与高概率词约束不足的问题。
🔍 现象分析PPO基于单样本蒙特卡洛估计的策略比例裁剪机制存在噪声,导致学习动态次优,从而影响训练的效率和稳定性。
🛠️ 主要方法提出Divergence Proximal Policy Optimization (DPPO),用直接的策略偏差估计(如总变差或KL散度)替代启发式裁剪,并引入高效的Binary与Top-K近似方法降低内存开销。
📊 数据与实验通过大量实验证明,DPPO在稳定性和效率上优于现有方法,为大语言模型强化学习微调提供了更稳健的基础。
⭐ 主要贡献重新设计适用于LLM的策略更新约束方法,提出DPPO及其内存优化策略,显著提升强化学习微调的效率与稳定性。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning.
🎯 研究动机逆向合成预测是化学合成中的核心任务,但现有模型多依赖静态模式匹配,缺乏逻辑决策能力,模型过程不具可解释性。
❓ 解决问题通过构建兼具逻辑推理和可解释性的框架,提升逆向合成预测的性能,同时弥补现有模型作为黑箱的不足。
🔍 现象分析当前模型难以结合化学逻辑进行有效推理,导致预测结果缺乏透明度和化学可操作性。
🛠️ 主要方法提出了Retro-Expert框架,结合专业模型和大语言模型的推理优势,通过强化学习实现三部分功能:专业模型生成决策空间、LLM进行解释性推理、优化解释性决策策略。
📊 数据与实验在多项评测指标上,Retro-Expert的表现均优于单独的LLM模型和专业模型,并能输出符合化学逻辑的专家级解释。
⭐ 主要贡献设计了融合跨模态推理能力的逆向合成框架,显著提升了预测性能和可解释性,为AI与化学领域提供了新范式。
查看完整摘要 (Abstract)
Retrosynthesis prediction aims to infer the reactant molecule based on a given product molecule, which is a fundamental task in chemical synthesis. However, existing models rely on static pattern-matching paradigm, which limits their ability to perform effective logic decision-making, leading to a black-box process. Building on this, we propose Retro-Expert, an interpretable retrosynthesis framework that performs collaborative reasoning by combining the complementary reasoning strengths of Large Language Models and specialized models via reinforcement learning. It outputs natural language explanations grounded in chemical logic through three components: (1) specialized models analyze the product to construct high-quality chemical decision space, (2) LLM-driven critical reasoning to generate predictions and corresponding interpretable reasoning path, and (3) reinforcement learning optimizing interpretable decision policy. Experiments show that Retro-Expert not only surpasses both LLM-based and specialized models across different metrics but also provides expert-aligned explanations that bridge the gap between AI predictions and actionable chemical insights.
🎯 研究动机传统策略梯度方法存在采样效率低的问题,而现有研究对再利用旧梯度有一定进展,但对再利用旧轨迹的理论分析不足。
❓ 解决问题提出如何有效复用以往的非策略内轨迹数据,以显著加速策略梯度方法的收敛速度。
🔍 现象分析现有的策略梯度算法在处理连续控制问题时,依赖大量采样且各迭代间采样信息未充分利用,导致收敛效率受限。
🛠️ 主要方法设计RT-PG算法,通过使用‘幂平均修正的多重重要性加权估计器’,将最新若干轮的策略内和非策略内数据结合,提升采样效率和收敛速度。
📊 数据与实验实验验证了RT-PG在基准测试数据上的有效性,表现优于具有当前最优收敛率的对比基线方法。
⭐ 主要贡献提出了首个对复用非策略内轨迹加速PG方法的理论分析;设计了RT-PG算法,收敛速率达到$ ilde{ackslashmathcal{O}(epsilon^{-1})}$,为文献中已知的最佳速率;实验证实了其优越性。
查看完整摘要 (Abstract)
*Policy gradient* (PG) methods are a class of effective *reinforcement learning* algorithms, particularly when dealing with continuous control problems. They rely on fresh *on-policy* data, making them sample-inefficient and requiring $\mathcal{O}(\epsilon^{-2})$ trajectories to reach an $\epsilon$-approximate stationary point. A common strategy to improve efficiency is to *reuse* information from past iterations, such as previous *gradients* or *trajectories*, leading to *off-policy* PG methods. While gradient reuse has received substantial attention, leading to improved rates up to $\mathcal{O}(\epsilon^{-3/2})$, the reuse of past trajectories, although intuitive, remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that reusing past off-policy trajectories can significantly accelerate PG convergence. We propose RT-PG (Reusing Trajectories - Policy Gradient), a novel algorithm that leverages a *power mean*-corrected multiple importance weighting estimator to effectively combine on-policy and off-policy data coming from the most recent $\omega$ iterations. Through a novel analysis, we prove that RT-PG achieves a sample complexity of $\widetilde{\mathcal{O}}(\epsilon^{-2}\omega^{-1})$. When reusing *all* available past trajectories, this leads to a rate of $\widetilde{\mathcal{O}}(\epsilon^{-1})$, the best known one in the literature for PG methods. We further validate our approach empirically, demonstrating its effectiveness against baselines with state-of-the-art rates.
🎯 研究动机传统演员-评论员(AC)方法缺乏可解释性,现有可解释强化学习模型未充分利用状态归因,未能区分状态维度对奖励的异质影响。
❓ 解决问题设计一种基于状态归因的可解释强化学习算法,同时提升效率、稳定性和模型的可解释性。
🔍 现象分析现有方法等同对待所有状态特征,忽视了不同维度在奖励生成中的作用差异,从而限制了训练优化。
🛠️ 主要方法提出RSA2C算法,结合RKHS-SHAP归因方法,用核加权机制调控演员梯度和优势评论员目标,同时使用稀疏字典实现两时间尺度架构。
📊 数据与实验进行了三种连续控制环境的实验,验证了模型在效率、稳定性和可解释性方面的优越表现。
⭐ 主要贡献开发了状态归因驱动、核化的两时间尺度AC算法,理论证明了状态扰动下的全局收敛性,并实践了可解释性强化学习的新路径。
查看完整摘要 (Abstract)
Actor-critic (AC) methods are a cornerstone of reinforcement learning (RL) but offer limited interpretability. Current explainable RL methods seldom use *state attributions* to assist training. Rather, they treat all state features equally, thereby neglecting the heterogeneous impacts of individual state dimensions on the reward. We propose *RKHS-SHAP-based Advanced Actor-Critic (RSA2C)*, an attribution-aware, kernelized, two-timescale AC algorithm, including Actor, Value Critic, and Advantage Critic. The Actor is instantiated in a vector-valued reproducing kernel Hilbert space (RKHS) with a Mahalanobis-weighted operator-valued kernel, while the Value Critic and Advantage Critic reside in scalar RKHSs. These RKHS-enhanced components use sparsified dictionaries: the Value Critic maintains its own dictionary, while the Actor and Advantage Critic share one. State attributions, computed from the Value Critic via RKHS-SHAP (kernel mean embedding for on-manifold and conditional mean embedding for off-manifold expectations), are converted into Mahalanobis-gated weights that modulate Actor gradients and Advantage Critic targets. We derive a global, non-asymptotic convergence bound under *state perturbations*, showing stability through the perturbation-error term and efficiency through the convergence-error term. Empirical results on three continuous-control environments show that RSA2C achieves efficiency, stability, and interpretability.
🎯 研究动机大型推理模型的链式思维能力显著增强,但推理链存在结构冗余问题,导致计算开销增加但正确率未明显提升。
❓ 解决问题现有的长度惩罚方法多采用统一对所有令牌施压的策略,难以区分冗余部分与有价值的推理内容,可能导致有效推理的压缩。
🔍 现象分析推理效率低下主要集中在高概率但边际效用较低的段落中,研究对这种段落的次优性进行了理论刻画。
🛠️ 主要方法提出 SLAT (Segment-Level Adaptive Trimming),一种基于强化学习框架的策略,根据段落正确性与长度之间的权衡,自适应压缩冗余段落。
📊 数据与实验基于标准基准数据集进行实验,SLAT 在有效减少推理长度50%的同时,保持了与原始基线模型相近的准确率表现。
⭐ 主要贡献从理论上证明了段落感知的修剪策略在提升推理效率中的潜力,并提出了能够实现在准确率与效率间更优折中框架的 SLAT 模型。
查看完整摘要 (Abstract)
Recent advances in Large Reasoning Models have significantly improved chain-of-thought (CoT) capabilities via reinforcement learning (RL). However, generated reasoning chains frequently suffer from structural redundancy (i.e., \emph{overthinking}), incurring high computational overhead without improving answer correctness. Existing mitigation strategies typically rely on token-uniform length penalties, which provide coarse, segment-agnostic pressure toward shorter outputs and can inadvertently suppress useful reasoning alongside redundancy. To address this, we demonstrate that inefficiency concentrates in high-probability segments with low marginal utility. We derive a theoretical characterization of segment suboptimality under the correctness-length trade-off objective and propose \textsc{SLAT} (Segment-Level Adaptive Trimming), an RL framework that selectively suppresses redundant segments based on this criterion. Empirical results on standard benchmarks indicate that \textsc{SLAT} establishes a superior accuracy-efficiency Pareto frontier, reducing reasoning length by 50\% relative to uncompressed baselines while maintaining competitive accuracy. Overall, our results suggest that theoretically grounded, segment-aware trimming is a promising direction for efficient CoT reasoning in large language models.
🎯 研究动机人类学习中的自我反思机制可以将稀疏反馈转化为可操作的指导,但该机制在大语言模型中的潜力尚未被充分探索。
❓ 解决问题如何通过自我反思的方式优化大语言模型,使其能够将稀疏的终端监督信号有效转化为密集的学习信号。
🔍 现象分析现有方法通常依赖外部评价器、奖励模型或更大的教师模型,而这些额外依赖增加了复杂性和计算成本。
🛠️ 主要方法提出了自反性策略优化(SRPO)框架,使模型能够对已完成的预测轨迹进行自我分析,提炼错误形成反思修正,并利用带有反思的模型输出作为高质量的在策略蒸馏目标。
📊 数据与实验在数学推理和长时间推理任务基准上进行实验,SRPO在多项任务中达到SOTA性能,例如AIME’24仅用8%训练FLOPs达到73.3%,显著提升了WebShop(64.7%)、ALFWorld(76.8%)和SWE-Bench-Lite(31.2%)的成功率。
⭐ 主要贡献提出了一种无需外部辅助的自反性优化方法,显著提高了数据利用效率和长时推理性能,拓展了大语言模型的优化范式。
查看完整摘要 (Abstract)
Self-reflection is a powerful mechanism for credit assignment in human learning, converting sparse outcome feedback into actionable guidance. However, its potential for post-training Large Language Models (LLMs) remains underexplored. We propose Self-Reflective Policy Optimization (SRPO), a framework that internalizes this capability. SRPO enables LLMs to analyze their own completed trajectories, synthesize errors into concise "reflection patches," and use these reflection-conditioned rollouts as high-quality, on-policy distillation targets. This process effectively transforms sparse terminal supervision into dense, token-level learning signals without requiring external critics, separate reward models, or larger teacher models. We demonstrate that SRPO achieves state-of-the-art performance across mathematical reasoning and long-horizon agentic benchmarks with exceptional data efficiency. Using a Qwen3-8B base model, SRPO attains 73.3\% on AIME’24 using only 8\% (0.08$\times$) of the training FLOPs required by scaled supervised fine-tuning, while significantly improving success rates on WebShop (64.7\%), ALFWorld (76.8\%), and SWE-Bench-Lite (31.2\%).
🎯 研究动机扩散策略在机器人操控中表现优异,但迭代去噪导致高推理延迟,限制了实时闭环系统的控制频率。
❓ 解决问题现有加速方法在保持动作质量与低延迟之间存在权衡,难以兼顾。
🔍 现象分析加速扩散策略需要既能生成高质量初始动作,又需与目标动作分布接近并保持时序一致性。
🛠️ 主要方法提出STEP机制,通过轻量级的时空一致性预测生成高质量的暖启动动作,同时引入速度感知的扰动注入机制自适应调节执行噪声。
📊 数据与实验在九个仿真基准和两个真实任务上进行了广泛评估,证明STEP在推理延迟和成功率方面优于现有方法。
⭐ 主要贡献提供了一种理论证明收敛性的预测机制,成功平衡了延迟和性能,显著提高了机器人操控任务的效率和精度。
查看完整摘要 (Abstract)
Diffusion policies have recently been as a powerful paradigm for visuomotor control in robotic manipulation due to their ability to model the distribution of action sequences and capture multimodality. However, iterative denoising leads to substantial inference latency, limiting control frequency in real-time closed-loop systems. Existing acceleration methods either reduce sampling steps, bypass diffusion through direct prediction, or reuse past actions, but often struggle to jointly preserve action quality and achieve consistently low latency. In this work, we propose **STEP**, a lightweight spatiotemporal consistency prediction mechanism to construct high-quality warm-start actions that are both distributionally close to the target action and temporally consistent, without compromising the generative capability of the original diffusion policy. Then, we propose a velocity-aware perturbation injection mechanism that adaptively modulates actuation excitation based on temporal action variation to execution stall especially for real-world tasks. We further provide a theoretical analysis showing that the proposed prediction induces a locally contractive mapping, ensuring convergence of action errors during diffusion refinement. We conduct extensive evaluations on nine simulated benchmarks and two real-world tasks. Notably, STEP with 2 steps can achieve an average 21.6\% and 27.5\% higher success rate than BRIDGER and DDIM on the RoboMimic benchmark and real-world tasks, respectively. These results demonstrate that \we consistently advances the Pareto frontier of inference latency and success rate over existing methods.
🎯 研究动机生成式控制策略(GCPs)被证明是机器人学习中的有效参数化方法,但如何高效微调这些策略仍存在争议,尤其是在数据采样效率方面。
❓ 解决问题提出了一种高效样本利用的算法,用于通过强化学习微调GCPs,同时优化整个生成过程,而不仅限于调整初始噪声分布或进行残差修正。
🔍 现象分析微调整个生成式策略过程能够超越预训练基础策略的动作分布范围,显著提升复杂场景下的性能表现和采样效率。
🛠️ 主要方法提出Off-policy Generative Policy Optimization(OGPO),通过维持离线评论员网络和修改后的PPO目标函数,最大化数据重用,并将评论员作为终端奖励贯穿整个生成过程。
📊 数据与实验实验涵盖多任务操控、高精度插入、灵巧控制场景,其中OGPO无需专家数据即可将初始化不佳的行为克隆策略提升至接近完全任务成功水平,并在任务特定超参数调节需求极低的情况下实现了状态-of-艺术的表现。
⭐ 主要贡献首次展示了完整微调GCPs在多任务和复杂操控场景上的优越性,提出了易于实践的实现细节,明确了其性能提升来源,即超越基础策略的动作分布范围。
查看完整摘要 (Abstract)
Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. Yet there remains substantial debate over how to sample efficiently fine-tune them via reinforcement learning. A prevailing view holds that fine-tuning all GCP steps is unnecessary, motivating approaches that fine-tune only a subset of the generative process: either steering the initial noise distribution or learning residual corrections on top of a frozen base policy. In this work, we introduce Off-policy Generative Policy Optimization (\OGPO{}), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. \OGPO{} achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can \emph{fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer}, and does so with \emph{few task-specific hyperparameter tuning}. We perform extensive empirical investigations on \OGPO{}, finding that its superior performance and sample efficiency lie in its ability to learn beyond the action distribution of the pre-trained base policy, and propose practical implementation details that further boost performance for more complex scenarios.
🎯 研究动机近年来强化学习利用扩散模型的多模态性和探索能力取得显著进展,但现有方法在平衡探索与利用时存在不足,影响策略收敛速度和多样性。
❓ 解决问题当前权重优化方法偏重探索,忽略了 Q 值信息的利用,导致策略收敛慢;梯度优化方法过度利用 Q 函数梯度,导致策略多样性低,易陷入单一模式。
🔍 现象分析权重优化方法初始训练时探索能力强,但利用效率低;梯度优化方法充分利用 Q 函数梯度,但多样性不足,限制了策略性能提升。
🛠️ 主要方法提出 CGPO 方法,将免训练的指导技术嵌入扩散策略的去噪过程,利用评论网络指导动作生成至高价值区域,并将指导动作作为回归目标,从而平衡探索与利用。
📊 数据与实验在 5 个 MuJoCo 运动任务上验证 CGPO 的有效性,并表现出较现有扩散强化学习方法的领先性能;此外,在 Franka 机器人臂抓取任务中成功应用扩散政策,是首次实现其在真实环境下成功应用。
⭐ 主要贡献提出了 CGPO 方法,首次将扩散策略成功用于真实强化学习任务,显著提升了探索与利用之间的平衡及训练效率,并在多项任务中取得了最优表现。
查看完整摘要 (Abstract)
Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the weighted-based policy optimization. This design enables better exploration capability of diffusion model, particularly at the beginning of training, but suffer from the low exploitation in Q-value information, resulting in a slow policy convergence. Another branch pay attention to gradient-based policy optimization, which sufficiently exploit the gradient of Q function yet tend to collapse into a unimodal policy with low diversity. To address this issue, we propose CGPO, \textbf{C}ritic-\textbf{G}uided diffusion \textbf{P}olicy \textbf{O}ptimization, which effectively balances exploration and exploitation with the training-free guidance technique integrated into the denoising process of diffusion policy. Concretely, CGPO steers action generation toward high-value regions defined by the critic network and uses the guided actions as regression objectives. In this manner, CGPO reduces the time required to obtain high-quality actions and improves final performance with better balance between exploration-exploitation tradeoff. We validate the effectiveness of CGPO on 5 MuJoCo locomotion tasks, and CGPO achieves state-of-the-art performance compared with existing diffusion-based RL methods. Notably, CGPO is the first success to incorporate diffusion policy into real-world RL, with its superior performance on Franka robot arm grasping tasks.
🎯 研究动机层级强化学习(RL)在长期决策中具有巨大潜力,但现有方法未能充分利用大规模训练的优势。
❓ 解决问题解决现有层级RL方法在高吞吐环境中扩展性不足的问题。
🔍 现象分析现有方法的低吞吐限制了其在大规模数据和复杂任务中的表现。
🛠️ 主要方法提出了一种高度可扩展的层级策略梯度算法——Scalable Option Learning (SOL),显著提升训练吞吐量。
📊 数据与实验通过30 مليار帧的NetHack游戏数据训练层级代理,并在MiniHack和Mujoco环境中验证了SOL的广泛适用性和性能。
⭐ 主要贡献实现了~35倍的吞吐提升,超越了传统平坦代理和其他层级方法,并展示了在大规模环境中的正向扩展趋势。
查看完整摘要 (Abstract)
Hierarchical reinforcement learning (RL) has the potential to enable effective decision-making over long timescales. Existing approaches, while promising, have yet to realize the benefits of large-scale training. In this work, we identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical policy gradient algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL's performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability.
🎯 研究动机当前基于学习的全身控制方法虽已取得进展,但大多需要为特定机器人单独训练,泛化性较差。因此,研究一种能够适配多种类人机器人设计的通用控制策略具有重要意义。
❓ 解决问题探索如何通过单次训练获得可在不同类人机器人之间通用的控制策略,同时在跨形态的机器人控制问题上实现强鲁棒性和泛化能力。
🔍 现象分析通过跨形态的多样性训练,策略能够以零样本迁移方式,适配先前未见过的机器人,证明了通用控制器在物理一致性和动态适应性方面的强泛化表现。
🛠️ 主要方法提出名为 XHugWBC 的训练框架,结合物理一致性形态随机化、语义对齐的观测与动作空间,以及适配形态与动力学特性的高效策略架构,致力于实现类人机器人通用控制。
📊 数据与实验在十二种模拟类人机器人和七种真实机器人上进行了实验,验证了该方法的强泛化性和鲁棒性,能够零样本适配多种机器人。
⭐ 主要贡献实现了一种可扩展、通用的类人机器人全身控制框架;提出了用于跨形态训练的关键技术方案;在模拟和真实机器人实验中验证了策略的卓越泛化性和迁移性能。
查看完整摘要 (Abstract)
Learning-based whole-body controllers have become a key driver for humanoid robots, yet most existing approaches require robot-specific training. In this paper, we study the problem of cross-embodiment humanoid control and show that a single policy can robustly generalize across a wide range of humanoid robot designs with one-time training. We introduce XHugWBC, a novel cross-embodiment training framework that enables generalist humanoid control through: (1) physics-consistent morphological randomization, (2) semantically aligned observation and action spaces across diverse humanoid robots, and (3) effective policy architectures modeling morphological and dynamical properties. XHugWBC is not tied to any specific robot. Instead, it internalizes a broad distribution of morphological and dynamical characteristics during training. By learning motion priors from diverse randomized embodiments, the policy acquires a strong structural bias that supports zero-shot transfer to previously unseen robots. Experiments on twelve simulated humanoids and seven real-world robots demonstrate the strong generalization and robustness of the resulting universal controller.
🎯 研究动机现有深度强化学习中,简单扩展 actor-critic 网络规模会导致训练不稳定及性能饱和,模型扩展能力受到限制。
❓ 解决问题提出一种可扩展的强化学习架构 ScaleMoE,通过引入混合专家模块解决传统单一架构在网络扩展中的性能瓶颈。
🔍 现象分析尽管 SimBa 和 BRC 等单一架构在特定规模内能通过设计归纳偏置提升性能,但参数进一步增加后性能提升趋于停滞。
🛠️ 主要方法ScaleMoE 将混合专家模块集成到现有持续控制算法的 actor 和 critic 中,通过两种门控机制实现专家输出的聚合和特征级融合。
📊 数据与实验在 DeepMind Control Suite、MetaWorld 和 HumanoidBench 数据集上实验表明,增大专家数量(如至 64)实现了显著的性能提升,超越了参数规模更大的单一模型。
⭐ 主要贡献ScaleMoE 提供了一种高效的深度强化学习扩展方式,在持续控制场景中显著提升了性能和可扩展性。
查看完整摘要 (Abstract)
Scaling network remains a bottleneck in deep reinforcement learning (RL): simply enlarging actor–critic networks destabilizes training and soon saturates performance. Although recent monolithic architectures such as SimBa and BRC have shown that carefully designed inductive biases can enable positive scaling up to a certain size, their improvements plateau soon as model parameters grow further. This work introduces ScaleMoE, a scalable RL architecture that integrates Mixture-of-Experts (MoE) modules into both the actor and critic of modern continuous control algorithms. Two complementary gating schemes are studied: output-level aggregation of per-expert policies and Q-functions, and feature-level fusion of expert representations before a shared head. We instantiate ScaleMoE on two representative monolithic RL baselines: the single-task method SimBa and the multi-task method BRC. Experiments across the DeepMind Control Suite, MetaWorld, and HumanoidBench show that progressively increasing the number of experts (up to 64) yields substantial improvements in returns, significantly outperforming monolithic networks of comparable or even greater parameter counts. Results demonstrate that ScaleMoE provides an efficient and effective scaling axis for deep RL in continuous control.
🎯 研究动机大型语言模型代理在长时任务中受限于上下文长度,现有框架多依赖人工定义的上下文工程管线,存在效率问题。
❓ 解决问题提出一种主动管理工作上下文的框架,以解决长时任务中的上下文冗余与任务解构难题。
🔍 现象分析现有代理框架通常通过多代理或后处理总结方式进行上下文管理,但无法主动高效地根据任务需求优化过程。
🛠️ 主要方法引入 Context Folding 框架,允许代理动态分支并折叠上下文;设计 FoldPO 强化学习框架,添加特定过程奖励以促使上下文管理学习。
📊 数据与实验在复杂长时任务上测试,该方法在使用较小上下文的同时匹配基线性能,并显著超越同等上下文限制的模型。
⭐ 主要贡献提出一种新的上下文管理机制并实现任务解构的自动化,显著提高长时任务处理效率,同时减少上下文使用规模。
查看完整摘要 (Abstract)
Large language model (LLM) agents are fundamentally constrained by context length on long-horizon tasks. Existing agent frameworks usually rely on manually defined context engineering pipelines, such as multi-agent or post-hoc summary. We introduce Context Folding, a framework that empowers agents to actively manage their working context. An agent can procedurally branch into a sub-trajectory to handle a subtask and then fold it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome. To make this behavior learnable, we propose FoldPO, an end-to-end reinforcement learning framework with specific process rewards to encourage effective task decomposition and context management. On complex long-horizon tasks, our agent matches the performance of baselines while using an active context up to 10x smaller, and significantly outperforms models constrained to the same context size.
🎯 研究动机视觉语言模型在复杂推理任务中易受视觉感知错误和幻觉影响,导致答案准确性下降。现有的基于可验证奖励的强化学习方法效果有限,却存在资源浪费和奖励稀疏问题。
❓ 解决问题提出一种框架,通过利用互信息在视觉阶段早期筛选失败轨迹,优化预算分配。同时,提供独立的互信息奖励,解决视觉感知与推理失败原因混淆的问题。
🔍 现象分析采样预算大多耗费在因视觉描述错误而注定失败的轨迹上。稀疏奖励无法区分视觉感知和推理阶段的错误来源。
🛠️ 主要方法设计一个名为 MIRL 的解耦框架,以互信息作为低成本的预筛选信号进行高潜力轨迹分叉,并通过解耦训练独立优化视觉感知。
📊 数据与实验在六个视觉语言推理基准上验证框架有效性,达到70.22%的平均准确率,并通过预筛选样本减少25%完整轨迹采样。
⭐ 主要贡献提出利用互信息的预筛选机制和解耦训练方式以优化视觉语言模型性能,有效改善资源利用效率与答案准确性。
查看完整摘要 (Abstract)
Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising solution by optimizing policies using answer correctness signals. Despite their effectiveness, prevailing RLVR methods face two critical limitations. First, much of the sampling budget is wasted on trajectories doomed to fail due to early visual description errors. Second, sparse rewards cannot distinguish whether failures stem from visual perception or reasoning stages. We introduce MIRL, a decoupled framework that addresses both limitations by leveraging mutual information (MI) between generated descriptions and visual inputs as a cheap pre-screening signal. This enables intelligent budget allocation toward high-potential trajectories via forking, while decoupled training provides independent MI-based rewards for visual perception optimization, resolving reward blindness. Experiments on six vision-language reasoning benchmarks demonstrate that MIRL achieves 70.22\% average accuracy and successfully surpasses the performance of sampling 16 complete trajectories using only 10 pre-samples with top-6 selection (25\% fewer complete trajectories). Our code is available at: https://anonymous.4open.science/r/mirl-main/.
🎯 研究动机视觉-语言模型(VLMs)在多模态推理能力上表现卓越,但其强化学习微调(RLFT)成本过高,限制了广泛应用。现有方法未充分利用多模态模型动态视觉-文本对齐这一内在特性。
❓ 解决问题探索如何将多模态模型的视觉-文本对齐特性转化为训练信号,以提高RLFT的效率。
🔍 现象分析分析模型在推理时的计划关注、实际关注与理想关注,并从中提取两类轻量化指标:预测视角精度(PVA)用于评估样本难度,推理视角精度(RVA)衡量链式推理质量。
🛠️ 主要方法提出FOCUS-RL框架,通过对齐信号实现自动化数据课程与密集的推理监督,可无缝集成到任意VLM中提升训练效率。
📊 数据与实验在六个基准测试和多个VLM上验证,FOCUS-RL相比基础GRPO实现了2.5到4倍的收敛加速,平均准确率提升4.4%。
⭐ 主要贡献提出将视觉-文本对齐特性转化为训练信号的创新思路,打造高效的FOCUS-RL框架,显著加速VLM的RLFT训练并提升性能。
查看完整摘要 (Abstract)
Although Reinforcement Learning Fine-Tuning (RLFT) applied to Vision-Language Models (VLMs) substantially enhances multimodal reasoning capabilities, their prohibitive training cost limits broad adoption. Surprisingly, most existing methods simply port Large Language Model (LLM) RLFT techniques to VLMs, while ignoring a intrinsic property of multimodal models: their dynamic text–vision alignment. We ask a new question: Can this intrinsic alignment be turned into a training signal that makes VLM RLFT more efficient? We analyze how a VLM plans to attend, actually attends, and ideally should attend during reasoning, and derive two lightweight metrics from these patterns. Predictive View Accuracy (PVA) estimates sample difficulty, and Reasoning View Accuracy (RVA) reflects the quality of chain-of-thought (CoT) reasoning. These alignment signals enable automated data curriculum and dense reasoning supervision. We introduce FOCUS-RL, a plug-and-play framework that can be seamlessly integrated into any VLM and dramatically boosts RLFT training efficiency. FOCUS-RL achieves 2.5 x – 4 x faster convergence over vanilla GRPO and consistent accuracy gains (+4.4 on average) across six different benchmarks and multiple VLM families.
🎯 研究动机现有强化学习方法在大型语言模型的推理任务中,多以单词或完整序列为基础进行策略优化,这与推理过程的自然分步结构不匹配,导致信用分配不佳及训练不稳定现象。
❓ 解决问题提出一种新的强化学习范式,用推理的连贯步骤而非单词或序列作为策略更新的基本单元,以解决多模态推理任务中信用分配和训练稳定性问题。
🔍 现象分析现有方法忽视了推理过程的分步特性,导致在复杂任务中难以高效捕捉语义边界,从而影响推理准确性和训练一致性。
🛠️ 主要方法设计了Segment-Aligned Policy Optimization(SAPO),基于推理步骤引入分步的马尔科夫决策过程抽象,结合段级价值估计、优势计算和重要性采样机制,实现与推理边界语义对齐的策略优化。
📊 数据与实验在代表性推理基准上进行实验,结果显示SAPO在准确性、训练稳定性和价值估计一致性方面显著优于传统基于单词和序列的优化方法。
⭐ 主要贡献提出了贴合推理结构的强化学习更新方法SAPO,验证了其在复杂推理任务中的有效性,为语义对齐的效率优化提供了新思路,同时公开代码和模型以促进复现。
查看完整摘要 (Abstract)
Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the natural step-wise structure of reasoning processes, leading to suboptimal credit assignment and unstable training in multi-modal reasoning tasks. To bridge this gap, we propose Segment-Aligned Policy Optimization (SAPO), a novel reinforcement learning paradigm that treats coherent reasoning steps, rather than tokens or full sequences as fundamental units of policy update. SAPO introduces a step-wise Markov decision process abstraction over reasoning segments, accompanied by segment-level value estimation, advantage computation, and importance sampling mechanisms that are semantically aligned with reasoning boundaries. Experiments on representative reasoning benchmarks demonstrate that SAPO consistently outperforms token-level and sequence-level policy optimization methods, achieving significant accuracy improvements while exhibiting better training stability and value estimation consistency. Our work underscores the importance of aligning reinforcement learning updates with the intrinsic structure of reasoning, paving the way for more efficient and semantically grounded policy optimization in complex reasoning tasks. Codes and models will be released to ensure full reproducibility.
🎯 研究动机强化学习已被证明能有效提升大语言模型在推理任务中的表现,尤其是数学任务,但改进通常会导致结果多样性降低。
❓ 解决问题模型集中概率质量于少量解的现象抑制了结果多样性,需要一种方法在增强推理性能的同时保持解的多样性。
🔍 现象分析通过分布扰动框架分析单一轨迹对语言模型多样性的贡献,理论验证了罕见轨迹对提高全局多样性的单调性贡献。
🛠️ 主要方法提出基于核相似度的集合级多样性目标,通过留一法边际贡献估算作为策略优化的插件式优势调整项。
📊 数据与实验在多种模型规模上展开广泛实验,在多个基准上的 Pass@1 和 Pass@K 指标上均优于强基线模型。
⭐ 主要贡献提出一种维持多样性的大语言模型推理优化方法,并提供理论支持及大量实验验证,展示显著性能提高。
查看完整摘要 (Abstract)
Reinforcement learning with verifiable rewards has shown notable effectiveness in enhancing large language models (LLMs) reasoning performance, especially in mathematics tasks. However, such improvements often come with reduced outcome diversity, where the model concentrates probability mass on a narrow set of solutions. Motivated by diminishing-returns principles, we introduce a set level diversity objective defined over sampled trajectories using kernelized similarity. Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization. We further investigate the contribution of a single trajectory to language model diversity within a distribution perturbation framework. This analysis theoretically confirms a monotonicity property, proving that rarer trajectories yield consistently higher marginal contributions to the global diversity. Extensive experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.
🎯 研究动机强化学习中的奖励验证机制(RLVR)常通过政策梯度方法优化大规模推理模型,但现有方法在低生成数据下的平均回报估计存在不足,导致梯度估计方差较大。
❓ 解决问题本文着眼于通过改进基线设计减少梯度估计方差,以提升RLVR训练的稳定性和效率。
🔍 现象分析传统方法通过批次内每个提示的经验平均值做回报居中处理,但在低生成场景中,此方法对每个提示平均值的估计精度较低。
🛠️ 主要方法提出了一种基于压缩估计器的基线方法,结合单提示和跨提示的均值估计,替代现有的单提示经验均值,降低梯度估计的全局方差。
📊 数据与实验实验表明,在多种RLVR场景中,引入压缩基线的一致性优于传统经验均值基线,显著减少梯度更新方差并提升训练稳定性。
⭐ 主要贡献本文引入无需额外超参或计算的压缩基线,为现有RLVR算法提供了理论证明和实证支持的优化工具。
查看完整摘要 (Abstract)
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean for each prompt. Statistically, this centering acts as a control variate (or baseline), reducing the variance of the policy-gradient estimator. Typically, the mean reward is estimated using per-prompt empirical averages for each prompt in a batch. Drawing inspiration from Stein’s paradox, we propose using \emph{shrinkage estimators} that combine \emph{per-prompt} and \emph{across-prompt} means to improve the overall per-prompt mean estimation accuracy---particularly in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our proposed baseline serves as a drop-in replacement for existing per-prompt mean baselines, requiring no additional hyper-parameters or computation. Empirically, shrinkage baselines consistently outperform standard empirical-mean baselines, leading to lower-variance gradient updates and improved training stability.
🎯 研究动机同时语音翻译需在非单调条件下实时翻译来源语音,传统方法依赖语言特定的次优对齐数据,采集难度较大。
❓ 解决问题消除对词级对齐数据的依赖,通过句级对齐训练模型以实现低延迟高质量翻译。
🔍 现象分析现有方法使用语音对齐数据存在语言依赖性和效果次优问题,模型需平衡翻译准确性与实时性。
🛠️ 主要方法提出Hibiki-Zero模型,采用强化学习优化延迟,并以句级对齐数据为基础进行监督训练后实现翻译任务。
📊 数据与实验在五个跨语言到英语翻译任务中,实验展示了模型的翻译准确性、延迟性能以及跨语言迁移能力;发布了包含15小时的多语言语音翻译测试数据集。
⭐ 主要贡献实现了无需词级对齐的多语言实时语音翻译,在翻译准确性、延迟、语音自然性及语言迁移性上达到了最新水平,并公开了模型及评测基准。
查看完整摘要 (Abstract)
Simultaneous speech translation is the task of translating source speech into a target language in real-time. Given that the dependencies between source and target words are non-monotonic (e.g. the word order can change between German and English), this means learning to jointly align and translate. This task has been traditionally tackled through supervised training on aligned data, and as collecting such data is challenging, this relies on synthetic data with automatic alignment. The latter relies on heuristics that are language-specific and suboptimal. We instead propose Hibiki-Zero, a model for simultaneous speech translation trained without word-level alignments between source and target speech. To do so, we train on sentence-level aligned data so that the model learns to perform speech translation but with high latency. We then introduce a novel reinforcement learning strategy relying on GRPO to optimize the translation latency of the model while retaining its translation capabilities. After supervised and post-training, Hibiki-Zero performs multilingual simultaneous translation with state-of-the-art translation accuracy, latency, voice transfer and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be easily finetuned to support another language as input with less than 1000h of speech data. We provide examples ([hibiki-zero-s2st.github.io](https://hibiki-zero-s2st.github.io)) as well as models and release a benchmark containing 15h of multilingual data for speech translation evaluation.
🎯 研究动机强化学习可提升大语言模型能力,但在专家混合(MoE)模型中,路由机制易引入训练不稳定性,甚至导致训练崩溃。
❓ 解决问题通过分析训练与推理阶段的路由行为差异,解决 MoE 模型中训练-推理一致性问题以增强稳定性。
🔍 现象分析发现训练与推理阶段的路由行为存在显著差异,这种不一致性会导致策略的 KL 散度增大,引发模型训练崩溃。
🛠️ 主要方法提出 Rollout Routing Replay (R3) 方法,在训练时重放推理阶段的路由分布以降低 KL 散度,消除行为差异且不损害训练效率。
📊 数据与实验在多种实验设置下验证 R3,证明其能稳定训练过程,避免崩溃,并在性能上优于多个强基线。
⭐ 主要贡献首次提出通过对齐训练与推理路由机制解决 MoE 模型中强化学习的不稳定性,为增强 RL 稳定性提供了新思路,同时具备广泛的兼容性。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has emerged as a crucial approach for enhancing the capabilities of large language models. However, in Mixture-of-Experts (MoE) models, the routing mechanism often introduces instability, even leading to catastrophic RL training collapse. We analyze the training-inference consistency of MoE models and identify a notable discrepancy in routing behaviors between the two phases. To address this issue, we propose \textbf{Rollout Routing Replay (R3)}, a novel and effective method that records routing distributions from the inference engine and replays them during training. R3 significantly reduces training-inference policy KL divergence and mitigates extreme discrepancies without compromising training speed. Extensive experiments on various settings confirm that R3 succeeds in stabilizing RL training, preventing collapse and outperforming strong baselines. R3 is orthogonal to most policy optimization algorithm improvements, allowing it to be used in conjunction with them. We believe this work can offer a new solution for stabilizing RL in MoE model.
🎯 研究动机PPO 在连续控制任务中表现敏感,训练动态受神经网络近似策略与价值函数的影响较大,亟需稳定训练方法。
❓ 解决问题通过稳定 actor-critic 几何结构,降低训练过程中的不稳定性与性能波动。
🔍 现象分析理论分析表明,现有方法在单步回溯误差、动作更新方向对齐性和高新颖区域的占用质量方面存在不足。
🛠️ 主要方法提出 SPPO,包含 CKA 约束的 critic 表征、actor 无翻转正则以及基于 KDE 的优势塑形,整体增强稳定性和性能。
📊 数据与实验在标准连续控制基准上验证 SPPO 的效果,实验显示其超过 PPO 和多种增强方法,消融实验和训练动态分析进一步支持结论。
⭐ 主要贡献开发 SPPO 表现出一致的性能提升,提供了有效的稳定性增强机制,并量化了各组件的独特和互补作用。
查看完整摘要 (Abstract)
Proximal Policy Optimization (PPO) is widely used in continuous-control tasks, yet its performance is often highly sensitive to training dynamics when neural networks approximate the policy and value functions. This paper introduces SPPO, a drop-in augmentation that preserves PPO’s clipped objective and network architecture while stabilizing actor-critic geometry via three mechanisms: (i) a CKA-based constraint on critic representations, (ii) a no-flip regularizer on actor updates, and (iii) KDE-driven advantage shaping. Theoretical analysis shows that these mechanisms tighten bounds on one-step bootstrapping error, improve expected directional alignment of action updates, and ensure non-decreasing occupancy mass over high-novelty regions. Experiments on standard continuous-control benchmarks demonstrate consistent gains over PPO and recent PPO stabilization methods. Ablation studies further quantify the contribution and complementary effects of each component. Additional training-dynamics analyses indicate that SPPO reduces instability and oscillations in both actor and critic updates, improving training stability and final performance.
🎯 研究动机扩散大语言模型(dLLMs)在使用群体相对策略优化(GRPO)训练时存在严重的不稳定性,阻碍了强化学习在提升推理能力上的应用效果。
❓ 解决问题分析并解决 GRPO 训练中因有限样本估计导致的噪声问题,从而减少梯度尖峰和策略漂移等不稳定现象。
🔍 现象分析通过理论和实验证明,噪声重要性比率会引发梯度波动和策略漂移,形成一个自我强化的不稳定循环,进一步增大估计方差。
🛠️ 主要方法提出 StableDRL 框架,包括无条件裁剪以抑制梯度尖峰,以及自正则化机制以限制梯度在每个样本更新的凸包内,并扩展至区块式扩散模型通过阶梯注意力机制。
📊 数据与实验在 MATH500 数据集上较之前最优全注意力基线提升 6%,在 AIME 数据集上较区块扩散基线改进 25.6%。
⭐ 主要贡献首次实现对扩散大语言模型的全参数稳定强化学习训练,提出理论支撑明确且性能领先的 StableDRL 框架。
查看完整摘要 (Abstract)
Diffusion Large Language Models (dLLMs) often exhibit severe instability during Group Relative Policy Optimization (GRPO) training, limiting the effectiveness of reinforcement learning for improving reasoning capabilities. In dLLMs, the importance ratios used by GRPO are derived from finite-sample estimates rather than exact likelihoods, making them inherently noisy. In this paper, we show that GRPO is highly sensitive to this noise, which drives training instability. Through theoretical analysis and empirical evidence, we identify a self-reinforcing instability loop in which noisy importance ratios induce gradient spikes and policy drift, further amplifying future importance ratio estimation variance. To address this issue, we propose StableDRL, a novel reinforcement learning framework for dLLMs. StableDRL stabilizes training via (i) unconditional clipping to suppress outlier-induced gradient spikes, and (ii) self-normalization to constrain gradients within the convex hull of per-sample updates. We further extend StableDRL to block-wise diffusion models via a staircase attention mechanism. StableDRL is the first method that enables stable, full-parameter reinforcement learning for dLLMs. It achieves the state-of-the-art performance, outperforming prior best full-attention baselines by 6% on MATH500 and block-diffusion baselines by 25.6% on AIME.
🎯 研究动机连续型Actor-Critic方法所学策略常出现高频振荡,难以实际应用。现有方法通过直接正则化策略输出来实现平滑,但未触及根本原因。
❓ 解决问题明确策略非平滑的根本原因,提出从评论者的微分几何特性入手稳定学习过程,从而改善策略的平滑性。
🔍 现象分析通过理论证明,策略的敏感性由Q函数的混合偏导数(噪声敏感性)与动作空间曲率(信号显著性)之比控制。
🛠️ 主要方法提出PAVE框架,从评论者角度利用标量场模型正则化Q值梯度场,通过最小化Q梯度波动来稳定学习信号,同时保留局部曲率特性。
📊 数据与实验实验表明,PAVE在不用修改行为体的情况下,实现了与策略端正则化方法相当的平滑性与鲁棒性,同时在任务性能上表现出色。
⭐ 主要贡献从理论和实践上揭示评论者微分几何特性对策略平滑性的控制作用,并提出了高效的评论者端正则化框架PAVE。
查看完整摘要 (Abstract)
Policies learned via continuous actor-critic methods often exhibit erratic, high-frequency oscillations, making them unsuitable for physical deployment. Current approaches attempt to enforce smoothness by directly regularizing the policy's output. We argue that this approach treats the symptom rather than the cause. In this work, we theoretically establish that policy non-smoothness is fundamentally governed by the differential geometry of the critic. By applying implicit differentiation to the actor-critic objective, we prove that the sensitivity of the optimal policy is bounded by the ratio of the Q-function's mixed-partial derivative (noise sensitivity) to its action-space curvature (signal distinctness). To empirically validate this theoretical insight, we introduce PAVE (Policy-Aware Value-field Equalization), a critic-centric regularization framework that treats the critic as a scalar field and stabilizes its induced action-gradient field. PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature. Experimental results demonstrate that PAVE achieves smoothness and robustness comparable to policy-side smoothness regularization methods, while maintaining competitive task performance, without modifying the actor.
🎯 研究动机现有代码推理方法忽视中间状态监督,依赖最终输出,易导致不一致推理与奖励黑客问题。
❓ 解决问题通过引入中间执行状态的监督与结构化强化学习,解决代码推理中的不一致性与准确性问题。
🔍 现象分析现有方法难以验证推理过程的正确性,未能有效利用代码运行中的中间状态信息。
🛠️ 主要方法提出StepCodeReasoner框架,通过插入打印式执行追踪锚点监督中间状态,并采用双层强化学习算法实现跨路径比较与中间阶段奖励分配。
📊 数据与实验在CRUXEval、LiveCodeBench和REval等基准上取得SOTA性能,与CodeReasoner-7B和GPT-4o相比显著提升推理与生成精度。
⭐ 主要贡献首次将显式执行建模引入代码推理,提出新型双层强化学习算法,大幅提升代码推理与生成任务性能。
查看完整摘要 (Abstract)
Existing code reasoning methods primarily supervise final code outputs, ignoring intermediate states, often leading to reward hacking where correct answers are obtained through inconsistent reasoning. We propose StepCodeReasoner, a framework that introduces explicit intermediate execution-state supervision. By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step, transforming code reasoning into a verifiable, stepwise execution modeling problem. Building on this execution-aware method, we introduce Bi-Level GRPO}, a reinforcement learning algorithm for structured credit assignment at two levels: inter-trajectory, comparing alternative execution paths, and intra-trajectory, rewarding intermediate accuracy based on its impact on downstream correctness. Extensive experiments demonstrate that StepCodeReasoner achieves SOTA performance in code reasoning. In particular, our 7B model achieves 91.1\% on CRUXEval and 86.5\% on LiveCodeBench, outperforming the CodeReasoner-7B baseline (86.0\% and 77.7\%) and GPT-4o (85.6\% and 75.1\%). Furthermore, on the execution-trace benchmark REval, our model scores 82.9\%, outperforming baseline CodeReasoner-7B (72.3\%), its 14B counterpart (81.1\%), and GPT-4o (77.3\%). Additionally, our approach also improves code generation performance, demonstrating that explicit execution modeling enhances both code reasoning and code generation.
🎯 研究动机大语言模型(LLM)代理在解决复杂多步骤问题时依赖搜索引擎等外部工具,但其策略展开中存在结构异质性,导致行为和奖励分布的显著差异。
❓ 解决问题现有的策略梯度方法使用单一全局基线,因跨层偏差问题导致奖励归因失真,损害了探索效果。
🔍 现象分析工具调用次数、位置及结果的不同会引发显著的结构性差异,造成对不同行为的非公平比较。
🛠️ 主要方法提出 Stratified GRPO 方法,核心为分层优势归一化(SAN),按结构特性将轨迹划分为同质分层,在分层内局部计算优势并与全局估计线性结合以提升鲁棒性。
📊 数据与实验在事实问答和深度研究代理任务基准上进行实验,结果显示 Stratified GRPO 相较 GRPO 在训练奖励、稳定性和搜索策略效果上有显著提升,最高可提升 12.6 点。
⭐ 主要贡献首次提出分层策略梯度框架,通过消除跨层偏差有效应对结构异质性,验证了分层方法在 LLM 强化学习中的原则性和有效性。
查看完整摘要 (Abstract)
Large language model (LLM) agents increasingly rely on external tools such as search engines to solve complex, multi-step problems, yet their rollouts are structurally heterogeneous: variations in tool-call number, placement, and outcomes induce distinct behaviors and reward distributions. As a result, policy gradient methods with a single global baseline suffer from *cross-stratum bias*, an "apples-to-oranges" comparison that distorts credit assignment and impedes exploration. To address this issue, we propose *Stratified GRPO*. Its core component, *Stratified Advantage Normalization* (SAN), partitions trajectories into homogeneous strata based on structural properties and computes advantages locally within each stratum, ensuring comparisons only among true peers. We show that SAN eliminates cross-stratum bias, yields conditionally unbiased unit-variance estimates within strata, and preserves the global unbiasedness and unit-variance properties of standard normalization, resulting in a more reliable learning signal. To improve robustness in finite-sample regimes, we further linearly blend SAN with the global estimator. Experiments on factual QA and deep-research agent benchmarks demonstrate that Stratified GRPO consistently outperforms GRPO by up to 12.6 points, achieving higher training rewards, improved training stability, and more effective search policies. These results establish stratification as a principled remedy for structural heterogeneity in RL for LLM search agents.
🎯 研究动机强化学习在可验证奖励下已成为提升大语言模型推理能力的重要范式,但在优化稳定性与学习效率之间存在权衡问题。
❓ 解决问题针对序列级优化欠缺局部信号利用与词元级优化带来高方差和不稳定性的矛盾,提高梯度利用率与样本效率。
🔍 现象分析词元重要性加权可支持细粒度的信用分配,但高方差会导致参数更新不稳定;而序列级优化动态更稳定但无法充分利用局部信号。
🛠️ 主要方法提出TGPO框架,通过引入序列锚点稳定词元更新,并通过信任信息门自适应调节词元级信号贡献,重权重利用不完美轨迹的梯度。
📊 数据与实验在七个数学推理数据集和多种模型规模上进行实验,结果表明TGPO能持续提高强化学习的学习效率和整体性能。
⭐ 主要贡献整合序列与词元级优化优势,提出高效的政策优化框架TGPO,并验证其在多数据集上的显著性能提升。
查看完整摘要 (Abstract)
Reinforcement learning from verifiable rewards (RLVR) has become an important paradigm for enhancing the reasoning capabilities of large language models, while it also involves a persistent tradeoff between optimization stability and learning efficiency. Token-level importance weighting supports fine-grained credit assignment, but it often introduces high variance and unstable parameter updates, whereas sequence-level optimization provides more stable learning dynamics while failing to fully exploit informative local signals. We introduce **T**rust-**G**ated **P**olicy **O**ptimization (TGPO), an efficient policy optimization framework that integrates two complementary mechanisms, namely *sequence anchors* and *information gates*. TGPO aligns token-wise updates with a stable sequence-level reference, which reduces the influence of extreme local likelihood fluctuations on the gradient, and a trust-based information gate adaptively modulates the contribution of token-level signals. By retaining and reweighting gradients from imperfect trajectories rather than excluding them, TGPO improves gradient utilization and sample efficiency while maintaining stable optimization behavior. Empirical results across seven mathematical reasoning datasets and multiple model scales show that TGPO consistently enhances learning efficiency and overall performance in outcome-supervised reinforcement learning settings.
🎯 研究动机强化学习和监督微调是提升大型语言模型性能的主要方法,但两者存在效率与性能保留的权衡问题。强化学习虽然保留能力较强,但代价高昂;监督微调效率高但容易发生遗忘问题。
❓ 解决问题提出 Trajectory-Mixed Supervision (TMS) 方法,旨在解决监督微调中的监督不匹配问题,减少模型政策与静态标签的偏差,提高模型性能保留能力。
🔍 现象分析监督微调易出现模式崩塌和遗忘问题,这源于政策与标签数据之间的偏离。实验表明,政策标签偏差显著影响模型性能。
🛠️ 主要方法构建动态学习课程,利用模型历史检查点生成监督数据,无需奖励函数或验证器。通过降低政策标签偏差,维持模型的稳定性能。
📊 数据与实验在 MATH 和 GSM8K 推理任务及若干指令跟随基准上进行实验,验证 TMS 在准确性与性能保留间的高效平衡,与标准和迭代监督微调相比优势显著。
⭐ 主要贡献提出了一种无需奖励的动态监督框架 TMS,有效弥合监督微调与强化学习间的差距;提供了机制分析,证明政策标签偏差是预测遗忘的关键指标。
查看完整摘要 (Abstract)
Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) are the two dominant paradigms for enhancing Large Language Model (LLM) performance on downstream tasks. While RL generally preserves broader model capabilities (retention) better than SFT, it comes with significant costs: complex reward engineering, instability, and expensive on-policy sampling. In contrast, SFT is efficient but brittle, often suffering from catastrophic forgetting due to $\textbf{Supervision Mismatch}$: the divergence between the model's evolving policy and static training labels. We address this trade-off with $\textbf{Trajectory-Mixed Supervision (TMS)}$, a reward-free framework that approximates the on-policy benefits of RL by creating a dynamic curriculum from the model's own historical checkpoints. TMS minimizes $\textit{Policy-Label Divergence (PLD)}$, preventing the mode collapse that drives forgetting in standard SFT. Experiments across reasoning (MATH, GSM8K) and instruction-following benchmarks demonstrate that TMS effectively shifts the accuracy-retention Pareto frontier. While RL remains the gold standard for retention, TMS significantly outperforms standard and iterative SFT, bridging the gap to RL without requiring reward models or verifiers. Mechanistic analysis confirms that PLD drift accurately predicts forgetting, and that TMS successfully mitigates this drift.
🎯 研究动机离线强化学习容易受到因总价值不确定性引起的高估偏差影响,现有方法忽视了不确定性中的偶然性成分。
❓ 解决问题提出抑制因偶然性成分引发的短暂但破坏性强的高估偏差(称为偶然脉冲)的方法,避免学习过程陷入次优策略。
🔍 现象分析偶然脉冲虽为短暂现象,但会使强化学习完全脱轨,导致无法收敛到最优解。
🛠️ 主要方法提出 Aleatoric Impulse Damping (AID),通过分离并自适应融合价值不确定性的认知成分和偶然成分,构建悲观的下置信界以克制偏差,同时通过对称的上置信界促进高效探索。
📊 数据与实验在 Gym-MuJoCo 和 DeepMind Control Suite 高维基准数据集上验证,将方法集成至分布式软演员评论算法(DSAC-AID),实现了性能的最新最优表现。
⭐ 主要贡献首次提出偶然脉冲现象及其对学习路径的破坏性影响;首创性地利用不确定性分解和权衡机制提升强化学习性能;证明了 AID 方法在多种高维任务中的广泛有效性和先进性。
查看完整摘要 (Abstract)
Off-policy reinforcement learning is vulnerable to overestimation bias, which is rooted in the total value uncertainty. However, existing methods typically misaddress this by targeting the epistemic component, neglecting the aleatoric component. We identify for the first time that this oversight fails to contain a massive bias surge, termed the **Aleatoric Impulse**. Although transient, this impulse fundamentally derails the learning trajectory, permanently locking the agent into suboptimal policies. To counteract this, we propose **A**leatoric **I**mpulse **D**amping **(AID)**, the first mechanism that models total value uncertainty by disentangling the return variance into epistemic and aleatoric components, followed by their adaptive weighted recombination. Leveraging this derived uncertainty, the critic constructs a pessimistic lower confidence bound to surgically suppress the impulse. Complementing this, the actor utilizes a symmetrical upper confidence bound to drive optimistic exploration, ensuring that the necessary pessimism does not compromise exploration efficiency. We integrate this mechanism into the Distributional Soft Actor-Critic algorithm to establish **DSAC-AID**. Extensive experiments on the high-dimensional Gym-MuJoCo and DeepMind Control Suite benchmarks demonstrate that it achieves state-of-the-art results in final performance.
🎯 研究动机旨在解决当前文本生成模型中长距离生成不连贯与语义表达缺乏连续性的限制。
❓ 解决问题将文本生成建模为连续时间的潜在动态过程,避免离散序列生成中信息间断问题。
🔍 现象分析分析了离散生成方法中语意发展与语境一致性表现不足的现象,通过理论将离散序列与连续语义演化连接。
🛠️ 主要方法引入基于神经ODE的连续潜在动态模型,同时利用强化学习结合任务奖励与预训练语言模型知识蒸馏优化生成质量。
📊 数据与实验通过实验验证该方法在生成连贯性与长语境适配性能上优于传统离散方法,采用基准任务与预训练模型对比。
⭐ 主要贡献提出了一种全新的理论框架,将文本生成与连续动态建模结合,在流畅性与可控性上显著提升长文本生成能力。
查看完整摘要 (Abstract)
We propose to model text generation as a continuous-time latent dynamical process, where token generation is formulated as a Markov Decision Process whose internal state evolves via a neural ODE. This formulation bridges discrete token sequences and continuous semantic evolution, providing a theoretically grounded approach for coherent long-range generation. The framework is optimized via reinforcement learning, maximizing a composite objective that integrates task-specific rewards with knowledge distillation from a powerful pre-trained language model. Experiments demonstrate that our method, Continuous-Time Latent Language Model (CT-LLM), outperforms discrete baselines in generation coherence and long-context performance, offering a new paradigm for fluid and controllable language generation.
🎯 研究动机当前大型推理模型在空间推理任务中表现不足,亟需提升其处理几何和语义一致性的能力,而无需依赖传统监督方式的外部数据标注。
❓ 解决问题提出一种能提升空间推理能力的自监督强化学习框架,使模型在无标签约束下实现逻辑一致性和自我纠正。
🔍 现象分析许多空间推理能力已经存在于预训练模型中,但未被有效对齐;现有方法依赖监督微调,忽视了潜在的内在逻辑一致性。
🛠️ 主要方法通过设计一致性验证函数(检查几何与语义一致性)及优化策略OT-GRPO,使模型内部逻辑链条在自监督环境下增强空间推理能力。
📊 数据与实验实验表明,提出的方法在无需标注数据的情况下,与依赖人工监督训练的模型精度接近,并在多任务与多领域的广泛测试中展现出优异的泛化性。
⭐ 主要贡献提供了一种新的无标签训练方法,利用一致性原则显著提高大型推理模型在空间推理任务中的表现,同时扩展了其跨领域泛化能力。
查看完整摘要 (Abstract)
Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial capabilities are already present in pre-trained LRMs but require alignment through principles of internal logical coherence. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal Chain-of-Thought (CoT) process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers—reward functions that check for geometric and semantic consistency under transformations like flipping or swapping the order of objects in the question—and optimizing them via our new OT-GRPO strategy, a minimal-consistency matching variant of group relative policy optimization, we demonstrate that models can self-correct their spatial logic. Our results show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and domains.
🎯 研究动机强化学习在长时间范围内应用于大语言模型时,常因梯度方差爆炸导致训练失败。为缓解此问题,通常引入基线计算优势值。
❓ 解决问题传统的价值模型优化困难且未考虑序列异质性,而经典的最优基线理论忽略了标记异质性并需要过高的梯度计算资源。
🔍 现象分析现有方法如组基线在处理标记序列的异质性时表现不佳,且大组规模增加了计算负担,导致资源浪费。
🛠️ 主要方法从理论出发推导出最优标记基线 (OTB),提出了Logit-Gradient Proxy,利用前向概率高效近似梯度范数,加快权重更新。
📊 数据与实验通过单轮推理任务和工具集成推理任务实验,验证新方法在组规模减少到 $N=4$ 的情况下能稳定训练,并相比大组规模节省超过65%的标记消耗。
⭐ 主要贡献提出了针对标记异质性的最优基线方法,显著降低计算成本,提升长时间任务的大语言模型训练稳定性与资源利用效率。
查看完整摘要 (Abstract)
Reinforcement Learning for Large Language Models (LLMs) often suffers from training collapse in long-horizon tasks due to exploding gradient variance. To mitigate this, baseline is commonly introduced for advantage computation; however, traditional value models remain difficult to optimize, and standard group-based baselines overlook sequence heterogeneity. Although classic optimal baseline theory can achieve global variance reduction, it neglects token heterogeneity and requires prohibitive gradient-based computation. In this work, we derive the Optimal Token Baseline (OTB) from first principles, proving that gradient updates should be weighted inversely to their cumulative gradient norm. To ensure efficiency, we propose the Logit-Gradient Proxy that approximates the gradient norm using only forward-pass probabilities. Our method achieves training stability and matches the performance of large group sizes ($N=32$) with only $N=4$, reducing token consumption by over 65\% across single-turn and tool-integrated reasoning tasks.
🎯 研究动机当前大语言模型作为多轮决策任务的自主代理,认知模式固定,难以应对任务中各步骤认知需求的动态变化,导致效率低下。
❓ 解决问题设计一种动态适应认知深度的框架,使代理能够根据任务需求调整从本能响应到战略规划的认知层次,从而提升长时任务的效率和性能。
🔍 现象分析任务需求在步骤间差异显著,有些步骤需深度思考,有些仅需日常执行,而固定认知模式无法有效匹配这些需求。
🛠️ 主要方法提出 CogRouter 框架,基于 ACT-R 理论设计四层认知水平,结合认知感知的监督微调和策略优化,通过置信度感知的优势重加权训练实现动态认知深度调整。
📊 数据与实验在 ALFWorld 和 ScienceWorld 数据集上验证,通过实验表明,CogRouter 在效率和性能上实现了当前最优的效果。
⭐ 主要贡献提出动态认知深度适应框架 CogRouter,基于理论方法结合两阶段训练,实现了对长时任务中认知需求的精准匹配,提升了效率和性能。
查看完整摘要 (Abstract)
Large language models (LLMs) are increasingly deployed as autonomous agents for multi-turn decision-making tasks. However, current agents typically rely on fixed cognitive patterns: non-thinking models generate immediate responses, while thinking models engage in deep reasoning uniformly. This rigidity is inefficient for long-horizon tasks, where cognitive demands vary significantly from step to step, with some requiring strategic planning and others only routine execution. In this paper, we introduce CogRouter, a framework that trains agents to dynamically adapt cognitive depth at each step. Grounded in ACT-R theory, we design four hierarchical cognitive levels ranging from instinctive responses to strategic planning. Our two-stage training approach includes Cognition-aware Supervised Fine-tuning (CogSFT) to instill stable level-specific patterns, and Cognition-aware Policy Optimization (CoPO) for step-level credit assignment via confidence-aware advantage reweighting. The key insight is that appropriate cognitive depth should maximize the confidence of the resulting action. Experiments on ALFWorld and ScienceWorld demonstrate that CogRouter achieves state-of-the-art performance with superior efficiency.
🎯 研究动机现有视觉-语言-动作(VLA)模型依赖显式链式推理(CoT),尽管有效,但高计算成本和多步任务中的误差传播限制了其应用。
❓ 解决问题通过引入一种隐式推理框架,减少对显式文本生成的依赖,同时优化推理过程的效率与准确性。
🔍 现象分析隐式推理轨迹容易受噪声干扰并与下游任务目标错位,导致模型性能下降。
🛠️ 主要方法提出基于强化学习的降噪机制,将隐变量生成建模为序列决策过程,并引入根据状态置信度自适应终止推理的早退策略,以平衡推理深度与效率。
📊 数据与实验在多种具身决策基准上验证,实验显示该方法在显著降低推理延迟的同时,表现出更高的稳定性和成功率。
⭐ 主要贡献提出AVA-VLA隐式推理框架,引入强化学习降噪机制和早退策略,有效提升了视觉-语言-动作任务的推理效率和性能。
查看完整摘要 (Abstract)
Existing Vision-Language-Action (VLA) models predominantly rely on explicit Chain-of-Thought (CoT) reasoning to bridge perception and action. While effective, this paradigm suffers from high computational costs and error propagation in multi-step tasks. In this paper, we propose Adaptive Variable Alignment VLA (AVA-VLA), a novel Latent Reasoning VLA framework that models reasoning as a sequence of unobservable latent variables, bypassing the need for explicit text generation. However, latent trajectories are inherently susceptible to noise interference and misalignment with downstream objectives. To address this, we introduce a Reinforcement Learning-based Denoising mechanism that treats latent state generation as a sequential decision process, optimizing reasoning trajectories via task-level rewards. Furthermore, we incorporate an Early-Exit Strategy that adaptively terminates reasoning based on state confidence, enabling a dynamic trade-off between depth and efficiency. Extensive experiments on embodied decision benchmarks demonstrate that AVA-VLA significantly reduces inference latency while achieving superior stability and success rates compared to full-reasoning baselines.
🎯 研究动机强化学习在复杂决策任务中表现卓越,但模型的不可解释性限制了其在关键安全领域的应用。解释性人工智能旨在应对这一挑战,但现有方法对深度强化学习的解释多局限于事后方法或黑箱代理的模仿学习,缺乏连续动作空间的支持。
❓ 解决问题针对深度强化学习模型的不可解释性问题,尤其是连续动作空间场景,开发一种既能保持性能又能提供内在解释性的框架。
🔍 现象分析现有方法通常依赖预训练的黑箱模型,局限于离散动作空间,难以扩展至更复杂的环境,同时缺乏透明性。
🛠️ 主要方法提出一种基于原型的深度强化学习框架 ProtoSAC,将原型动作生成机制与 Soft Actor-Critic 算法结合,通过原型群和高斯分布生成动作,实现内在可解释性。
📊 数据与实验在多个连续动作空间环境中验证 ProtoSAC,结果表明其性能与原始 SAC 相当,同时显著提升了解释性。
⭐ 主要贡献提出了 ProtoSAC,首次将原型机制引入连续动作空间强化学习,通过透明决策过程提供内在解释性,同时维持竞争性能。
查看完整摘要 (Abstract)
Reinforcement learning (RL) has achieved remarkable success across complex decision-making tasks, especially with the advent of deep neural networks. However, the resulting models are often opaque, making their deployment in safety-critical domains challenging. Explainable AI aims to address this issue, but most specific efforts for deep RL remain limited either to post-hoc explanation methods or to imitation learning and distillation procedures. These latter approaches rely on pre-trained black-box agents and are typically restricted to environments with discrete action spaces, limiting their scalability and interpretability. In this paper, we introduce ProtoSAC, a novel deep RL architecture that integrates a prototype-based actor into the Soft Actor-Critic (SAC) algorithm, enabling intrinsic interpretability in continuous action spaces. Our method learns a set of prototypes that represent interpretable state clusters, each associated with a Gaussian action distribution. Actions are generated as a similarity-weighted mixture over these prototypes, providing transparent decision-making without sacrificing performance. We evaluate ProtoSAC on continuous action-space environments and show that it matches the performance of the original SAC while offering enhanced interpretability.
🎯 研究动机现有视觉-语言模型在处理医学图像时表现不足,因医学图像中用于决策的视觉信息普遍稀疏,亟需提升多模态医学推理能力。
❓ 解决问题缺乏统一的强化学习框架进行主动视觉标记剪枝与医学多模态推理,导致推理效率和性能受限。
🔍 现象分析通过剪除非核心区域的视觉标记,医学推理性能显著提升,表明视觉标记剪枝对医学图像推理至关重要。
🛠️ 主要方法提出ViToS双流强化学习框架,同时处理标记剪枝和问答任务,通过跨反馈的序列优化方法解决策略学习冲突问题。
📊 数据与实验在七个医学基准数据集上验证,剪枝后视觉标记减少至原长度的77%,在Lingshu-7B和HuatuoGPT-Vision-7B上性能分别提升至108.27%和104.16%。
⭐ 主要贡献建立高效的医学多模态推理范式,显著提升性能和推理速度,为医学领域提供更优解决方案。
查看完整摘要 (Abstract)
Vision-language models (VLMs) combining reinforcement learning (RL) ignite remarkable progress in multimodal reasoning, yet still struggle with medical images, which typically exhibit extremely sparse visual evidence to inform clinical decision-making. We recognize that pruning visual tokens outside the grounding region greatly enhances medical reasoning. However, a united RL framework for active visual token pruning (VTP) and medical multimodal reasoning remains unestablished. Here, we propose a dual-stream RL framework, ViToS, to fulfill token pruning and question answering. ViToS trains one policy model with two task branches, where one focuses on grounding while the other conducts token-sparse reasoning after VTP. Furthermore, we solve the coupled policy learning problem by introducing the cross-feedback sequential optimization, avoiding gradient conflict and facilitating convergence of the shared policy model. Evaluated on seven medical benchmarks, our method reduces visual tokens to 77\% of the original sequence length while achieving a 108.27\% relative performance on Lingshu-7B and 104.16\% relative performance on HuatuoGPT-Vision-7B. Overall, ViToS delivers superior performance and inference speedup, establishing an efficient paradigm for medical multimodal reasoning.
🎯 研究动机监督微调(SFT)具有计算高效性,但其泛化能力通常低于强化学习(RL),主要是由于RL使用了在线策略数据,因此需要一种方法弥合这种差距。
❓ 解决问题提出一种支持在线策略的SFT框架,使其在保持高效性的同时提高泛化性能,为RL难以实施的领域提供替代方案。
🔍 现象分析通过分布判别理论(DDT)解释并量化数据与模型分布间的对齐程度,明确数据偏离分布对SFT泛化能力的影响。
🛠️ 主要方法提出两种技术:(1)基于损失的分布内微调(IDFT),提高SFT泛化能力;(2)数据级的提示解码,通过调整训练语料与模型分布重新对齐。
📊 数据与实验在多项实验中证明该框架的泛化性能可与主流离线RL算法(如DPO和SimPO)媲美,同时保持SFT的训练效率。
⭐ 主要贡献创新性地引入分布判别理论(DDT),设计了高效的在线策略SFT框架,并通过开源代码与数据促进更广泛的应用。
查看完整摘要 (Abstract)
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL’s use of on-policy data. We propose a framework to bridge this chasm by enabling On-Policy SFT. We first present ***Distribution Discriminant Theory (DDT)***, which explains and quantifies the alignment between data and the model-induced distribution. Leveraging DDT, we introduce two complementary techniques: (i) ***In-Distribution Finetuning (IDFT)***, a loss-level method to enhance generalization ability of SFT, and (ii) ***Hinted Decoding***, a data-level technique that can re-align the training corpus to the model’s distribution. Extensive experiments demonstrate that our framework achieves generalization performance on par with prominent offline RL algorithms, including DPO and SimPO, while maintaining the efficiency of an SFT pipeline. The proposed framework thus offers a practical alternative in domains where RL is infeasible. We will open-source the code and data on GitHub.
🎯 研究动机人工智能辅助科研正在成为重要工具,其核心能力之一是为科研目标制定研究计划,但如何提升生成研究计划的质量尚未得到充分研究。
❓ 解决问题研究如何利用现有科研论文中的目标及评分标准,通过强化学习训练语言模型生成更高质量的研究计划。
🔍 现象分析通过人工测试发现,经过优化的模型相比初始模型或其他顶尖模型在跨领域的研究目标生成中具有明显优势。
🛠️ 主要方法从科研论文中自动提取研究目标及评分准则,利用固定评分模型通过强化学习优化研究计划生成模型的策略。
📊 数据与实验实验基于多个领域的研究目标,包括机器学习和医学论文,进行量化评估和跨领域泛化验证,实验耗时225小时,展现模型在多场景中的适用性。
⭐ 主要贡献提出可扩展的训练方法,证明其在提升语言模型生成研究计划能力及跨领域泛化性能方面的有效性,为构建通用AI科研助理迈出重要一步。
查看完整摘要 (Abstract)
AI co-scientists are emerging as a useful tool for human researchers, with a crucial ability being proposing a research plan for a given research goal. In this work, we study how to train language models that generate better research plans by leveraging the vast corpus of existing research papers. To collect diverse training data, we automatically extract research goals and goal-specific grading rubrics from papers across domains. We then train models for research plan generation via reinforcement learning, with a frozen copy of the initial policy acting as the grader, using the rubrics to evaluate plans generated by the training policy. To validate this approach, we conduct a human study for machine learning research goals spanning 225 expert hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% goals, and over Grok-4-Thinking for 59.6% goals. To assess generality, we also extend our approach to goals from medical papers, and recent arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Overall, we demonstrate the potential of a scalable training recipe as a step towards improving general AI co-scientists.
🎯 研究动机近年来通过强化学习对大模型进行对齐已在复杂推理任务中取得显著进展,但代价是昂贵的策略回报计算和对多样化推理路径的有限探索。
❓ 解决问题提出如何在降低计算开销的同时,提升模型推理路径的多样性和推理效率这一关键挑战。
🔍 现象分析传统方法受限于高计算成本和低推理路径探索度,导致策略优化效率低下且模型泛化性受限。
🛠️ 主要方法提出TreePO算法,将生成序列视为树状搜索过程,采用动态树采样策略和固定长度片段解码,通过局部不确定性增加分支,并通过前缀共享和低价值路径剪枝减少计算开销。
📊 数据与实验基于多种推理基准实验验证TreePO在性能提升的同时,实现GPU训练时间节约22%-43%,并在轨迹级和Token级采样计算分别减少40%和35%。
⭐ 主要贡献开发了段采样算法、树结构优势估计和动态回退策略,提出一种减少样本和计算需求的强化学习后训练扩展路径,同时提升推理效率。
查看完整摘要 (Abstract)
Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of \modelname on a set reasoning benchmarks and the efficiency saving of GPU hours from 22% up to 43% of the sampling design for the trained models, meanwhile showing up to 40% reduction at trajectory-level and 35% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute.
🎯 研究动机大型语言模型的策略梯度方法在长时间任务中由于实现差异导致策略不匹配,引发训练崩溃问题。传统信任域误差界在长序列中失效,需寻找更紧的界限来解决这一挑战。
❓ 解决问题通过提出新的误差界(Pinsker-Marginal界和Mixed界),降低因策略不匹配导致的误差,同时为长序列任务提供实际可行的收敛保证。
🔍 现象分析传统信任域误差界随序列长度呈 $O(T^2)$ 规模增长,在长时间任务中无法有效约束策略误差,导致代理目标与真实目标之间误差过大。
🛠️ 主要方法提出信任域屏蔽(Trust Region Masking, TRM),通过屏蔽违反信任域约束的整个序列来实现训练稳定性,并确保理论单调改进。
📊 数据与实验实验验证了在长时间任务中,TRM显著提高了训练稳定性,并减少了策略偏差,优于基于单个 token 的方法如 PPO 裁剪。
⭐ 主要贡献首次提出基于序列层面的信任域控制方法,确保非空洞的误差界和单调改进;TRM方法在长时间任务 LLM-RL 中的性能与稳定性取得了显著提升。
查看完整摘要 (Abstract)
Policy gradient methods for Large Language Models (LLMs) optimize a policy $\pi_\theta$ via a surrogate objective computed from samples of a rollout policy $\pi_{\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences—such as backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness. These factors cause an off-policy mismatch ($\pi_{\text{roll}} \neq \pi_\theta$), leading to approximation errors between the surrogate and true objectives, often precipitating training collapse. We demonstrate that classical trust region bounds on this error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. To address this, we derive two tighter bounds: a *Pinsker-Marginal* bound scaling as $O(T^{3/2})$ and a *Mixed* bound scaling as $O(T)$. Crucially, both bounds depend on $\mathcal{D}_{\text{KL}}^{\max}$—the maximum token-level KL divergence across the sequence. As this is a *sequence-level* quantity, it cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences that violate the trust region. TRM theoretically provides the first non-vacuous monotonic improvement guarantees and empirically improves training stability for long-horizon LLM-RL.
🎯 研究动机强化学习中大规模并行模拟逐渐受到关注,但现有方法通常使用简单的高斯策略参数化,难以处理复杂控制问题。扩散模型提供了更具表现力的策略类别,但多专注于离线或异策略训练。
❓ 解决问题探索扩散策略在大规模并行、同策略强化学习环境中的有效训练,解决数据分布快速变化导致复杂策略难以稳定训练的问题。
🔍 现象分析现有扩散式强化学习方法在快速变化的数据分布下难以进行稳定训练,尤其是在大规模并行环境中面临挑战。
🛠️ 主要方法提出了 Trust-region Diffusion Policies (TruDi) 方法,通过内嵌信任域优化规则对整个扩散轨迹施加 KL 散度约束,确保训练的稳定性。
📊 数据与实验在包含 73 个任务的 4 个大规模并行强化学习基准上,评估 TruDi 方法,展现其在标准任务中的一致性能和在复杂类人控制任务上的显著优势。
⭐ 主要贡献开发了适用于大规模并行同策略强化学习的扩散策略框架 TruDi,为复杂控制任务建立了强有力的新基准。
查看完整摘要 (Abstract)
Reinforcement learning with massively parallel simulations has become an emerging trend; however, most existing approaches still rely on simple Gaussian policy parameterizations. Diffusion models provide a more expressive policy class and have shown strong performance on challenging control problems, yet most diffusion-based RL methods are designed for offline or off-policy training. In this work, we ask whether diffusion policies can be trained effectively in the massively parallel, on-policy regime. To this end, we introduce Trust-region Diffusion Policies (TruDi), which enables diffusion policies for on-policy RL with massively parallel simulations. This setting is particularly challenging because the data distribution changes quickly across updates, making stable training with complex policies difficult. TruDi addresses this by integrating a trust-region optimization rule to enforce a KL-divergence constraint over the entire diffusion trajectory, rather than only at the final denoising step. Empirically, we evaluate TruDi on a diverse set of 4 massively parallel RL benchmarks comprising a total of 73 tasks. Across these tasks, TruDi consistently outperforms or is on-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks, establishing a strong new baseline for massively parallel on-policy RL.
🎯 研究动机大语言模型在回答事实性问题时容易出现幻觉或不真实的回答,特别是在超出其参数化知识范围的任务中,这对模型的真实度提出挑战。
❓ 解决问题现有方法在优化准确性时可能加剧幻觉,而鼓励模型回避则导致过度保守。该研究旨在优化模型的真实度,平衡准确性和不确定性。
🔍 现象分析传统基于监督微调或二元奖励的强化学习方法难以在事实正确性和不确定性之间取得平衡,导致真实度受损。
🛠️ 主要方法提出TruthRL框架,采用基于GRPO的强化学习算法,并设计了三元奖励机制,区分正确回答、幻觉和回避,以减少幻觉并提升真实度。
📊 数据与实验在四个知识密集型基准上进行实验,展示模型显著减少幻觉(如43.5%下降至19.4%)和提升真实度(如5.3%提升至37.2%)。实验涵盖多种主流模型,且包括详细消融研究及抗幻觉测试。
⭐ 主要贡献设计了一种针对真实度的学习目标,显著提升LLMs在准确性和真实度上的性能,同时增强模型识别知识边界的能力,使其在面对诱导幻觉问题时更具鲁棒性。
查看完整摘要 (Abstract)
While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy---models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that \model significantly reduces hallucinations (e.g., 43.5\% $\rightarrow$ 19.4\%) and improves truthfulness (e.g., 5.3\% $\rightarrow$ 37.2\%), with consistent gains across various backbone models (e.g., Qwen, Llama). In-depth ablation study demonstrates that vanilla accuracy-driven methods such as supervised fine-tuning or RL with a binary reward struggle to balance factual correctness and uncertainty, whereas the truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs. Moreover, we find the improvement of \model arises from enhancing the capability of LLMs to recognize their knowledge boundary, hence avoiding being overly conservative as the baselines are. Further analysis validates our method across multiple evaluation judges, and confirms that TruthRL is robust to hallucination-baiting questions.
🎯 研究动机无人机自动驾驶系统中导航与飞行控制的调度对飞行性能至关重要,但现有平台的分离式架构限制了系统全局观测和调度优化能力。
❓ 解决问题现有调度方法难以处理导航与飞行控制中的隐式、交互与累积因素,导致自主飞行性能难以优化。
🔍 现象分析导航和飞行控制在传统架构中硬件分离,缺乏整体优化能力且传统基于模型或启发式的调度方法无法适应复杂交互。
🛠️ 主要方法提出 UAV$^2$ 框架,将导航与飞行控制集成到单一计算平台,通过强化学习将调度问题建模为部分可观测马尔可夫决策过程,从运行时反馈中学习最优调度策略。
📊 数据与实验在硬件在环仿真环境中进行训练与评估,实验结果显示所学调度策略在飞行稳健性和跟踪性能上显著优于固定频率调度策略。
⭐ 主要贡献通过统一框架和强化学习优化自动驾驶调度,显著提升无人机飞行性能,并解决了传统架构中的关键局限性。
查看完整摘要 (Abstract)
Unmanned aerial vehicle (UAV) autopilot systems typically comprise navigation and flight-control modules, and their effective scheduling is critical to achieving high flight performance. However, most existing UAV platforms adopt a split architecture in which navigation and flight control are deployed on separate hardware devices. This separation restricts system-wide observability and prevents holistic scheduling and optimization across the entire autopilot pipeline. Moreover, autonomous flight performance emerges from implicit, cross-coupled, and accumulated interactions among multiple factors, rendering traditional model-based or heuristic scheduling approaches ineffective. To address these challenges, we propose UAV$^2$, a unified and adaptive scheduling framework for UAV autopilot systems with reinforcement learning, targeting flight performance optimization. UAV$^2$ integrates navigation and flight control onto a single onboard computing platform and operating system, formulates the scheduling problem as a partially observable Markov decision process, and learns scheduling policies from runtime execution feedback. The proposed approach is trained and evaluated in a hardware-in-the-loop simulation environment. Experimental results demonstrate that the learned scheduling policy consistently outperforms fixed-rate scheduling strategies in terms of flight robustness and tracking performance.
🎯 研究动机为了使大语言模型具有表达不确定性的能力,从而解决限制其高风险应用的幻觉问题。
❓ 解决问题现有强化学习框架因优势偏差和静态不确定性奖励导致过于保守或过度自信,影响模型可靠性。
🔍 现象分析现有方法的奖励机制存在漏洞,如奖励失效和过度自信,这与不确定性奖励设计及优势计算方式有关。
🛠️ 主要方法提出了UCPO框架,通过三元优势解耦来分离和归一化确定性与不确定性结果,并动态调整奖励权重以适应模型进化和实例复杂度。
📊 数据与实验在数学推理和通用任务上进行实验验证,结果表明UCPO显著改善了奖励不平衡,提升了模型的可靠性与校准能力。
⭐ 主要贡献提出了基于不确定性的优化框架UCPO,解决了现存RL模型中的优势偏差问题,显著增强了高风险任务中的模型表现。
查看完整摘要 (Abstract)
The key to building trustworthy Large Language Models (LLMs) lies in endowing them with inherent uncertainty expression capabilities to mitigate the hallucinations that restrict their high-stakes applications. However, existing RL paradigms such as GRPO often suffer from Advantage Bias due to binary decision spaces and static uncertainty rewards, inducing either excessive conservatism or overconfidence. To tackle this challenge, this paper unveils the root causes of reward hacking and overconfidence in current RL paradigms incorporating uncertainty-based rewards, based on which we propose the UnCertainty-Aware Policy Optimization (UCPO) framework. UCPO employs Ternary Advantage Decoupling to separate and independently normalize deterministic and uncertain rollouts, thereby eliminating advantage bias. Furthermore, a Dynamic Uncertainty Reward Adjustment mechanism is introduced to calibrate uncertainty weights in real-time according to model evolution and instance difficulty. Experimental results in mathematical reasoning and general tasks demonstrate that UCPO effectively resolves the reward imbalance, significantly enhancing model reliability and calibration beyond their knowledge boundaries.
🎯 研究动机离散生成建模的均匀扩散模型(UDM)具有潜力,但其与强化学习的结合尚未深入研究。
❓ 解决问题直接将GRPO应用于UDM会导致训练不稳定和性能有限,需改进方法以稳定优化和提升性能。
🔍 现象分析直接处理中间预测样本作为动作与模型训练目标不匹配,且训练轨迹与预训练概率路径不一致。
🛠️ 主要方法提出UDM-GRPO框架,通过将最终样本作为优化信号和利用前向过程重建训练轨迹,并引入Reduction-Step与CFG-Free策略提升效率。
📊 数据与实验在多项T2I任务上显著提升基线模型性能,GenEval准确率达96%,OCR基准准确率从4%提升到57%。
⭐ 主要贡献首次将UDM与强化学习结合,提出稳定高效的新方法,并取得文本到图像生成和OCR精度的新标杆成绩。
查看完整摘要 (Abstract)
Uniform Discrete Diffusion (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively adapting GRPO to UDM leads to unstable training and marginal performance. To address this, we propose \Ours, the first framework that integrates UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample, rather than intermediate predicted sample, as the action provides more accurate and stable optimization signals; and (ii) adopting the forward process to reconstruct the training trajectories helps the model learn probability paths that are more consistent with pretraining. For efficiency, we introduce Reduction-Step and CFG-Free training strategies. \Ours significantly improves the performance of the base model across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy improves from $4\%$ to $57\%$, further validating the effectiveness and generalization capability of our method.
🎯 研究动机扩散模型在强化学习中展现了巨大潜力,但现有方法如DPO对轨迹概率分布的简化导致次优对齐。
❓ 解决问题克服轨迹概率分布与最终状态概率之间的偏差,提升扩散模型的得分函数优化效果。
🔍 现象分析现有方法通过简化假设处理轨迹概率,未能充分捕获最优得分函数下的真实分布,造成性能局限。
🛠️ 主要方法提出两阶段框架,第一阶段学习值分布函数以估计短轨迹的回报,第二阶段通过调整分布优化得分函数。
📊 数据与实验在大规模扩散模型实验中,验证方法的理论分析,并实现性能稳定且一致性优于现有方法。
⭐ 主要贡献证明了在充分模型容量下的优化等价性,提出了结合值函数的改进框架,提升了强化学习中扩散模型的应用效果。
查看完整摘要 (Abstract)
Reinforcement learning with diffusion models has shown strong potential, but existing approaches such as variants of Direct Preference Optimization (DPO) often rely on an inaccurate simplification: they equate trajectory likelihoods with final-state probabilities. This mismatch leads to suboptimal alignment. We address this limitation with a principled framework that leverages the optimal value function as the return for short trajectory segments. Our approach follows a two-stage procedure: (i) learning a value-distribution function to estimate segment-level returns, and (ii) applying our VRPO to refine the score function. We prove that, under sufficient model capacity, the resulting model is equivalent to training a diffusion process on the tilted distribution proportional to $p(x)\exp(\eta r(x))$. Experiments on large-scale diffusion models validate our analysis and show stable and consistent improvements over prior methods.
🎯 研究动机强化学习因其状态分布一致性特性,已成为解决时间视频定位任务的后训练方式,但现有基于GRPO的方法受限于奖励信号稀疏和高计算成本的问题。
❓ 解决问题提出一种高效的后训练框架Video-OPD,通过利用逆KL散度目标,将稀疏奖励转化为密集的逐步监督信号以提高训练效率。
🔍 现象分析传统强化学习方法因仅依赖稀疏奖励信号导致训练效率低,并容易出现分布偏移;而直接优化当前策略采样路径可减少分布偏差。
🛠️ 主要方法基于Video-OPD框架,引入教师验证的分歧聚焦(TVDF)训练机制,优先选择可靠且信息量大的路径进行训练,结合逆KL散度实现细粒度学习。
📊 数据与实验在时间视频定位任务相关的基准数据集上进行实验,结果表明Video-OPD相比现有方法具有更快的收敛速度和更低的计算成本。
⭐ 主要贡献证明了基于策略蒸馏的后训练可以作为传统强化学习的新替代方案,同时提出了一种高效训练机制,提高了框架的实用性和效率。
查看完整摘要 (Abstract)
Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.
🎯 研究动机现有的 Group Relative Policy Optimization (GRPO) 方法虽然有效,但在复杂推理任务中容易因过多推理导致效率低下、过度思考,并使准确性与推理效率之间的平衡变得复杂。
❓ 解决问题解决 GRPO 中推理延长导致效率低下的问题,通过改进监督信号避免冗余推理,同时保持结果的准确性。
🔍 现象分析传统方法中,仅基于最终答案的监督信号难以精确控制何时继续或停止,且全局长度惩罚难以校准,可能截断有用的推理。
🛠️ 主要方法提出弱监督的 GRPO (WS-GRPO),通过从结果正确性中提取前缀级别的继续/停止信号,用于引导推理,避免冗余计算,同时保持准确性。
📊 数据与实验在多个推理基准数据集上进行实验,结果表明 WS-GRPO 能显著减少推理长度,并与 GRPO 基线方法的准确性持平。
⭐ 主要贡献提供了一种新的弱监督方法,使推理过程更高效;提出了基于部分推理轨迹的指导信号,替代难以校准的全局惩罚;通过理论分析和实验证明其可行性和优越性。
查看完整摘要 (Abstract)
Group Relative Policy Optimization (GRPO) is effective for training language models on complex reasoning. However, since the objective is defined relative to a group of sampled trajectories, extended deliberation can create more chances to realize relative gains, leading to inefficient reasoning and overthinking, and complicating the trade-off between correctness and rollout efficiency. Controlling this behavior is difficult in practice, considering (i) Length penalties are hard to calibrate because longer rollouts may reflect harder problems that require longer reasoning, penalizing tokens risks truncating useful reasoning along with redundant continuation; and (ii) supervision that directly indicates when to continue or stop is typically unavailable beyond final answer correctness. We propose Weakly Supervised GRPO (WS-GRPO), which improves rollout efficiency by converting terminal rewards into correctness-aware guidance over partial trajectories. Unlike global length penalties that are hard to calibrate, WS-GRPO trains a preference model from outcome-only correctness to produce prefix-level signals that indicate when additional continuation is beneficial. Thus, WS-GRPO supplies outcome-derived continue/stop guidance, reducing redundant deliberation while maintaining accuracy. We provide theoretical results and empirically show on reasoning benchmarks that WS-GRPO substantially reduces rollout length while remaining competitive with GRPO baselines.
🎯 研究动机现有视觉-语言模型在复杂推理任务中缺乏有效的自我校正能力,而现有强化学习方法难以捕捉稀疏的学习信号。
❓ 解决问题通过提出一种名为 Octopus 的回滚增强框架,生成密集的自我校正监督信号以提高效率并稳定强化学习的优化过程。
🔍 现象分析强化学习过程中自我校正行为的稀少性导致学习信号稀疏,难以有效优化模型自我校正能力。
🛠️ 主要方法提出结合现有回滚生成新监督信号的回滚增强方法,同时设计两阶段强化学习策略以分离自我校正与直接推理信号,消除冲突。
📊 数据与实验引入 Octopus-8B 模型,在7个基准上超越最优基线模型,性能提升1.0分,且单步训练耗时减少至 0.72 倍。
⭐ 主要贡献提出了结合回滚增强和两阶段策略的自我校正方法,显著提高了开放源码视觉-语言模型的推理效率与性能。
查看完整摘要 (Abstract)
Self-correction is essential for solving complex reasoning problems in vision–language models (VLMs), yet existing reinforcement learning (RL) methods struggle to learn it. Effective self-correction behaviors emerge only rarely during RL, making learning signals sparse. To address this challenge, we propose c**o**rre**ct**i**o**n-s**p**ecific rollo**u**t**s**} (**Octopus**), a rollout-augmentation framework that synthesizes dense self-correction supervision by recombining existing rollouts without computational overhead. This rollout augmentation simultaneously improves sample efficiency and stabilizes RL optimization. Furthermore, we introduce a two-stage RL training strategy that disentangles self-correction and direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce $\texttt{Octopus-8B}$, an advanced reasoning VLM with controllable self-correction capabilities. It achieves SoTA performance among open-source VLMs across 7 benchmarks, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.
🎯 研究动机在 GRPO 中进行思维链推理时,缺乏价值函数的思想级优势估计易导致高方差。虽然实践中常用树状分支策略降低方差,但其理论基础和必要性尚未明确。
❓ 解决问题探讨树状分支对 GRPO 中思想级优势估计方差的影响,解释其有效性及潜在的必要性。
🔍 现象分析通过多变量 Delta 方法发现,不同采样维度对方差影响不对称。增加思想采样数量存在方差下限,而增加每个思想的回答数可单调减少方差,甚至趋近于零。
🛠️ 主要方法在最小化的树状分支环境下,通过理论分析与实验验证,揭示答案级分支对降低方差的关键作用。
📊 数据与实验实验覆盖数学和视觉多个领域,使用不同模型架构与规模,验证答案级分支在优化稳定性、训练效率和最终性能上的提升。
⭐ 主要贡献提出树状分支是实现准确思想级优势估计的必要机制,提供理论支持并通过广泛的实验证明其实用价值。
查看完整摘要 (Abstract)
Group Relative Policy Optimization (GRPO) trains Chain-of-Thought reasoning with verifiable rewards, but estimating thought-level advantages without value functions often suffers from high variance. Although tree-style branching is used in practice to reduce the variance, it lacks a theoretical explanation of why it works and whether it is important or even potentially necessary. We study thought-level advantage estimation in GRPO from a variance perspective under a minimal tree-style setting where multiple answers are sampled for each thought. Using the multivariate delta method, we reveal an asymmetry in how different sampling dimensions affect variance. Increasing the number of sampled thoughts ($K$) leaves a strictly positive variance floor, whereas increasing the number of answers per thought ($M$) induces a monotonic decrease in variance, asymptotically driving it to zero. This implies that accurate thought-level advantage estimation is impossible through scaling thought sampling alone, making branching a potentially necessary mechanism rather than a heuristic. Experiments further provide empirical evidence for both the effectiveness and necessity of answer-level branching, demonstrating improved optimization stability, training efficiency, and final performance not only in math but also across a broad range of vision domains and under different model architectures and sizes.
🎯 研究动机大语言模型(LLMs)的训练管道中,分词是一个硬编码的压缩步骤。然而,随着架构趋向端到端化,分词的人工设计显得局限,亟需改进以实现更高效的融合。
❓ 解决问题现有方法通过启发式或连续化处理尝试学习分词边界,但精度和理论保障有限。本文探索利用强化学习优化离散分词边界以直接最小化模型损失。
🔍 现象分析分词边界学习的方差问题显著影响应用效率,而时间折扣等强化学习技术显现出有效降低方差、提升实用性的潜力。
🛠️ 主要方法采用基于强化学习的评分函数估计方法,通过时间折扣等技术优化离散分词边界,确保训练过程中理论紧密性和低方差表现。
📊 数据与实验实验基于具有 1 亿参数规模的大语言模型,定量和定性结果显示,所提方法在分词质量上优于现有的直通估计方法。
⭐ 主要贡献提出一种使用强化学习优化离散分词边界的新方法,在保持理论保障的同时显著改进分词性能,推动了端到端分词学习的实现。
查看完整摘要 (Abstract)
Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the $100$ million parameter scale.
🎯 研究动机近年来,小型模型通过潜在递归在复杂推理任务中表现出色,但递归层的性能仍不及相同深度的非递归模型,这引发了对递归何时有效的探讨。
❓ 解决问题分析潜在推理如何提升模型性能,同时避免无效计算步骤,并优化模型递归过程的效率。
🔍 现象分析现有研究发现递归模型的每一步并未等效地增加深度,作者通过形式化证明其本质为无分类器指导的策略改进算法。
🛠️ 主要方法借鉴强化学习与扩散模型中的训练方案,改进潜在推理模型以减少无效递归步骤。
📊 数据与实验以 Tiny Recursive Model 为测试平台,实验显示改进方法在保持性能的情况下将前向计算次数减少了 18 倍。
⭐ 主要贡献通过策略改进视角理论化潜在递归行为;提出改进训练方案减少无效计算;首次将相关方法应用于递归推理领域并验证有效性。
查看完整摘要 (Abstract)
Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explained by the theory that such recursion increases a network’s depth, allowing it to compactly emulate the capacity of larger models. However, the performance of recursively added layers remains behind the capabilities of one‑pass models with the same feed-forward depth. This means that in the looped version, not every recursive step effectively contributes to depth. This raises the question: when and why does latent reasoning improve performance, and when does it result in dead compute? In our work, we analyze the algorithms that latent reasoning provides answer to this question. We show that latent reasoning can be formalized as a classifier‑free guidance and policy improvement algorithm. Building on these insights, we propose to use a training schemes from RL and diffusion methods for latent reasoning modles. Using the Tiny Recursive Model as our testbed, we show that with our modifications we can avoid dead compute steps and reduce the total number of forward passes by 18× while maintaining performance. Broadly speaking, we show how a policy improvement perspective on recursive steps can explain model behavior and provide insights for further improvements.
🎯 研究动机扩散语言模型(DLMs)在文本生成方面表现出色,但其推理能力的改进仍需探索,尤其是在强化学习背景下。
❓ 解决问题当前的DLMs在支持任意顺序解码方法时存在局限性,且现有推理算法在估计采样轨迹概率上缺乏准确性。
🔍 现象分析通过对广泛使用的DLMs的实证研究,发现任意顺序解码在实践中并非通用;因此需要针对不同解码能力设计推理概率估计方案。
🛠️ 主要方法提出d2框架,包括d2-AnyOrder算法,它在支持任意顺序解码的DLM中单次计算即可得出精确轨迹概率;以及d2-StepMerge算法,它通过计算资源与估计准确性之间的权衡,适配不支持任意顺序解码的DLM。
📊 数据与实验在推理任务(Countdown和Sudoku)及数学推理基准(GSM8K和MATH500)上进行测试,d2框架表现优于主流强化学习方法。
⭐ 主要贡献首次提出适配扩散语言模型解码方式的推理框架d2,为逻辑和数学推理任务设立了新的性能基准。
查看完整摘要 (Abstract)
While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on accurate estimates of the sampling trajectory likelihoods. Our likelihood estimator, d2-AnyOrder, achieves exact trajectory likelihood with a single model pass for DLMs that support a sampling algorithm called any-order decoding. Through an empirical study of widely used DLMs, we show that any-order decoding is not universally supported in practice. Consequently, for DLMs that do not naturally support any-order decoding, we propose another estimator, d2-StepMerge, which, unlike d2-AnyOrder, only approximates the trajectory likelihood. d2-StepMerge trades off compute for approximation accuracy in an analytically tractable manner. Empirically, d2 significantly outperforms widely-used RL baselines when applied to popular DLMs, and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500).
🎯 研究动机多模态大语言模型中基于视觉的链式推理(CoT)在提升细粒度感知上表现出潜力,但推理阶段的效率尚未充分研究。
❓ 解决问题现有方法在推理中强制使用显式的视觉定位信息可能导致性能下降,影响模型答案预测的主要任务焦点。
🔍 现象分析研究发现显式的视觉定位干扰了推理流程,而将视觉定位能力内部化到文本推理中可避免这种负面影响。
🛠️ 主要方法提出了一种新的强化学习框架 iVGR,通过一致性奖励的双流训练,将视觉定位能力迁移到文本推理流程,从而无需显式的视觉信息辅助。
📊 数据与实验基于 Qwen2.5-VL 和 Qwen3-VL 模型,在多个细粒度基准上对方法进行了实验,显示其显著优于现有基线,并可支持工具辅助的推理流程。
⭐ 主要贡献首次提出内部化视觉推理的概念,与现有方法相比显著提升多模态推理性能,同时增强了推理阶段的灵活性。
查看完整摘要 (Abstract)
While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains under-scrutinized. In this work, we empirically find that mandating the explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT---which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding imposes unnecessary task interference, which detracts from the model's primary focus on answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (**iVGR**), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality (visually) grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments on Qwen2.5-VL and Qwen3-VL demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.