[PR] GDPO

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [link]

Summary:

To enable LMs to respond accurately and align with human preferences across diverse scenarios, this work studies RL with multiple rewards as guidance signals for policy optimization. The paper focuses on GRPO, which replaces PPO’s value-function based advantage estimation with group-relative advantages computed from multiple sampled rollouts, thereby avoiding the need to train and load a separate value model. However, GRPO is not tailored to multi-reward optimization: when distinct reward components are combined and normalized jointly, different reward combinations can collapse to identical advantage values, reducing signal resolution and causing unstable training dynamics in multi-reward settings. To address this limitation, the authors propose GDPO, which decouples group-wise normalization across individual rewards and then applies batch-wise advantage normalization to maintain a stable numerical range that does not grow with the number of rewards, thus improving update stability. GDPO is evaluated against GRPO and GRPO w/o std on multi-reward experiments, including tool-calling, mathematical reasoning, and coding reasoning tasks. The experiments show that GRPO exhibits reward signal “collapse” and partial training failures in multi-reward setting, whereas GDPO preserves more distinct advantage groups and achieves more stable optimization. The authors also investigate whether adjusting reward weights can force the model to prioritize more challenging reward components (e.g., correctness over easier format rewards), and they report that simple reweighting alone does not fully resolve the dominance of easier rewards in GRPO-style training. Across downstream tasks, GDPO consistently outperforms GRPO baselines in both correctness metrics and constraint adherence metrics, demonstrating its effectiveness for stable and expressive multi-reward RL optimization for LMs.

Relation to prior work

RLHF has become the pipeline for aligning LMs with human preferences, evolving from single-scalar rewards capturing aggregated helpfulness to multi-reward setups incorporating objectives. PPO dominates early RLHF implementations, using a learned value function for generalized advantage estimation and clipped surrogate objectives to stabilize policy updates against scalar rewards. Group-based methods like GRPO, DAPO, and Reinforce++ address PPO's value model overhead by computing group-relative advantages from multiple rollouts. These have been extended to multi-reward RL by summing heterogeneous components prior to normalization, as in ToolRL and DeepSeek-v3.2, but this risks "reward collapse" where distinct combinations map to identical advantages, distorting signals in conflicting or multi-objective settings.

Strengths

GDPO decouple normalization. GRPO treats distinct multi-reward combinations like (0,1), (0,2), and (1,2) as identical after summing and normalizing, yielding collapsed advantages, which erodes policy gradient resolution. By decoupling group-wise normalization per reward component before batch-wise advantage normalization, GDPO preserves fine-grained distinctions, enabling more expressive and stable multi-reward policy updates.
Reward weight setting. GDPO's ablation on tool-calling shows that upweighting the correctness reward (from 1.0 to 2.0 or 5.0) in GRPO fails to overcome easy-reward dominance: format accuracy plateaus ~100% while correctness stagnates at ~50-60%. In contrast, GDPO achieves higher correctness (~75%) with better format adherence, demonstrating that decoupled normalization reduces reliance on heuristic weighting for balanced multi-objective alignment.

Future Work

Future work could explore GDPO's application to broader downstream alignment tasks beyond tool-calling, math, and coding, such as instruction-following, safety alignment, and long-context reasoning, where RL-based methods could be replaced with GDPO to leverage its stability in high-reward-count setting. Integrating GDPO with additional reward signals would test its scalability while enabling more comprehensive preference modeling, potentially through learned reward weighting schemes that adapt during training to balance objectives.

— TC Chen, 14 Jan. 2026