[PR] RPT
Reinforcement Pre-Training [link]
Summary:
The paper introduces a new reinforcement pre-training approach that integrates reinforcement learning with reasoning-based objectives. Unlike conventional pre-training methods focused solely on next-token prediction from large text corpora, this approach emphasizes reasoning ability rather than surface-level token matching. Traditional next-token prediction faces two main limitations: (1) it requires massive text datasets, and (2) the training objective encourages pattern imitation instead of genuine reasoning, leading models to generate text without understanding underlying logic. To address these issues, the authors propose a next-token reasoning task within a reinforcement pre-training framework, enabling models to optimize for reasoning consistency rather than mere likelihood. They evaluate the method through fine-tuning on downstream tasks, zero-shot performance on end tasks, and reasoning pattern analysis. Experimental results demonstrate that the reinforcement pre-trained models outperform standard next-token prediction models, even when compared to larger parameter baselines. Furthermore, the analysis shows that the proposed approach mainly works inferential reasoning processes, distinguishing it from the structured problem-solving exhibited by conventional pre-training method.
Relation to prior work
The paradigm for training LLMs relies on scaling up pre-training with substantially larger datasets to improve next-token prediction performance. In this framework, pre-training focuses on linguistic fluency and text generation quality, while reasoning capabilities are typically introduced in the post-training through methods such as instruction tuning or reinforcement learning. In contrast, RPT integrates these two stages—language modeling and reasoning enhancement—into a unified training process. Related research such as Reinforcement Learning from Human Feedback (RLHF) has demonstrated that reinforcement-based optimization can effectively align models with human preferences during post-training. This success highlights the potential of reinforcement learning paradigms; however, RPT extends this idea by applying reinforcement objectives directly in the pre-training stage rather than solely for alignment.
Strengths
- Pre-training task design. The proposed RPT introduces an pre-training task for language modeling. Given an input sequence from the training corpus, the base model generates a chain-of-thought (CoT) reasoning sequence, which is then used to predict the next token without directly relying on the original input sequence. This approach incorporates self-critique and self-correction behaviors, reformulating the next-token prediction task into a reasoning problem. As a result, the model develops a deeper understanding of implicit knowledge during pre-training, rather than deferring reasoning emergence to post-training. The authors employ the OmniMATH dataset, which emphasizes reasoning skills but still includes tokens that can be predicted without reasoning. They address this by filtering out such cases. Additionally, the method computes entropy from a proxy model for each token and removes low-entropy positions, ensuring the model focuses on more challenging, informative examples that promote confident and robust learning.
- Performance across tasks. The authors evaluate RPT across three settings: language modeling, reinforcement fine-tuning, and zero-shot performance on end tasks. In each case, they compare RPT with both standard next-token prediction models and larger-parameter Qwen baselines. The results show that the smaller RPT-based Qwen model outperforms not only standard next-token predictors but also larger models. Particularly in zero-shot evaluations on MMLU-Pro and SuperGPQA benchmarks—which cover diverse domains and reasoning-intensive questions—the RPT method exhibits notable performance gains, demonstrating its strong reasoning generalization beyond mathematical tasks.
- Enhanced interpretability. To understand the internal reasoning behavior of RPT, the authors perform an analysis of next-token reasoning patterns. Following prior studies, they categorize reasoning chains into six types: transition, reflection, breakdown, hypothesis, divergent thinking, and deduction. The findings show that RPT encourages inferential processes rather than relying on structured, step-by-step problem-solving typical of conventional pre-training models. This indicates that reinforcement-driven reasoning objectives promote more flexible and human-like reasoning patterns within language models.
Future Work
This study introduces a new pre-training paradigm; however, the current experiments are limited to mathematical reasoning tasks. Future research could extend RPT to more diverse domains and larger, heterogeneous datasets to evaluate its generalization capability. Additionally, it would be valuable to investigate how RPT interacts with different post-training methods and whether similar challenges arise during the post-training phase. Exploring optimal combinations of RPT with reinforcement-based or instruction-tuning strategies could further enhance both reasoning and alignment performance.
Datasets
- OmniMATH. This dataset contains 4,428 competition-level mathematical problems and solutions.
- MMLU-Pro. A comprehensive multi-task understanding benchmark evaluating LLM capabilities across various domains.
- SuperGPQA. A large-scale benchmark of graduate-level reasoning questions spanning 285 disciplines.
— TC Chen, 5 Jan. 2026