ai
  • Crypto News
  • Ai
  • eSports
  • Bitcoin
  • Ethereum
  • Blockchain
Home»Ai»Shanghai Jiao Tong Researchers Propose OctoThinker for Reinforcement Learning-Scalable LLM Development
Ai

Shanghai Jiao Tong Researchers Propose OctoThinker for Reinforcement Learning-Scalable LLM Development

Share
Facebook Twitter LinkedIn Pinterest Email

Introduction: Reinforcement Learning Progress through Chain-of-Thought Prompting

    LLMs have shown excellent progress in complex reasoning tasks through CoT prompting combined with large-scale reinforcement learning (RL). Models like Deepseek-R1-Zero have shown strong reasoning capabilities by applying RL directly to base models. Similarly, methods such as SimpleRL and Open-ReasonerZero show improvements in smaller models like the Qwen series. However, achieving success across different base model families remains a challenge. Moreover, applying R1-Zero-style training to base models such as the Llama series faces difficulty, posing a fundamental question about the underlying factors that lead different base models to behave inconsistently during reinforcement learning.

    Limitations of RL Scaling on Llama Models

      Large-scale RL advances in models like OpenAI’s o1, o3, and DeepSeek’s R1 on competition-level mathematics problems, motivating the exploration of RL on smaller models with less than 100B parameters. However, they are limited to the Qwen model family, while replicating results on families such as Llama is difficult. The lack of transparency in pre-training pipelines has made it difficult to understand how pre-training influences RL scaling. This has prompted unconventional studies, which found that one-shot prompting improves reasoning in Qwen but offers little benefit in Llama. Efforts to curate high-quality mathematical pre-training corpora through projects like OpenWebMath, MathPile, InfiMM-Web-Math, and FineMath have made progress but remain limited in scale under 100B tokens.

      Exploring Mid-Training with Stable-then-Decay Strategy

        Researchers from Shanghai Jiao Tong University investigate how mid-training strategies shape RL dynamics, focusing on Qwen and Llama. The study presents several insights: First, high-quality mathematical corpora such as MegaMath-Web-Pro boost both base model and RL outcomes. Second, using QA-style data, especially those with long CoT reasoning, further enhances RL results. Third, long CoT introduces verbosity and instability in RL training. Lastly, applying scaling during mid-training results in stronger downstream RL performance. Researchers introduce a two-stage mid-training strategy called Stable-then-Decay, where base models are first trained on 200B tokens, followed by 20B tokens across three CoT-focused branches, resulting in OctoThinker models that show strong RL compatibility.

        RL Configuration and Benchmark Evaluation

          Researchers use the MATH8K dataset for RL training prompts. The configuration includes a global training batch size of 128, 16 rollout responses per query, and a PPO mini-batch size of 64, with experiments conducted on Llama-3.2-3B-Base and Qwen2.5-3B-Base models. For evaluation, few-shot prompting is used for base language models, and zero-shot for RL-tuned models across indicator tasks, including GSM8K, MATH500, OlympiadBench, and AMC23. During RL training, Qwen models exhibit increasing response lengths that remain reasonable throughout, whereas Llama displays abnormal behavior, with average response lengths escalating to 4,096 tokens. Evaluation further reveals that RL-tuned Qwen2.5-3B achieves improvements across benchmarks, while Llama-3.2-3B shows only marginal gains.

          OctoThinker Outperforms Llama in RL Compatibility

            Each OctoThinker branch demonstrates 10%-20% improvement over the original Llama base model and consistent gains over the stable-stage model across all sizes when evaluated on 13 mathematical benchmarks. The OctoThinker-Zero families reveal diverse thinking behaviors during RL scaling, with strong performance from the OctoThinker-Long variant. When comparing three 3B-scale base models during RL training, OctoThinker-Long-3B outperforms the original Llama-3.2-3B model and reaches performance parity with Qwen2.5-3B, a model known for strong reasoning capabilities and extensive pre-training. The hybrid and short branches show slightly lower performance, especially on challenging benchmarks

            Conclusion and Future Work: Toward RL-Ready Foundation Models

              This paper investigates why base models such as Llama and Qwen exhibit divergent behaviors during RL for reasoning, showing that mid-training plays a major role in RL scalability. The two-stage mid-training strategy transforms Llama into a foundation model better suited for RL, resulting in OctoThinker models. Future research directions include:

              • Curating higher-quality mathematical corpora to improve mid-training.
              • Creating RL-friendly base models using open recipes without distillation from long CoT reasoning models.
              • Separating the QA format and content to understand their contributions individually.
              • Expanding the OctoThinker family with new branches, such as tool-integrated reasoning.

              Check out the Paper, Hugging Face Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


              Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

Artificial intelligence enhances air mobility planning | MIT News

juillet 3, 2025

DeepSeek R1T2 Chimera: 200% Faster Than R1-0528 With Improved Reasoning and Compact Output

juillet 3, 2025

Google’s electricity demand is skyrocketing

juillet 3, 2025

Novel method detects microbial contamination in cell cultures | MIT News

juillet 3, 2025
Add A Comment

Comments are closed.

Top Posts

SwissCryptoDaily.ch delivers the latest cryptocurrency news, market insights, and expert analysis. Stay informed with daily updates from the world of blockchain and digital assets.

We're social. Connect with us:

Facebook X (Twitter) Instagram Pinterest YouTube
Top Insights

Dencun Mainnet Announcement | Ethereum Foundation Blog

juillet 3, 2025

Esports World Cup Foundation expands presence in India

juillet 3, 2025

The Silent Bitcoin Accumulation: Companies’ Surprising 2025 Lead

juillet 3, 2025
Get Informed

Subscribe to Updates

Get the latest creative news from FooBar about art, design and business.

Facebook X (Twitter) Instagram Pinterest
  • About us
  • Get In Touch
  • Cookies Policy
  • Privacy-Policy
  • Terms and Conditions
© 2025 Swisscryptodaily.ch.

Type above and press Enter to search. Press Esc to cancel.