ByteDance Introduces Seed-Prover: An Advanced Formal Reasoning System for Automated Mathematical Theorem Proving

LLMs have shown notable improvements in mathematical reasoning by extending through natural language, resulting in performance gains on benchmarks such as MATH and AIME. However, reinforcement learning (RL) for training these models encounters a challenge: verifying the correctness of natural language proofs is very difficult, requiring careful manual checking of each reasoning step. This limits the application of RL for training mathematical theorem-proving models. While formal languages like Lean offer automatic correctness verification, current LLM formal provers face their limitations. Step-level provers generate code incrementally but require special scaffolding and lack high-level reasoning capabilities.

ByteDance Seed Team introduces Seed-Prover, a lemma-style whole-proof reasoning model. It refines proofs iteratively using Lean feedback, previously established lemmas, and self-summarization. Seed-Prover employs three specialized test-time inference strategies that allow deep and broad reasoning methods to solve IMO-level contest problems. Its primary innovation is in adopting lemma-style proving as its core method, placing lemmas at the center of the reasoning process rather than relying on traditional step-by-step or whole-proof generation methods. Moreover, this paper introduces Seed-Geometry, a complementary geometry reasoning engine that overcomes Lean’s limitations in handling geometric support.

For interaction between Seed-Prover and Lean, multi-stage, multi-task RL based on VAPO is utilized. The training dataset combines open-source datasets with in-house formal problems, using a proposer to create simpler variants of difficult tasks. It excludes overly simple problems with proof rates above 25%. Seed-Geometry’s backend supports large-scale problem generation, identifying over 230 million unique problems across seven days with an eightfold improvement in search efficiency. A separate policy and value model is trained, though extensive testing shows that value models may reduce performance due to estimation errors. As a result, step-by-step generation with beam search is adopted in distributed setups.

Seed-Prover achieves state-of-the-art results across multiple mathematical benchmarks. For IMO 2025, Seed-Prover fully solves 5 out of 6 problems, with Seed-Geometry instantly solving Problem 2 and Seed-Prover deriving proofs for the remaining problem using various inference settings. On past IMO problems, it proved 121 out of 155 tasks, achieving a 78.1% success rate across all difficulty levels. The performance breakdown shows strong results across problem categories: solving 47 out of 55 easy problems, 47 out of 56 medium problems, and 27 out of 44 hard problems, with subject-specific success rates including 72 out of 85 in algebra, 42 out of 55 in number theory, and 7 out of 14 in combinatorics.

On MiniF2F, researchers achieve a 99.6% proof rate for both validation and test sets under medium settings, solving difficult problems such as IMO 1990 P3. PutnamBench results show improvement from 201 to 331 solved problems out of 657 when upgrading from light to medium inference settings, showing a significant performance jump over previous undergraduate-level mathematical reasoning systems. On CombiBench, Seed-Prover solves 30 out of 100 combinatorics problems, outperforming existing methods but revealing continued challenges in combinatorial reasoning. Researchers achieve 81.8% success on MiniCTX-v2, showing strong generalization beyond competition problems and outperforming the o4-mini baseline’s 44.3% at Pass@8.

In conclusion, ByteDance Seed presents Seed-Geometry and Seed-Prover, two formal reasoning methods that integrate the capabilities of LLMs. Seed-Geometry provides accelerated verification and enhanced search mechanisms while Seed-Prover utilizes iterative refinement and complex test-time inference strategies. The achievement of solving 5 out of 6 problems in the IMO 2025 shows the practical efficacy of these methods in tackling elite mathematical competitions. The adoption of formal languages like Lean provides rapid proof verification that is more cost-effective than human experts and more reliable than LLM-based judges. Future research will focus on combining formal systems with LLMs to address open conjectures.

Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.