Current LLM-based software engineering agents rely heavily on human-curated data (e.g., GitHub issues, pull requests) and environments (e.g., test suites), which limits their path to superintelligence. The paper introduces Self-play SWE-RL (SSR), a reinforcement learning framework that trains a single LLM agent in a self-play loop with minimal assumptions: only access to sandboxed real-world code repositories (source code + dependencies), no human-labeled issues or pre-existing tests.
Core Idea: Self-Play Loop
A single LLM acts in two roles (sharing parameters, trained jointly via RL):
- Bug Injector (Challenger): Autonomously discovers how to run tests, then injects realistic bugs by modifying code and weakening tests to hide failures.
- Bug Solver: Repairs the bug using only the reversed test-weakening patch as a formal specification (no natural language description).
- Bugs are defined via a structured "bug artifact" (test script, parser, code patch, test-weakening patch).
- Failed repairs generate higher-order bugs (bugs on top of bugs), enabling layered, increasingly complex learning.
- Bug injection strategies include code removal, history-aware reversions from git logs (most effective), and direct modifications.
- Rewards balance incentives: injector rewarded for creating challenging but solvable bugs; solver for successful repairs.
The agent must learn everything agentically—test execution, parsing outputs, validation—grounded in real codebases.
Key Contributions
- First self-play RL paradigm for software agents, inspired by AlphaZero but applied to code.
- Minimal human dependency: No need for issues, tests, or language-specific tools.
- Novel bug specification via test patches (formal, verifiable) instead of NL descriptions.
- Theoretical analysis of the challenger-solver game, showing an optimal solve rate (~20%).
Results
- Trained on a 32B Code World Model checkpoint (no prior RL).
- Evaluated on SWE-bench Verified (500 tasks) and SWE-Bench Pro (731 tasks).
- SSR shows steady self-improvement and outperforms a human-data RL baseline throughout training.
- Gains: +10.4 points on Verified, +7.8 on Pro.
- Ablations confirm: Joint self-play > isolated roles; history-aware injection critical; diverse bugs essential.
Discussion and Limitations
- Demonstrates autonomous curriculum generation from real repositories can surpass human-curated data.
- Limitations: Relies on unit tests; no NL issue synthesis; potential instability if injector becomes too adversarial.
- Future: Multi-step tasks, better seeding, long-horizon RL.
Overall, SSR is an early but promising step toward super intelligent agents that learn indefinitely by generating their own challenges in real software environments, potentially enabling systems that create novel software autonomously.