Training AI Co-Scientists Using Rubric Rewards
Core Idea:
AI "co-scientists" (LLM-based assistants) can help researchers by generating detailed research plans from given goals and constraints. However, current models often produce plans that violate implicit requirements due to the open-ended nature of scientific planning and the lack of fast, cheap feedback (unlike code execution). This paper proposes a scalable, unsupervised training method using reinforcement learning (RL) with automatically extracted "rubric rewards" from existing scientific papers, enabling models to self-improve plan quality without human labeling or experiment execution.
Key Contributions
- ResearchPlanGen Dataset: A large, diverse corpus automatically extracted from scientific papers across domains (primarily ML, but extended to medicine and recent arXiv preprints). Includes research goals (aims/constraints) and goal-specific grading rubrics (structured criteria for evaluating plans).
- Rubric-Based RL Training: Finetune LLMs (e.g., Qwen3-30B-A3B) via RL where:
- The training policy generates plans.
- A frozen copy of the initial model acts as a "verifier/grader," scoring plans using privileged rubric access and general guidelines.
- This creates a generator-verifier gap for stable self-improvement without external supervision.
- No Need for Execution Feedback: Unlike domains with simulators, this works in settings (e.g., medical research) where verifying plans via real experiments is infeasible.
Methods
- Data Extraction: Use LLMs to parse papers and extract goals + domain-specific rubrics (e.g., novelty, feasibility, rigor).
- Training Setup: Self-grading RL with rubrics as privileged information to the grader; rewards based on rubric scores.
- Models: Tested on open-source bases like Qwen3; scalable to larger models.
Results
- Human Evaluation (ML Domain): 225 hours of expert review (machine learning researchers). Finetuned model preferred over baseline in 70% of cases; experts approved 84% of auto-extracted rubrics.
- Automated Evaluation: 12–22% relative improvements in rubric scores across ML, medical papers, and new arXiv preprints.
- Cross-Domain Generalization: Strong transfer to unseen domains (e.g., medicine) without domain-specific training.
- Ablations: Confirm rubric access and self-grading are critical for gains.
Discussion and Implications
- Demonstrates a fully automated, scalable recipe for training better AI research assistants using the vast existing literature—no costly human feedback loops needed.
- Potential step toward general AI co-scientists that reliably brainstorm or outline executable research.
- Limitations: Relies on quality of auto-extracted rubrics; may inherit biases from source papers; evaluated mostly via proxies (rubrics/human prefs), not real-world execution success.
- Future: Expand to more domains, multi-step planning, or integration with execution tools.
Overall, the work provides a promising unsupervised path to enhance open-ended scientific reasoning in LLMs, releasing the dataset for community use.