AI "co-scientists" (LLM-based assistants) can help researchers by generating detailed research plans from given goals and constraints. However, current models often produce plans that violate implicit requirements due to the open-ended nature of scientific planning and the lack of fast, cheap feedback (unlike code execution). This paper proposes a scalable, unsupervised training method using reinforcement learning (RL) with automatically extracted "rubric rewards" from existing scientific papers, enabling models to self-improve plan quality without human labeling or experiment execution.
Authors: Goel, Shashwat, et al.
Published: December 2025
Core Idea:
AI "co-scientists" (LLM-based assistants) can help researchers by generating detailed research plans from given goals and constraints. However, current models often produce plans that violate implicit requirements due to the open-ended nature of scientific planning and the lack of fast, cheap feedback (unlike code execution). This paper proposes a scalable, unsupervised training method using reinforcement learning (RL) with automatically extracted "rubric rewards" from existing scientific papers, enabling models to self-improve plan quality without human labeling or experiment execution.
Key Contributions
ResearchPlanGen Dataset: A large, diverse corpus automatically extracted from scientific papers across domains (primarily ML, but extended to medicine and recent arXiv preprints). Includes research goals (aims/constraints) and goal-specific grading rubrics (structured criteria for evaluating plans).
Rubric-Based RL Training: Finetune LLMs (e.g., Qwen3-30B-A3B) via RL where:
The training policy generates plans.
A frozen copy of the initial model acts as a "verifier/grader," scoring plans using privileged rubric access and general guidelines.
This creates a generator-verifier gap for stable self-improvement without external supervision.
No Need for Execution Feedback: Unlike domains with simulators, this works in settings (e.g., medical research) where verifying plans via real experiments is infeasible.
Methods
Data Extraction: Use LLMs to parse papers and extract goals + domain-specific rubrics (e.g., novelty, feasibility, rigor).
Training Setup: Self-grading RL with rubrics as privileged information to the grader; rewards based on rubric scores.
Models: Tested on open-source bases like Qwen3; scalable to larger models.
Results
Human Evaluation (ML Domain): 225 hours of expert review (machine learning researchers). Finetuned model preferred over baseline in 70% of cases; experts approved 84% of auto-extracted rubrics.
Automated Evaluation: 12–22% relative improvements in rubric scores across ML, medical papers, and new arXiv preprints.
Cross-Domain Generalization: Strong transfer to unseen domains (e.g., medicine) without domain-specific training.
Ablations: Confirm rubric access and self-grading are critical for gains.
Discussion and Implications
Demonstrates a fully automated, scalable recipe for training better AI research assistants using the vast existing literature—no costly human feedback loops needed.
Potential step toward general AI co-scientists that reliably brainstorm or outline executable research.
Limitations: Relies on quality of auto-extracted rubrics; may inherit biases from source papers; evaluated mostly via proxies (rubrics/human prefs), not real-world execution success.
Future: Expand to more domains, multi-step planning, or integration with execution tools.
Overall, the work provides a promising unsupervised path to enhance open-ended scientific reasoning in LLMs, releasing the dataset for community use.