TopPaper
Back to Home
Full Abstract

Training AI Co-Scientists Using Rubric Rewards

Authors: Goel, Shashwat, et al.
Published: December 2025

Core Idea

AI "co-scientists" (LLM-based assistants) can help researchers by generating detailed research plans from given goals and constraints. However, current models often produce plans that violate implicit requirements due to the open-ended nature of scientific planning and the lack of fast, cheap feedback (unlike code execution). This paper proposes a scalable, unsupervised training method using reinforcement learning (RL) with automatically extracted "rubric rewards" from existing scientific papers, enabling models to self-improve plan quality without human labeling or experiment execution.

 

Key Contributions

  • ResearchPlanGen Dataset: A large, diverse corpus automatically extracted from scientific papers across domains (primarily ML, but extended to medicine and recent arXiv preprints). Includes research goals (aims/constraints) and goal-specific grading rubrics (structured criteria for evaluating plans).
  • Rubric-Based RL Training: Finetune LLMs (e.g., Qwen3-30B-A3B) via RL where:
  • The training policy generates plans.
  • A frozen copy of the initial model acts as a "verifier/grader," scoring plans using privileged rubric access and general guidelines.
  • This creates a generator-verifier gap for stable self-improvement without external supervision.
  • No Need for Execution Feedback: Unlike domains with simulators, this works in settings (e.g., medical research) where verifying plans via real experiments is infeasible.

 

Methods

  • Data Extraction: Use LLMs to parse papers and extract goals + domain-specific rubrics (e.g., novelty, feasibility, rigor).
  • Training Setup: Self-grading RL with rubrics as privileged information to the grader; rewards based on rubric scores.
  • Models: Tested on open-source bases like Qwen3; scalable to larger models.

 

Results

  • Human Evaluation (ML Domain): 225 hours of expert review (machine learning researchers). Finetuned model preferred over baseline in 70% of cases; experts approved 84% of auto-extracted rubrics.
  • Automated Evaluation: 12–22% relative improvements in rubric scores across ML, medical papers, and new arXiv preprints.
  • Cross-Domain Generalization: Strong transfer to unseen domains (e.g., medicine) without domain-specific training.
  • Ablations: Confirm rubric access and self-grading are critical for gains.


Discussion and Implications

  • Demonstrates a fully automated, scalable recipe for training better AI research assistants using the vast existing literature—no costly human feedback loops needed.
  • Potential step toward general AI co-scientists that reliably brainstorm or outline executable research.
  • Limitations: Relies on quality of auto-extracted rubrics; may inherit biases from source papers; evaluated mostly via proxies (rubrics/human prefs), not real-world execution success.
  • Future: Expand to more domains, multi-step planning, or integration with execution tools.

 

Overall, the work provides a promising unsupervised path to enhance open-ended scientific reasoning in LLMs, releasing the dataset for community use. 

View Full Paper DOI: 10.48550/arXiv.2512.23707 Share