The AI Scientist is a comprehensive framework designed to automate the entire research lifecycle. It enables large language models to independently generate new research ideas, write code, execute experiments, visualize results, and produce comprehensive scientific papers. It even includes a simulated review process for evaluation.
While the AI Scientist primarily focuses on machine learning research, it can be broadly applied to various other fields, such as biology or physics.
The project is open-source and available here (allowing you to run the AI Scientist paper generation experiments, get LLM generated paper reviews or make your own templates). The code is optimized for NVIDIA GPUs using CUDA and PyTorch, but it can be adapted to other GPU architectures.
Why does writing papers matter?
- Knowledge sharing: they have been the main way to share research discoveries within the scientific community.
- Clarity and understanding: they provide a structured and clear way to present research findings.
- Versatility: they can include natural language, plots and code, making them versatile tools for describing various types of studies and discoveries.
Scientific discovery has traditionally depended on human creativity, intuition, and thorough research. An AI system, such as the AI Scientist, capable of independently formulating hypotheses, designing experiments, analyzing data, and generating new theories without human intervention could greatly ease the workload of the research community.
Methodology
The model has three main phases :(I) Idea Generation, (II) Experiment Iteration, and (III) Paper Write-Up (see the next picture).
It begins by generating and assessing potential research directions. Subsequently, it designs experiments, automates their execution through code generation, and analyzes the results using both quantitative metrics and visual representations. A comprehensive LaTeX report is produced, interpreting findings and contextualizing them within the broader research landscape. The system concludes by generating a peer review-style assessment, offering insights for project refinement or guiding future exploration.
The framework is guided by human-written prompts (see the pictures below) and encompasses the following stages:
I. Idea Generation
- Iterative brainstorming: the AI Scientist begins with a template, which can be a simple prompt or a detailed research question. Using this template, the AI Scientist employs an LLM to create a variety of new research ideas. The LLM is given access to the current archive of ideas and their review scores. Based on this information, the model generates new research ideas.
- Idea evaluation: these ideas are evaluated and refined by the LLMs through multiple rounds of chain-of-thought and self-reflection, to ensure their quality and relevance.
- Novelty assessment: finally, these ideas are filtered using the Semantic Scholar API and web access to ensure they are not too similar to existing literature.
II. Experiment Iteration
- Planning and execution: once having an idea and a template, the model uses Aider (an LLM-based coding assistant) to plan a list of experiments and executes them in order. If an experiment fails or times out, Aider attempts to fix the code and re-run the experiment up to four times.
- Recording results: after each experiment, Aider records the results in the style of an experimental journal. Currently, this process is text-based, but future versions may include data visualizations or other modalities.
- Visualization and documentation: after completing the experiments, Aider edits a plotting script to create figures for the research paper using Python. The AI Scientist documents what each plot represents, ensuring that the saved figures and experimental notes provide all the necessary information for writing the paper. Aider maintains a history of its actions.
III. Paper Write-Up
- Scientific paper writing: the AI Scientist creates a machine learning conference paper in LaTeX. It uses Aider to write each section (introduction, background, methods, results, conclusion) based on notes and figures. Aider also finds the most relevant sources and adds citations. Once the LaTeX template is refined and filled with the relevant data, it is fed into a LaTeX compiler to create the final document.
- Automated paper review: the paper is then reviewed by an LLM reviewer agent. They created a GPT-4o-based agent to perform these reviews in accordance with the NeurIPS reviewer guidelines.
Some representative prompts
The pictures below showcase some representative prompts used by the research team for the AI Scientist. The full list of prompts can be found on the GitHub repository.
I. Idea generation system prompt (set the overall guidelines and objectives for the AI’s behavior and operations) and idea generation prompt (provide specific tasks or queries for the AI to generate ideas).
II. Designing experiments
III. Paper writing & reviewing
These prompts correspond to the final stage of the AI Scientist.
Experiments
The AI Scientist was evaluated on 3 templates (diffusion modeling, transformer-based language modeling, and learning dynamics) across different publicly available LLMs: Claude Sonnet 3.5, GPT-4o, DeepSeek Coder, and Llama-3.1 405b.
For each run they started with 1-2 basic seed ideas (such as adjusting the learning rate or batch size) and generated another 50 new ideas. A batch of around 50 ideas can be generated in 12 hours using 8 NVIDIA H100 GPUs.
Each concept was fully developed and implemented into research papers at a low cost of under $15 per paper. The next table contains 10 research papers generated by the AI Scientist, along with their scores from the automated reviewer based on NeurIPS reviewer guidelines.
Type | Paper Title | Score |
---|---|---|
2DÂ Diffusion | DualScale Diffusion: Adaptive Feature Balancing for Low-Dimensional Generative Models | 5 |
2D Diffusion | Multi-scale Grid Noise Adaptation: Enhancing Diffusion Models For Low-dimensional Data | 4 |
2D Diffusion | GAN-Enhanced Diffusion: Boosting Sample Quality and Diversity | 3 |
2D Diffusion | DualDiff: Enhancing Mode Capture in Low-dimensional Diffusion Models via Dual-expert Denoising | 5 |
NanoGPT | StyleFusion: Adaptive Multi-style Generation in Character-Level Language Models | 5 |
NanoGPT | Adaptive Learning Rates for Transformers via Q-Learning | 3 |
Grokking | Unlocking Grokking: A Comparative Study of Weight Initialization Strategies in Transformer Models | 5 |
Grokking | Accelerated: Layer-wise Learning Rates for Transformer Generalization | 4 |
Grokking | Through Compression: Unveiling Sudden Generalization via Minimal Description Length | 3 |
Grokking | Accelerating Mathematical Insight: Boosting Grokking Through Strategic Data Augmentation | 5 |
The NeurIPS conference uses a scoring system for paper reviews, but the specific range and meanings of the scores can vary slightly from year to year. Generally, the scores range from 1 to 10, with higher scores indicating better quality. An overall score of 5 (Borderline Accept) signifies a technically solid paper.
Conclusion
The AI Scientist demonstrates a strong ability to innovate and has many implications in the research field. By automating the time-consuming and repetitive tasks involved in scientific research, the AI Scientist could significantly accelerate the pace of discovery. It can also explore research areas that are currently beyond human capabilities.
Future research should focus on integrating vision functionalities and human feedback.
Read more:
- Paper on arXiv: “The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery”
- GitHub repository