UC Berkeley has recently launched Koala, a dialog chatbot designed in the ChatGPT style, but with a significantly smaller size, while maintaining a competitive level of performance.
Based on the research findings, Koala proves to be a successful tool in generating responses to various user queries and is often preferred over Alpaca. Additionally, Koala is at least as good as ChatGPT in over half of the cases.
The results show that smaller models can achieve nearly comparable performance to larger models if they are trained on meticulously chosen data.

The team is encouraging the community to focus on curating high-quality datasets to create smaller, yet more proficient and safer models, instead of simply scaling up the size of the existing systems.
They also note that Koala, being a research prototype, is not suitable for commercial use due to its limitations in terms of content, safety, and reliability.
The main distinctions between Koala and other significant existing models are summarized in the picture below:

Datasets and training
Koala has been trained by fine-tuning Meta’s LLaMA on freely available data gathered from the web, with a specific focus on conversational data.

The Supervised Fine-Tuning was performed with2 main categories of datasets:
1. ChatGPT distillation data consisting of:
- Data coming from ShareGPT: 30K examples selected from 60K dialogues
- The human, ChatGPT corpus: 60K human answers, 27K ChatGPT answers, and 24K questions
2. Open source data consisting of:
- The datasets from LAION
- The datasets used to train Stanford Alpaca
- Anthropic
- OpenAI WebGPT
- OpenAI summarization
The model was implemented with Jax/Flax on the research team’s own framework, EasyLM.
Koala was trained on a single Nvidia DGX server that consists of 8 A100 GPUs, with the training process taking 6 hours, for 2 epochs. On public cloud computing platforms, such a training run is less than $100 with preemptible instances.
Evaluation
To evaluate the Koala’s performance, the team conducted a blind pairwise comparison by presenting these prompts to approximately 100 evaluators on the Amazon Mechanical Turk platform.
Each evaluator was shown the input prompt and the output of two different models. Subsequently, he was asked to evaluate which output was better or of equal quality.
Picture below shows the performance of two models, Koala-Distill and Koala-All, as compared to Alpaca and ChatGPT.
- Koala-Distill was trained on the distillation dataset that comes out of ChatGPT
- Koala-All was trained on the distillation datasets from ChatGPT and humans

The study suggests that small models, like Koala, can achieve the same level of performance as closed-source LLMs, if they are trained on high-quality datasets.
The experiment also found that training on open-source data in addition to the distillation data (Koala-All) did not result in significant improvements.
However, the quality and diversity of the ChatGPT dialogues were found to be crucial in building strong dialogue models.
Conclusion, future research
The results achieved by Koala shows that small language models can be trained more quickly and require fewer computational resources compared to larger models.
This makes them more accessible to researchers and developers who may not have access to high-performance computing resources.
Koala has limitations such as generating confident but inaccurate information, inheriting biases and stereotypes from training data, lacking common sense knowledge, and having limited understanding of context and nuances.
These limitations need to be addressed in future research.
Learn more:
- The blog post: “Koala: A Dialogue Model for Academic Research” (on BAIR)
- The web demo of Koala