The Allen Institute for AI (AI2) has launched OLMo, a series of state-of-the-art Open Language Models that offer full access to the model weights, inference code, training data, training code, and evaluation code for each model.
All code, weights, and intermediate checkpoints are made available under the Apache 2.0 License. The users have freedom to use, modify, and redistribute the provided resources, even for commercial purposes.
To get started with OLMo, follow the steps provided on Hugging Face or the GitHub repository.
Model architecture
OLMo has a decoder-only transformer architecture with several improvements for better performance:
- Bias-free: eliminate bias terms from its architecture to improve training stability, like LLaMA and PaLM.
- Simplified normalization: apply the layer norm without parameters.
- Efficient activation function: switch from ReLU to SwiGLU activation function.
- Improved encoding: use Rotary Position Embedding (RoPE) instead of absolute positional embedding (to capture the word order in the text).
- Enlarged vocabulary: adopt a modified version of tokenizer design from GPT-NeoX-20B, with additional tokens to hide private information.
Bias terms, extra numbers that are added to the outputs of the neural network layers, can help the model learn better. However, OLMo models exclude them to increase the stability of the training process.
Layer norm is a technique that normalizes the outputs of the neural network layers, ensuring a zero mean and unit variance. This helps the model learn faster and better.
The team provides OLMo 1B and 7B versions, with a 65B version coming soon.
Size | Layers | Hidden Size | Attention Heads | Tokens Trained |
---|---|---|---|---|
1B | 16 | 2048 | 16 | 2T |
7B | 32 | 4086 | 32 | 2.46T |
65B (coming soon) | 80 | 8192 | 64 | – |
Training
OLMo models are trained on the AI2’s Dolma dataset, a 3 trillion token open corpus which contains various sources of web content, academic papers, code, books, and encyclopedias. The framework also provides the code used to generate this comprehensive pretraining data.
Source | Doc Type | UTF-8 bytes (GB) | Documents (millions) | GPT-NeoX tokens (billions) |
---|---|---|---|---|
Common Crawl | web pages | 9,022 | 3,370 | 2,006 |
The Stack | code | 1,043 | 210 | 342 |
C4 | web pages | 790 | 364 | 174 |
social media | 339 | 377 | 80 | |
peS2o | STEM papers | 268 | 38.8 | 57 |
Project Gutenberg | books | 20.4 | 0.056 | 5.2 |
Wikipedia, Wikibooks | encyclopedic | 16.2 | 6.2 | 3.7 |
Total | 11,519 | 4,367 | 2,668 |
Dolma is built by using a pipeline of (1) language filtering, (2) quality filtering, (3) content filtering, (4) deduplication, (5) multi-source mixing, and (6) tokenization.
The training was conducted on two hardware platforms: LUMI (AMD MI250X GPUs) and MosaicML (NVIDIA A100 GPUs). Both platforms produced models with near-identical performance.
Evaluation
To ensure a comprehensive evaluation of OLMo’s capabilities, the framework includes a robust evaluation suite, comprised of AI2’s Catwalk project and Paloma. This suite provides over 500 checkpoints (model’s states at specific points), capturing its state at every 1000 steps throughout both training and evaluation.
OLMo-7B was compared with other open and partially open LLMs, such as: TII’s Falcon-7B, Meta’s LlaMA-7B and LLaMA2-7B, MosaicML’s MPT-7B model, EleutherAI’s Pythia-6.9B and RPJ-INCITE-7B (see the next table).
These models were all tested without any extra training for the specific tasks (zero-shot evaluation).
The table below shows that OLMo performs better than other 7B models on many evaluation tasks.
7B Models | arc challenge | arc easy | boolq | copa | hella- swag | open bookqa | piqa | sciq | wino-grande | avg. |
---|---|---|---|---|---|---|---|---|---|---|
Falcon | 47.5 | 70.4 | 74.6 | 86.0 | 75.9 | 53.0 | 78.5 | 93.9 | 68.9 | 72.1 |
LLaMA | 44.5 | 57.0 | 73.1 | 85.0 | 74.5 | 49.8 | 76.3 | 89.5 | 68.2 | 68.7 |
LLaMA2 | 39.8 | 57.7 | 73.5 | 87.0 | 74.5 | 48.4 | 76.4 | 90.8 | 67.3 | 68.4 |
MPT | 46.5 | 70.5 | 74.2 | 85.0 | 77.6 | 48.6 | 77.3 | 93.7 | 69.9 | 71.5 |
Pythia | 44.2 | 61.9 | 61.1 | 84.0 | 63.8 | 45.0 | 75.1 | 91.1 | 62.0 | 65.4 |
RPJ-INCITE | 42.8 | 68.4 | 68.6 | 88.0 | 70.3 | 49.4 | 76.0 | 92.9 | 64.7 | 69.0 |
OLMo-7B | 48.5 | 65.4 | 73.4 | 90.0 | 76.4 | 50.4 | 78.4 | 93.8 | 67.9 | 71.6 |

The 9 key evaluation tasks are:
- ARC Challenge: a subset of ARC(which is a multiple-choice question answering task that requires scientific reasoning) that contains only the most difficult questions ARC
- ARC-Easy: a subset of ARC that contains only the easiest questions
- BoolQ: the Boolean Questions, which is a binary question answering task that requires reading a short passage
- COPA: the Choice of Plausible Alternatives, which is a causal reasoning task that requires choosing the most plausible cause or effect of a given situation
- HellaSwag: a commonsense reasoning task that requires choosing the most appropriate ending for a given story context
- OpenBookQA: an open-domain question answering task that requires using both a small set of facts and general knowledge
- PIQA: the Physical Interaction Question Answering, which is a question answering task that requires reasoning about physical interactions in everyday scenarios
- SciQ: a multiple-choice question answering task that requires understanding science texts
- Winogrande: a pronoun disambiguation task that requires choosing the correct referent for a given pronoun in a sentence
Beyond its core capabilities, OLMo can leverage the Open Instruct framework to learn from natural language instructions and adapt to human feedback. This enables a more interactive and user-guided training process.
Conclusion
OLMo is a truly open source model that addresses the issues that many popular AI models today are like “black boxes,” trained with undisclosed methods and datasets, which can have ethical, social, and environmental implications.
This level of openness enables the users to copy, study, improve, and build upon the model, and to collectively advance the science of language models.
Learn more:
- Paper on arXiv: OLMo: Accelerating the Science of Language Models
- Release announcement
- Technical blog
- Project page
- Repositories:
- Core repository (training, inference, fine-tuning)
- Evaluation code
- Further fine-tuning code
- Hugging Face (model card, files and versions, community)