OLMo, a truly open source language model by AI2

March 11, 2024

The Allen Institute for AI (AI2) has launched OLMo, a series of state-of-the-art Open Language Models that offer full access to the model weights, inference code, training data, training code, and evaluation code for each model.

All code, weights, and intermediate checkpoints are made available under the Apache 2.0 License. The users have freedom to use, modify, and redistribute the provided resources, even for commercial purposes.

To get started with OLMo, follow the steps provided on Hugging Face or the GitHub repository.

Model architecture

OLMo has a decoder-only transformer architecture with several improvements for better performance:

Bias-free: eliminate bias terms from its architecture to improve training stability, like LLaMA and PaLM.
Simplified normalization: apply the layer norm without parameters.
Efficient activation function: switch from ReLU to SwiGLU activation function.
Improved encoding: use Rotary Position Embedding (RoPE) instead of absolute positional embedding (to capture the word order in the text).
Enlarged vocabulary: adopt a modified version of tokenizer design from GPT-NeoX-20B, with additional tokens to hide private information.

Bias terms, extra numbers that are added to the outputs of the neural network layers, can help the model learn better. However, OLMo models exclude them to increase the stability of the training process.
Layer norm is a technique that normalizes the outputs of the neural network layers, ensuring a zero mean and unit variance. This helps the model learn faster and better.

The team provides OLMo 1B and 7B versions, with a 65B version coming soon.

Size	Layers	Hidden Size	Attention Heads	Tokens Trained
1B	16	2048	16	2T
7B	32	4086	32	2.46T
65B (coming soon)	80	8192	64	–

OLMo model sizes and the maximum number of tokens trained to (source: paper)

Training

OLMo models are trained on the AI2’s Dolma dataset, a 3 trillion token open corpus which contains various sources of web content, academic papers, code, books, and encyclopedias. The framework also provides the code used to generate this comprehensive pretraining data.

Source	Doc Type	UTF-8 bytes *(GB)*	Documents *(millions)*	GPT-NeoX tokens *(billions)*
Common Crawl	web pages	9,022	3,370	2,006
The Stack	code	1,043	210	342
C4	web pages	790	364	174
Reddit	social media	339	377	80
peS2o	STEM papers	268	38.8	57
Project Gutenberg	books	20.4	0.056	5.2
Wikipedia, Wikibooks	encyclopedic	16.2	6.2	3.7
Total		11,519	4,367	2,668

Composition of Dolma (source: paper)

Dolma is built by using a pipeline of (1) language filtering, (2) quality filtering, (3) content filtering, (4) deduplication, (5) multi-source mixing, and (6) tokenization.

The training was conducted on two hardware platforms: LUMI (AMD MI250X GPUs) and MosaicML (NVIDIA A100 GPUs). Both platforms produced models with near-identical performance.

Evaluation

To ensure a comprehensive evaluation of OLMo’s capabilities, the framework includes a robust evaluation suite, comprised of AI2’s Catwalk project and Paloma. This suite provides over 500 checkpoints (model’s states at specific points), capturing its state at every 1000 steps throughout both training and evaluation.

OLMo-7B was compared with other open and partially open LLMs, such as: TII’s Falcon-7B, Meta’s LlaMA-7B and LLaMA2-7B, MosaicML’s MPT-7B model, EleutherAI’s Pythia-6.9B and RPJ-INCITE-7B (see the next table).

These models were all tested without any extra training for the specific tasks (zero-shot evaluation).

The table below shows that OLMo performs better than other 7B models on many evaluation tasks.

7B Models	arc challenge	arc easy	boolq	copa	hella- swag	open bookqa	piqa	sciq	wino-grande	avg.
Falcon	47.5	70.4	74.6	86.0	75.9	53.0	78.5	93.9	68.9	72.1
LLaMA	44.5	57.0	73.1	85.0	74.5	49.8	76.3	89.5	68.2	68.7
LLaMA2	39.8	57.7	73.5	87.0	74.5	48.4	76.4	90.8	67.3	68.4
MPT	46.5	70.5	74.2	85.0	77.6	48.6	77.3	93.7	69.9	71.5
Pythia	44.2	61.9	61.1	84.0	63.8	45.0	75.1	91.1	62.0	65.4
RPJ-INCITE	42.8	68.4	68.6	88.0	70.3	49.4	76.0	92.9	64.7	69.0
OLMo-7B	48.5	65.4	73.4	90.0	76.4	50.4	78.4	93.8	67.9	71.6

Zero-shot evaluation of OLMo-7B (results for the 2.46T token checkpoint) and 6 other publicly available comparable model checkpoints on 9 core tasks (source: paper)

The same evaluation, averages (data source: paper)

The 9 key evaluation tasks are:

ARC Challenge: a subset of ARC(which is a multiple-choice question answering task that requires scientific reasoning) that contains only the most difficult questions ARC
ARC-Easy: a subset of ARC that contains only the easiest questions
BoolQ: the Boolean Questions, which is a binary question answering task that requires reading a short passage
COPA: the Choice of Plausible Alternatives, which is a causal reasoning task that requires choosing the most plausible cause or effect of a given situation
HellaSwag: a commonsense reasoning task that requires choosing the most appropriate ending for a given story context
OpenBookQA: an open-domain question answering task that requires using both a small set of facts and general knowledge
PIQA: the Physical Interaction Question Answering, which is a question answering task that requires reasoning about physical interactions in everyday scenarios
SciQ: a multiple-choice question answering task that requires understanding science texts
Winogrande: a pronoun disambiguation task that requires choosing the correct referent for a given pronoun in a sentence

Beyond its core capabilities, OLMo can leverage the Open Instruct framework to learn from natural language instructions and adapt to human feedback. This enables a more interactive and user-guided training process.

Conclusion

OLMo is a truly open source model that addresses the issues that many popular AI models today are like “black boxes,” trained with undisclosed methods and datasets, which can have ethical, social, and environmental implications.

This level of openness enables the users to copy, study, improve, and build upon the model, and to collectively advance the science of language models.