OLMo, a truly open source language model by AI2

,

The Allen Institute for AI (AI2) has launched OLMo, a series of state-of-the-art Open Language Models that offer full access to the model weights, inference code, training data, training code, and evaluation code for each model.

All code, weights, and intermediate checkpoints are made available under the Apache 2.0 License. The users have freedom to use, modify, and redistribute the provided resources, even for commercial purposes.

To get started with OLMo, follow the steps provided on Hugging Face or the GitHub repository.

Model architecture

OLMo has a decoder-only transformer architecture with several improvements for better performance:

  1. Bias-free: eliminate bias terms from its architecture to improve training stability, like LLaMA and PaLM.
  2. Simplified normalization: apply the layer norm without parameters.
  3. Efficient activation function: switch from ReLU to SwiGLU activation function.
  4. Improved encoding: use Rotary Position Embedding (RoPE) instead of absolute positional embedding (to capture the word order in the text).
  5. Enlarged vocabulary: adopt a modified version of tokenizer design from GPT-NeoX-20B, with additional tokens to hide private information.

Bias terms, extra numbers that are added to the outputs of the neural network layers, can help the model learn better. However, OLMo models exclude them to increase the stability of the training process.
Layer norm is a technique that normalizes the outputs of the neural network layers, ensuring a zero mean and unit variance. This helps the model learn faster and better.

The team provides OLMo 1B and 7B versions, with a 65B version coming soon.

SizeLayersHidden SizeAttention HeadsTokens Trained
1B 162048162T
7B324086322.46T
65B (coming soon)80819264
OLMo model sizes and the maximum number of tokens trained to (source: paper)

Training

OLMo models are trained on the AI2’s Dolma dataset, a 3 trillion token open corpus which contains various sources of web content, academic papers, code, books, and encyclopedias. The framework also provides the code used to generate this comprehensive pretraining data.

SourceDoc TypeUTF-8 bytes (GB)Documents
(millions)
GPT-NeoX tokens
(billions)
Common Crawlweb pages9,0223,3702,006
The Stackcode1,043210342
C4web pages790364174
Redditsocial media33937780
peS2oSTEM papers26838.857
Project Gutenbergbooks20.40.0565.2
Wikipedia, Wikibooksencyclopedic16.26.23.7
Total11,5194,3672,668
Composition of Dolma (source: paper)

Dolma is built by using a pipeline of (1) language filtering, (2) quality filtering, (3) content filtering, (4) deduplication, (5) multi-source mixing, and (6) tokenization.

The training was conducted on two hardware platforms: LUMI (AMD MI250X GPUs) and MosaicML (NVIDIA A100 GPUs). Both platforms produced models with near-identical performance.

Evaluation

To ensure a comprehensive evaluation of OLMo’s capabilities, the framework includes a robust evaluation suite, comprised of AI2’s Catwalk project and Paloma. This suite provides over 500 checkpoints (model’s states at specific points), capturing its state at every 1000 steps throughout both training and evaluation.

OLMo-7B was compared with other open and partially open LLMs, such as: TII’s Falcon-7B, Meta’s LlaMA-7B and LLaMA2-7B, MosaicML’s MPT-7B model, EleutherAI’s Pythia-6.9B and RPJ-INCITE-7B (see the next table).

These models were all tested without any extra training for the specific tasks (zero-shot evaluation).

The table below shows that OLMo performs better than other 7B models on many evaluation tasks.

7B Modelsarc
challenge
arc
easy
boolqcopahella-
swag
open
bookqa
piqasciqwino-grandeavg.
Falcon47.570.474.686.075.953.078.593.968.972.1
LLaMA44.557.073.185.074.549.876.389.568.268.7
LLaMA239.857.773.587.074.548.476.490.867.368.4
MPT46.570.574.285.077.648.677.393.769.971.5
Pythia44.261.961.184.063.845.075.191.162.065.4
RPJ-INCITE42.868.468.688.070.349.476.092.964.769.0
OLMo-7B48.565.473.490.076.450.478.493.867.971.6
Zero-shot evaluation of OLMo-7B (results for the 2.46T token checkpoint) and 6 other publicly available comparable model checkpoints on 9 core tasks (source: paper)
The same evaluation, averages (data source: paper)

The 9 key evaluation tasks are:

  1. ARC Challenge: a subset of ARC(which is a multiple-choice question answering task that requires scientific reasoning) that contains only the most difficult questions ARC
  2. ARC-Easy: a subset of ARC that contains only the easiest questions
  3. BoolQ: the Boolean Questions, which is a binary question answering task that requires reading a short passage
  4. COPA: the Choice of Plausible Alternatives, which is a causal reasoning task that requires choosing the most plausible cause or effect of a given situation
  5. HellaSwag: a commonsense reasoning task that requires choosing the most appropriate ending for a given story context
  6. OpenBookQA: an open-domain question answering task that requires using both a small set of facts and general knowledge
  7. PIQA: the Physical Interaction Question Answering, which is a question answering task that requires reasoning about physical interactions in everyday scenarios
  8. SciQ: a multiple-choice question answering task that requires understanding science texts
  9. Winogrande: a pronoun disambiguation task that requires choosing the correct referent for a given pronoun in a sentence

Beyond its core capabilities, OLMo can leverage the Open Instruct framework to learn from natural language instructions and adapt to human feedback. This enables a more interactive and user-guided training process.

Conclusion

OLMo is a truly open source model that addresses the issues that many popular AI models today are like “black boxes,” trained with undisclosed methods and datasets, which can have ethical, social, and environmental implications.

This level of openness enables the users to copy, study, improve, and build upon the model, and to collectively advance the science of language models.

Learn more:

Other popular posts