OpenELM, open-source efficient language models from Apple

OpenELM represents a suite of open-source language models developed by Apple, designed to work on-device. It is compatible with Apple’s MLX framework, ensuring seamless integration with Apple devices, such as iPhones and iPads.

The model employs layer-wise scaling, a unique resource allocation method that enhances accuracy while minimizing the need for computational resources.

Although OpenELM is optimized for Apple hardware, the models are open source and could potentially be adapted for use on other platforms.

You can access the source code, pre-trained model weights, and training recipes in the CoreNet repository. The models are available on Hugging Face.

Unlike previous models that might only release model weights and inference code, OpenELM provides the entire framework, including data preparation, training, fine-tuning, and evaluation procedures, alongside multiple pre-trained checkpoints and training logs. The team released additional code to convert their models into the MLX library format for use on Apple devices.

OpenELM architecture

OpenELM is built on a decoder-only transformer framework, incorporating significant enhancements. Keeping pace with recent advancements in LLMs, OpenELM adopts grouped query attention (GQA) as a substitute for traditional multi-head attention (MHA) and replaces the standard feed forward network (FFN) with the SwiGLU FFN variant, which is a more advanced version for processing information within the network (see the picture below).

Main components of the transformer model (source)

Further refinements to improve the model’s architecture:

  1. eliminate learnable bias parameters in any linear layers to simplify the model and potentially reduce the training time
  2. apply pre-normalization using RMSNorm and rotary positional embedding (ROPE) to accurately encode the position of words in a sentence
  3. use flash attention to speed up the model’s computation of the attention scores

Layer-wise scaling strategy

The transformer-based LLMs typically maintain the same number of heads and the same feed forward networks’ dimension. While this uniformity simplifies the model’s architecture, it is not the most efficient way to distribute the parameters throughout the model.

OpenELM uses a layer-wise scaling strategy to optimize the parameters’ allocation within each layer of the transformer. It starts with smaller dimensions in the attention and feed-forward modules of the transformer layers that are closer to the input, and gradually increases the size of these layers that are closer to the output (see the picture below).

Block-wise scaling efficiently allocates parameters and operations across blocks (source)

Training

The OpenELM models were trained for 350k iterations (training steps) using CoreNet. These models used the same tokenizer as the LLama model, and they filtered and tokenized text data in real-time, rather than using pre-tokenized data.

The pretraining process leveraged publicly available datasets, totaling around 1.8T tokens:

Source
Subset Token
RefinedWeb 665 B
RedPajama Github
Books
ArXiv
Wikipedia
StackExchange
C4
59 B
26 B
28 B
24 B
20 B
175 B
PILE 207 B
Dolma
The Stack
Reddit
PeS2o
Project Gutenberg
Wikipedia + Wikibooks
411 B
89 B
70 B
6 B
4.3 B

Dataset used for pre-training OpenELM (source: paper)

Evaluation

OpenELM outperforms comparable-sized existing LLMs pretrained on publicly available datasets. The average accuracy was calculated across multiple tasks, including reasoning, knowledge understanding, and misinformation & bias.

ModelPublic datasetCode (open-source)Weights (open-source)Model sizePretraining tokensAverage acc. (in %)
OPT1.3B0.2T41.49
PyThia1.4B0.3T41.83
MobiLlama1.3B1.3T43.55
OLMo1.2B3.0T43.57
OpenELM1.1B1.5T45.93
OpenELM vs. public LLMs (source: paper)

Notably, OpenELM achieves better performance than the existing open-source LLMs trained on public datasets. For instance, OpenELM, with its 1.1B parameters, outperforms OLMo, which has 1.2B parameters, by 2.36% while requiring half as many pretraining tokens.

OpenELM was compared with other widely-used LLMs using different evaluation benchmarks. The results can be seen in the next tables where the highest accuracy scores achieved by each model are marked in bold, and those trained with a smaller dataset are shown in gray.

(source: paper)

Conclusion

OpenELM sets a new standard in AI research by offering a comprehensive and fully open framework.

Since the models are trained on publicly available datasets, they lack built-in safety measures, potentially leading to harmful, inaccurate, or biased outputs. Therefore, it’s important for users and developers to conduct extensive safety tests and create appropriate filtering mechanisms for their applications.

Read more:

Other popular posts