MosaicML launches MPT-30B: a new open-source model that outperforms GPT-3

MosaicML, a company that provides a platform for training and deploying large language models (LLMs), has recently released its second open-source foundation model called MPT-30B. The model is part of the MosaicML Foundation Series and comes after the smaller MPT-7B model that was launched in May 2023.

MPT-30B is an open-source LLM (with a commercial Apache 2.0 license) and was trained on Nvidia’s H100 GPUs.

Image source: MosaicML

MPT-30B Family

MPT-30B is a 30-billion-parameter transformer model that was trained on a mixture of text and code data, with a context window of 8,000 tokens.

It significantly improves over the previous MPT-7B model, having more parameters, longer context window, and better quality. It also has two fine-tuned variants.

The figure below shows a comparison between MPT-7B vs MPT-30B.

The MPT-30B model significantly improves over the MPT-7B model

According to MosaicML, the new model surpasses the quality of the original GPT-3 model and is competitive with other open-source models such as LLaMa-30B and Falcon-40B.

MPT-30B outperforms GPT-3 in six out of the nine metrics
MPT vs. LLaMa vs. Falcon models.
Left: Comparing models with 7 billion parameters. Right: Comparing models with 30 to 40 billion parameters.

MPT-30B also comes with two fine-tuned variants: MPT-30B-Instruct and MPT-30B-Chat.

  • MPT-30B-Instruct is specialized for single-turn instruction following, such as generating code snippets, summaries, or translations.
  • MPT-30B-Chat is a fine-tuned variant of MPT-30B that is designed for multi-turn conversations, such as chatbots or interactive storytelling. MPT-30B-Chat is a research artifact and it is not for commercial use.

Main advantages over other LLMs

MPT-30B has several advantages over other LLMs, such as:

  1. It was trained with an 8k token context window, which is four times longer than GPT-3, LLaMa-30B, and Falcon-40B. This allows the model to handle longer sequences of text or code, which are common in enterprise applications such as banking or healthcare. 
  2. It supports an attention mechanism called Attention with Linear Biases (ALiBi), which enables the model to process even longer contexts than its training window by using linear biases to attend to relevant tokens.
  3. It uses a novel attention mechanism called FlashAttention, which reduces the memory and computation requirements of transformer models by using hashing and sparsity techniques. This makes the model more efficient for inference and training on GPUs.
  4. It was trained on a large amount of code data from various programming languages, such as Python, Java, C#, and SQL. This gives the model strong coding abilities and enables it to generate or understand code snippets for various tasks.

Training times and costs

MPT-30B was trained on NVIDIA’s latest-generation H100 GPUs, which are now available for MosaicML customers. The H100 GPUs offer 2.4 times more throughput per GPU than the A100 GPUs, which were used to train MPT-7B. This makes MPT-30B one of the first publicly known LLMs trained on H100 GPUs.

The table below shows the times and costs to pretrain MPT-30B from scratch on 1 trillion tokens.

It takes about 28.3 days and $871,000 to pretrain MPT-30B on A100 GPUs, and about 11.6 days and $714,000 to pretrain MPT-30B on H100 GPUs.

Times and costs to pretrain MPT-30B from scratch on 1 trillion tokens

In the next table we can see the times and costs to finetune MPT-30B on 1 billion tokens. This means training an existing model on a smaller and more specific dataset to improve its performance on a certain task or domain.

It takes about 21.8 hours and $871 to finetune MPT-30B on A100 GPUs, and about 8.9 hour and $714 to finetune MPT-30B on H100 GPUs.

Times and costs to finetune MPT-30B on 1 billion tokens

This suggests that MosaicML enables a fast and low-cost approach for training and deploying customized LLMs using their platform.

How you can use MPT-30B

You can use MPT-30B for various generative AI applications by downloading it from the HuggingFace Hub. You can choose different frameworks (PyTorch or TensorFlow) and different variants (base, instruct, or chat) depending on your needs.

You can also use MPT-30B via the MosaicML Platform, which helps you train and deploy your own models with your own data in several ways: MosaicML Training, MosaicML Inference: Starter Edition, or MosaicML Inference: Enterprise Edition.

The MosaicML platform helps you train and use large AI models, such as MPT-30B, for different purposes (see the figure below).

The MosaicML platform architecture

The MosaicML platform has three main parts

  1. Client Interfaces: Consists of a Python API, a command line interface, and a web console for management.
  2. Control Plane: This is the part that controls the platform and makes sure your training jobs run well on different computers. It also keeps track of your settings, logs, and other information.
  3. Compute Plane: This part performs distributed training on one of the cloud providers. It ensures data privacy by using a Virtual Private Cloud (VPC).

The MosaicML platform enables you to fine-tune and serve inference for MPT-30B fast and low-cost. If you are interested in building your own generative AI applications, then MPT-30B is a good place to start.

Conclusion

MPT-30B is a powerful tool for generative AI applications.

The model also comes with two fine-tuned variants, MPT-30B-Instruct and MPT-30B-Chat. All the models are open-source. MPT-30B-Instruct has a commercial license, while MPT-30B-Chat has a non-commercial license (CC-By-SA-3.0 and CC-By-NC-SA-4.0 respectively).

Learn more:

Other popular posts