Microsoft Phi-3-mini, a small language model on your phone

April 26, 2024

Microsoft introduces the Phi-3 suite of small language models, starting with Phi-3 mini. This model, with its 3.8B parameters, is compact enough to be deployed on a phone and was trained on an extensive dataset of 3.3T tokens.

The upcoming release includes the Phi-3-small (7B parameters) and Phi-3-medium (14B parameters) models.

The Phi-3 models are small language models (SLMs) designed to be both powerful and affordable. They outperform language models of the same and larger sizes on various benchmarks, such as language understanding, logical reasoning, coding, and mathematics. For example, Phi-3-mini surpasses the performance of models double its size, while the Phi-3-small and Phi-3-medium exceed the capabilities of significantly larger models, such as GPT-3.5.

You can interact with the Phi-3 family directly on Azure AI Playground or build with and customize Phi-3 for your scenarios using the Azure AI Studio. You can also find it on Hugging Chat playground and easily run it on your device with Ollama. It is offered as a microservice through NVIDIA NIM, featuring a conventional API interface for flexible deployment. Additionally, it was optimized for NVIDIA’s GPUs.

Due to its small size, Phi-3-mini can be transformed into a 4-bit quantized form, which requires approximately 1.8GB of memory. Its performance was proven by effectively operating on an iPhone 14 with an A16 Bionic chip, completely offline, achieving a rate of more than 12 tokens per second.

The model

The Phi-3-mini model (3.8B parameters) has a transformer decoder architecture, being available in two context-length variants: 4K and 128K tokens. This represents an advancement from the Phi-2 (2.7B parameters), which was released by Microsoft in December 2023.

The upcoming Phi-3-small model (7B parameters) leverages the tiktoken tokenizer (for better multilingual tokenization) with a vocabulary size of 100.352 tokens and a default context length of 8K tokens. This allows the model to process and generate more extensive and complex text passages, enhancing its performance on a variety of language tasks.

Training methodology

Phi-3-mini-4K-Instruct and Phi-3-mini-128K-Instruct were trained over 7 days on 3.3T tokens using 512 H100-80G GPUs for each model. They followed advanced fine-tuning techniques to align with human preferences and safety standards.

The pre-training process followed two distinct and consecutive stages:

In the first stage, the models were primarily exposed to a vast collection of web sources. This data helped the models develop general knowledge and language comprehension.
In the second stage, the models were fine-tuned with a more rigorously selected subset of web data from the first phase, combined with additional synthetic data, to improve their logical reasoning and specialized abilities.

After these 2 stages, the models underwent additional training, which included supervised instruction fine-tuning and preference tuning, to enhance their stability and security.

The training dataset, made of 3.3 trillion tokens, is a meticulously curated mix of quality-filtered public documents, select educational materials, code, and newly generated synthetic data generated by LLMs. Specifically, the team filtered the web data to encompass the appropriate degree of knowledge and retained a greater number of web pages that may enhance the models’ reasoning abilities. Instead of indiscriminately feeding vast amounts of data into the training model, the emphasis was placed on enhancing its reasoning capabilities, rather than one that merely has a vast repository of information.

Evaluation

Phi-3 achieves results comparable to larger models like Gemma-7B, Mistral-7b-v0.1, Mixtral-8x7b, Llama-3-instruct-8b , GPT-3.5, and Claude-3, on various metrics, such as AGI Eval (Artificial General Intelligence Evaluation) and MMLU (Multi-Modal Language Understanding). The model competes with its larger counterparts in areas such as language understanding, coding, and reasoning. Examples of its performance are illustrated in the image below.

Phi-3 models outperform language models of the same and larger sizes on key benchmarks (data source: Phi-3 Microsoft blog)

In particular, Phi-3-mini outperforms models twice its size, while the Phi-3-small and Phi-3-medium models exhibit improved performance when compared to significantly larger models, such as the GPT-3.5T.

Limitations

The Phi-3-mini model matches larger models in language comprehension and reasoning but is limited in storing extensive factual knowledge, as evidenced by its low performance on TriviaQA. This shortcoming can be overcome by combining the model with a search engine and by expanding its language support to include multiple languages.

Challenges like inaccuracies, biases, and safety risks can be mitigated by carefully choosing the training data and constantly improving the model.

Safety

Phi-3-mini incorporates extensive safety measures, including post-training adjustments, red team reviews, and automated tests to minimize risks across various harm categories. This comprehensive process led to a marked reduction in the model’s potential for generating harmful responses (see the picture below).

Conclusion

SLMs, such as Phi-3, have the potential to significantly change our interaction with technology. They are designed to be more efficient and accessible, allowing for seamless AI integration into various aspects of daily life, from personal assistants to advanced computational tools.