Google introduces Gemma, a new open-source, lightweight language model built on the same technology used for Gemini models. Gemma is among the first open-weights LLMs available for commercial and research usage.
The team offers two model sizes: 7B parameters (for powerful GPUs and TPUs) and 2B parameters (optimized for CPU and on-device applications). You can try the 7B model demo or install it locally from the repository.
Find more resources and information about Gemma on the official website.
In addition to the base models, the team also provides pre-trained and instruction-tuned variants for each model size, as well as an open-source codebase for easy integration and deployment.
Gemma outperforms other open models of similar size on academic benchmarks for language understanding, reasoning, and safety (see the picture below).
Why choose Gemma?
Gemma lets users download the model weights and the tokenizer, and run them on the cloud or locally, despite other LLMs, such as GPT-4 and ChatGPT, which only offer access to their APIs.This gives users more flexibility and control over the model, and the ability to fine-tune it for specific domains or tasks.
Gemma prioritizes text generation tasks, offering a powerful tool for general usage, whereas Gemini tackles a broader range of complexities with its multimodal capabilities.
Model architecture
The Gemma models are based on the transformer decoder, with several improvements:
- Multi-Query attention (for Gemma 2B) allows the model to attend to multiple queries at once.
- RoPE embeddings use a rotational encoding scheme to represent the position of each token. This reduces the model size and enhances the generalization ability.
- GeGLU activations replace the standard ReLU activation function and increase the expressiveness of the model.
- Normalizer location applies normalization to both the input and the output of each transformer sub-layer, instead of only one of them, improving the stability and convergence of the model.
Training
Gemma 2B and 7B were trained using 4096 TPUv5e (for the 7B model) and 512 TPUv5e (for the 2B model). Gemma models are specialized in text-based tasks and prioritize scalability and efficiency.
Pretraining: Gemma 2B and 7B are trained on massive datasets consisting of 2T and 6T tokens respectively. The training data primarily comprises English-language text extracted from web documents, mathematical content, and code sources.
Fine-tuning: The authors fine-tune the Gemma models using two methods: Structured Fine-tuning (SFT), which uses synthetic and human-generated data, and Reinforcement Learning with Human Feedback (RLHF), which uses human feedback and high-quality prompts.
They show that both methods are essential for enhancing the quality and relevance of the model outputs.
Evaluation
Gemma was evaluated across diverse domains, employing both automated benchmarks and human assessment. The following tables show the results.
How to use Gemma?
Gemma is easy to use and integrate with various tools and frameworks.
- Install Gemma using Python 3.9 or higher and JAX for CPU, GPU, or TPU.
- Download the model checkpoints and the tokenizer from the Hugging Face Hub, and extract them to a local directory.
- Run the model using the provided examples and tutorials, such as the sampling script, the fine-tuning tutorial, and the GSM8K evaluation. Gemma models can run on any device that supports JAX, not just desktop or laptop computers. You can also use the playground on the NVIDIA NGC catalog (coming soon).
Visit the GitHub repository for the code and instructions. You can also try the demo on Hugging Chat.
Conclusion
Gemma is a promising and exciting development in the field of LLMs and AI. It opens up new possibilities for developers and researchers to explore and experiment with open-source LLMs and to create innovative applications.
Google plans to continue improving and expanding Gemma, as well as releasing more models and features in the future.