Democratizing AI: Google releases Gemma, a free and open-source language model


Google introduces Gemma, a new open-source, lightweight language model built on the same technology used for Gemini models. Gemma is among the first open-weights LLMs available for commercial and research usage.

The team offers two model sizes: 7B parameters (for powerful GPUs and TPUs) and 2B parameters (optimized for CPU and on-device applications). You can try the 7B model demo or install it locally from the repository.

Find more resources and information about Gemma on the official website.

In addition to the base models, the team also provides pre-trained and instruction-tuned variants for each model size, as well as an open-source codebase for easy integration and deployment.

Gemma outperforms other open models of similar size on academic benchmarks for language understanding, reasoning, and safety (see the picture below).

Language understanding and generation performance of Gemma 7B across different capabilities compared to similarly sized open models (source: technical report)

Why choose Gemma?

Gemma lets users download the model weights and the tokenizer, and run them on the cloud or locally, despite other LLMs, such as GPT-4 and ChatGPT, which only offer access to their APIs.This gives users more flexibility and control over the model, and the ability to fine-tune it for specific domains or tasks.

Gemma prioritizes text generation tasks, offering a powerful tool for general usage, whereas Gemini tackles a broader range of complexities with its multimodal capabilities.

Model architecture

The Gemma models are based on the transformer decoder, with several improvements:

  1. Multi-Query attention (for Gemma 2B) allows the model to attend to multiple queries at once.
  2. RoPE embeddings use a rotational encoding scheme to represent the position of each token. This reduces the model size and enhances the generalization ability.
  3. GeGLU activations replace the standard ReLU activation function and increase the expressiveness of the model.
  4. Normalizer location applies normalization to both the input and the output of each transformer sub-layer, instead of only one of them, improving the stability and convergence of the model.
3072The amount of information the model can process at once (the model’s working memory). This is typically the number of parameters in the model.
Layers1828The number of encoder layers in the model’s architecture. More layers allow the model to learn more complex relationships between words.
Feedforward hidden dims3276849152
The size of the hidden layer in the feedforward neural network used in the model. A larger hidden layer can store more information.
Num heads816The number of attention heads in the model’s multi-head attention mechanism. More heads allow the model to attend to different parts of the input sequence more effectively.
Num KV heads116The number of heads used for the key and value projections in the multi-head attention mechanism. This is separate from the number of heads used for the attention mechanism itself.
Head size256256The dimensionality of the output from each head in the multi-head attention mechanism. A larger head size allows the model to attend to more information from the input sequence.
Vocab size256128256128The number of tokens (words or sub-words) the model is trained to understand and generate. A larger vocab size allows the model to handle a wider range of words.
Key model parameters: 2B compared to 7B (data source: technical report)


Gemma 2B and 7B were trained using 4096 TPUv5e (for the 7B model) and 512 TPUv5e (for the 2B model). Gemma models are specialized in text-based tasks and prioritize scalability and efficiency.

Pretraining: Gemma 2B and 7B are trained on massive datasets consisting of 2T and 6T tokens respectively. The training data primarily comprises English-language text extracted from web documents, mathematical content, and code sources.

Fine-tuning: The authors fine-tune the Gemma models using two methods: Structured Fine-tuning (SFT), which uses synthetic and human-generated data, and Reinforcement Learning with Human Feedback (RLHF), which uses human feedback and high-quality prompts.

They show that both methods are essential for enhancing the quality and relevance of the model outputs.


Gemma was evaluated across diverse domains, employing both automated benchmarks and human assessment. The following tables show the results.

Win rate of Gemma models versus Mistral 7B v0.2 Instruct with 95% confidence intervals (source: technical report)
Gemma’s academic benchmark results, compared to Mistral and LLaMA-2 (source: technical report)

How to use Gemma?

Gemma is easy to use and integrate with various tools and frameworks.

  1. Install Gemma using Python 3.9 or higher and JAX for CPU, GPU, or TPU.
  2. Download the model checkpoints and the tokenizer from the Hugging Face Hub, and extract them to a local directory.
  3. Run the model using the provided examples and tutorials, such as the sampling script, the fine-tuning tutorial, and the GSM8K evaluation. Gemma models can run on any device that supports JAX, not just desktop or laptop computers. You can also use the playground on the NVIDIA NGC catalog (coming soon).

Visit the GitHub repository for the code and instructions. You can also try the demo on Hugging Chat.


Gemma is a promising and exciting development in the field of LLMs and AI. It opens up new possibilities for developers and researchers to explore and experiment with open-source LLMs and to create innovative applications.

Google plans to continue improving and expanding Gemma, as well as releasing more models and features in the future.

Learn more:

Other popular posts