Introducing Lemur-70B and Lemur-70B-Chat: the open-source models harmonizing text and code for powerful language agents

Lemur and Lemur-Chat are openly accessible language models optimized for both natural language and coding capabilities. They are the foundation for versatile language agents that can communicate with humans, think, plan, and interact with the environment.

Lemur and Lemur-Chat outperform other open-source models like Llama-2-70B-Chat and CodeLlama-34B-Instruct and achieve a level of performance comparable to GPT-3.5-turbo, thus narrowing the gap between open-source and proprietary models (see the picture below).

Lemur-70B and Lemur-70B-Chat are open and state-of-the-art foundation models for language agents (source: blog post)

You can try the Lemur-Chat model (on Hugging Face), where it can answer your questions about code or text.

The training procedure

The following image illustrates how Lemur and Lemur-Chat models were trained.

Overview of the training procedure (source: paper)

It consisted of two main steps:

Pre-training. The researchers built Lemur by training the base Llama-2-70B model on source codes from GitHub that have licenses allowing free use and modification (The Stack) and a huge dataset of text with 90 billion tokens and 10 times more text than code (including Refined-Web, Redpajama, CommonCrawl, Wikipedia, Books, ArXiv, StackExchange, and DM Mathematics). This helped to boost the coding skills of the model while keeping its natural language abilities.

Fine-tuning. They fine-tuned Lemur on about 300K examples of text and code tasks to make Lemur-Chat, a model that can follow user commands. Lemur-chat can generate executable and semantically correct code snippets or natural language explanations based on the user’s input.

The two-step training method resulted in language models that perform better than all other open-source models on many different text and code tasks.

Lemur and Lemur-Chat perform almost as well as commercial models, such as GPT-3.5-Turbo, on tasks that require interaction with the environment and respond to feedback. These tasks include tool usage, exploration, and self-debugging. Lemur and Lemur-Chat show strong capabilities to adapt to feedback from the environment or humans, and to learn from their own mistakes.

Evaluation results

The research team evaluated Lemur on two types of tasks:

  • Language and code tasks: These tasks require the model to understand and generate both natural language and code, such as answering questions, writing programs, or translating between languages and code.
  • Interactive agent tasks: These tasks require the model to interact with tools, environments, or humans, such as using web search, learning from feedback, fixing errors, or exploring unknown environments (see the next picture).
The models were tested in the context of multi-turn interactive scenarios to measure their ability to adjust in realistic situations (source: paper)

A graphical representation of the performance of Lemur-Chat and other open-source pre-trained language models on different foundational and interactive agent capabilities can be seen in the next picture.

Comparison of the foundational and agent capabilities between Lemur and other models (source: blog post)

The left chart shows the average scores of Lemur-Chat, Llama-2-70B-Chat, and CodeLlama-34B-Instruct on 8 language and code datasets, such as Massive Multitask Language Understanding (MMLU), BIG-Bench Hard (BBH), Grade School Math 8K (GSM8K), Human-Labeled Evaluation Set (HumanEval), and Semantic Parsing and Text-to-SQL Challenge (Spider). These datasets measure the models’ abilities to understand and generate both natural language and code. The chart shows that Lemur-Chat outperforms the other models on most of the datasets, demonstrating its balanced proficiency in both domains.

The right chart compares the performance of Lemur-Chat, Llama-2-70B-Chat, CodeLlama-34B-Instruct, and GPT-3.5-Turbo on 6 agent benchmarks (that test the models’ skills in reasoning, following instructions, using tools, adapting to feedback, and interacting with the environment). The chart shows that Lemur-Chat significantly outperforms the other open-source models on most of the agent benchmarks, narrowing the gap with GPT-3.5-Turbo.

Lemur-70B and Lemur-70B-Chat outperformed Llama-2-70B and Llama-2-70B-Chat by 4.3% and 14.8%, respectively. They also narrowed the gap with GPT-3.5-Turbo on some of the benchmarks, especially those that involve agent capabilities.


Lemur-70B and Lemur-70B-Chat are open-source models with strong language and coding skills, narrowing the divide with closed-source alternatives. They can function as the foundational components for language-based systems across a range of applications, including customer service chatbots, virtual assistants, and even robotic interfaces.

Learn more:

Other popular posts