Gorilla: a Large Language Model connected with over 1,600 APIs

June 6, 2023

Gorilla is a fine-tuned LLaMA-based model that can interact with more than 1,600 Application Programming Interfaces (APIs) sourced from HuggingFace, TorchHub, and TensorHub. Gorilla outperforms GPT-4 on writing API calls based on user prompts, showing higher accuracy and lower hallucination rates.

Proposed by a research team from UC Berkeley and Microsoft Research, the new approach improves the performance of Large Language Models (LLMs) for real-world tasks.

The model incorporates a document retriever that finds the most suitable and up-to-date API documentation stored in an API database. The document is then concatenated to the user prompt, helping Gorilla to create accurate API calls.

The team also introduced APIBench, a comprehensive dataset used to evaluate the model. APIBench is publicly available on GitHub and consists of a great number of machine learning APIs retrieved from HuggingFace, TorchHub, and TensorHub.

Gorilla surpassed GPT-4 in writing API calls when tested with different baselines and retrieval methods. The model demonstrated a remarkable capacity to adapt to changes in API documentation during testing.

The approach opens up new possibilities for LLMs to access a much larger space of knowledge, tools and resources via APIs.

The Gorilla system (see picture below) has two pipelines: the training (top half) and the inference (bottom half).

Gorilla’s architecture: the training and inference pipelines

Gorilla was fine-tuned on the base LLaMA-7B model using instruction data from APIBench, a comprehensive benchmark with more than 11,000 {instruction, API} pairs that were generated by Self-instruct using the API dataset.

The model was trained and tested in two ways: with the information-retriever (retrieval mode) and without it (zero-shot mode).

In retrieval mode, Gorilla uses an information retriever to find the best API documentation and concatenates it with the prompt before making an API call.
In the zero-shot mode, Gorilla makes an API call from the prompt without any help.

Evaluation

The authors evaluated Gorilla on a dataset of natural language prompts and API calls pairs, using four retrieval methods and different baselines (e.g. GPT-4 and Claude).

They measured the accuracy of the model and its ability to adapt to changes in API documentation and to reason under constraints.

Gorilla outperformed state-of-the-art language models in a zero-shot setting, and the oracle retriever improved the model’s performance significantly.

Example API calls generated by GPT-4, Claude, and Gorilla

The figure above shows an API calls example, with the prompt “Help me find an API to convert the spoken language in a recorded audio to text using Torch Hub”.

GPT-4 presents a non-existent model, Claude chooses a wrong library, and Gorilla correctly identifies the task and proposes a fully-qualified API call.

In the graph below we can observe the accuracy and hallucination of different models, in four scenarios: zero-shot (without a retriever) and with retrievers (BM25, GPT, and oracle). Higher values on the graph indicate higher accuracy, while being positioned towards the left indicates a lower level of hallucination.

Accuracy vs hallucination of different models in 4 settings: zero-shot and with 3 retrievers

We observe that Zero-shot Gorilla (without using a document retriever) outperforms GPT 3.5, LLaMA, GPT-4, and Claude.

Conclusion, further research

Gorilla is a model that can generate the right API calls and learns from three big datasets of machine learning APIs: HuggingFace, TorchHub, and TensorHub. It surpasses the performance of existing LLMs such as GPT-4 in API call generation.

It can also adapt to API documentation changes during model testing by using a document retriever to fetch the most up-to-date API documentation.

The large corpus of APIs (APIBench) is available to be used by the community for future research. It may be expanded to include more domains and functionalities.