Gorilla: a Large Language Model connected with over 1,600 APIs

Gorilla is a fine-tuned LLaMA-based model that can interact with more than 1,600 Application Programming Interfaces (APIs) sourced from HuggingFace, TorchHub, and TensorHub. Gorilla outperforms GPT-4 on writing API calls based on user prompts, showing higher accuracy and lower hallucination rates.

Proposed by a research team from UC Berkeley and Microsoft Research, the new approach improves the performance of Large Language Models (LLMs) for real-world tasks.

The model incorporates a document retriever that finds the most suitable and up-to-date API documentation stored in an API database. The document is then concatenated to the user prompt, helping Gorilla to create accurate API calls.

The team also introduced APIBench, a comprehensive dataset used to evaluate the model. APIBench is publicly available on GitHub and consists of a great number of machine learning APIs retrieved from HuggingFace, TorchHub, and TensorHub.

Gorilla surpassed GPT-4 in writing API calls when tested with different baselines and retrieval methods. The model demonstrated a remarkable capacity to adapt to changes in API documentation during testing.

The approach opens up new possibilities for LLMs to access a much larger space of knowledge, tools and resources via APIs.

The Gorilla system (see picture below) has two pipelines: the training (top half) and the inference (bottom half).

Gorilla’s architecture: the training and inference pipelines

Gorilla was fine-tuned on the base LLaMA-7B model using instruction data from APIBench, a comprehensive benchmark with more than 11,000 {instruction, API} pairs that were generated by Self-instruct using the API dataset.

The model was trained and tested in two ways: with the information-retriever (retrieval mode) and without it (zero-shot mode).

  • In retrieval mode, Gorilla uses an information retriever to find the best API documentation and concatenates it with the prompt before making an API call.
  • In the zero-shot mode, Gorilla makes an API call from the prompt without any help.

Evaluation

The authors evaluated Gorilla on a dataset of natural language prompts and API calls pairs, using four retrieval methods and different baselines (e.g. GPT-4 and Claude).

They measured the accuracy of the model and its ability to adapt to changes in API documentation and to reason under constraints.

Gorilla outperformed state-of-the-art language models in a zero-shot setting, and the oracle retriever improved the model’s performance significantly.

Example API calls generated by GPT-4, Claude, and Gorilla

The figure above shows an API calls example, with the prompt “Help me find an API to convert the spoken language in a recorded audio to text using Torch Hub”.

GPT-4 presents a non-existent model, Claude chooses a wrong library, and Gorilla correctly identifies the task and proposes a fully-qualified API call.

In the graph below we can observe the accuracy and hallucination of different models, in four scenarios: zero-shot (without a retriever) and with retrievers (BM25, GPT, and oracle). Higher values on the graph indicate higher accuracy, while being positioned towards the left indicates a lower level of hallucination.

Accuracy vs hallucination of different models in 4 settings: zero-shot and with 3 retrievers

We observe that Zero-shot Gorilla (without using a document retriever) outperforms GPT 3.5, LLaMA, GPT-4, and Claude.

Conclusion, further research

Gorilla is a model that can generate the right API calls and learns from three big datasets of machine learning APIs: HuggingFace, TorchHub, and TensorHub. It surpasses the performance of existing LLMs such as GPT-4 in API call generation.

It can also adapt to API documentation changes during model testing by using a document retriever to fetch the most up-to-date API documentation.

The large corpus of APIs (APIBench) is available to be used by the community for future research. It may be expanded to include more domains and functionalities.

Learn more:

Other popular posts