Recursive summarization: store and retrieve long-term dialogue memory in large language models

A new research paper suggests that large language models (LLMs) can be taught to remember the dialogue by recursively summarizing the conversation, i.e., making shorter summaries of the conversation at each step.

The authors of the paper are from the Chinese Academy of Sciences, Beijing, the National University of Defense Technology, China, and the University of Sydney. They propose a method to improve the long-term dialogue memory of large language models (LLMs) by using recursive summarization.

Their method works by first stimulating the LLM to memorize small dialogue contexts. Then, it recursively produces new summaries using the previous one and the following contexts. Finally, the LLM can keep track of the conversation. It can generate a consistent response with the help of the latest summary.

The evaluation results show that it enables LLMs to handle long and complex dialogues that span multiple sessions and topics.

LLMs often struggle to maintain a long-term memory of the conversation, which can lead to inconsistent or incoherent responses. This is because LLMs are trained on a massive amount of text data, which makes it difficult for them to focus on the specific conversation at hand.

Chat-GPT generates an inconsistent response in a long-term conversation. (Source: paper)


The workflow of the proposed method is shown in the following picture:

The recursive summarization method (Source: paper)
  • Ct is the dialogue context at step t, defined as the concatenation of the last t dialogue turns.
  • Ms is the memory information that stores multiple natural language sentences abstracted from previous utterances.
  • Rt is the generated response at step t.

The method uses a large language model (LLM) to generate responses based on the dialogue history. It generates summaries of the dialogue history and then uses these summaries as prompts for the LLM.

The research team used two types of large language models (LLMs): ChatGPT (gpt-3.5-turbo-0301) and text-davinci-003. They have two functions:

  1. Memory management: manage the memory by creating summaries of the important information from the long-term dialogue.
  2. Response generator: generate the response by producing a suitable reply based on the summaries and the context.

Experiments and evaluation

The method was evaluated on the Multi-Session Chat (MSC) dataset, which is a collection of long conversations between humans. The research team used this dataset to compare their method with other methods and see how well they can generate responses that are consistent and relevant to the conversation history.

The team evaluated the dialogue generation performance using DISTINCT-1/2, F1 and BLEU-1/2.

They compared their method with three other methods that also use LLMs to generate responses based on the conversation history:

  • All Context that uses the whole conversation history, including previous sessions and current session, as the input for the LLMs.
  • Part Context that uses only the current session, which is the part of the conversation that is happening right now, as the input for the LLMs. This means that the LLMs cannot see anything that has been said before in previous sessions.
  • Gold Memory that uses a gold memory, which is a human-written summary of the conversation history, as the input for the LLMs. This means that the LLMs can see a short and concise version of what has been said before in the conversation.
  • Predicted Memory that uses the recursive summarization method.

The main results are shown in the table below. The researchers used fixed ChatGPT to generate predicted responses with different methods on the MSC dataset test set, and measured their performance with automatic metrics.

The method with the highest score is in bold, and the method with the second highest score is underlined. The proposed method is labeled “Predicted Memory”.

The dialogue generation performance of fixed ChatGPT when using different methods: All Context, Part Context, Gold Memory, and Predicted Memory (Source: paper)

The main findings are:

ChatGPT can keep its performance stable as the conversation goes on because the conversation history (recursive summarization) does not exceed the maximum capacity of the LLM’s inputs, which is 4096 words.

ChatGPT performs better when using all context (the whole conversation history) than when using only a part of it because it can understand the conversation and generate responses more correctly.

Using golden memory, which is the human-written summary of the conversation history, does not perform as well as using a predicted memory.

The predicted memory method achieves the best performance in most metrics. By using the summarized memory, it can capture the long connections and generate an easy-to-understand input for ChatGPT.


Recursive summarization enables LLMs to  keep track of the important information from the earlier parts of the conversation and use it to generate responses that are in line with the conversation history and relevant to the current topic.

This method can help chatbots to maintain a coherent and engaging conversation with humans over multiple sessions, and to avoid forgetting or contradicting what has been said before.

This method can also be applied to other tasks that involve long-context modeling, such as story generation.

Learn more:

Research paper: “Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models” (on arXiv)

Other popular posts