DiagrammerGPT generates better diagrams using LLMs

DiagrammerGPT is a new framework that uses large language models (LLMs) to generate diagrams from text with a high degree of accuracy and flexibility.

It can create diagrams for any topic or file format, offering more precise object layouts and clearer text labels than the existing text-to-image (T2I) models.

DiagrammerGPT was developed by a research team from UNC Chapel Hill. You can find more information and examples on the project page.

A diagram is a simplified or symbolic drawing that uses complex visual elements, such as objects, text labels, arrows, and lines, to explain information in a clear and concise way. Existing T2I models, such as DALL-E 3, struggle to generate diagrams because they cannot position the objects correctly and create legible labels.

Innovative features of DiagrammerGPT

DiagrammerGPT manages to better control all aspects of the diagrams’ visual elements, including their position, size, shape, color, orientation, connections, and labels.

These are the innovative features that enables DiagrammerGPT to produce accurate diagrams:

  • It uses a two-stage approach to generate diagrams: first, it creates a diagram plan and then it renders the diagram image using a diagram generator.
  • It leverages the layout guidance features of advanced LLMs, which allow the LLMs to control the position, size, shape, color, and orientation of the visual elements, as well as the connections and labels between them.
  • It introduces a new diagram dataset, AI2D-Caption.

The model

DiagrammerGPT is a two-stage text-to-diagram generation framework:

I. Diagram planning. In this stage it uses a GPT-4 model, called the planner, to generate diagram plans from text prompts. Another GPT-4 model, called the auditor, checks the diagram plans for errors and inconsistencies and gives feedback to the planner. The planner and the auditor work together in a feedback loop to improve the diagram plans until they align with the input prompts.

The first stage of DiagrammerGPT: diagram planning (source: paper)

II. Diagram generation . In this stage, it creates diagrams following the diagram plans. The model uses two key elements: (1) DiagramGLIGEN, which converts the diagram plan into a visual image, and (2) a text label rendering module, which generates clear and readable text labels on the diagram.

The second stage of DiagrammerGPT: diagram generation (source: paper)

DiagramGLIGEN is a model based on the GLIGEN architecture which uses gated self-attention layers to improve the Stable Diffusion v1.4 model. Unlike the original GLIGEN model, which can only create natural images and uses only objects for layout grounding, DiagramGLIGEN is more specialized for diagrams, because it is trained on the AI2D-Caption diagram dataset, which has captions for different types of diagrams.

Evaluation

To evaluate the performance of the model, the team introduces the AI2D-Caption dataset, which is built on top of the AI2D dataset and provides dense annotations for each diagram (e.g., object descriptions and text label-object linkages). This dataset enables DiagrammerGPT to learn from diverse and rich examples of diagrams and their descriptions.

They provide a comprehensive analysis of DiagrammerGPT, including its ability to generate open-domain diagrams, vector graphic diagrams in different platforms, human-in-the-loop diagram plan editing, and multimodal planner/auditor LLMs (e.g., GPT-4Vision).

DiagrammerGPT was compared with three other models (Stable Diffusion v1.4, VPGen, and AutomaTikZ) in generating diagrams from text using various evaluation methods. The results of this comparison are presented in the table below.

DiagrammerGPT outperforms other models on all metrics. We can also see that the other models improve their performance when they are fine-tuned on a diagram dataset.

Comparison of DiagrammerGPT to existing text-to-image generation baseline models (source: project page)

Below you can see the results of the human evaluation. The table shows how well DiagrammerGPT and Stable Diffusion v1.4 align the images and texts and capture the object relationships.

Human evaluation of pairwise preferences between DiagrammerGPT and Stable Diffusion v1.4 (source: project page)

DiagrammerGPT was preferred more than Stable Diffusion v1.4 on both criteria: image-text alignment (36% vs 20%) and object relationships (48% vs 30%).

Conclusion

DiagrammerGPT is a new framework that can generate diagrams from text prompts using LLMs, outperforming the existing text-to-image (T2I) models.

It consists of a planner LLM that generates a diagram plan, an auditor LLM that checks and refines the diagram plan, and a diagram generator module that draws the diagram image.

DiagrammerGPT can create diagrams for various topics and in different file formats, such as flowcharts, mind maps, and electrical circuits.

Learn more:

Other popular posts