TinyChart, a powerful AI that understands charts

May 6, 2024

TinyChart is an open-source multimodal large language model specifically designed for chart understanding. With a relatively small size (3B parameters), it can perform various analytical tasks, such as answering questions based on chart content, generating textual chart descriptions, and converting visual chart data into a tabular format.

The model was proposed by a research team from Renmin University of China and Alibaba Group.

TinyChart achieves state-of-the-art performance on various chart understanding benchmarks, such as ChartQA and Chart-to-Text. It shows faster inference speed, surpassing models with significantly more parameters, such as ChartLlama and ChartAst. This is accomplished by using two innovative approaches: the Program-of-Thoughts (PoT) learning strategy for easier numerical computation learning, and the Vision Token Merging module for reducing the data size from high-resolution images.

TinyChart-3B outperforms several 13B MLLMs on a variety of chart understanding benchmarks (left), while achieving larger inference throughput (right) (source: paper)

How to use TinyChart

TinyChart is open-source and currently accessible on GitHub repository where you can find the model, the inference code and the script for setting up a local demo. You can also check out its demo on Hugging Face.

The team intends to release in the upcoming period:

Online demo on ModelScope
Evaluation code
Training data and code

Related work on chart understanding

A model truly understands a chart when it can extract the information within (data, relationships, and trends) and perform specific tasks, such as question-answering (QA), summarization, and re-rendering (transforming the chart’s content or visual style into alternative formats, such as tables, textual descriptions, or another chart type).

Early approaches relied on existing tools like Optical Character Recognition (OCR) and component detectors to convert the chart into a table or text format. After extracting the information, they employed language models to analyze it and complete the desired tasks. Unfortunately, this approach was prone to error propagation. Any mistakes made in early stages could be carried through the entire process, leading to unreliable results.

Recent advancements in chart understanding have shifted to a holistic strategy, employing “end-to-end methods” powered by multimodal large language models (MLLMs). However, despite their better performance, MLLMs require a significant number of parameters and substantial computational power, making them unsuitable for use with limited resources.

TinyChart – the model

TinyChart addresses these challenges by introducing two key innovations:

the Program-of-Thoughts (PoT) learning strategy: the model is trained to generate Python programs for numerical calculations, thereby reducing the burden of learning complex numerical computations. The PoT learning strategy enables TinyChart to handle numerical data with unprecedented precision and efficiency.
the Vision Token Merging module: it progressively merges similar visual tokens, significantly reducing the length of the visual feature sequences produced by the vision transformer.

TinyChart follows the typical MLLM design with three components: a vision transformer encoder, a vision-language connector, and an LLM (see the next picture).

The Visual Token Merging process is shown in the next figure. After the vision transformer encoder transforms the chart images into vision features, the model applies the Visual Token Merging technique to convert this visual information into a more accessible format. It divides the tokens (data points) into 2 sets, then finds the most similar tokens between these 2 sets and keeps only the top “r” most similar connections (r=2 in this example). The tokens with connections are merged into single units (3&4, and 6&7).

(a) Vision transformer layer with Visual Token Merging. (b) Process of the Visual Token Merging (source: paper)

Implementation details

TinyChart uses the foundation of TinyLLaVA. It incorporates SigLIP as the vision encoder and Phi-2 as the language model.The initial resolution for the vision encoder is set at (384×384). This has been increased to (512×512) and (768×768) to capture more details.

They applied multitask learning, a training strategy where the model is taught to handle multiple tasks simultaneously. It’s like studying different subjects at the same time to improve the overall learning efficiency.
The composition of the training dataset can be seen in the next table.

Dataset	Benchmark	Samples
Chart question answer
ChartQA	✓	28,299
ChartQA-PoT	✓	140,584
PlotQA		157,070
DVQA		200,000
OpenCQA		5,407
Chart-to-text generation
Pew	✓	7,892
Statista	✓	29,589
OpenCQA		5,407
VisText		11,171
ChartSumm		75,255
Chart2Text-8k		7,862
Chart-to-table generation
ChartQA	✓	19,373
PlotQA		190,720
Chart2Text-8k		8,305
DVQA		300,000
Statista		29,589
Chart instruction following
ChartLlama		148,398
Total		1,364,921

Datasets used for training TinyChart (source: paper)

To support PoT learning on chart understanding, the team constructed the ChartQA-PoT dataset, which is derived from the training subset of the ChartQA dataset. ChartQA-PoT contains 140,584 (question, PoT answer) pairs. Each PoT answer consists of multiple lines of Python code. They employed two approaches for constructing (question, PoT answer) pairs: Template-based PoT and GPT-based PoT.

The entire training process is completed in 3 days using 32 Tesla V100 GPUs, each with 32 GB VRAM.

Evaluation

The results of extensive experiments show that TinyChart achieves state-of-the-art performance on a variety of chart understanding benchmarks. These include ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. Remarkably, it outperforms several MLLMs with up to 13 billion parameters, such as ChartLlama and ChartAst, and even the closed-sourced general-purpose MLLM GPT-4V on ChartQA.

Beyond its remarkable accuracy, TinyChart has high efficiency. Due to its smaller scale and optimized vision encoding, TinyChart boasts a faster inference throughput. This translates to quicker processing times and the potential for real-time applications.

Case study

The pictures below showcase TinyChart performing various tasks such as chart question answering, chart redrawing, and chart-to-table extraction.

Case studies on ChartQA. TinyChart@768 is compared with ChartLlama (source: paper)

Examples of chart redrawing, showing the image generated after executing the Python code produced by the model. The bad case is with the red bounding box. (source: paper)

Examples of chart-to-table extraction of TinyChart@768. The wrong values produced by the model are marked red (source: paper)

These pictures illustrate TinyChart’s effectiveness in these tasks, demonstrating its state-of-the-art performance on various chart understanding benchmarks, despite having only 3B parameters.

Conclusion

TinyChart is a model specifically designed for chart understanding. Its ability to quickly understand and draw conclusions from charts can accelerate the pace of research and foster new discoveries.

In the business world, TinyChart allows for the rapid interpretation of market trends, financial reports, and consumer data, all of which are often encapsulated in complex charts. In academia, researchers can leverage TinyChart to analyze data from various fields, ranging from sciences to the humanities.

By making the code and model publicly available, its developers provided a valuable resource to the open research community.