IBM’s Docling converts PDFs into other digital formats

December 2, 2024

Docling is an open-source, easy to use Python library designed to convert PDF documents into machine-readable formats like JSON or Markdown. It runs efficiently on consumer-grade hardware and its API enables seamless integration and customization for a variety of document conversion tasks.

Docling has the following AI capabilities:

Recognizes page layouts, reading order, table structures and figures.
Extracts metadata like titles, authors, and references.
Applies optional OCR for scanned PDFs.

It has hardware flexibility. The model can run entirely locally on CPUs and various accelerators like GPUs and MPS.

Docling is customizable. The pipeline is modular, allowing for easy extension with custom models or features.

It has easy integration with LlamaIndex and LangChain to create advanced RAG applications from the PDF content. Docling converts your PDFs into a digital format that can be fed into LlamaIndex and LangChain. As a result, PDF conversion, LlamaIndex, and LangChain work together to turn your document pile into a smart assistant that can answer your questions.

Here is an example of the DocLayNet paper from arXiv, converted into Markdown format by Docling:

An example of Docling’s output in Markdown: left PDF, right rendered Markdown (source: Technical Report)

Why convert PDFs into machine-readable formats?

PDFs are widely used for sharing documents due to their fixed layout and compatibility across devices. However, they are primarily designed for human readability, making them less suitable for computational tasks. Converting PDFs into machine-readable formats is important for many reasons, including:

Automation: Many business workflows rely on automated processing of digital documents. PDFs can become a bottleneck since they require manual intervention for extracting information.
Training machine learning models: Machine learning models require structured data for training and prediction tasks. PDFs, often containing unstructured text, need to be converted into formats that machines can easily process.

Converting PDFs into machine-readable formats bridges the gap between human-readable documents and the needs of modern digital workflows.

Core technology

PDF backends: Offers multiple backend options for PDF parsing. The default backend is an open-source custom PDF parser called docling-parse, which is built on the qpdf library. Additionally, there is a backup PDF backend using pypdfium to handle specific font encoding issues
A layout analysis model: An object detector model based on RT-DETR, which was trained DocLayNet and other proprietary datasets to identify bounding boxes and classes of elements in page images.
TableFormer: A vision-transformer-based model that recognizes complex table structures.
OCR: Incorporates EasyOCR for text recognition in scanned PDFs or embedded images.

Pipeline

Docling has a linear pipeline of operations that processes each document sequentially. It is made of 3 main steps:

Aggregate results across pages and apply post-processing to organize metadata, identify the document language, determine the reading order, and assemble a structured document.
Parse PDFs to extract text to extract programmatic text tokens (the string content and their coordinates on the page) while also rendering a bitmap image of each page to support subsequent operations.
Process each page through AI models to extract layout elements and table structures.

Docling’s default processing pipeline (source: Technical Report)

Evaluation

The model was tested using three papers from arXiv and two IBM Redbooks, with a total of 225 pages. The runtime characteristics of the model are presented in the next table.

Runtime characteristics with the standard model pipeline and settings (source: Technical Report)

While GPU acceleration was not fully tested, the system already performs well on CPUs.

How to use

The model is free and open-source with an MIT license. You can easily install the package directly from PyPI. For detailed guidance and examples, check the GitHub repository. Below is a simple usage example from the Technical Report to help you get started:

from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2206.01062" # PDF path or URL
converter = DocumentConverter()
result = converter.convert_single(source)
print(result.render_as_markdown()) # output: "## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis [...]"

You can customize the pipeline and runtime settings to enable/disable specific features (such as OCR or table structure recognition), set limits on input document size, and define the CPU thread allocation.

Conclusion

Docling is a powerful PDF conversion tool, offering a balance of quality, performance, and flexibility. It is open-source, fast and easy to use. In the future, the team plans to extend Docling with additional models, such as a figure-classifier, an equation-recognition model, and a code-recognition model.