Thousands of free and open audiobooks using synthetic speech from Project Gutenberg, Microsoft, and MIT

September 19, 2023

A research team from Project Gutenberg, Microsoft, and MIT has developed a system that can generate thousands of free and open audiobooks from the Project Gutenberg e-book collection. The model uses the latest developments in neural text-to-speech technology and scalable computing to create human-quality audiobooks.

The goal of this project is to enable book lovers everywhere to enjoy literature more easily and freely by providing them with high quality audiobooks. It can automatically select the relevant text from the e-books and can process thousands of books at the same time using a scalable machine learning framework.

You can listen to them here: the Project Gutenberg Open Audiobook Collection.

The pipeline

The pipeline creates audiobooks for thousands of free e-books from Project Gutenberg. These e-books have different formats, such as PDF, EPUB, or HTML, but the pipeline focuses on the HTML format.

1. Clean up the text. The e-books differ in their styles and contents. Some of them have text that is not relevant for audio listeners, such as page numbers, footnotes, pictures, or annotations. The researchers removed them because they would they would disrupt the audiobooks’ continuity.

They applied a mix of automatic and hand-crafted HTML features to extract the essential aspects of each e-book’s HTML code and create a high-quality subset of e-books.

2. Clusterize and normalize the text. These relevant features were used to cluster the e-books based on their similarity and structure (see the figure below). For example, one cluster might contain e-books that have a table of contents at the beginning; another cluster might include e-books that have footnotes at the end, and so on.

The team used these clusters to build a rule-based HTML normalizer that could convert the most common types of e-books into a standard format that was easily parsed by computers.

t-SNE (t-distributed stochastic neighbor embedding) representation of clustered e-books. Colored areas represent uniformly formatted clusters of books. (Source: paper)

3. Text-to-speech. Once parsed, the system could extract a stream of plain text from each e-book to feed a neural text-to-speech model and generate natural and expressive voices. Then, it customizes the voice speed, style, emotion, and identity based on the user’s preferences or a sample audio.

In short, the system works by first breaking the e-book down into individual chapters and each chapter is passed through 3 steps: text normalization, speech synthesis, and audio post-processing.

The system uses SynapseML, a tool that helps to create and manage large-scale machine learning pipelines.

Potential impact of the system

The system has the potential to revolutionize the way the audiobooks are made.

This could have a significant impact on the publishing industry and on how people consume books. They can enjoy them while traveling, working out, or doing other activities.