Word-As-Image for Semantic Typography (SIGGRAPGH 2023 technical paper awards)

August 16, 2023

Word-As-Image is a novel and creative way to make semantic typography, where the letters of the word are transformed into images that reflect the word’s meaning.

For example, the word “jazz” can be illustrated by making the letter “J” look like a saxophone, or the word “fire” can be illustrated by making the letters look like flames.

This is a very difficult task because it needs to understand the meaning of the word and come up with a creative way to show this meaning visually. The research team used a pretrained Stable Diffusion model and created simple, clear, black-and-white designs that show the meaning clearly.

Word-as-image illustrations in various fonts and for different textual concept. (Source: paper)

Word-As-Image can be used to make catchy and interesting headlines for websites and blogs, infographics, custom fonts for branding and marketing materials, or educational and fun materials that are easy to read and understand.

The following video shows some examples of word-as-image illustrations for different words and fonts.

Source: project page

Method

The figure below shows how the method works to create a word-as-image illustration. The ultimate goal is to create an artistic representation that effectively conveys the meaning of the word while maintaining a balance between visual aesthetics and readability.

It takes an input letter (“S”) and a concept (“Surfing”) and tries to change the shape of the letter to match the concept.

An overview of the word-as-image method (Source: paper)

The letter is represented by a set of points (P) that define its outline (𝑙𝑖). These points are used to change the shape of the letter to match an image that is related to the concept (“Surfing”).

The algorithm moves these points around to deform the letter and it optimizes the new positions (Pˆ) of the deformed letter (𝑙ˆi) iteratively. This is done by feeding the set (Pˆ) into a differentiable rasterizer (DiffVG) at each iteration. The differentiable rasterizer outputs the rasterized deformed letter (𝑙ˆi).

The rasterized deformed letter (𝑙ˆi) is then passed into a pretrained frozen Stable Diffusion model that has been trained to generate images from text. For example, if the word is “jazz”, it can generate an image of a saxophone or a musical note. These images are then used as references to guide the letter shape deformation.

It drives the letter shape to convey the semantic concept using the LLSDS loss, which measures the similarity between the deformed letter and the image generated by the model.

The input letter (𝑙𝑖) and the deformed letter (𝑙ˆi) are also passed through a Low Pass Filter (LPF). It helps to preserve the overall tone of the font style and also the local letter shape.

The sets (P) and (Pˆ) are passed through a Delaunay triangulation operator (D) to preserve the initial shape of the letter.

The Contrastive Language-Image Pre-Training (CLIP) and the UNet neural networks are used together to create word-as-image illustrations. The idea is to use CLIP to drive the letter shape and UNet to preserve the font style and the letter readability.

It repeats these steps many times until it finds a good balance between making the shape of the letter look like something related to the meaning of the word and keeping its readability and style.

The final output of the algorithm is the deformed letter (𝑙ˆi) , which is an image that matches the meaning of the input word.

In short, the Word-As-Image framework takes a word, thinks about its meaning, changes the shape of letters to match that meaning, and turns it into an artistic picture.

Loss Functions

The aim of the loss functions is to transform the letter shape to match the meaning of the word. The model uses three loss functions:

SDS loss measures how well the shape of the letter conveys the meaning of the word.
Tone loss measures how much the color and brightness of the letter change from its original values.
As-conformal-as-possible constraint measures how much the angles and distances between the points that make up the letter change from their original values.

Results

The authors defined three objectives for an effective word-as-image illustration. It should visually capture the given semantic concept, maintain readability, and preserve the original font’s characteristics.

To evaluate the performance of their method, the authors randomly selected 50 words from five common concept classes: animals, fruits, plants, sports, and professions. They then applied their method to each word in four different fonts: Quicksand, Bell MT, Noteworthy-Bold, and HobeauxRococeaux-Sherman.

The authors found that their method was able to successfully convey the given semantic concept in most cases while still remaining legible. In addition, their method was able to capture the font characteristics of the original word (see the pictures below).

Word-As-Image: the same word in a variety of fonts (Source: paper)

Word-as-image illustrations (Source: paper)

Quantitative evaluation: To validate their results, the authors conducted a perceptual study with 40 participants. The participants were shown isolated letter illustrations from the Word-As-Image results, and asked to identify the concept, letter, and font that best matched the illustration.

The results of the perceptual study showed that the participants were able to correctly identify the concept and letter in most cases. In addition, the participants were able to correctly identify the font in 51% of cases, which is significantly better than random chance (25%).

Comparison: The authors compared their method to other methods for generating word-as-image illustrations: Stable Diffusion (SD), SDEdit, DallE2, DallE2+a letter specific prompt, CLIPDraw, and Word-As-Image (see the picture below). The input letters that are used by some of the methods, such as SDEdit, CLIPDraw, and Word-As-Image.

Word-As-Image: comparison to alternative methods based on large scale text-to-image models (Source: paper)

The new method was able to generate word-as-image illustrations that are more visually appealing and legible than the results of DallE2 and CLIPDraw. It was also able to preserve the font characteristics of the input word.

Conclusion

Word-As-Image for semantic typography is a promising approach for turning words into images that represent their meanings. It can be used in conjunction with human input to create even more creative and interesting word-as-image illustrations.