SpaText: Spatio-Textual Representation for Controllable Image Generation

1Meta AI (FAIR), 2The Hebrew University of Jerusalem, 3Reichman University

CVPR 2023

SpaText - new method for text-to-image generation using open-vocabulary scene control.

Abstract

Recent text-to-image diffusion models are able to generate convincing results of unprecedented quality. However, it is nearly impossible to control the shapes of different regions/objects or their layout in a fine-grained fashion. Previous attempts to provide such controls were hindered by their reliance on a fixed set of labels. To this end, we present SpaText - a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map where each region of interest is annotated by a free-form natural language description. Due to lack of large-scale datasets that have a detailed textual description for each region in the image, we choose to leverage the current large-scale text-to-image datasets and base our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based. In addition, we show how to extend the classifier-free guidance method in diffusion models to the multi-conditional case and present an alternative accelerated inference algorithm. Finally, we offer several automatic evaluation metrics and use them, in addition to FID scores and a user study, to evaluate our method and show that it achieves state-of-the-art results on image generation with free-form textual scene control.

Video

Method

We aim to provide the user with more fine-grained control over the generated image. In addition to a single global text prompt, the user will also provide a segmentation map, where the content of each segment of interest is described using a local free-form text prompt.

However, current large-scale text-to-image datasets cannot be used for this task because they do not contain local text descriptions for each segment in the images. Hence, we need to develop a way to extract the objects in the image along with their textual description. To this end, we opt to use a pre-trained panoptic segmentation model along with a CLIP model.


During training (left) - given a training image x, we extract K random segments, pre-process them and extract their CLIP image embeddings. Then we stack these embeddings in the same shapes of the segments to form the spatio-textual representation ST. During inference (right) - we embed the local prompts into the CLIP text embedding space, then convert them using the prior model P to the CLIP image embeddings space, lastly, we stack them in the same shapes of the inputs masks to form the spatio-textual representation ST.

Results

Examples

The following images were generated by our method: each pair consists of an (i) input global text (top left, black), a spatio-textual representation describing each segment using free-form text prompts (left, colored text and sketches), and (ii) the corresponding generated image (right). (The colors are for illustration purposes only, and do not affect the actual inputs.)

Mask Sensitivity

During our experiments, we noticed that the model generates images that correspond to the implicit masks in the spatio-textual representation, but not perfectly. Nevertheless, we argue that this characteristic can be beneficial. For example, given a general animal shape mask (first image), the model is able to generate a diverse set of results driven by the different local prompts. It changes the body type according to the local prompt while leaving the overall posture of the character intact.


We also demonstrate this characteristic on a Rorschach test mask: Given a general Rorschach mask (first image), the model is able to generate a diverse set of results driven by the different local prompts. It changes fine details according to the local prompt while leaving the overall general shape intact.



These results are visualized in the following AI art video:

Multi-Scale Control

The extension of classifier-free guidance to the multi-conditional case allows fine-grained control over the input conditions. Given the same inputs (left), we can use different scales for each condition. In the following example, if we put all the weight on the local scene (1), the generated image contains a horse with the correct color and posture, but not at the beach. Conversely, if we place all the weight on the global text (5), we get an image of a beach with no horse in it. The in-between results correspond to a mix of conditions - in (4) we get a gray donkey, in (2) the beach contains no water, and in (3) we get a brown horse at the beach on a sunny day.

BibTeX

If you find this research useful, please cite the following:

@InProceedings{Avrahami_2023_CVPR,
    author    = {Avrahami, Omri and Hayes, Thomas and Gafni, Oran and Gupta, Sonal and Taigman, Yaniv and Parikh, Devi and Lischinski, Dani and Fried, Ohad and Yin, Xi},
    title     = {SpaText: Spatio-Textual Representation for Controllable Image Generation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {18370-18380}
}