The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

1Google Research, 2The Hebrew University of Jerusalem, 3Tel Aviv University, 4Reichman University

SIGGRPAH 2024

The Chosen One - given a text prompt describing a character, our method distills a representation that enables consistent depiction of the same character in novel contexts.

Abstract

Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, the users that use these models struggle with the generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development, asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach.

Video

Method

Our fully-automated solution to the task of consistent character generation is based on the assumption that a sufficiently large set of generated images, for a certain prompt, will contain groups of images with shared characteristics. Given such a cluster, one can extract a representation that captures the "common ground" among its images. Repeating the process with this representation, we can increase the consistency among the generated images, while still remaining faithful to the original input prompt.

We start by generating a gallery of images based on the provided text prompt, and embed them in a Euclidean space using a pre-trained feature extractor. Next, we cluster these embeddings, and choose the most cohesive cluster to serve as the input for a personalization method that attempts to extract a consistent identity. We then use the resulting model to generate the next gallery of images, which should exhibit more consistency, while still depicting the input prompt. This process is repeated iteratively until convergence.


Results

Consistent Characters Examples

Using our method, we can generate a consistent character in novel scenes. Notice that our method works on different character types (e.g., humans, animals etc.) and styles (e.g., photorealistic, renderings etc.)

"A photo of a 50 years old man with curly hair"



"A portrait of a man with a mustache and a hat, fauvism"



"A rendering of a cute albino porcupine, cozy indoor lighting"



"a 3D animation of a happy pig"



"a sticker of a ginger cat"


Life Story

Using our method, we can generate a consistent life story of the same character, from different stages in life.

"a photo of a man with short black hair"


Story Illustration

Our method can be used for story illustration. For example, we can illustrate the following story:

"This is a story about Jasper, a cute mink with a brown jacket and red pants. Jasper started his day by jogging on the beach, and afterwards, he enjoyed a coffee meetup with a friend in the heart of New York City. As the day drew to a close, he settled into his cozy apartment to review a paper"

Local Text-Driven Image Editing

Our method can be integrated with Blended Latent Diffusion for the task of consistent local text-driven image editing:

Additional Pose Control

Our method can be integrated with ControlNet for the task of consistent pose-driven image generation:

BibTeX

If you find this research useful, please cite the following:

@inproceedings{avrahami2024chosen,
  author = {Avrahami, Omri and Hertz, Amir and Vinker, Yael and Arar, Moab and Fruchter, Shlomi and Fried, Ohad and Cohen-Or, Daniel and Lischinski, Dani},
  title = {The Chosen One: Consistent Characters in Text-to-Image Diffusion Models},
  year = {2024},
  isbn = {9798400705250},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3641519.3657430},
  doi = {10.1145/3641519.3657430},
  abstract = {Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, the users that use these models struggle with the generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development, asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach.},
  booktitle = {ACM SIGGRAPH 2024 Conference Papers},
  articleno = {26},
  numpages = {12},
  keywords = {Consistent characters generation},
  location = {Denver, CO, USA},
  series = {SIGGRAPH '24}
}