Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, these models struggle with generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach.
Our fully-automated solution to the task of consistent character generation is based on the assumption that a sufficiently large set of generated images, for a certain prompt, will contain groups of images with shared characteristics. Given such a cluster, one can extract a representation that captures the "common ground" among its images. Repeating the process with this representation, we can increase the consistency among the generated images, while still remaining faithful to the original input prompt.
We start by generating a gallery of images based on the provided text prompt, and embed them in a Euclidean space using a pre-trained feature extractor. Next, we cluster these embeddings, and choose the most cohesive cluster to serve as the input for a personalization method that attempts to extract a consistent identity. We then use the resulting model to generate the next gallery of images, which should exhibit more consistency, while still depicting the input prompt. This process is repeated iteratively until convergence.
Using our method, we can generate the consistent character in novel scenes. Notice that our method works on different character types (e.g., humans, animals etc.) and styles (e.g., photorealistic, renderings etc.)
"A photo of a 50 years old man with curly hair"
"in the park"
"reading a book"
"at the beach"
"holding an avocado"
"A portrait of a man with a mustache and a hat, fauvism"
"in the park"
"reading a book"
"at the beach"
"holding an avocado"
"A rendering of a cute albino porcupine, cozy indoor lighting"
"in the park"
"reading a book"
"at the beach"
"holding an avocado"
"a 3D animation of a happy pig"
"in the park"
"reading a book"
"at the beach"
"holding an avocado"
"a sticker of a ginger cat"
"in the park"
"reading a book"
"at the beach"
"holding an avocado"
Using our method, we can generate a consistent life story of the same character, from different stages in life.
"a photo of a man with short black hair"
"as a baby"
"as a small child"
"as a teenager"
"with his first girlfriend"
"before the prom"
"as a soldier"
"in the college campus"
"sitting in a lecture"
"playing football"
"drinking a beer"
"studying in his room"
"happy with his accepted paper"
"giving a talk in a conference"
"graduating from college"
"a profile picture"
"working in a coffee shop"
"in his wedding"
"with his small child"
"as a 50 years old man"
"as a 70 years old man"
"a watercolor painting"
"a pencil sketch"
"a rendered avatar"
"a 2D animation"
"a graffiti"
Using our method, we can used for story illustration. For example, we can illustrate the following story:
"This is a story about Jasper, a cute mink with a brown jacket and red pants. Jasper started his day by jogging on the beach, and afterwards, he enjoyed a coffee meetup with a friend in the heart of New York City. As the day drew to a close, he settled into his cozy apartment to review a paper"
Scene 1
Scene 2
Scene 3
Scene 4
Our method can be integrates with Blended Latent Diffusion for achieving the task of consistent local text-driven image editing:
Input image + mask
"sitting"
"jumping"
"wearing sunglasses"
Our method can be integrates with ControlNet for achieving the task of consistent pose-driven image generation:
Input pose 1
Result 1
Input pose 2
Result 2