Blended Latent Diffusion

1The Hebrew University of Jerusalem, 2Reichman University

SIGGRAPH 2023

Blended Latent Diffusion teaser.

Given an input image and a mask, Blended Latent Diffusion modifies the masked area according to a guiding text prompt, without affecting the unmasked regions

Abstract

The tremendous progress in neural image generation, coupled with the emergence of seemingly omnipotent vision-language models has finally enabled text-based interfaces for creating and editing images. Handling generic images requires a diverse underlying generative model, hence the latest works utilize diffusion models, which were shown to surpass GANs in terms of diversity. One major drawback of diffusion models, however, is their relatively slow inference time. In this paper, we present an accelerated solution to the task of local text-driven editing of generic images, where the desired edits are confined to a user-provided mask. Our solution leverages a recent text-to-image Latent Diffusion Model (LDM), which speeds up diffusion by operating in a lower-dimensional latent space. We first convert the LDM into a local image editor by incorporating Blended Diffusion into it. Next we propose an optimization-based solution for the inherent inability of this LDM to accurately reconstruct images. Finally, we address the scenario of performing local edits using thin masks. We evaluate our method against the available baselines both qualitatively and quantitatively and demonstrate that in addition to being faster, our method achieves better precision than the baselines while mitigating some of their artifacts.

Video

Method

Blended Latent Diffusion aims to offer a solution for the task of local text-driven editing of generic images that was introduced in Blended Diffusion paper. Blended Diffusion suffered from a slow inference time (getting a good result requires about 25 minutes on a single GPU) and pixel-level artifacts.

In order to address these issues, we offer to incorporate Blended Diffusion into the text-to-image Latent Diffusion Model. In order to do so, we operate on the latent space and repeatedly blend the foreground and the background parts in this latent space, as the diffusion progresses in the following way:


Operating on the latent space indeed enjoys a fast inference speed, however, it suffers from an imperfect reconstruction of the unmasked area and it is unable to handle thin masks. For more details on how we addressed these problems please read the paper.

Applications

Background Replacement

Given a source image and a mask of the background, Blended Latent Diffusion is able to replace the background according to the text description. Note that the famous landmarks are not meant to accurately appear in the new background but serve as an inspiration for the image completion.

Adding a New Object

Given a source image and a mask of an area to edit, Blended Latent Diffusion is able to add a new object in the masked area seamlessly.

Object Editing

Given a source image and a mask of an area to edit an existing object, Blended Latent Diffusion is able alter the object seamlessly.

Text Generation

Blended Latent Diffusion is able to generate plausible texts.

Multiple Results

Because of the one-to-many nature of our problem, there is a need for multiple predictions for each input. Blended Latent Diffusion is able to to do so.


Input prompt: graffiti with the text "no free lunch"

Input prompt: "stones"

Scribble Editing

A user-provided scribble can be used as a guide. Specifically, the user can scribble a rough shape on a background image, provide a mask (covering the scribble) to indicate the area that is allowed to change, and provide a text prompt.Blended Latent Diffusion transforms the scribble into a natural object while attempting to match the prompt.


Input prompt: "paint splashes"

BibTeX

If you find this research useful, please cite the following:

@article{avrahami2023blendedlatent,
        author = {Avrahami, Omri and Fried, Ohad and Lischinski, Dani},
        title = {Blended Latent Diffusion},
        year = {2023},
        issue_date = {August 2023},
        publisher = {Association for Computing Machinery},
        address = {New York, NY, USA},
        volume = {42},
        number = {4},
        issn = {0730-0301},
        url = {https://doi.org/10.1145/3592450},
        doi = {10.1145/3592450},
        journal = {ACM Trans. Graph.},
        month = {jul},
        articleno = {149},
        numpages = {11},
        keywords = {zero-shot text-driven local image editing}
}

@InProceedings{Avrahami_2022_CVPR,
  author    = {Avrahami, Omri and Lischinski, Dani and Fried, Ohad},
  title     = {Blended Diffusion for Text-Driven Editing of Natural Images},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2022},
  pages     = {18208-18218}
}