DiffUHaul: A Training-Free Method for Object Dragging in Images

1NVIDIA Research, 2The Hebrew University of Jerusalem, 3Tel Aviv University, 4Reichman University
*Indicates Equal Advising

SIGGRPAH Asia 2024

DiffUHaul - given an image with an object, our method can seamlessly relocate it within the scene.

Abstract

Text-to-image diffusion models have proven effective for solving many image editing tasks. However, the seemingly straightforward task of seamlessly relocating objects within a scene remains surprisingly challenging. Existing methods addressing this problem often struggle to function reliably in real-world scenarios due to lacking spatial reasoning. In this work, we propose a training-free method, dubbed DiffUHaul, that harnesses the spatial understanding of a localized text-to-image model, for the object dragging task. Blindly manipulating layout inputs of the localized model tends to cause low editing performance due to the intrinsic entanglement of object representation in the model. To this end, we first apply attention masking in each denoising step to make the generation more disentangled across different objects and adopt the self-attention sharing mechanism to preserve the high-level object appearance. Furthermore, we propose a new diffusion anchoring technique: in the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance; in the later denoising steps, we pass the localized features from the source images to the interpolated images to retain fine-grained object details. To adapt DiffUHaul to real-image editing, we apply a DDPM self-attention bucketing that can better reconstruct real images with the localized model. Finally, we introduce an automated evaluation pipeline for this task and showcase the efficacy of our method. Our results are reinforced through a user preference study.

Video

Method

Recently, several localized text-to-image models were developed by the community that add spatial controllability to the task of text-to-image generation. A natural question is then whether the localized understanding of the 2D pixel world in such models can be harnessed for the task of object dragging. Hence, we examine the disentanglement properties of such models, and propose a series of modifications that allow them to serve as a backbone for drag-and-drop movement of objects within an image. Specifically, we use the recently introduced BlobGEN model, and demonstrate that its spatial understanding can enable significantly more robust object dragging without requiring fine-tuning or training.

In pursuit of our solution, we begin by revealing an entanglement problem in the localized text-to-image models, through which the prompt-based localized controls of different image regions interfere with each other. We trace the root cause to the commonly used Gated Self-Attention layers, where each individual layout embedding are free to attend to all the visual features. We propose an inference-time masking-based solution, named gated self-attention masking, and show that improving the model disentanglement leads to better object dragging performance.

Next, specially for the object dragging task, we first adopt the commonly-used self-attention sharing mechanism to preserve the high-level object appearance. To better transfer the fine-grained object details from source images to target images and better harness spatial understanding of the model, we propose a novel soft anchoring mechanism: in early denoising steps, which control the object shape and scene layout in an image, we interpolate the self-attention features of the source image and those of the target image with a coefficient relative to the diffusion time step. This process promotes a smooth fusion between the target layout and source appearance. Then, in later denoising steps, which control the fine-grained visual appearance in an image, we update the interpolated attention features from the corresponding features in the source image via the nearest-neighbor copying.


Results

Please   click   on the following images to reveal our method output.

BibTeX

If you find this research useful, please cite the following:

@inproceedings{avrahami2024diffuhaul,
  author = {Avrahami, Omri and Gal, Rinon and Chechik, Gal and Fried, Ohad and Lischinski, Dani and Vahdat, Arash and Nie, Weili},
  title = {DiffUHaul: A Training-Free Method for Object Dragging in Images},
  year = {2024},
  isbn = {9798400711312},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3680528.3687590},
  doi = {10.1145/3680528.3687590},
  abstract = {Text-to-image diffusion models have proven effective for solving many image editing tasks. However, the seemingly straightforward task of seamlessly relocating objects within a scene remains surprisingly challenging. Existing methods addressing this problem often struggle to function reliably in real-world scenarios due to lacking spatial reasoning. In this work, we propose a training-free method, dubbed DiffUHaul, that harnesses the spatial understanding of a localized text-to-image model, for the object dragging task. Blindly manipulating layout inputs of the localized model tends to cause low editing performance due to the intrinsic entanglement of object representation in the model. To this end, we first apply attention masking in each denoising step to make the generation more disentangled across different objects and adopt the self-attention sharing mechanism to preserve the high-level object appearance. Furthermore, we propose a new diffusion anchoring technique: in the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance; in the later denoising steps, we pass the localized features from the source images to the interpolated images to retain fine-grained object details. To adapt DiffUHaul to real-image editing, we apply a DDPM self-attention bucketing that can better reconstruct real images with the localized model. Finally, we introduce an automated evaluation pipeline for this task and showcase the efficacy of our method. Our results are reinforced through a user preference study.},
  booktitle = {SIGGRAPH Asia 2024 Conference Papers},
  articleno = {38},
  numpages = {12},
  keywords = {Object Draggining, Image Editing},
  location = {
  },
  series = {SA '24}
}