Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify "vital layers" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications.
Specifically, we explore image editing via parallel generation, where features from the generative trajectory of the source (reference) image are injected into the trajectory of the edited image. Such an approach has been shown effective in the context of convolutional UNet-based diffusion models, where the roles of the different attention layers are well understood. However, such understanding has not yet emerged for DiT. Specifically, DiT does not exhibit the same fine-coarse-fine structure of the UNet, hence it is not clear which layers should be tampered with to achieve the desired editing behavior.
To address this gap, we analyze the importance of the different components in the DiT architecture, in order to determine the subset that should be injected while editing. More specifically, we introduce an automatic method for detecting a set of vital layers --- layers that are essential for the image formation --- by measuring the deviation in image content resulting from bypassing each layer. We show that there is no simple relationship between the vitality of a layer and its position in the architecture, i.e, the vital layers are spread across the transformer.
Fig1. (Left) Text-to-image DiT models consist of consecutive layers connected through residual connections. Each layer implements a multimodal diffusion transformer block that processes a combined sequence of text and image embeddings. (Right) For each DiT layer, we performe an ablation by bypassing the layer using its residual connection. Then, we compare the generated result on the ablated model with the complete model using a perceptual similarity metric.
Our method can be used for various types of image editing:
It can also be used for text-related editing tasks:
We also offer latent nudging technique to enable real image editing: