Stable Flow

Stable Flow: Vital Layers for Training-Free Image Editing

¹Snap Research, ²The Hebrew University of Jerusalem, ³Tel Aviv University, ⁴Reichman University

CVPR 2025

Abstract

Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify "vital layers" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications.

Method

Specifically, we explore image editing via parallel generation, where features from the generative trajectory of the source (reference) image are injected into the trajectory of the edited image. Such an approach has been shown effective in the context of convolutional UNet-based diffusion models, where the roles of the different attention layers are well understood. However, such understanding has not yet emerged for DiT. Specifically, DiT does not exhibit the same fine-coarse-fine structure of the UNet, hence it is not clear which layers should be tampered with to achieve the desired editing behavior.

To address this gap, we analyze the importance of the different components in the DiT architecture, in order to determine the subset that should be injected while editing. More specifically, we introduce an automatic method for detecting a set of vital layers --- layers that are essential for the image formation --- by measuring the deviation in image content resulting from bypassing each layer. We show that there is no simple relationship between the vitality of a layer and its position in the architecture, i.e, the vital layers are spread across the transformer.

Fig1. (Left) Text-to-image DiT models consist of consecutive layers connected through residual connections. Each layer implements a multimodal diffusion transformer block that processes a combined sequence of text and image embeddings. (Right) For each DiT layer, we performe an ablation by bypassing the layer using its residual connection. Then, we compare the generated result on the ablated model with the complete model using a perceptual similarity metric.

Results

Our method can be used for various types of image editing:

Input

"Holding hands"

"Wearing glasses"

"Next to an albino porcupine"

Input

"Statue of Liberty"

"Taj Mahal"

"Eiffel Tower"

Input

"A wooden lion"

"A wooden toilet"

"A wooden noodles bowl"

Input

"A hedgehog"

"A shark"

"A bird"

Input

"Jumping"

"Sitting"

"Putting its paw on a stone"

Input

"An albino porcupine"

"A horse"

"A crow"

Input

"Wearing a red shirt"

"Wearing purple jeans"

"Wearing glasses"

Input

"Text 'FLUX' is written on the bag"

"A camel is in the background"

"A cat inside the bag"

Input

"A pink car"

"A man is driving the car"

"In the evening"

Input

"Dog is wearing a blue hat"

"Cat is wearing yellow glasses"

"Casting shadows"

Input

"Jumping"

"Sitting"

"Sniffing the road"

Input

"An otter"

"A pig"

"A plastic bag"

Input

"Wearing green glasses"

"Wearing a straw hat"

"Next to an avocado"

Input

"On a wet road"

"During the evening"

"Snowy day"

It can also be used for text-related editing tasks:

Input

"Text `diffusion` in a blue color"

"Uppercase text `DIFFUSION`"

"Text 'Flow'"

Input

A `Stable Flow` neon sign

A `P = NP` neon sign

A neon sign of avocados

Input

"Minds"

"Think"

"Alike"

Input

"Minds"

"Think"

"Alike"

Input

"Minds"

"Think"

"Alike"

Input

"Minds"

"Think"

"Alike"

Input

"Minds"

"Think"

"Alike"

We also offer latent nudging technique to enable real image editing:

Input

"A bottle next to an apple. There is a heart painting on the wall"

Input

"A doll with a green body wearing a hat"

Input

"The cat is yelling and raising its paw"

Input

"A rabbit toy sitting and wearing pink socks during the late afternoon"

Input

"A rubber duck next to a purple ball during a sunny day"

Input

"A dog with a small collar lifting its paw while wearing red glasses"

Input

"A man with a long hair"

Input

"A photo of a bear with a long hair"

Input

"A photo of a man raising his hand"

Input

"A photo of a sheep with a yellow hat"

Input

"A photo of a cat wearing purple sunglasses"

Input

"A photo of a chicken during sunset"

Input

"A photo of an elephant next to a blue ball"

BibTeX

If you find this research useful, please cite the following:

@InProceedings{Avrahami_2025_CVPR, author = {Avrahami, Omri and Patashnik, Or and Fried, Ohad and Nemchinov, Egor and Aberman, Kfir and Lischinski, Dani and Cohen-Or, Daniel}, title = {Stable Flow: Vital Layers for Training-Free Image Editing}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {7877-7888} }

Stable Flow: Vital Layers for Training-Free Image Editing

Stable Flow: a training-free editing method that is able to perform various types of image editing operations, including non-rigid editing, object addition, object removal, and global scene editing. These different edits are done using the same mechanism.

Abstract

Video

Method

Results

BibTeX