SAEdit: Token-Level Control for Continuous Image Editing via Sparse Autoencoder

1Tel-Aviv University     2Google Deepmind


We train a Sparse AutoEncoder (SAE) to lift the text embeddings into a higher-dimensional space, where we identify disentangled semantic directions (e.g. for laughing). These directions can then be applied to specific tokens within the input of a text-to-image model to facilitate continuous image editing.

Abstract

Large-scale text-to-image diffusion models have become the backbone of modern image editing, yet text prompts alone do not offer adequate control over the editing process. Two properties are especially desirable: disentanglement, where changing one attribute does not unintentionally alter others, and continuous control, where the strength of an edit can be smoothly adjusted. We introduce a method for disentangled and continuous editing through token-level manipulation of text embeddings. The edits are applied by manipulating the embeddings along carefully chosen directions, which control the strength of the target attribute. To identify such directions, we employ a Sparse Autoencoder (SAE), whose sparse latent space exposes semantically isolated dimensions. Our method operates directly on text embeddings without modifying the diffusion process, making it model agnostic and broadly applicable to various image synthesis backbones. Experiments show that it enables intuitive and efficient manipulations with continuous control across diverse attributes and domains.

Editing Results on Flux-dev

Training a Sparse Autoencoder

  • First we train a Sparse Autoencoder (SAE) to lift the text embeddings into a higher-dimensional space.
  • The SAE is trained in an unsupervised manner, using a combination of reconstruction and sparsity losses.
  • The sparse latent space exposes semantically isolated dimensions, which we identify as disentangled directions.

Finding an edit direction

  • Given a pair of text embeddings corresponding to the source and target prompts, we pass them through the SAE encoder to obtain their sparse representations.
  • We apply max pooling to each latent vector and compare the resulting pooled vectors to construct a mask that highlights the most significant entries.
  • Using this mask, we extract the relevant components from the pooled representation of the target prompt, producing the direction \(d\).
  • Repeating this process over \(N\) source–target pairs yields a set of directions \(\{d_i\}_{i=1}^N\). The final edit direction is obtained by applying PCA to \(\{d_i\}_{i=1}^N\).

Applying the edit direction

  • During inference, we modify the latent representation of the token embedding we aim to edit.
  • The modification is done by adding the edit direction obtained in the previous stage.
  • Specifically, we apply \(z' = z + \alpha d\), where \(\alpha\) controls the strength of the edit.

More Results

  • By adding the directions found by our method to a specific token (e.g 'man'), our method can steer the image generation towards different attributes, such as 'smile', 'old', 'angry', and 'surprised'.
  • In every column, we apply the same direction \(d\) on different prompts, demonstrating the generalization ability of our method.

BibTeX


      @misc{kamenetsky2025saedittokenlevelcontrolcontinuous,
      title={SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder},
      author={Ronen Kamenetsky and Sara Dorfman and Daniel Garibi and Roni Paiss and Or Patashnik and Daniel Cohen-Or},
      year={2025},
      eprint={2510.05081},
      archivePrefix={arXiv},
      primaryClass={cs.GR},
      url={https://arxiv.org/abs/2510.05081},
}