> For the complete documentation index, see [llms.txt](https://aisuko.gitbook.io/wiki/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://aisuko.gitbook.io/wiki/ai-techniques/stable-diffusion/conditioning.md).

# Conditioning

## Overview

The purpose of *<mark style="color:blue;">**conditioning**</mark>* is to *<mark style="color:red;">**steer the noise predictor so that the predicted noise will give us what we want**</mark>* after subtracting from the image.

## Text conditioning

{% hint style="info" %}
Text-to-image
{% endhint %}

#### The overview of a text prompt is processed and fed into the noise predictor

<figure><img src="/files/qSpOPVpyZLtHabqUJAav" alt=""><figcaption><p>How the text prompt is processed and fed into the noise predictor to steer image generation</p></figcaption></figure>

#### [Tokenizer](#tokenizer-1)

<mark style="color:green;">**Tokenizer**</mark> first converts each word in the prompt to a number called a <mark style="color:green;">**token**</mark>.&#x20;

#### Token

Each token is then converted to a 768-value vector called <mark style="color:green;">**embedding**</mark>.

#### [Embedding](#embedding-1)

The embeddings are then processed by the <mark style="color:green;">**text transformer**</mark> and are ready to be consumed by the noise predictor.

#### Example

The tokens and embeddings of any prompt with the [notebook](https://colab.research.google.com/github/sagiodev/stablediffusion_webui/blob/master/Stable_Diffusion_tokenizer_and_embedding_SDA.ipynb).

### Tokenizer

<figure><img src="/files/BSLQEJfxEMrKdX2l5ys6" alt=""><figcaption><p>Tokenizer</p></figcaption></figure>

The text prompt is first **tokenized** by a [CLIP tokenizer](https://huggingface.co/docs/transformers/model_doc/clip) (More detail for [CLIP](/wiki/ai-techniques/stable-diffusion/clip.md)). *<mark style="color:green;">**Tokenization**</mark>* is the computer's way of understanding words(It compares to humans). This is the reason the words in a text prompt are first converted to numbers.

*<mark style="color:red;">**A tokenizer can only tokenize words it has seen during training.**</mark>* For example, there are "dream" and "beach" in the CLIP model but not "dreambeach". Tokenizer would break up the word "dreambeach" into two tokens "dream" and "beach". *<mark style="color:red;">**So one word does not always mean one token.**</mark>*

*<mark style="color:red;">**The stable Diffusion model is limited to using 75 tokens in a prompt. (So, it is not the same as 75 words.)**</mark>*

### Embedding

<figure><img src="/files/3Vfnxmx7FNBrvNzfjjhA" alt=""><figcaption><p>Embedding</p></figcaption></figure>

Stable diffusion v1 uses Open AI's [ViT-L/14](https://github.com/CompVis/stable-diffusion) Clip model. Embedding is a 768-value vector. Each token has its own unique embedding vector. Embedding is fixed by the CLIP model, which is learned during training.

#### Why do we need embedding?

It's because *<mark style="color:red;">**some words are closely related to each other**</mark>*. We want to take advantage of this information. For example, the embeddings of man, gentleman, and guy are nearly identical because they can be used interchangeably. Monet, Manet, and Degas all painted in impressionist styles but in different ways. The names have close but not identical embeddings.

Embedding can trigger a style with a keyword. Embeddings can do magic. Scientists have shown that finding the proper embeddings can trigger arbitrary objects and styles, a fine-tuning technique called [textual inversion](https://textual-inversion.github.io).

### Feeding embeddings to noise predictor

<figure><img src="/files/d4sim53wo2AQekiC8ZvC" alt=""><figcaption><p>From embeddings to the noise predictor</p></figcaption></figure>

The embedding needs to be further processed by the *<mark style="color:red;">**text transformer**</mark>* before feeding into the noise predictor. The transformer is like a universal adapter for conditioning. In this case, its input is <mark style="color:red;">**text embedding vectors**</mark>, but <mark style="color:purple;">**it could**</mark> as well be something else like *<mark style="color:purple;">**class labels, images**</mark>*, and [depth maps](/wiki/ai-techniques/stable-diffusion/depth-maps.md). The transformer not only further processes the data but also *<mark style="color:red;">**provides a mechanism to include different conditioning modalities**</mark>*.

### Cross-attention

The output of the text transformer is used *<mark style="color:red;">**multiple times**</mark>* by the noise predictor throughout the U-Net. The U-Net\[#Todo] consumes it by a *<mark style="color:red;">**cross-attention mechanism**</mark>*. <mark style="color:green;">**That's where the prompt meets the image.**</mark>

Here is an example, the prompt "A man with the blue eye" as an example. Stable Diffusion pairs the two words "blue" and "eyes" together (<mark style="color:red;">**self-attention within the prompt**</mark>) so that it generates a man with blue eyes but not a man with a blue shirt. It then uses this information to <mark style="color:red;">**steer the reverse diffuse towards images containing blue eyes**</mark>. (*<mark style="color:blue;">**cross-attention between the prompt and the image**</mark>*)

## Extension

### Hypernetwork

*<mark style="color:green;">**A technique to fine-tune Stable Diffusion models**</mark>* <mark style="color:red;">**hijacks the cross-attention networks to insert styles**</mark>. <mark style="color:blue;">**LoRA models modify the weights of the cross-attention module to change styles**</mark>. *<mark style="color:orange;">**The fact that modifying this module alone can fine-tune a Stable Diffusion model tells you how important this module is.**</mark>*

### Other conditionings

The text prompt is not the only way a Stable Diffusion model can be conditioned. Both a text prompt and a depth image are used to condition the [depth-to-image model](/wiki/ai-techniques/stable-diffusion/depth-maps.md).

[ControlNet](/wiki/ai-techniques/stable-diffusion/controlnet.md) conditions the noise predictor with [detected outlines](/wiki/ai-techniques/stable-diffusion/controlnet.md#edge-detection), [human poses](/wiki/ai-techniques/stable-diffusion/controlnet.md#human-pose-detection), etc, and achieves excellent controls over image generations.

## Credit

{% embed url="<https://stable-diffusion-art.com/how-stable-diffusion-work/#Conditioning>" %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://aisuko.gitbook.io/wiki/ai-techniques/stable-diffusion/conditioning.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
