# Stable Diffusion workflow

## Overview

Let's go through some examples of wat happens under the hood.

## Text-to-image

We give Stable Diffusion a text prompt, and it returns an image.

### Step1

Stable Diffusion generates a *<mark style="color:red;">**random tensor in the latent space**</mark>*. We *<mark style="color:purple;">**control this tensor by setting the**</mark>* [*<mark style="color:purple;">**seed**</mark>* ](https://aisuko.gitbook.io/wiki/ai-techniques/the-important-parameters-for-stunning-ai-image#seed)*<mark style="color:purple;">**of the random number generator.**</mark>* If we <mark style="color:blue;">**set the seed to a certain value**</mark>, we will <mark style="color:blue;">**always get the same random tensor**</mark>. *<mark style="color:red;">**This is your image in latent space. But it is all noise for now.**</mark>*

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-cefce4d9df4519152884bb88301f5e1fad2adcc5%2Fimage.png?alt=media" alt="" width="339"><figcaption><p>A random tensor is generated in latent space</p></figcaption></figure>

### Step2

The noise predictor U-Net takes the latent noisy image and text prompt as input and predicts the noise, also in latent space (a 4x64x64 tensor).

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-a678c9986292cb78f2a3ac163bd87902fc39c026%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

### Step3

Subtract the latent noise from the latent image. This becomes our **new latent image**.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-60cbf24e389922a670da69d837d474e46f554dac%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
Step2 and 3 are repeated for a <mark style="color:red;">**certain number of sampling steps**</mark>, for example 20 times.
{% endhint %}

### Step4

Finally, [the decoder of VAE converts the latent image back to pixel space](https://aisuko.gitbook.io/wiki/ai-techniques/vae#variational-autoencoder). This is the image you get after running Stable Diffusion.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-c750e4fa6d2a365b32a4544c3d07b660e0b2e67e%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

#### The process of the image evolves in each sampling step

<figure><img src="https://i0.wp.com/stable-diffusion-art.com/wp-content/uploads/2022/12/cat_euler_15.gif?resize=512%2C512&#x26;ssl=1" alt=""><figcaption><p>Image at each sampling step</p></figcaption></figure>

## Noise schedule

The image changes from noisy to clean. *<mark style="color:red;">**How about if the noise predictor not working well in the initial steps?**</mark>* This <mark style="color:red;">**is only partly true**</mark>. The real reason is we try to get to an expected noise at each sampling step. This is called the *<mark style="color:blue;">**noise schedule**</mark>*. Here is an example.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-1e684b9f4a0b1c6f4929e9b63942379deb24424c%2Fimage.png?alt=media" alt=""><figcaption><p>A noise schedule for 15 sampling steps.</p></figcaption></figure>

<mark style="color:green;">**The noise schedule is something we define**</mark>. We can choose to *<mark style="color:blue;">**subtract the same amount of noise at each step**</mark>*. <mark style="color:blue;">**Or**</mark> we can *<mark style="color:blue;">**subtract more in the beginning**</mark>*, <mark style="color:purple;">**like above**</mark>. *<mark style="color:orange;">**The sampler subtracts just enough noise in each step to reach the expected noise in the next step**</mark>*. That's what you see in the [step-by-step image](#the-process-of-the-image-evolves-in-each-sampling-step).

## Image-to-image

Image-to-image is a method first proposed in the [SDEdit ](https://arxiv.org/abs/2108.01073)method. SDEdit can be applied to any diffusion model. So, we have image-to-image for Stable Diffusion (a latent diffusion model).

<mark style="color:red;">**An input image**</mark> and <mark style="color:red;">**a text prompt**</mark> *<mark style="color:blue;">**are supplied as the input**</mark>* in image-to-image. The generated image will be conditioned by both the input image and text prompt. For example, using this amateur drawing and the prompt "photo of perfect green apple with stem, water droplets, dramatic lighting" as inputs, image-to-image can turn it into a professional drawing:

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-8d4c40120696e5788c88c54a581024263538523c%2Fimage.png?alt=media" alt=""><figcaption><p>Image-to-image</p></figcaption></figure>

### Step1

The input image is encoded to latent space.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-e7e02361a44aebe3b2964f20579823b5effaa414%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

### Step2

Noise is added to the latent image. [Denoising strength](https://aisuko.gitbook.io/wiki/ai-techniques/stable-diffusion/denoising-strength) controls *<mark style="color:green;">**how much noise is added**</mark>*. If it is 0, no noise is added. *<mark style="color:red;">**If it is 1, the maximum amount of noise**</mark>* is added so that *<mark style="color:red;">**the latent image becomes a complete random tensor**</mark>*.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-cf57910cef2b8b7dbc0cb2107ff0c509394df6e9%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

### Step3

The [*<mark style="color:green;">**noise predictor U-Net**</mark>*](https://aisuko.gitbook.io/wiki/ai-techniques/diffusion-in-image#noise-predictor) *<mark style="color:green;">**takes the latent noisy image**</mark>* and *<mark style="color:green;">**text prompt**</mark>* *<mark style="color:red;">**as input**</mark>* and *<mark style="color:red;">**predicts the noise in latent space (a 4x64x64 tensor)**</mark>*.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-a678c9986292cb78f2a3ac163bd87902fc39c026%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

### Step4

Subtract the latent noise from the latent image. This becomes our new latent image.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-60cbf24e389922a670da69d837d474e46f554dac%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

Step3 3 and 4 are repeated for a certain number of sampling steps, for example, 20 times.

### Step5

Finally, the decoder of VAE converts the latent image back to pixel space. This is the image you get after running image-to-image.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-88e503d8c65dbf7927d8a26d22fe0164feac4140%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

All *<mark style="color:blue;">**image-to-image does is to set the initial latent image**</mark>* with *<mark style="color:red;">**a bit of noise**</mark>* and *<mark style="color:red;">**a bit of input image**</mark>*. Setting *<mark style="color:green;">**denoising strength to 1 is equivalent to text-to-image**</mark>* because the *<mark style="color:yellow;">**initial latent image is entirely random noise**</mark>*.

## Inpainting

Inpainting is really just <mark style="color:red;">**a particular case of image-to-image**</mark>. <mark style="color:blue;">**Noise is added to the parts of the image you wanted to impatient**</mark>**.&#x20;***<mark style="color:green;">**The amount of noise is similarly controlled by**</mark>* [*<mark style="color:green;">**denoising strength**</mark>*](https://aisuko.gitbook.io/wiki/ai-techniques/stable-diffusion/denoising-strength).

## Depth-to-image

[*<mark style="color:blue;">**Depth-to-image**</mark>*](https://aisuko.gitbook.io/wiki/ai-techniques/stable-diffusion/depth-maps) *<mark style="color:blue;">**is an enhancement to image-to-image**</mark>*; <mark style="color:red;">**it generates new images with additional conditioning**</mark> using a [depth map](https://aisuko.gitbook.io/wiki/ai-techniques/stable-diffusion/depth-maps).

### Step1

The input image is encoded into the latent state

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-88d5fdd3df86987ab9be8f0bc898b4833c41cf53%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

### Step2

[MiDAS ](https://github.com/isl-org/MiDaS)(an AI depth model) estimates the depth map from the input image.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-63899f04a34ca75adccca20e2678b83f9b5a76e0%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

### Step3

Noise is added to the latent image. [Denoising strength](https://aisuko.gitbook.io/wiki/ai-techniques/stable-diffusion/denoising-strength) controls how much noise is added. If the denoising strength is 0, no noise is added. If the denoising strength is 1, the maximum noise is added so that the latent image becomes a random tensor.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-cf57910cef2b8b7dbc0cb2107ff0c509394df6e9%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

### Step4

The noise predictor estimates the noise of the latent space, *<mark style="color:red;">**conditioned by the text prompt and the depth map**</mark>*.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-39cbd8e1bad42eea6c6b2be694a433e5a618473c%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

### Step5

Subtract the latent noise from the latent image. This becomes our <mark style="color:red;">**new latent image**</mark>.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-60cbf24e389922a670da69d837d474e46f554dac%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

Steps 4 and 5 are repeated for the number of sampling steps.

### Step6

The decoder of VAE decodes the latent image. Now you get the final image from depth-to-image.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2FxZyRS6LrfOc64UwgUhHi%2Fimage.png?alt=media&#x26;token=7eea4850-cdc0-40d8-ab23-7a409640bdaa" alt=""><figcaption></figcaption></figure>

## Credit

{% embed url="<https://stable-diffusion-art.com/how-stable-diffusion-work/#Stable_Diffusion_step-by-step>" %}
