# π·Stable Diffusion workflow

## Overview

Let's go through some examples of wat happens under the hood.

## Text-to-image

We give Stable Diffusion a text prompt, and it returns an image.

### Step1

Stable Diffusion generates a * random tensor in the latent space*. We

**control this tensor by setting the**

**seed***If we*

**of the random number generator.****set the seed to a certain value**, we will

**always get the same random tensor**.

**This is your image in latent space. But it is all noise for now.**### Step2

The noise predictor U-Net takes the latent noisy image and text prompt as input and predicts the noise, also in latent space (a 4x64x64 tensor).

### Step3

Subtract the latent noise from the latent image. This becomes our **new latent image**.

Step2 and 3 are repeated for a **certain number of sampling steps**, for example 20 times.

### Step4

Finally, the decoder of VAE converts the latent image back to pixel space. This is the image you get after running Stable Diffusion.

#### The process of the image evolves in each sampling step

## Noise schedule

The image changes from noisy to clean. * How about if the noise predictor not working well in the initial steps?* This

**is only partly true**. The real reason is we try to get to an expected noise at each sampling step. This is called the

*. Here is an example.*

**noise schedule****The noise schedule is something we define**. We can choose to* subtract the same amount of noise at each step*.

**Or**we can

*,*

**subtract more in the beginning****like above**.

*. That's what you see in the step-by-step image.*

**The sampler subtracts just enough noise in each step to reach the expected noise in the next step**## Image-to-image

Image-to-image is a method first proposed in the SDEdit method. SDEdit can be applied to any diffusion model. So, we have image-to-image for Stable Diffusion (a latent diffusion model).

**An input image** and **a text prompt** * are supplied as the input* in image-to-image. The generated image will be conditioned by both the input image and text prompt. For example, using this amateur drawing and the prompt "photo of perfect green apple with stem, water droplets, dramatic lighting" as inputs, image-to-image can turn it into a professional drawing:

### Step1

The input image is encoded to latent space.

### Step2

Noise is added to the latent image. Denoising strength controls* how much noise is added*. If it is 0, no noise is added.

*is added so that*

**If it is 1, the maximum amount of noise***.*

**the latent image becomes a complete random tensor**### Step3

The **noise predictor U-Net*** takes the latent noisy image* and

**text prompt***and*

**as input***.*

**predicts the noise in latent space (a 4x64x64 tensor)**### Step4

Subtract the latent noise from the latent image. This becomes our new latent image.

Step3 3 and 4 are repeated for a certain number of sampling steps, for example, 20 times.

### Step5

Finally, the decoder of VAE converts the latent image back to pixel space. This is the image you get after running image-to-image.

All* image-to-image does is to set the initial latent image* with

*and*

**a bit of noise***. Setting*

**a bit of input image***because the*

**denoising strength to 1 is equivalent to text-to-image***.*

**initial latent image is entirely random noise**## Inpainting

Inpainting is really just **a particular case of image-to-image**. **Noise is added to the parts of the image you wanted to impatient****. ****The amount of noise is similarly controlled by *** denoising strength*.

## Depth-to-image

**Depth-to-image*** is an enhancement to image-to-image*;

**it generates new images with additional conditioning**using a depth map.

### Step1

The input image is encoded into the latent state

### Step2

MiDAS (an AI depth model) estimates the depth map from the input image.

### Step3

Noise is added to the latent image. Denoising strength controls how much noise is added. If the denoising strength is 0, no noise is added. If the denoising strength is 1, the maximum noise is added so that the latent image becomes a random tensor.

### Step4

The noise predictor estimates the noise of the latent space, * conditioned by the text prompt and the depth map*.

### Step5

Subtract the latent noise from the latent image. This becomes our **new latent image**.

Steps 4 and 5 are repeated for the number of sampling steps.

### Step6

The decoder of VAE decodes the latent image. Now you get the final image from depth-to-image.

## Credit

Last updated