👷Stable Diffusion workflow

Overview

Let's go through some examples of wat happens under the hood.

Text-to-image

We give Stable Diffusion a text prompt, and it returns an image.

Step1

Stable Diffusion generates a random tensor in the latent space. We control this tensor by setting the seed of the random number generator. If we set the seed to a certain value, we will always get the same random tensor. This is your image in latent space. But it is all noise for now.

Step2

The noise predictor U-Net takes the latent noisy image and text prompt as input and predicts the noise, also in latent space (a 4x64x64 tensor).

Step3

Subtract the latent noise from the latent image. This becomes our new latent image.

Step2 and 3 are repeated for a certain number of sampling steps, for example 20 times.

Step4

Finally, the decoder of VAE converts the latent image back to pixel space. This is the image you get after running Stable Diffusion.

The process of the image evolves in each sampling step

Noise schedule

The image changes from noisy to clean. How about if the noise predictor not working well in the initial steps? This is only partly true. The real reason is we try to get to an expected noise at each sampling step. This is called the noise schedule. Here is an example.

The noise schedule is something we define. We can choose to subtract the same amount of noise at each step. Or we can subtract more in the beginning, like above. The sampler subtracts just enough noise in each step to reach the expected noise in the next step. That's what you see in the step-by-step image.

Image-to-image

Image-to-image is a method first proposed in the SDEdit method. SDEdit can be applied to any diffusion model. So, we have image-to-image for Stable Diffusion (a latent diffusion model).

An input image and a text prompt are supplied as the input in image-to-image. The generated image will be conditioned by both the input image and text prompt. For example, using this amateur drawing and the prompt "photo of perfect green apple with stem, water droplets, dramatic lighting" as inputs, image-to-image can turn it into a professional drawing:

Step1

The input image is encoded to latent space.

Step2

Noise is added to the latent image. Denoising strength controls how much noise is added. If it is 0, no noise is added. If it is 1, the maximum amount of noise is added so that the latent image becomes a complete random tensor.

Step3

The noise predictor U-Net takes the latent noisy image and text prompt as input and predicts the noise in latent space (a 4x64x64 tensor).

Step4

Subtract the latent noise from the latent image. This becomes our new latent image.

Step3 3 and 4 are repeated for a certain number of sampling steps, for example, 20 times.

Step5

Finally, the decoder of VAE converts the latent image back to pixel space. This is the image you get after running image-to-image.

All image-to-image does is to set the initial latent image with a bit of noise and a bit of input image. Setting denoising strength to 1 is equivalent to text-to-image because the initial latent image is entirely random noise.

Inpainting

Inpainting is really just a particular case of image-to-image. Noise is added to the parts of the image you wanted to impatient. The amount of noise is similarly controlled by denoising strength.

Depth-to-image

Depth-to-image is an enhancement to image-to-image; it generates new images with additional conditioning using a depth map.

Step1

The input image is encoded into the latent state

Step2

MiDAS (an AI depth model) estimates the depth map from the input image.

Step3

Noise is added to the latent image. Denoising strength controls how much noise is added. If the denoising strength is 0, no noise is added. If the denoising strength is 1, the maximum noise is added so that the latent image becomes a random tensor.