Stable Diffusion workflow
Last updated
Last updated
Let's go through some examples of wat happens under the hood.
We give Stable Diffusion a text prompt, and it returns an image.
Stable Diffusion generates a random tensor in the latent space. We control this tensor by setting the seed of the random number generator. If we set the seed to a certain value, we will always get the same random tensor. This is your image in latent space. But it is all noise for now.
The noise predictor U-Net takes the latent noisy image and text prompt as input and predicts the noise, also in latent space (a 4x64x64 tensor).
Subtract the latent noise from the latent image. This becomes our new latent image.
Step2 and 3 are repeated for a certain number of sampling steps, for example 20 times.
Finally, the decoder of VAE converts the latent image back to pixel space. This is the image you get after running Stable Diffusion.
The image changes from noisy to clean. How about if the noise predictor not working well in the initial steps? This is only partly true. The real reason is we try to get to an expected noise at each sampling step. This is called the noise schedule. Here is an example.
The noise schedule is something we define. We can choose to subtract the same amount of noise at each step. Or we can subtract more in the beginning, like above. The sampler subtracts just enough noise in each step to reach the expected noise in the next step. That's what you see in the step-by-step image.
Image-to-image is a method first proposed in the SDEdit method. SDEdit can be applied to any diffusion model. So, we have image-to-image for Stable Diffusion (a latent diffusion model).
An input image and a text prompt are supplied as the input in image-to-image. The generated image will be conditioned by both the input image and text prompt. For example, using this amateur drawing and the prompt "photo of perfect green apple with stem, water droplets, dramatic lighting" as inputs, image-to-image can turn it into a professional drawing:
The input image is encoded to latent space.
Noise is added to the latent image. Denoising strength controls how much noise is added. If it is 0, no noise is added. If it is 1, the maximum amount of noise is added so that the latent image becomes a complete random tensor.
The noise predictor U-Net takes the latent noisy image and text prompt as input and predicts the noise in latent space (a 4x64x64 tensor).
Subtract the latent noise from the latent image. This becomes our new latent image.
Step3 3 and 4 are repeated for a certain number of sampling steps, for example, 20 times.
Finally, the decoder of VAE converts the latent image back to pixel space. This is the image you get after running image-to-image.
All image-to-image does is to set the initial latent image with a bit of noise and a bit of input image. Setting denoising strength to 1 is equivalent to text-to-image because the initial latent image is entirely random noise.
Inpainting is really just a particular case of image-to-image. Noise is added to the parts of the image you wanted to impatient. The amount of noise is similarly controlled by denoising strength.
Depth-to-image is an enhancement to image-to-image; it generates new images with additional conditioning using a depth map.
The input image is encoded into the latent state
MiDAS (an AI depth model) estimates the depth map from the input image.
Noise is added to the latent image. Denoising strength controls how much noise is added. If the denoising strength is 0, no noise is added. If the denoising strength is 1, the maximum noise is added so that the latent image becomes a random tensor.
The noise predictor estimates the noise of the latent space, conditioned by the text prompt and the depth map.
Subtract the latent noise from the latent image. This becomes our new latent image.
Steps 4 and 5 are repeated for the number of sampling steps.
The decoder of VAE decodes the latent image. Now you get the final image from depth-to-image.