πŸ€‘Stable Diffusion model

Overview

The stable diffusion model in image space is slow because the image space is enormous. For example, one 512x512 image with three color channels is a 786,432-dimensional space (Do not say that many values for ONE image). So, the latent diffusion model is used to solve this.

Latent diffusion model

Stable Diffusion is designed to solve the speed problem.

Stable Diffusion is a latent diffusion model. Instead of operating in the high-dimensional image space, it first compresses the image into the latent space. The latent space is 48 times smaller, so it reaps the benefits of crunching a lot fewer numbers. That's why it's a lot faster.

Image resolution

The image resolution is reflected in the size of the latent image tensor. The size of the latent image is 4x64x64 for 512Γ—512 images only. It is 4x96x64 for a 768Γ—512 portrait image. That's why it takes longer and moves VRAM to generate a larger image.

Since Stable Diffusion v1 is fine-tuned on 512x512 images, generating images larger than 512x512 could result in duplicate objects. For example, the infamous two heads, etc. If you must, keep at least one side to 512 pixels and use an AI upscales for higher resolution.

The high dimensionality of images is artifactual. Natural images can be readily compressed into the much smaller latent space without losing any information. This is called the manifold hypothesis in machine learning.

Why is latent space possible?

VAE can compress an image into a much smaller latent space without losing information. The reason is that natural images are not random. They are highly regular.

Here's how latent reverse diffusion in Stable Diffusion works.

  1. A random latent space matrix is generated

  2. The noise predictor estimates the noise of the latent matrix.

  3. The estimated noise is then subtracted from the latent matrix.

  4. Steps 2 and 3 are repeated up to specific sampling steps.

  5. The decoder of VAE converts the latent matrix to the final image.

What is a VAE file?

VAE files are used in Stable Diffusion v1 to improve eyes and faces. They are the decoder of the autoencoder we just talked about. By further fine-tuning the decoder, the model can paint finer details.

However, compressing an image into the latent space does lose information since the original VAE did not recover the fine details. Instead, the VAE decoder is responsible for painting fine details.

Credit

Last updated