βΎDiffusion in image
Diffusion processing in image
Last updated
Diffusion processing in image
Last updated
Its math looks very much like diffusion in physics, so it was called diffusion model.
Stable Diffusion is a text-to-image latent diffusion model. It is called a latent diffusion model because it works with a lower-dimensinal representation of the image instead of the actual pixel space, which makes it more memory efficient.
The encoder compresses the image into a smaller representation, and a decoder to convert the compresses representation back into an image.
For text-to-image models, you'll need a tokenizer and an encoder to generate text embeddings.
Stable Diffusion is a text-to-image deep-learning model.
A forward diffusion process adds noise to a training image, gradually turning it into an uncharacteristic noise image. The forward process will run any cat or dog image into a noise image. Eventually, you won't be able to tell whether they are initially a dog or a cat.
It's like a drop of ink fell into a glass of water. The ink drop diffuses in water. After a few minutes, it randomly distributes itself throughout the water. You can no longer tell whether it initially fell at the center or near the rim.
Example of forward diffusion of a cat image
The main idea of reverse diffusion is starting from a noisy, meaningless image, it recovers a cat OR a dog image.
Reverse diffusion in latent space please see here.
Every diffusion process has two parts below
Drift or directed motion
Random motion
And the reverse diffusion towards either cat or dog images but nothing in between. That's why the result can either be a cat or a dog.
I agree this is a million-dollar question.
To reverse the discussion, we need to know how much noise is added to an image. The answer is using noise predictor.
A neural network model to predict the noise added. And it is a U-Net model.
Here is the training process for the noise predictor below:
Pick a training image, like a photo of a cat.
Generate a random noise image
Corrupt the training image by adding this noisy image up to a certain number of steps
Teach the noise predictor to tell us how much noise was added. This is done by tuning its weights and showing it the correct answer.
In the above picture, noise is sequentially added at each step. The noise predictor estimates the total noise added up to each step.
After training, we have a noise predictor capable of estimating the noise added to an image.
Generating a completely random image and ask the noise predictor to tell us the noise
Subtracting this estimated noise from the original image
Repeat this process a few times
We will get an image of either a cat or a dog.
Here is no control over generating a cat or a dog's image(unconditioned). More detail for conditioning