ControlNet
Last updated
Last updated
ControlNet is a neural network model for controlling Stable Diffusion models. You can use ControlNet along with any Stable Diffusion models.
Stable diffusion models support text-to-image. It uses text prompts as the conditioning to steer image generation so that we generate images that match the text prompt.
ControlNet adds one more conditioning in addition to the text prompt. The extra conditioning can take many forms in ControlNet.
ControlNet works by attaching trainable network modules to various parts of the U-Net (noise predictor) of the Stable Diffusion Model. The weight of the Stable Diffusion model is locked so that they are unchanged during training. Only the attached modules are modified during training.
According to the diagram from the paper. Initially, the weights of the attached network module are all zero, making the new model able to take advantage of the trained and locked model.
During training, two conditionings are supplied along with each training image.
The text prompt
The control map
OpenPose
Canny edges
etc
The ControlNet model learns to generate images based on these two inputs.
Each control method is trained independently.
Edge detection
Human pose detection
ControlNet takes an additional input image and detects its outlines using the Canny edge detector. An image containing the detected edges is then saved as a control map. It is fed into the ControlNet model as an extra conditioning to the text prompt.
The process of extracting specific information(edge in this case) from the input image is called annotation(Adding Conditional Control to Text-to-image Diffusion Models) or preprocessing (in the ControlNet extension).
Openpose(Edge detection is not the only way an image can be preprocessed) is a fast human keypoint detection model that can extract human poses like positions of hands, legs, and head, like:
In this ControlNet workflow, key points are extracted from the input image using OpenPose and saved as a control map containing the positions of key points. It is then fed to Stable Diffusion as an extra conditioning together with the text prompt. Images are generated based on these two conditionings.
The Canny edge detector extracts the edges of the subject and background alike. It tends to translate the scene more faithfully. For instance, the outline and hairstyle are preserved in the pictures(edge detection) above.
The OpenPose(It reminds me of Xbox Kinect) only detects human key points such as positions of the head, arms, etc. The image generation is more liberal but follows the original pose. For example, the woman jumping up with the left foot pointing sideways is different from the original image and the one in the Canny Edge example because Openpose's keypoint detection does not specify the orientations of the feet.
Stability AI, the creator of Stable Diffusion, released a depth-to-image model. It shares a lot of similarities with ControlNet, but there are important differences.
Both are Stable Diffusion models
Both use two conditionings (a preprocessed image and text prompt)
Both use MiDAS to estimate the depth map
Depth-to-image model is a v2 model. ControlNet can be used with any v1 or v2 models. ControlNet can use any v1 model not only opening up depth conditioning to the v1.5 base model, but also thousands of special models that were released by the community.
ControlNet is more versatile like condition with edge detection, pose detection, and so on.
ControlNet's depth map has a higher resolution than depth-to-image's