# ControlNet

## Overview

ControlNet is a neural network model for controlling [Stable Diffusion](https://aisuko.gitbook.io/wiki/ai-techniques/stable-diffusion) models. You can use ControlNet *<mark style="color:red;">**along with**</mark>* any Stable Diffusion models.

Stable diffusion models support text-to-image. It uses text prompts as the conditioning to steer image generation so that we generate images that match the text prompt.

ControlNet adds one more conditioning in addition to the text prompt. The extra conditioning can take many forms in ControlNet.

## How does ControlNet work?

ControlNet works by attaching trainable network modules to various parts of the U-Net (noise predictor) of the Stable Diffusion Model. The weight of the Stable Diffusion model is locked so that they are unchanged during training. Only the attached modules are modified during training.

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-3a1e15e5fde500b86695a128cccf3f7dfc9c78d5%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

According to the diagram from the [paper](https://arxiv.org/abs/2302.05543). Initially, the weights of the attached network module are all zero, making the new model able to take advantage of the trained and locked model.

During training, two conditionings are supplied along with each training image.

* The text prompt
* **The control map**
  * OpenPose
  * Canny edges
  * etc

The ControlNet model learns to generate images based on these two inputs.

Each control method is trained independently.

## Two ways to use ControlNet

* Edge detection
* Human pose detection

### Edge detection

ControlNet takes an additional input image and detects its outlines using the <mark style="color:red;">**Canny edge detector**</mark>. An image containing the detected edges is then saved as a <mark style="color:red;">**control map. It is fed into the  ControlNet model as an extra conditioning to the text prompt.**</mark>

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-5484017d4ab7feb5f6f896543aa4103f481e7775%2Fimage.png?alt=media" alt=""><figcaption><p>Stable Diffusion ControlNet with Canny edge conditioning. Source: stable diffusion art</p></figcaption></figure>

The process of extracting specific information(edge in this case) from the input image is called <mark style="color:red;">**annotation**</mark>([Adding Conditional Control to Text-to-image Diffusion Models](https://arxiv.org/abs/2302.05543)) or <mark style="color:red;">**preprocessing**</mark> (in the ControlNet extension).

### Human pose detection

[Openpose](https://github.com/CMU-Perceptual-Computing-Lab/openpose)(*<mark style="color:red;">**Edge detection is not the only way an image can be preprocessed**</mark>*) is a fast human keypoint detection model that can extract human poses like positions of hands, legs, and head, like:

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-f12e8ac9ce144f2410bf620699d38cf2070d483b%2Fimage.png?alt=media" alt="" width="300"><figcaption><p>Input image annotated with human pose detection using Openpose.</p></figcaption></figure>

#### ControlNet workflow using OpenPose

<figure><img src="https://3515747285-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fqm1WfU7McQ1hgBrDXi90%2Fuploads%2Fgit-blob-01f5160ea9c1f5cf9379b51612b9da66a4daa5fd%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

In this ControlNet workflow, <mark style="color:purple;">**key points are extracted from the input image using OpenPose**</mark> and saved as a control map containing the positions of key points. It is then fed to Stable Diffusion as an <mark style="color:purple;">**extra conditioning**</mark> together with the text prompt. Images are generated based on these two conditionings.

### The difference between using Canny edge detection and Openpose

The <mark style="color:green;">**Canny edge detector**</mark> <mark style="color:red;">**extracts**</mark> the *<mark style="color:red;">**edges of the subject**</mark>* and *<mark style="color:red;">**background alike**</mark>*. It tends to translate the scene *<mark style="color:red;">**more faithfully.**</mark>* For instance, the outline and hairstyle are preserved in the pictures(edge detection) above.

The OpenPose(It reminds me of Xbox Kinect) only detects human key points such as positions of the head, arms, etc. The image generation is *<mark style="color:red;">**more liberal**</mark>* but follows the original pose. For example, the woman jumping up with the left foot pointing sideways is <mark style="color:red;">**different from the original image and the one in the Canny Edge example**</mark> <mark style="color:purple;">**because**</mark> Openpose's keypoint detection does not specify the orientations of the feet.

## Difference between the Stable Diffusion depth model and ControlNet

Stability AI, the creator of Stable Diffusion, released a depth-to-image model. It shares a lot of similarities with ControlNet, but there are important differences.

### Similar

* Both are Stable Diffusion models
* Both use two conditionings (a preprocessed image and text prompt)
* Both use MiDAS to estimate the depth map

### Difference

* Depth-to-image model is a v2 model. ControlNet can be used with any v1 or v2 models. ControlNet can use **any** v1 model not only opening up depth conditioning to the v1.5 base model, but also thousands of special models that were released by the community.
* ControlNet is more versatile like condition with edge detection, pose detection, and so on.
* ControlNet's depth map has a higher resolution than depth-to-image's

## How does Con

## Credit

{% embed url="<https://stable-diffusion-art.com/controlnet/#What_is_ControlNet>" %}
