πŸ“ΌStable Diffusion v1 vs v2

Model difference

Stable Diffusion v2 uses OpenClip for text embedding. Stable Diffusion v1 uses Open AI's CLIP Vit-L/14 for text embedding. The reasons for this change are:

OpenClip is up five times larger

A larger text encoder model improves image quality

More transparency

Open AI's CLIP models are opensource, but the models were trained with proprietary data.

Training data difference

Stable Diffusion v1.4 is trained with

Stable Diffusion v2 is trained with

  • 550k steps at the resolution 256x256 on a subset of LAION-5B filtered for explicit pornographic material, using the LAION-NSFW classifier with punsafe=0.1 and an aesthetic score >= 4.5.

  • 850k steps at the resolution 512x512 on the same dataset on images with resolution >= 512x512.

  • 150k steps using a v-objective on the same dataset.

  • Resumed for another 140k steps on 768x768 images.

Stable Diffusion v2.1 is fine-tuned on v2.0

  • additional 55k steps on the same dataset (with punsafe=0.1)

  • another 155k extra steps with punsafe=0.98

So basically, they turned off the NSFW filter in the last training steps.

Outcome Difference

Users generally find it harder to use Stable Diffusion v2 to control styles and generate celebrities. Although Stability AI did not explicitly filter out artist and celebrity names, their effects are much weaker in v2. This is likely due to the difference in training data. Open AI’s proprietary data may have more artwork and celebrity photos. Their data is probably highly filtered so that everything and everyone looks fine and pretty.

Last updated