πΌStable Diffusion v1 vs v2
Model difference
Stable Diffusion v2 uses OpenClip for text embedding. Stable Diffusion v1 uses Open AI's CLIP Vit-L/14 for text embedding. The reasons for this change are:
OpenClip is up five times larger
A larger text encoder model improves image quality
More transparency
Open AI's CLIP models are opensource, but the models were trained with proprietary data.
Training data difference
Stable Diffusion v1.4 is trained with
237k steps at resolution 256Γ256 on laion2B-en dataset.
194k steps at resolution 512Γ512 on laion-high-resolution.
225k steps at 512Γ512 on βlaion-aesthetics v2 5+β with 10% dropping of text conditioning.
Stable Diffusion v2 is trained with
550k steps at the resolution
256x256
on a subset of LAION-5B filtered for explicit pornographic material, using the LAION-NSFW classifier withpunsafe=0.1
and an aesthetic score >=4.5
.850k steps at the resolution
512x512
on the same dataset on images with resolution>= 512x512
.150k steps using a v-objective on the same dataset.
Resumed for another 140k steps on
768x768
images.
Stable Diffusion v2.1 is fine-tuned on v2.0
additional 55k steps on the same dataset (with
punsafe=0.1
)another 155k extra steps with
punsafe=0.98
So basically, they turned off the NSFW filter in the last training steps.
Outcome Difference
Users generally find it harder to use Stable Diffusion v2 to control styles and generate celebrities. Although Stability AI did not explicitly filter out artist and celebrity names, their effects are much weaker in v2. This is likely due to the difference in training data. Open AIβs proprietary data may have more artwork and celebrity photos. Their data is probably highly filtered so that everything and everyone looks fine and pretty.
Last updated