Stable Diffusion v1 vs v2
Last updated
Was this helpful?
Last updated
Was this helpful?
Stable Diffusion v2 uses for text embedding. Stable Diffusion v1 uses Open AI's for text embedding. The reasons for this change are:
A larger text encoder model improves image quality
Open AI's CLIP models are opensource, but the models were trained with proprietary data.
Stable Diffusion v1.4 is with
237k steps at resolution 256Γ256 on dataset.
194k steps at resolution 512Γ512 on .
225k steps at 512Γ512 on ββ with 10% dropping of text conditioning.
Stable Diffusion v2 is with
550k steps at the resolution 256x256
on a subset of filtered for explicit pornographic material, using the with punsafe=0.1
and an >= 4.5
.
850k steps at the resolution 512x512
on the same dataset on images with resolution >= 512x512
.
150k steps using a on the same dataset.
Resumed for another 140k steps on 768x768
images.
additional 55k steps on the same dataset (with punsafe=0.1
)
another 155k extra steps with punsafe=0.98
So basically, they turned off the NSFW filter in the last training steps.
Users generally find it harder to use Stable Diffusion v2 to control styles and generate celebrities. Although Stability AI did not explicitly filter out artist and celebrity names, their effects are much weaker in v2. This is likely due to the difference in training data. Open AIβs proprietary data may have more artwork and celebrity photos. Their data is probably highly filtered so that everything and everyone looks fine and pretty.
is fine-tuned on v2.0