How Stable Diffusion Works: A Quick Summary

·

·

In order to guide the reverse diffusion process to generate a specific image, the Stable Diffusion model uses conditioning. For text-to-image generation, it uses text conditioning. The process is as follows:

  1. The model begins with a text prompt. This is processed by a tokenizer, which converts the text into a sequence of tokens.
  2. An embedding model then transforms the tokens into a vector representation, which captures the semantic meaning of the text.
  3. These embeddings are then fed into the noise predictor, which uses them to influence the predicted noise. This results in noise that is informed by the text prompt.
  4. The model also uses a process called cross-attention, where the noise predictor pays different amounts of attention to different parts of the text embeddings when generating the noise. This allows the model to generate images that more closely match the text prompt.

For the CFG value, it’s a part of Classifier guidance and Classifier-free guidance. There are two forms of guidance for the Stable Diffusion model. The Classifier guidance uses a pre-trained image classifier to guide the reverse diffusion process towards images that the classifier recognizes. The Classifier-free guidance, on the other hand, does not use a pre-trained classifier. Instead, it uses the CFG (Classifier-Free Guidance) value, which is learned during training and guides the reverse diffusion process.

The Stable Diffusion model has gone through two versions, v1 and v2. The differences between these versions include changes in the model, the training data, and the resulting images.

Source