Visualising the effect of guidance scale and inference steps – 2500 images

The definition you always hear of classifier free guidance (aka CFG aka guidance scale) is “it controls how closely the model follows your prompt when generating the image”. But there’s a bit more to it than that.

In diffusion, the model starts with noise, and progressively refines it over several inference steps, transforming the image toward something matching the prompt, according to the model’s training data.

Intuitively then, you might expect the CFG and inference step levers to control two separate things. If you prompt for “a tomato” and set CFG very low, you might imagine the model would ignore the prompt entirely and generate something unrelated, maybe a suspension bridge. As CFG increases, perhaps you’d expect the image to gradually become more “tomato-like”.

But CFG and steps are actually more closely related than that. If CFG is low, it doesn’t mean “Ignore the prompt. Make something completely different”. It appears to be more like the model will revert to its priors – things in the prompt that have more representation in its training data appear to be be more emphasised.

For example, see the grid below. This shows 2500 images generated in a grid search with guidance scale from 1 to 50 (top to bottom), and number of inference steps from 1 to 50 (left to right). These were generated using Jeneration Lab with Stable Diffusion 1.5, using the same seed (405905731) and prompt (cyberpunk neon city).

You can click any image to see a larger version of each image – it might take a second to load the new image.

There’s a few interesting things to point out in the grid.

1) Nothing at steps = 1

Note in the entire left side of the grid, all images are just brown – no matter what the scale, we don’t get anything resembling an image.

2) There’s no cyberpunk at scale = 1

When guidance scale is 1, we see a city, as we asked for, but it’s not in the cyberpunk aesthetic that the prompt specified. There’s no hot pink, no electric blue, no purple, no neon. This is likely because there were fewer images tagged with “cyberpunk”, so the scale isn’t strong enough to push the model very far in that direction through latent space. That aspect of the prompt doesn’t become fully established until around scale 4-5.

But at higher scales, all natural colours disappear and the cyberpunk aesthetic dominates completely.

3) At higher scale, images are simpler

Note how the lower scale images towards the top are detailed images of a city street, we have cars, street signs, balconies, road markings… light glistens from the tarmac. At higher scale towards the bottom, we get simpler, stylised, Tron-like cityscapes.

4) Diagonal colour bands

You might have also noticed that we have these diagonal bands of colour going from top left to bottom right. It’s easier to see if zoomed out a bit:

This is happening because when the scale is higher, the model takes more steps to arrive at certain features in the image. For example, take a look at scale=5/steps=3:

Notice there’s a vertical green band in the centre, and another to the left of it. At scale=30, it takes the model 7 steps to arrive at this feature:

At scale=50, we don’t see it until steps 12 (although there’s a bluey-green one at 11 steps)

If you click the images and follow other diagonal bands down the grid, you’ll see similar things happening with different features.

Incidentally, if you stay on scale=3 and increase the steps, you’ll see that those green bands – which became glowing neon buildings at high scale – become neon street signs at lower scale:

It appears from this experiment that CFG and steps don’t push the model toward completely different images. Instead, they seem to affect how quickly different concepts and features form during generation. Also, according to how the rest of the image has developed by that point, these features are then incorporated into the image in different ways.

There’s more to CFG than simply “makes the model follow the prompt more closely”!

Warren Davies

Visualising the effect of guidance scale and inference steps – 2500 images

1) Nothing at steps = 1

2) There’s no cyberpunk at scale = 1

3) At higher scale, images are simpler

4) Diagonal colour bands

Leave a Reply Cancel reply