Make Any Diffusion Model 3x Faster with Auto Caching (Pro)

This tutorial demonstrates how to use the pruna_pro package to optimize any diffusers pipeline. We use the stable-diffusion-v1-4 model as an example, although the tutorial also applies to other popular models, such as SD-XL, FLUX, and Hunyuan Video.

1. Loading the Stable Diffusion Model

First, load your model.

[ ]:

import torch
from diffusers import StableDiffusionPipeline

# Define the model ID
model_id = "CompVis/stable-diffusion-v1-4"

# Load the pre-trained model
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

2. Initializing the Smash Config

Next, initialize the smash config (we use our proprietary auto caching algorithm). The speed factor controls the latency of the model. A speed factor of 0.33 will result in a latency that is approximately 0.33x the latency of the original model.

[ ]:

from pruna_pro import SmashConfig

# Initialize the SmashConfig
smash_config = SmashConfig()
smash_config["cacher"] = "auto"
smash_config["auto_cache_mode"] = "taylor"
smash_config["auto_speed_factor"] = 0.33  # This will lead to a 3x speedup, lower is faster but more quality loss

3. Smashing the Model

Now, smash the model. This only takes a few seconds.

[ ]:

from pruna_pro import smash

smashed_model = smash(
    model=pipe,
    token="<your_pruna_token>",
    smash_config=smash_config,
)

4. Running the Model

Finally, run the model to generate the image with accelerated inference.

[ ]:

# Define the prompt
prompt = "a fruit basket"

# Display the result
smashed_model(prompt).images[0]

The speed factor can be adjusted on the fly to match your needs.

[ ]:

pipe.cache_helper.set_params(speed_factor=0.5)
smashed_model(prompt).images[0]

Wrap Up

Congratulations! You have successfully smashed a diffusion model! You can now use the pruna_pro package to optimize any diffusion model. Adjust step 1, 2 and 4 to fit your use case. In particular, play around with the auto_speed_factor to explore the trade-off between latency and quality and find the best configuration for your application.