Turbocharge Text-to-Video Generation (Pro)

Open In Colab

This tutorial demonstrates how to use the pruna package to optimize a video generation pipeline. We will use the Wan2.1-T2V model as an example.

1. Loading the Wan Text-to-Video Model

First, load your video generation model.

[ ]:
import torch
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.utils import export_to_video

model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16).to("cuda")

2. Initializing the Smash Config

Next, initialize the smash_config.

[ ]:
from pruna_pro import SmashConfig, smash

# Initialize the SmashConfig
smash_config = SmashConfig({
    "auto": {
        "cache_mode": "taylor",
        "speed_factor": 0.42
    },
    "torch_compile": {},
})

3. Smashing the Model

Now, you can smash the model, which will take a one minute. Don’t forget to replace the token by the one provided by PrunaAI.

[ ]:
# Smash the pipe
smashed_pipe = smash(
    model=pipe,
    token="<your_pruna_token>",
    smash_config=smash_config,
)

4. Running the Model

Finally, run the model to generate the video with accelerated inference.

[ ]:
# warm up: the first run will be slow
output = smashed_pipe(
    prompt="A cat walks on the grass, realistic",
    negative_prompt= "Bright tones, overexposed, static, blurred details.",
    height=480,
    width=480,
    num_frames=81,
    guidance_scale=5.0,
    num_inference_steps=50,
)
[ ]:
output = smashed_pipe(
    prompt="A cat walks on the grass, realistic",
    negative_prompt= "Bright tones, overexposed, static, blurred details.",
    height=480,
    width=480,
    num_frames=81,
    guidance_scale=5.0,
    num_inference_steps=50,
).frames[0]
export_to_video(output, "smashed_output.mp4", fps=15)

Wrap Up

Congratulations! You have successfully smashed a text-to-video model. You can now use the pruna package to optimize any custom video generation model. The only parts that you should modify are step 1 and step 4 to fit your use case.