Compress and Evaluate Video Generation Models

Component	Details
Goal	Showcase a standard workflow for optimizing and evaluating a video generation model
Model	Wan-AI/Wan2.1-T2V-1.3B
Dataset	nannullna/laion_subset
Device	1 x H100 (80GB)
Optimization Algorithms	compiler(torch_compile), kernel(flash_attn3)
Evaluation Metrics	`total time`, `latency`, `througput`, `co2_emissions`, and `energy_consumed`

Getting Started

To install the required dependencies, you can run the following command:

[ ]:

%pip install pruna
%pip install ftfy imageio imageio-ffmpeg

For more information about how to install Pruna, please refer to the Installation page.

Then, we will set the device to the best available option to maximize the optimization process’s benefits. However, in this case, we recommend using a GPU.

[2]:

import torch

device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

1. Load the Model

First, we must load the original model using the diffusers library to ensure it fits into memory. In this example, we will use a light model compatible with most of the consumer-grade GPUs, Wan-AI/Wan2.1-T2V-1.3B.

Pruna works at least as well with larger models, like the model version of Wan 2.1 14B or HuyuanVideo. The choice to use a smaller model is simply because it’s a good starting point, so feel free to use any text-to-video model available on Hugging Face.

[ ]:

from diffusers import AutoencoderKLWan, WanPipeline

model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"

vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)

pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16).to(device)

Once we have loaded the pipeline, we can run some inference and check the output. The standard prompt structure for a video is Subject + Subject Action + Scene, which can become more complex as we add descriptions and details like the lighting, point of view, or visual style to achieve specific and refined results.

Remember that you can improve the quality of the video by increasing the number of frames, the number of inference steps, and the guidance scale, but this will also increment the time and amount of resources required to generate the video.

[ ]:

from diffusers.utils import export_to_video

prompt = "A dog runs on the beach, realistic."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"  # noqa: E501

with torch.no_grad():
    output = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        height=480,
        width=832,
        num_frames=33,
        guidance_scale=3.0,
        num_inference_steps=15,
        generator=torch.Generator(device=device).manual_seed(42),
    ).frames[0]

export_to_video(output, "base_video.mp4", fps=15)

As we can see, the model has generated a nice short video based on our prompt.

2. Define the SmashConfig

Now that we have correctly loaded and tested our base model, let’s continue by defining the SmashConfig to customize the optimizations we want to apply when smashing.

Take into account that not all optimization algorithms are available for all models, so you can learn about the requirements and compatibility in the Algorithms Overview.

In the current optimization, we will use torch_compile to make it more efficient and flash_attn3 to speed up the model.

Let’s define the SmashConfig object.

[ ]:

from pruna import SmashConfig

smash_config = SmashConfig({
    "torch_compile": {"target": "module_list"},
    "flash_attn3": {},
})

3. Smash the Model

Next, we need to apply our defined SmashConfig by smashing our model. The smash function will be in charge of this, so we just need to pass the model and the smash_config. To evaluate and compare the models in the upcoming sections, we will make a deep copy of the base model.

Time to smash! This will take around 20 seconds, depending on the configuration.

[ ]:

import copy

from pruna import smash

copy_pipe = copy.deepcopy(pipe).to("cpu")
smashed_pipe = smash(
    model=pipe,
    smash_config=smash_config,
)

Now, we will have an optimized smashed model, so let’s check how it works using the previous prompt.

Consider that if you are using torch_compile as a compiler, you can expect the first inference warmup to take longer than the actual inference.

[ ]:

with torch.no_grad():
    output = smashed_pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        height=480,
        width=832,
        num_frames=33,
        guidance_scale=3.0,
        num_inference_steps=15,
        generator=torch.Generator(device=device).manual_seed(42),
    ).frames[0]

export_to_video(output, "smashed_video.mp4", fps=15)

As we can observe, it has also generated a short video similar to the original model.

If you notice a significant difference, it might be due to the model, the configuration, the hardware, etc. We encourage you to retry the optimization process or try out different configurations and models to find the best fit for your use case. However, feel free to reach out to us on Discord if you have any questions or feedback.

4. Evaluate the Smashed Model

Now that we have our smashed model, the key question is how much has improved with our optimization. For this, we can run an evaluation of the performance using the EvaluationAgent. In this case, we will include metrics like the total_time, latency, throughput, co2_emissions, and energy_consumed.

A complete list of the available metrics can be found in Evaluation.

[ ]:

from pruna import PrunaModel
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import (
    CO2EmissionsMetric,
    EnergyConsumedMetric,
    LatencyMetric,
    ThroughputMetric,
    TotalTimeMetric,
)
from pruna.evaluation.task import Task

# Define the metrics. Increment the number of iterations and
# warmup iterations to get a more accurate result.
metrics = [
    TotalTimeMetric(n_iterations=3, n_warmup_iterations=1),
    LatencyMetric(n_iterations=3, n_warmup_iterations=1),
    ThroughputMetric(n_iterations=3, n_warmup_iterations=1),
    CO2EmissionsMetric(n_iterations=3, n_warmup_iterations=1),
    EnergyConsumedMetric(n_iterations=3, n_warmup_iterations=1),
]

# Define the datamodule
datamodule = PrunaDataModule.from_string("LAION256")
datamodule.limit_datasets(10)

# Define the task and the evaluation agent
task = Task(metrics, datamodule=datamodule, device=device)
eval_agent = EvaluationAgent(task)

[ ]:

# Evaluate the smashed model and offload it to CPU
from pruna.engine.utils import move_to_device

move_to_device(smashed_pipe, device)
smashed_model_results = eval_agent.evaluate(smashed_pipe)
move_to_device(smashed_pipe, "cpu")

[ ]:

# Evaluate the base model and offload it to CPU
base_pipe = PrunaModel(model=copy_pipe)
move_to_device(base_pipe, device)
base_model_results = eval_agent.evaluate(base_pipe)
move_to_device(base_pipe, "cpu")

Let’s visualize and compare the evaluation results of the base and smashed models.

[15]:

from IPython.display import Markdown, display  # noqa


def make_comparison_table(base_model_results, smashed_model_results):  # noqa
    header = "| Metric | Base Model | Smashed Model | Improvement % |\n"
    header += "|" + "-----|" * 4 + "\n"
    rows = []

    for base, smashed in zip(base_model_results, smashed_model_results):
        base_result = base.result
        smashed_result = smashed.result
        if base.higher_is_better:
            diff = ((smashed_result - base_result) / base_result) * 100
        else:
            diff = ((base_result - smashed_result) / base_result) * 100
        row = f"| {base.name} | {base_result:.7f} {base.metric_units or ''}"
        row += f"| {smashed_result:.7f} {smashed.metric_units or ''} | {diff:.2f}% |"
        rows.append(row)
    return header + "\n".join(rows)


display(Markdown(make_comparison_table(base_model_results, smashed_model_results)))

Metric	Base Model	Smashed Model	Improvement %
total_time	460992.1875000 ms	265793.1718750 ms	42.34%
latency	153664.0625000 ms/num_iterations	88597.7239583 ms/num_iterations	42.34%
throughput	0.0000065 num_iterations/ms	0.0000113 num_iterations/ms	73.44%
co2_emissions	0.0031181 kgCO2e	0.0018072 kgCO2e	42.04%
energy_consumed	0.0556424 kWh	0.0322483 kWh	42.04%

As we can see, the smashed model is much more efficient than the base model. It runs almost 2x faster both overall and per iteration and has a higher throughput. Moreover, the energy consumption and CO₂ emissions were also reduced, meaning that the compressed model is not only faster but also more environmentally friendly, consuming less electricity and producing less carbon footprint. This results are consistent with the expected ones as both the compiler and the kernel are designed to improve the performance of the model.

So, we can save the optimized model to disk or share it with others. Note that some optimizations, such as torch_compile, are device dependent and will be re-applied when loading the model on a different device.

[ ]:

# Save the model to disk
smashed_pipe.save_pretrained("Wan2.1-T2V-1.3B-smashed")
# Load the model from disk
# smashed_pipe = PrunaModel.from_pretrained("Wan2.1-T2V-1.3B-smashed/")

# Save the model to HuggingFace
# smashed_pipe.push_to_hub("PrunaAI/Wan2.1-T2V-1.3B-smashed")

Conclusions

In this tutorial, we have gone over the standard workflow for optimizing and evaluating a text-to-video model.

We started loading the base model and defining the SmashConfig with the desired optimization algorithms and parameters. Then we smashed the base model, obtaining an optimized version, and we ensured the improvement in performance by running an evaluation with the EvaluationAgent.

The results show that we can significantly increase the inference speed andreduce the energy consumption, while maintaining a high level of output quality. This makes it easy to explore trade-offs and iterate on configurations to find the best optimization strategy for your specific use case.

Check out our other tutorials for more examples on how to optimize and evaluate image generation models or LLM models.