Compress and Evaluate Video Generation Models
Component |
Details |
---|---|
Goal |
Showcase a standard workflow for optimizing and evaluating a video generation model |
Model |
|
Dataset |
|
Device |
1 x H100 (80GB) |
Optimization Algorithms |
compiler(torch_compile), kernel(flash_attn3) |
Evaluation Metrics |
|
Getting Started
To install the required dependencies, you can run the following command:
[ ]:
%pip install pruna
%pip install ftfy imageio imageio-ffmpeg
For more information about how to install Pruna, please refer to the Installation page.
Then, we will set the device to the best available option to maximize the optimization process’s benefits. However, in this case, we recommend using a GPU.
[2]:
import torch
device = (
"cuda"
if torch.cuda.is_available()
else "mps"
if torch.backends.mps.is_available()
else "cpu"
)
1. Load the Model
First, we must load the original model using the diffusers library to ensure it fits into memory. In this example, we will use a light model compatible with most of the consumer-grade GPUs, Wan-AI/Wan2.1-T2V-1.3B.
Pruna works at least as well with larger models, like the model version of Wan 2.1 14B or HuyuanVideo. The choice to use a smaller model is simply because it’s a good starting point, so feel free to use any text-to-video model available on Hugging Face.
[ ]:
from diffusers import AutoencoderKLWan, WanPipeline
model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(
model_id, subfolder="vae", torch_dtype=torch.float32
)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16).to(
device
)
Once we have loaded the pipeline, we can run some inference and check the output. The standard prompt structure for a video is Subject + Subject Action + Scene, which can become more complex as we add descriptions and details like the lighting, point of view, or visual style to achieve specific and refined results.
Remember that you can improve the quality of the video by increasing the number of frames, the number of inference steps, and the guidance scale, but this will also increment the time and amount of resources required to generate the video.
[ ]:
from diffusers.utils import export_to_video
prompt = "A dog runs on the beach, realistic."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards" # noqa: E501
with torch.no_grad():
output = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=480,
width=832,
num_frames=33,
guidance_scale=3.0,
num_inference_steps=15,
generator=torch.Generator(device=device).manual_seed(42),
).frames[0]
export_to_video(output, "base_video.mp4", fps=15)
As we can see, the model has generated a nice short video based on our prompt.
2. Define the SmashConfig
Now that we have correctly loaded and tested our base model, let’s continue by defining the SmashConfig
to customize the optimizations we want to apply when smashing.
Take into account that not all optimization algorithms are available for all models, so you can learn about the requirements and compatibility in the Algorithms Overview.
In the current optimization, we will use torch_compile to make it more efficient and flash_attn3 to speed up the model.
Let’s define the SmashConfig
object.
[5]:
from pruna import SmashConfig
smash_config = SmashConfig(device=device)
# Configure the compiler
smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_target"] = "module_list"
# Configure the kernel
smash_config["kernel"] = "flash_attn3"
3. Smash the Model
Next, we need to apply our defined SmashConfig
by smashing our model. The smash
function will be in charge of this, so we just need to pass the model
and the smash_config
. To evaluate and compare the models in the upcoming sections, we will make a deep copy of the base model.
Time to smash! This will take around 20 seconds, depending on the configuration.
[ ]:
import copy
from pruna import smash
copy_pipe = copy.deepcopy(pipe).to("cpu")
smashed_pipe = smash(
model=pipe,
smash_config=smash_config,
)
Now, we will have an optimized smashed model, so let’s check how it works using the previous prompt.
Consider that if you are using torch_compile
as a compiler, you can expect the first inference warmup to take longer than the actual inference.
[ ]:
with torch.no_grad():
output = smashed_pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=480,
width=832,
num_frames=33,
guidance_scale=3.0,
num_inference_steps=15,
generator=torch.Generator(device=device).manual_seed(42),
).frames[0]
export_to_video(output, "smashed_video.mp4", fps=15)
As we can observe, it has also generated a short video similar to the original model.
If you notice a significant difference, it might be due to the model, the configuration, the hardware, etc. We encourage you to retry the optimization process or try out different configurations and models to find the best fit for your use case. However, feel free to reach out to us on Discord if you have any questions or feedback.
4. Evaluate the Smashed Model
Now that we have our smashed model, the key question is how much has improved with our optimization. For this, we can run an evaluation of the performance using the EvaluationAgent
. In this case, we will include metrics like the total_time
, latency
, throughput
, co2_emissions
, and energy_consumed
.
A complete list of the available metrics can be found in Evaluation.
[ ]:
from pruna import PrunaModel
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import (
CO2EmissionsMetric,
EnergyConsumedMetric,
LatencyMetric,
ThroughputMetric,
TotalTimeMetric,
)
from pruna.evaluation.task import Task
# Define the metrics. Increment the number of iterations and
# warmup iterations to get a more accurate result.
metrics = [
TotalTimeMetric(n_iterations=3, n_warmup_iterations=1),
LatencyMetric(n_iterations=3, n_warmup_iterations=1),
ThroughputMetric(n_iterations=3, n_warmup_iterations=1),
CO2EmissionsMetric(n_iterations=3, n_warmup_iterations=1),
EnergyConsumedMetric(n_iterations=3, n_warmup_iterations=1),
]
# Define the datamodule
datamodule = PrunaDataModule.from_string("LAION256")
datamodule.limit_datasets(10)
# Define the task and the evaluation agent
task = Task(metrics, datamodule=datamodule, device=device)
eval_agent = EvaluationAgent(task)
[ ]:
# Evaluate the smashed model and offload it to CPU
smashed_pipe.move_to_device(device)
smashed_model_results = eval_agent.evaluate(smashed_pipe)
smashed_pipe.move_to_device("cpu")
[ ]:
# Evaluate the base model and offload it to CPU
base_pipe = PrunaModel(model=copy_pipe)
base_pipe.move_to_device(device)
base_model_results = eval_agent.evaluate(base_pipe)
base_pipe.move_to_device("cpu")
Let’s visualize and compare the evaluation results of the base and smashed models.
[15]:
from IPython.display import Markdown, display # noqa
# Calculate percentage differences for each metric
def calculate_percentage_diff(original, optimized): # noqa
return ((optimized - original) / original) * 100
# Calculate differences and prepare table data
table_data = []
for base_metric_result in base_model_results:
for smashed_metric_result in smashed_model_results:
if base_metric_result.name == smashed_metric_result.name:
diff = calculate_percentage_diff(
base_metric_result.result, smashed_metric_result.result
)
table_data.append(
{
"Metric": base_metric_result.name,
"Base Model": f"{base_metric_result.result:.7f}",
"Compressed Model": f"{smashed_metric_result.result:.7f}",
"Relative Difference": f"{diff:+.2f}%",
}
)
break
# Create and display markdown table manually
markdown_table = "| Metric | Base Model | Compressed Model | Relative Difference |\n"
markdown_table += "|--------|----------|-----------|------------|\n"
for row in table_data:
metric = [m for m in metrics if m.metric_name == row["Metric"]][0]
unit = metric.metric_units if hasattr(metric, "metric_units") else ""
markdown_table += f"| {row['Metric']} | {row['Base Model']} {unit} | {row['Compressed Model']} {unit} | {row['Relative Difference']} |\n" # noqa: E501
display(Markdown(markdown_table))
Metric |
Base Model |
Compressed Model |
Relative Difference |
---|---|---|---|
total_time |
460992.1875000 ms |
265793.1718750 ms |
-42.34% |
latency |
153664.0625000 ms/num_iterations |
88597.7239583 ms/num_iterations |
-42.34% |
throughput |
0.0000065 num_iterations/ms |
0.0000113 num_iterations/ms |
+73.44% |
co2_emissions |
0.0031181 kgCO2e |
0.0018072 kgCO2e |
-42.04% |
energy_consumed |
0.0556424 kWh |
0.0322483 kWh |
-42.04% |
As we can see, the smashed model is much more efficient than the base model. It runs almost 2x faster both overall and per iteration and has a higher throughput. Moreover, the energy consumption and COâ‚‚ emissions were also reduced, meaning that the compressed model is not only faster but also more environmentally friendly, consuming less electricity and producing less carbon footprint. This results are consistent with the expected ones as both the compiler and the kernel are designed to improve the performance of the model.
So, we can save the optimized model to disk or share it with others. Note that some optimizations, such as torch_compile
, are device dependent and will be re-applied when loading the model on a different device.
[ ]:
# Save the model to disk
smashed_pipe.save_pretrained("Wan2.1-T2V-1.3B-smashed")
# Load the model from disk
# smashed_pipe = PrunaModel.from_pretrained("Wan2.1-T2V-1.3B-smashed/")
# Save the model to HuggingFace
# smashed_pipe.push_to_hub("PrunaAI/Wan2.1-T2V-1.3B-smashed")
Conclusions
In this tutorial, we have gone over the standard workflow for optimizing and evaluating a text-to-video model.
We started loading the base model and defining the SmashConfig with the desired optimization algorithms and parameters. Then we smashed the base model, obtaining an optimized version, and we ensured the improvement in performance by running an evaluation with the EvaluationAgent.
The results show that we can significantly increase the inference speed andreduce the energy consumption, while maintaining a high level of output quality. This makes it easy to explore trade-offs and iterate on configurations to find the best optimization strategy for your specific use case.
Check out our other tutorials for more examples on how to optimize and evaluate image generation models or LLM models.