Evaluation Metrics
The pruna package provides helpful evaluation tools to assess your models. In this section, we’ll introduce the evaluation metrics you can use with the package.
Evaluation helps you understand how compression affects your models across different dimensions - from output quality to resource requirements. This knowledge is essential for making informed decisions about which compression techniques work best for your specific needs.
Quick Tutorial
Before we start, here’s a simple example showing how to evaluate your models using pruna.
The rest of this guide provides more detailed explanations of each component and additional features available for model evaluation.
import copy
from diffusers import StableDiffusionPipeline
from pruna import smash, SmashConfig
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.task import Task
# Load data and set up smash config
smash_config = SmashConfig()
smash_config['cacher'] = 'deepcache'
# Load the base model
model_path = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(model_path)
# Smash the model
copy_pipe = copy.deepcopy(pipe)
smashed_pipe = smash(copy_pipe, smash_config)
# Define the task and the evaluation agent
metrics = ['clip_score', 'psnr']
task = Task(metrics, datamodule=PrunaDataModule.from_string('LAION256'))
eval_agent = EvaluationAgent(task)
# Evaluate base model, all models need to be wrapped in a PrunaModel before passing them to the EvaluationAgent
first_results = eval_agent.evaluate(pipe)
print(first_results)
# Evaluate smashed model
smashed_results = eval_agent.evaluate(smashed_pipe)
print(smashed_results)
# Base model result output
{'clip_score_y_x': 28.0828}
# Smashed model result output
{'clip_score_y_x': 28.4500, 'psnr_pairwise_y_gt': 18.7465}
Evaluation Framework
The evaluation framework in pruna consists of several key components:
Task
Processes user requests and converts them into a set of metrics. The Task
accepts metrics in three ways:
As a plain text request from predefined options (e.g.,
image_generation_quality
)As a list of metric names (e.g., [
"clip_score"
,"psnr"
]) (see Available Metrics below)As a list of metric instances
In addition to metrics, Task
requires a PrunaDataModule to perform the evaluation.
- class pruna.evaluation.task.Task(request, datamodule, device='cuda')
Processes user requests and converts them into a format that the evaluation module can handle.
- Parameters:
request (str | List[str | BaseMetric]) – The user request.
datamodule (PrunaDataModule) – The dataloader to use for the evaluation.
device (str | torch.device) – The device to use for the evaluation.
Currently, Task
supports the following plain textrequests:
image_generation_quality
: Creates metrics for evaluating image generation models (clip_score
,pairwise_clip_score
,psnr
)
from pruna.evaluation.task import Task
from pruna.data.pruna_datamodule import PrunaDataModule
task = Task("image_generation_quality", datamodule=PrunaDataModule.from_string('LAION256'))
EvaluationAgent
The main entry point for evaluating models. The EvaluationAgent
:
Takes a
Task
object that defines what metrics to useProvides methods to evaluate any model
Handles the evaluation process, including separating metrics by execution strategy
Runs inference on the model to generate predictions
Caches predictions to avoid redundant computations
Passes ground truth data and predictions to the appropriate metrics
Collects and returns results from all metrics
- class pruna.evaluation.evaluation_agent.EvaluationAgent(task)
Entry point for evaluating a model.
- Parameters:
task (Task) – Configuration object that defines how to evaluate the model.
from pruna.evaluation.task import Task
from pruna.data.pruna_datamodule import PrunaDataModule
data_module = PrunaDataModule.from_string('LAION256')
data_module.limit_datasets(10)
task = Task("image_generation_quality", datamodule=data_module)
from pruna.evaluation.evaluation_agent import EvaluationAgent
eval_agent = EvaluationAgent(task)
For the full example running evaluation please see Quick Tutorial above.
Metrics
Metrics help quantify different aspects of model performance, from output quality to resource requirements. The pruna package includes metrics for both quality assessment and resource utilization.
When using the EvaluationAgent
, all metrics are executed automatically as part of the evaluation pipeline. The agent handles model inference, data preparation, and passing the appropriate inputs to each metric, eliminating the need to run metrics individually.
Metrics can operate in both single-model and pairwise modes:
In single-model mode, each evaluation produces independent scores for the model being evaluated.
In pairwise mode, metrics compare a subsequent model against the first model evaluated by the agent. Usually, this is used to compare the base model (first model) with its smashed version (subsequent model). The first model’s outputs are cached and used as a reference point for all following evaluations. The pairwise comparison produces a single score that quantifies the relationship (e.g., similarity or difference) between the two models.
Our metrics fall into two implementation categories that work differently under the hood:
Base Metrics
Simple metrics that compute values directly from inputs without maintaining state across batches. Examples include: - Model Architecture metrics - Energy consumption metrics - Memory usage metrics
elapsed_time
Measures inference time, latency, and throughput.
- Evaluation on CPU:
Yes.
- Required:
A PrunaModel object that defines the model to evaluate. A DataLoader object that defines the dataloader to evaluate the model on.
- Parameters:
n_iterations
: Number of inference iterations to measure (default 100).n_warmup_iterations
: Number of warmup iterations before measurement (default 10).device
: Device to run inference on (default “cuda”).timing_type
: Type of timing to use (“sync” or “async”, default “sync”).
gpu_memory
Measures peak GPU memory usage during model loading and execution.
- Evaluation on CPU:
No.
- Required:
Path to the PrunaModel to evaluate. A DataLoader object that defines the dataloader to evaluate the model on. The model class to load the model from the path.
- Parameters:
mode
: Memory measurement mode (“disk”, “inference”, or “training”).gpu_indices
: List of GPU indices to monitor (default all available GPUs).
energy
Measures energy consumption in kilowatt-hours (kWh) and CO2 emissions in kilograms (kg).
- Evaluation on CPU:
Yes.
- Description:
Measures energy consumption in kilowatt-hours (kWh) and CO2 emissions in kilograms (kg).
- Required:
A PrunaModel object that defines the model to evaluate. A DataLoader object that defines the dataloader to evaluate the model on.
- Parameters:
n_iterations
: Number of inference iterations to measure (default 100).n_warmup_iterations
: Number of warmup iterations before measurement (default 10).device
: Device to run inference on (default “cuda”).
model_architecture
Measures the number of parameters and MACs (multiply-accumulate operations) in the model.
- Evaluation on CPU:
Yes.
- Required:
A PrunaModel object that defines the model to evaluate. A DataLoader object that defines the dataloader to evaluate the model on.
- Parameters:
device
: Device to evaluate the model on (default “cuda”).
Stateful Metrics
Metrics that maintain internal state and accumulate information across multiple batches. These are typically used for quality assessment.
Most of our stateful metrics are implemented using the TorchMetricsWrapper, which adapts metrics from the TorchMetrics library to work within our evaluation framework. This allows us to leverage the robust implementations provided by TorchMetrics while maintaining a consistent API.
clip_score
Measures the similarity between images and text using the CLIP model.
- Evaluation on CPU:
Yes.
- Required:
Inputs, ground truth and predictions.
- Parameters:
Accepts all parameters from the TorchMetrics CLIPScore implementation.
pairwise_clip_score
Measures the similarity between images of first and subsequent models using the CLIP model.
- Evaluation on CPU:
Yes.
- Required:
Inputs, ground truth and predictions.
- Parameters:
Accepts all parameters from the TorchMetrics CLIPScore implementation.
accuracy
Measures the proportion of correct predictions in classification tasks.
- Evaluation on CPU:
Yes.
- Required:
Inputs, ground truth and predictions. TorchMetrics requires a ‘task’ parameter to be set to ‘binary’, ‘multiclass’, or ‘multilabel’. Each task type may have additional specific requirements - please refer to the TorchMetrics documentation for details.
- Parameters:
Accepts all parameters from the TorchMetrics Accuracy implementation (task, num_classes, threshold, etc.).
precision
Measures the proportion of positive identifications that were actually correct.
- Evaluation on CPU:
Yes.
- Required:
Inputs, ground truth and predictions. TorchMetrics requires a ‘task’ parameter to be set to ‘binary’, ‘multiclass’, or ‘multilabel’. Each task type may have additional specific requirements - please refer to the TorchMetrics documentation for details.
- Parameters:
Accepts all parameters from the TorchMetrics Precision implementation (task, num_classes, threshold, etc.).
recall
Measures the proportion of actual positives that were identified correctly.
- Evaluation on CPU:
Yes.
- Required:
Inputs, ground truth and predictions. TorchMetrics requires a ‘task’ parameter to be set to ‘binary’, ‘multiclass’, or ‘multilabel’. Each task type may have additional specific requirements - please refer to the TorchMetrics documentation for details.
- Parameters:
Accepts all parameters from the TorchMetrics Recall implementation (task, num_classes, threshold, etc.).
perplexity
Measures how well a probability model predicts a text sample.
- Evaluation on CPU:
Yes.
- Required:
Inputs, ground truth and predictions.
- Parameters:
Accepts all parameters from the TorchMetrics Perplexity implementation.
fid
Measures the similarity between generated and real image distributions using the Frechet Distance between Gaussian distributions fitted to the Inception embeddings of the generated and real images.
FID compares the distribution of real and generated images in a high-dimensional feature space. Since it estimates mean and covariance statistics, smaller sample sizes can introduce high variance, making the metric less stable. Large-scale evaluations often use tens of thousands of images, but for practical use, smaller sample sizes may still provide a reasonable approximation.
Computation Considerations
When generating images and computing FID on thousands to tens of thousands of samples, the process can take multiple hours to several days, even on a high-end GPU like an A100 or RTX 4090. On mid-range GPUs like a 3060 or 4060, it can take significantly longer. A rough approximation using a few thousand images may still take several hours, even with strong hardware.
- Evaluation on CPU:
No (impractical due to the high computational cost)
- Required:
Inputs, ground truth and predictions.
- Parameters:
Accepts all parameters from the TorchMetrics FrechetInceptionDistance implementation (feature extraction parameters, etc.).
psnr
Measures the peak signal-to-noise ratio (PSNR) between two images.
- Evaluation on CPU:
Yes.
- Required:
Inputs, ground truth and predictions.
- Parameters:
Accepts all parameters from the TorchMetrics PSNR implementation.
ssim
Measures the structural similarity index (SSIM) between two images.
- Evaluation on CPU:
Yes.
- Required:
Inputs, ground truth and predictions.
- Parameters:
Accepts all parameters from the TorchMetrics SSIM implementation.
lpips
Measures the Learned Perceptual Image Patch Similarity (LPIPS) between two images.
- Evaluation on CPU:
Yes.
- Required:
Inputs, ground truth and predictions.
- Parameters:
Accepts all parameters from the TorchMetrics LPIPS implementation.