Compress and Evaluate Large Language Models

Component	Details
Goal	Show a standard workflow for optimizing and evaluating a large language model
Model	HuggingFaceTB/SmolLM2-360M-Instruct
Dataset	SmolSmolTalk
Device	1 x RTX A5000 (24GB VRAM)
Optimization Algorithms	quantizer(hqq), compiler(torch_compile)
Evaluation Metrics	perplexity, throughput, total time, energy consumption

Getting Started

To install the dependencies, run the following command:

Let’s also set the device to the best available device to make the most out of the optimization process.

[ ]:

import torch

device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

1. Load the model

Before we can optimize the model, we need to ensure that we can load the model and tokenizer correctly and that they can fit in memory. For this example, we will use a nice and small LLM, HuggingFaceTB/SmolLM2-360M-Instruct, but feel free to use any text-generation model on Hugging Face.

Although Pruna works at least as good with much larger models, like Qwen or LLaMA, a small model is a good starting point to show and test the steps of the optimization process.

[ ]:

from transformers import pipeline

model_name = "HuggingFaceTB/SmolLM2-360M-Instruct"
pipe = pipeline(
    task="text-generation",
    model=model_name,
)

Now we’ve loaded the model and tokenizer. Let’s see if we can run some inference with them. To make this easy for use, we will be using the transformers library’s pipeline.__call__ function and passing in a list of messages.

[ ]:

from transformers import pipeline

messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages, max_new_tokens=100)

As we can see, the model is able to generate a response to the user’s question, which is being cut-off after the allowed max_new_tokens.

2. Define the SmashConfig

Now we know the model is working, let’s continue with the optimization process and define the SmashConfig, which we will use later on to optimize the model.

Not all optimization algorithms are available for all models but we can learn a bit more about different optimization algorithms and their requirements in the Algorithms Overview section of the documentation.

For the current optimization, we will be using the hqq quantizer and the torch_compile compiler. We will updating some parameters for these algorithms, setting hqq_weight_bits to 4, hqq_compute_dtype to torch.bfloat16, torch_compile_fullgraph to True, torch_compile_dynamic to True, and torch_compile_mode to max-autotune. This is one of the many configurations and will just serve as an example.

Let’s define the SmashConfig object.

[ ]:

from pruna import SmashConfig

smash_config = SmashConfig(device=device)
smash_config["quantizer"] = "hqq"
smash_config["hqq_weight_bits"] = 4
smash_config["hqq_compute_dtype"] = "torch.bfloat16"
# We can also use `torch_compile` as our compiler, but we will skip it for now as it will take a bit longer to compile.
smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_fullgraph"] = True
smash_config["torch_compile_dynamic"] = True
smash_config["torch_compile_mode"] = "max-autotune"

3. Smash the model

Now that we have defined the SmashConfig object, we can smash the model. We will be using the smash function to smash the model and pass the model and smash_config to it. We also make a deep copy of the model to avoid modifying the original model.

Let’s smash the model, which should take around 20 seconds for this configuration.

[ ]:

import copy

from pruna import smash

copy_model = copy.deepcopy(pipe.model).to("cpu")
smashed_model = smash(
    model=pipe.model,
    smash_config=smash_config,
)

Now we’ve optimized the model. Let’s see if everything still works as expected and we can run some inference with the optimized model. In this case, we are running the inference by first encoding the prompt through the tokenizer and then passing the input_ids to the PrunaModel.generate method, which also allows us to specify additional parameters such as max_new_tokens.

If you are using torch_compile as your compiler, you can expect the first inference warmup to take a bit longer than the actual inference.

[ ]:

prompt = "Who are you?"
messages = [{"role": "user", "content": prompt}]
pipe(messages, max_new_tokens=256)

As we can see, the model is able to generate a similar response to the original model.

If you notice a significant difference, it might have several reasons, the models, the configuration, the hardware, etc. As optimization can be non-deterministic, we encourage you to retry the optimization process or try out different configurations and models to find the best fit for your use case but also feel free to reach out to us on Discord if you have any questions or feedback.

4. Evaluate the smashed model

Now that we have optimized the model, we can evaluate the performance of the optimized model. We will be using the EvaluationAgent to evaluate the performance of the optimized model. We will do so with some basic metrics, the elapsed_time, as well as a stateful metrics, the perplexity. An overview of the different metrics can be found in our documentation.

Let’s define the EvaluationAgent object and start the evaluation process. Note that we are using the datamodule.limit_datasets(100) method to limit the number of datasets to 100, which is just for the sake of time. Additionally, set the n_iterations and n_warmup_iterations to ensure that we monitor the performance of the model whenever it is running smoothly.

The evaluation can take anywhere from a couple of minutes to a couple of hours to complete, depending on your hardware, the number of samples in the dataset, and the configuration of the model. In our case it should only take a couple of minutes.

[ ]:

from pruna import PrunaModel
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import (
    EnergyConsumedMetric,
    ThroughputMetric,
    TorchMetricWrapper,
    TotalTimeMetric,
)
from pruna.evaluation.task import Task

# Define the metrics
metrics = [
    EnergyConsumedMetric(n_iterations=50, n_warmup_iterations=5),
    ThroughputMetric(n_iterations=50, n_warmup_iterations=5),
    TotalTimeMetric(n_iterations=50, n_warmup_iterations=5),
    TorchMetricWrapper("perplexity", call_type="single"),
]

# Define the datamodule
pipe.tokenizer.pad_token = pipe.tokenizer.eos_token
datamodule = PrunaDataModule.from_string("SmolSmolTalk", tokenizer=pipe.tokenizer)
datamodule.limit_datasets(100)

# Define the task and evaluation agent
task = Task(metrics, datamodule=datamodule, device=device)
eval_agent = EvaluationAgent(task)

# Update the model args to ensure right generation arguments are passed after compilation
inference_args = {"max_new_tokens": 250}

# Evaluate smashed model and offload it to CPU
smashed_model.move_to_device(device)
smashed_model.inference_handler.model_args.update(inference_args)
smashed_model_results = eval_agent.evaluate(smashed_model)
smashed_model.move_to_device("cpu")

# Evaluate base model and offload it to CPU
base_model = PrunaModel(model=copy_model)
base_model.move_to_device(device)
base_model.inference_handler.model_args.update(inference_args)
base_model_results = eval_agent.evaluate(base_model)
base_model.move_to_device("cpu")

Now we can see the results of the evaluation and compare the performance of the original and the optimized model.

[ ]:

from IPython.display import Markdown, display  # noqa


# Calculate percentage differences for each metric
def calculate_percentage_diff(original, optimized):  # noqa
    return ((optimized - original) / original) * 100


# Calculate differences and prepare table data
table_data = []
for base_metric_result, smashed_metric_result in zip(base_model_results, smashed_model_results):
    diff = calculate_percentage_diff(base_metric_result.result, smashed_metric_result.result)
    table_data.append(
        {
            "Metric": base_metric_result.name,
            "Base Model": f"{base_metric_result.result:.4f}",
            "Compressed Model": f"{smashed_metric_result.result:.4f}",
            "Relative Difference": f"{diff:+.2f}%",
        }
    )

# Create and display markdown table manually
markdown_table = "| Metric | Base Model | Compressed Model | Relative Difference |\n"
markdown_table += "|--------|----------|-----------|------------|\n"
for row in table_data:
    metric_obj = [metric for metric in metrics if metric.metric_name == row["Metric"]][0]
    unit = f" {metric_obj.metric_units}" if hasattr(metric_obj, "metric_units") else ""
    markdown_table += f"| {row['Metric']} | {row['Base Model']} {unit} | {row['Compressed Model']} {unit} | {row['Relative Difference']} |\n"  # noqa: E501

display(Markdown(markdown_table))

As we can see, the optimized model handles 4x more throughput and consumes only 1/5 of the energy of the base model, while losing only a small portion of its performance based on the perplexity metric, which is expected given the nature of the optimization. Now, we can start to compare, iterate and see what optimization works best for our models, given the metrics we are interested in.

We can now save the optimized model to disk and share it with others. Note that some optimizations, such as torch_compile, are device dependent and will be re-applied when loading the model on a different device.

[ ]:

# save the model either locally or push it to huggingface hub
smashed_model.save_pretrained("smashed_model")
# smashed_model.push_to_hub("PrunaAI/smashed_model")

Wrap up

In this tutorial, we have shown a standard workflow for optimizing and evaluating a large language model. We have used the SmashConfig object to define the optimization algorithms and the EvaluationAgent to evaluate the performance of the optimized model. We have also used the PrunaDataModule to load the dataset and the Task object to define the task and evaluation agent.

We have shown how to optimize the model using the smash function and how to evaluate the performance of the optimized model using the EvaluationAgent.

Proving we can optimize the model, by making it quicker, more energy efficient and using less memory, while only losing a small amount of accuracy.