Compress and Evaluate Large Language Models
Component |
Details |
---|---|
Goal |
Show a standard workflow for optimizing and evaluating a large language model |
Model |
|
Dataset |
|
Device |
1 x RTX A5000 (24GB VRAM) |
Optimization Algorithms |
quantizer(hqq), compiler(torch_compile) |
Evaluation Metrics |
perplexity, throughput, total time, energy consumption |
Getting Started
To install the dependencies, run the following command:
Let’s also set the device to the best available device to make the most out of the optimization process.
[ ]:
import torch
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
1. Load the model
Before we can optimize the model, we need to ensure that we can load the model and tokenizer correctly and that they can fit in memory. For this example, we will use a nice and small LLM, HuggingFaceTB/SmolLM2-360M-Instruct, but feel free to use any text-generation model on Hugging Face.
Although Pruna works at least as good with much larger models, like Qwen or LLaMA, a small model is a good starting point to show and test the steps of the optimization process.
[ ]:
from transformers import pipeline
model_name = "HuggingFaceTB/SmolLM2-360M-Instruct"
pipe = pipeline(
task="text-generation",
model=model_name,
)
Now we’ve loaded the model and tokenizer. Let’s see if we can run some inference with them. To make this easy for use, we will be using the transformers
library’s pipeline.__call__
function and passing in a list of messages.
[ ]:
from transformers import pipeline
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe(messages, max_new_tokens=100)
As we can see, the model is able to generate a response to the user’s question, which is being cut-off after the allowed max_new_tokens
.
2. Define the SmashConfig
Now we know the model is working, let’s continue with the optimization process and define the SmashConfig
, which we will use later on to optimize the model.
Not all optimization algorithms are available for all models but we can learn a bit more about different optimization algorithms and their requirements in the Algorithms Overview section of the documentation.
For the current optimization, we will be using the `hqq
quantizer <https://docs.pruna.ai/en/stable/compression.html#hqq>`__ and the `torch_compile
compiler <https://docs.pruna.ai/en/stable/compression.html#torch-compile>`__. We will updating some parameters for these algorithms, setting hqq_weight_bits
to 4
, hqq_compute_dtype
to torch.bfloat16
, torch_compile_fullgraph
to True
, torch_compile_dynamic
to True
, and torch_compile_mode
to max-autotune
.
This is one of the many configurations and will just serve as an example.
Let’s define the SmashConfig
object.
[ ]:
from pruna import SmashConfig
smash_config = SmashConfig(device=device)
smash_config["quantizer"] = "hqq"
smash_config["hqq_weight_bits"] = 4
smash_config["hqq_compute_dtype"] = "torch.bfloat16"
# We can also use `torch_compile` as our compiler, but we will skip it for now as it will take a bit longer to compile.
smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_fullgraph"] = True
smash_config["torch_compile_dynamic"] = True
smash_config["torch_compile_mode"] = "max-autotune"
3. Smash the model
Now that we have defined the SmashConfig
object, we can smash the model. We will be using the smash
function to smash the model and pass the model
and smash_config
to it. We also make a deep copy of the model to avoid modifying the original model.
Let’s smash the model, which should take around 20 seconds for this configuration.
[ ]:
import copy
from pruna import smash
copy_model = copy.deepcopy(pipe.model).to("cpu")
smashed_model = smash(
model=pipe.model,
smash_config=smash_config,
)
Now we’ve optimized the model. Let’s see if everything still works as expected and we can run some inference with the optimized model. In this case, we are running the inference by first encoding the prompt through the tokenizer
and then passing the input_ids
to the PrunaModel.generate
method, which also allows us to specify additional parameters such as max_new_tokens
.
If you are using torch_compile
as your compiler, you can expect the first inference warmup to take a bit longer than the actual inference.
[ ]:
prompt = "Who are you?"
messages = [{"role": "user", "content": prompt}]
pipe(messages, max_new_tokens=256)
As we can see, the model is able to generate a similar response to the original model.
If you notice a significant difference, it might have several reasons, the models, the configuration, the hardware, etc. As optimization can be non-deterministic, we encourage you to retry the optimization process or try out different configurations and models to find the best fit for your use case but also feel free to reach out to us on Discord if you have any questions or feedback.
4. Evaluate the smashed model
Now that we have optimized the model, we can evaluate the performance of the optimized model. We will be using the EvaluationAgent
to evaluate the performance of the optimized model. We will do so with some basic metrics, the elapsed_time
, as well as a stateful metrics, the perplexity
. An overview of the different metrics can be found in our documentation.
Let’s define the EvaluationAgent
object and start the evaluation process. Note that we are using the datamodule.limit_datasets(100)
method to limit the number of datasets to 100, which is just for the sake of time. Additionally, set the n_iterations
and n_warmup_iterations
to ensure that we monitor the performance of the model whenever it is running smoothly.
The evaluation can take anywhere from a couple of minutes to a couple of hours to complete, depending on your hardware, the number of samples in the dataset, and the configuration of the model. In our case it should only take a couple of minutes.
[ ]:
from pruna import PrunaModel
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import (
EnergyConsumedMetric,
ThroughputMetric,
TorchMetricWrapper,
TotalTimeMetric,
)
from pruna.evaluation.task import Task
# Define the metrics
metrics = [
EnergyConsumedMetric(n_iterations=50, n_warmup_iterations=5),
ThroughputMetric(n_iterations=50, n_warmup_iterations=5),
TotalTimeMetric(n_iterations=50, n_warmup_iterations=5),
TorchMetricWrapper("perplexity", call_type="single"),
]
# Define the datamodule
pipe.tokenizer.pad_token = pipe.tokenizer.eos_token
datamodule = PrunaDataModule.from_string("SmolSmolTalk", tokenizer=pipe.tokenizer)
datamodule.limit_datasets(100)
# Define the task and evaluation agent
task = Task(metrics, datamodule=datamodule, device=device)
eval_agent = EvaluationAgent(task)
# Update the model args to ensure right generation arguments are passed after compilation
inference_args = {"max_new_tokens": 250}
# Evaluate smashed model and offload it to CPU
smashed_model.move_to_device(device)
smashed_model.inference_handler.model_args.update(inference_args)
smashed_model_results = eval_agent.evaluate(smashed_model)
smashed_model.move_to_device("cpu")
# Evaluate base model and offload it to CPU
base_model = PrunaModel(model=copy_model)
base_model.move_to_device(device)
base_model.inference_handler.model_args.update(inference_args)
base_model_results = eval_agent.evaluate(base_model)
base_model.move_to_device("cpu")
Now we can see the results of the evaluation and compare the performance of the original and the optimized model.
[ ]:
from IPython.display import Markdown, display # noqa
# Calculate percentage differences for each metric
def calculate_percentage_diff(original, optimized): # noqa
return ((optimized - original) / original) * 100
# Calculate differences and prepare table data
table_data = []
for base_metric_result, smashed_metric_result in zip(base_model_results, smashed_model_results):
diff = calculate_percentage_diff(base_metric_result.result, smashed_metric_result.result)
table_data.append(
{
"Metric": base_metric_result.name,
"Base Model": f"{base_metric_result.result:.4f}",
"Compressed Model": f"{smashed_metric_result.result:.4f}",
"Relative Difference": f"{diff:+.2f}%",
}
)
# Create and display markdown table manually
markdown_table = "| Metric | Base Model | Compressed Model | Relative Difference |\n"
markdown_table += "|--------|----------|-----------|------------|\n"
for row in table_data:
metric_obj = [metric for metric in metrics if metric.metric_name == row["Metric"]][0]
unit = f" {metric_obj.metric_units}" if hasattr(metric_obj, "metric_units") else ""
markdown_table += f"| {row['Metric']} | {row['Base Model']} {unit} | {row['Compressed Model']} {unit} | {row['Relative Difference']} |\n" # noqa: E501
display(Markdown(markdown_table))
As we can see, the optimized model handles 4x more throughput and consumes only 1/5 of the energy of the base model, while losing only a small portion of its performance based on the perplexity metric, which is expected given the nature of the optimization. Now, we can start to compare, iterate and see what optimization works best for our models, given the metrics we are interested in.
We can now save the optimized model to disk and share it with others. Note that some optimizations, such as torch_compile
, are device dependent and will be re-applied when loading the model on a different device.
[ ]:
smashed_model.save_pretrained("smashed_model")
smashed_model.save_to_hub("smashed_model")
Wrap up
In this tutorial, we have shown a standard workflow for optimizing and evaluating a large language model. We have used the SmashConfig
object to define the optimization algorithms and the EvaluationAgent
to evaluate the performance of the optimized model. We have also used the PrunaDataModule
to load the dataset and the Task
object to define the task and evaluation agent.
We have shown how to optimize the model using the smash
function and how to evaluate the performance of the optimized model using the EvaluationAgent
.
Proving we can optimize the model, by making it quicker, more energy efficient and using less memory, while only losing a small amount of accuracy.