Compress and Evaluate Reasoning Large Language Models
Component |
Details |
---|---|
Goal |
Showcase a standard workflow for optimizing and evaluating a reasoning Large Language Model |
Model |
|
Dataset |
|
Device |
1 x H100 (80GB) |
Optimization Algorithms |
quantizer(hqq), compiler(torch_compile) |
Evaluation Metrics |
|
Getting Started
To install the required dependencies, you can run the following command:
[ ]:
%pip install pruna
For more information about how to install Pruna, please refer to the Installation page.
Then, we will set the device to the best available option to maximize the optimization process’s benefits. However, in this case, we recommend using a GPU.
[1]:
import torch
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
1. Load the Model
First, we will load the original model and tokenizer using the transformers library. In our case, we will use one of the small versions of Qwen3, Qwen/Qwen3-1.7B just as a starting point. However, Pruna works at least as well with larger models, so feel free to use a bigger version of Qwen3 or any other reasoning model available on Hugging Face.
[ ]:
from transformers import pipeline
model_name = "Qwen/Qwen3-1.7B"
pipe = pipeline(
"text-generation",
model_name,
)
Once we’ve loaded the model and tokenizer, we can try to generate a response from the model and parse the response to get the reasoning steps.
[5]:
import copy
import re
# Helper function to parse the thinking content
def parse_thinking_content(messages): # noqa: D103
messages = copy.deepcopy(messages)
for message in messages:
if message["role"] == "assistant" and (
m := re.match(r"<think>\n(.+)</think>\n\n", message["content"], flags=re.DOTALL)
):
message["content"] = message["content"][len(m.group(0)) :]
if thinking_content := m.group(1).strip():
message["reasoning_content"] = thinking_content
return messages
# Run the model
messages = [
{
"role": "user",
"content": "Give me a short introduction to large language model.",
},
]
messages = pipe(messages, max_new_tokens=32768)[0]["generated_text"]
parse_thinking_content(messages)
[5]:
[{'role': 'user',
'content': 'Give me a short introduction to large language model.'},
{'role': 'assistant',
'content': 'Large language models (LLMs) are AI systems designed to understand, generate, and interact with human language. They are trained on massive datasets of text, enabling them to grasp complex patterns and produce coherent, context-aware responses. These models, often based on transformer architecture, excel in tasks like translation, writing, and answering questions. While they offer remarkable capabilities, they also face challenges such as data bias and the need for continuous refinement. LLMs are revolutionizing industries by enhancing productivity and innovation in areas like customer service, content creation, and research.',
'reasoning_content': 'Okay, the user wants a short introduction to large language models. Let me start by defining what they are. Large language models (LLMs) are AI systems trained on vast amounts of text data. I should mention their key features like natural language understanding and generation.\n\nWait, I need to make sure it\'s concise. Maybe start with a definition, then talk about their training, the components like transformer architecture, and their applications. Also, mention that they\'re used in various fields like customer service, content creation, etc. Oh, and maybe touch on their limitations, like data bias or lack of real-world knowledge. But since it\'s a short intro, maybe keep it positive and highlight the benefits first.\n\nLet me check if I\'m missing anything. The user might be a student or someone new to AI. They need a clear, straightforward explanation without too much jargon. Make sure to explain key terms like "training data" and "transformer architecture" in simple terms. Avoid technical details that might confuse them. Alright, structure it as a brief overview with key points.'}]
2. Define the SmashConfig
Now that our base model is loaded and tested, we can specify the SmashConfig
to customize the optimizations applied during smashing.
Not every optimization algorithm works with every model. You can learn about the requirements and compatibility in the Algorithms Overview.
In this example, we will enable hqq quantization to improve the performance of the model and torch_compile compilation to improve the speed of the model.
[ ]:
from pruna import SmashConfig
smash_config = SmashConfig()
# Configure the quantizer
smash_config["quantizer"] = "hqq"
smash_config["hqq_weight_bits"] = 8
smash_config["hqq_compute_dtype"] = "torch.bfloat16"
# Configure the compiler
smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_fullgraph"] = True
smash_config["torch_compile_dynamic"] = True
3. Smash the Model
Now that we have our SmashConfig
defined, it’s time to apply it to our base model. We’ll call the smash
function with the base model and our SmashConfig
Ready to smash? This operation typically takes around 20 seconds, depending on the configuration.
[ ]:
from pruna import smash
copy_model = copy.deepcopy(pipe.model).to("cpu")
smashed_model = smash(
model=pipe.model,
smash_config=smash_config,
)
Great! Now we have our optimized smashed model. Let’s check how it works by running some inference.
Consider that if you are using torch_compile
as a compiler, you can expect the first inference warmup to take a bit longer than the actual inference.
[8]:
from transformers import pipeline
messages = [
{
"role": "user",
"content": "Give me a short introduction to large language models.",
},
]
messages = pipe(messages, max_new_tokens=32768)[0]["generated_text"]
parse_thinking_content(messages)
[8]:
[{'role': 'user',
'content': 'Give me a short introduction to large language models.'},
{'role': 'assistant',
'content': "Large language models (LLMs) are advanced AI systems designed to understand and generate human-like text. They learn from vast amounts of data using deep learning techniques, enabling them to produce coherent and contextually relevant responses. These models excel in tasks like language translation, content creation, and customer service chatbots. While they're powerful, they're not infallible and rely on data quality. Their integration into daily life has transformed how we interact with technology, making tasks faster and more efficient.",
'reasoning_content': "Okay, the user wants a short introduction to large language models. Let me start by defining what they are. Large language models are AI systems that can understand and generate human-like text. I should mention their training with vast amounts of data and their use in various applications like chatbots and content creation.\n\nWait, maybe I should break it down into key points: what they are, how they work, and their applications. Keep it concise. Also, the user might be a beginner, so avoid technical jargon. Maybe mention neural networks and deep learning. Oh, and the difference between models like GPT and others. But since it's a short intro, maybe just a few examples. \n\nI need to ensure the explanation is clear but not too detailed. Maybe start with a simple definition, then talk about their training, then their uses. Also, note that they're not just data processors but have some level of understanding. Make sure it's friendly and easy to grasp. Let me check the key points again: definition, training, applications, and maybe a sentence on their limitations. But since it's a short intro, maybe keep it to the basics. Alright, time to put it all together in a few sentences."}]
As we can see, the model still generates a similar response with a thinking process.
If you notice a significant difference, it might be due to the model, the configuration, the hardware, etc. As optimization can be non-deterministic, we encourage you to retry the optimization process or try out different configurations and models to find the best fit for your use case. However, feel free to reach out to us on Discord if you have any questions or feedback.
4. Evaluate the Smashed Model
As our smashed model is working, we can evaluate how much it has improved with our optimization. For this, we can run an evaluation of the performance using the EvaluationAgent
and zwhe99/DeepMath-103K, as our reasoning custom dataset. In this case, we will also include metrics like total time
,perplexity
, throughput
and
energy_consumed
.
A complete list of the available metrics can be found in Evaluation.
[ ]:
from datasets import load_dataset
from pruna import PrunaModel
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.data.utils import split_train_into_train_val_test
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import (
EnergyConsumedMetric,
ThroughputMetric,
TorchMetricWrapper,
TotalTimeMetric,
)
from pruna.evaluation.task import Task
# Define the metrics. Increment the number of iterations
# and warmup iterations to get a more accurate result.
metrics = [
TotalTimeMetric(n_iterations=50, n_warmup_iterations=5),
ThroughputMetric(n_iterations=50, n_warmup_iterations=5),
TorchMetricWrapper("perplexity", call_type="single"),
EnergyConsumedMetric(n_iterations=50, n_warmup_iterations=5),
]
# Load the dataset and split it into train, validation and test
train_ds = load_dataset("zwhe99/DeepMath-103K", split="train")
train_ds = train_ds.rename_column(
"question", "text"
) # Rename the column to match the `text_generation_collate` function
train_ds, val_ds, test_ds = split_train_into_train_val_test(train_ds, seed=42)
# (Optional) Use the eos_token as the pad_token
pipe.tokenizer.pad_token = pipe.tokenizer.eos_token
# Create the data module. Increment the `max_seq_len` to match the
# `max_new_tokens` of the model for a more accurate evaluation.
datamodule = PrunaDataModule.from_datasets(
datasets=(train_ds, val_ds, test_ds),
collate_fn="text_generation_collate",
tokenizer=pipe.tokenizer,
collate_fn_args={"max_seq_len": 512},
dataloader_args={"batch_size": 16, "num_workers": 4},
)
datamodule.limit_datasets(100)
# Define the task and the evaluation agent
task = Task(metrics, datamodule=datamodule, device=device)
eval_agent = EvaluationAgent(task)
# (Optional) Define specific inference arguments for benchmarking.
inference_args = {
"max_new_tokens": 512, # Increment the `max_new_tokens` for a more accurate evaluation.
}
[ ]:
# Evaluate the smashed model and offload it to CPU
smashed_model.move_to_device(device)
smashed_model.inference_handler.model_args.update(inference_args)
smashed_model_results = eval_agent.evaluate(smashed_model)
smashed_model.move_to_device("cpu")
[ ]:
# Evaluate the base model and offload it to CPU
base_pipe = PrunaModel(model=copy_model)
base_pipe.move_to_device(device)
base_pipe.inference_handler.model_args.update(inference_args)
base_model_results = eval_agent.evaluate(base_pipe)
base_pipe.move_to_device("cpu")
Now we can see the results of the evaluation and compare the performance of the original and the optimized model.
[13]:
from IPython.display import Markdown, display # noqa
# Calculate percentage differences for each metric
def calculate_percentage_diff(original, optimized): # noqa
return ((optimized - original) / original) * 100
# Calculate differences and prepare table data
table_data = []
for base_metric_result, smashed_metric_result in zip(base_model_results, smashed_model_results):
diff = calculate_percentage_diff(base_metric_result.result, smashed_metric_result.result)
table_data.append(
{
"Metric": base_metric_result.name,
"Base Model": f"{base_metric_result.result:.4f}",
"Compressed Model": f"{smashed_metric_result.result:.4f}",
"Relative Difference": f"{diff:+.2f}%",
}
)
# Create and display markdown table manually
markdown_table = "| Metric | Base Model | Compressed Model | Relative Difference |\n"
markdown_table += "|--------|----------|-----------|------------|\n"
for row in table_data:
metric_obj = [metric for metric in metrics if metric.metric_name == row["Metric"]][0]
unit = f" {metric_obj.metric_units}" if hasattr(metric_obj, "metric_units") else ""
markdown_table += f"| {row['Metric']} | {row['Base Model']} {unit} | {row['Compressed Model']} {unit} | {row['Relative Difference']} |\n" # noqa: E501
display(Markdown(markdown_table))
Metric |
Base Model |
Compressed Model |
Relative Difference |
---|---|---|---|
perplexity |
3.3330 |
2.8230 |
-15.30% |
total_time |
42390.9036 ms |
6869.6069 ms |
-83.79% |
throughput |
0.0189 num_iterations/ms |
0.1165 num_iterations/ms |
+517.08% |
energy_consumed |
0.0059 kWh |
0.0011 kWh |
-81.92% |
As expected, we can observe a significant improvement. The compressed model is almost 6× faster and delivers over 5× more throughput. Even better, we didn’t lose performance (remember, lower perplexity means better results), and energy use went down too. This really is the best-case scenario!
With this results, we can save the optimized model to disk or share it with others:
[ ]:
# Save the model to disk
smashed_model.save_pretrained("Qwen3-1.7B-smashed")
# Load the model from disk
# smashed_model = PrunaModel.from_pretrained("Qwen3-1.7B-smashed/")
# Save the model to HuggingFace
# smashed_model.push_to_hub("PrunaAI/Qwen3-1.7B-smashed")
Conclusions
In this tutorial, we have seen how to optimize and evaluate a reasoning Large Language Model using Pruna. We have seen how to use the SmashConfig
to customize the optimizations applied during smashing and how to evaluate the performance of the optimized model using the EvaluationAgent
.
The results show that by compressing the model and combining different algorithms, we can achieve a significant improvement in performance without losing accuracy.
Check out our other tutorials for more examples on how to optimize and evaluate image/video generation models or LLM models.