Compress and Evaluate Reasoning Large Language Models

Component	Details
Goal	Showcase a standard workflow for optimizing and evaluating a reasoning Large Language Model
Model	Qwen/Qwen3-1.7B
Dataset	zwhe99/DeepMath-103K
Device	1 x H100 (80GB)
Optimization Algorithms	quantizer(hqq), compiler(torch_compile)
Evaluation Metrics	`total time`, `perplexity`, `throughput`, `energy_consumed`

Getting Started

To install the required dependencies, you can run the following command:

[ ]:

%pip install pruna

For more information about how to install Pruna, please refer to the Installation page.

Then, we will set the device to the best available option to maximize the optimization process’s benefits. However, in this case, we recommend using a GPU.

[1]:

import torch

device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

1. Load the Model

First, we will load the original model and tokenizer using the transformers library. In our case, we will use one of the small versions of Qwen3, Qwen/Qwen3-1.7B just as a starting point. However, Pruna works at least as well with larger models, so feel free to use a bigger version of Qwen3 or any other reasoning model available on Hugging Face.

[ ]:

from transformers import pipeline

model_name = "Qwen/Qwen3-1.7B"

pipe = pipeline(
    "text-generation",
    model_name,
)

Once we’ve loaded the model and tokenizer, we can try to generate a response from the model and parse the response to get the reasoning steps.

[5]:

import copy
import re


# Helper function to parse the thinking content
def parse_thinking_content(messages):  # noqa: D103
    messages = copy.deepcopy(messages)
    for message in messages:
        if message["role"] == "assistant" and (
            m := re.match(r"<think>\n(.+)</think>\n\n", message["content"], flags=re.DOTALL)
        ):
            message["content"] = message["content"][len(m.group(0)) :]
            if thinking_content := m.group(1).strip():
                message["reasoning_content"] = thinking_content
    return messages


# Run the model
messages = [
    {
        "role": "user",
        "content": "Give me a short introduction to large language model.",
    },
]
messages = pipe(messages, max_new_tokens=32768)[0]["generated_text"]

parse_thinking_content(messages)

[5]:

[{'role': 'user',
'content': 'Give me a short introduction to large language model.'},
{'role': 'assistant',
'content': 'Large language models (LLMs) are AI systems designed to understand, generate, and interact with human language. They are trained on massive datasets of text, enabling them to grasp complex patterns and produce coherent, context-aware responses. These models, often based on transformer architecture, excel in tasks like translation, writing, and answering questions. While they offer remarkable capabilities, they also face challenges such as data bias and the need for continuous refinement. LLMs are revolutionizing industries by enhancing productivity and innovation in areas like customer service, content creation, and research.',
'reasoning_content': 'Okay, the user wants a short introduction to large language models. Let me start by defining what they are. Large language models (LLMs) are AI systems trained on vast amounts of text data. I should mention their key features like natural language understanding and generation.\n\nWait, I need to make sure it\'s concise. Maybe start with a definition, then talk about their training, the components like transformer architecture, and their applications. Also, mention that they\'re used in various fields like customer service, content creation, etc. Oh, and maybe touch on their limitations, like data bias or lack of real-world knowledge. But since it\'s a short intro, maybe keep it positive and highlight the benefits first.\n\nLet me check if I\'m missing anything. The user might be a student or someone new to AI. They need a clear, straightforward explanation without too much jargon. Make sure to explain key terms like "training data" and "transformer architecture" in simple terms. Avoid technical details that might confuse them. Alright, structure it as a brief overview with key points.'}]

2. Define the SmashConfig

Now that our base model is loaded and tested, we can specify the SmashConfig to customize the optimizations applied during smashing.

Not every optimization algorithm works with every model. You can learn about the requirements and compatibility in the Algorithms Overview.

In this example, we will enable hqq quantization to improve the performance of the model and torch_compile compilation to improve the speed of the model.

[ ]:

from pruna import SmashConfig

smash_config = SmashConfig(
    {
        "hqq": {"weight_bits": 8, "compute_dtype": "torch.bfloat16"},
        "torch_compile": {"fullgraph": True, "dynamic": True}
    }
)

3. Smash the Model

Now that we have our SmashConfig defined, it’s time to apply it to our base model. We’ll call the smash function with the base model and our SmashConfig

Ready to smash? This operation typically takes around 20 seconds, depending on the configuration.

[ ]:

from pruna import smash

copy_model = copy.deepcopy(pipe.model).to("cpu")
smashed_model = smash(
    model=pipe.model,
    smash_config=smash_config,
)

Great! Now we have our optimized smashed model. Let’s check how it works by running some inference.

Consider that if you are using torch_compile as a compiler, you can expect the first inference warmup to take a bit longer than the actual inference.

[8]:

from transformers import pipeline

messages = [
    {
        "role": "user",
        "content": "Give me a short introduction to large language models.",
    },
]
messages = pipe(messages, max_new_tokens=32768)[0]["generated_text"]
parse_thinking_content(messages)

[8]:

[{'role': 'user',
'content': 'Give me a short introduction to large language models.'},
{'role': 'assistant',
'content': "Large language models (LLMs) are advanced AI systems designed to understand and generate human-like text. They learn from vast amounts of data using deep learning techniques, enabling them to produce coherent and contextually relevant responses. These models excel in tasks like language translation, content creation, and customer service chatbots. While they're powerful, they're not infallible and rely on data quality. Their integration into daily life has transformed how we interact with technology, making tasks faster and more efficient.",
'reasoning_content': "Okay, the user wants a short introduction to large language models. Let me start by defining what they are. Large language models are AI systems that can understand and generate human-like text. I should mention their training with vast amounts of data and their use in various applications like chatbots and content creation.\n\nWait, maybe I should break it down into key points: what they are, how they work, and their applications. Keep it concise. Also, the user might be a beginner, so avoid technical jargon. Maybe mention neural networks and deep learning. Oh, and the difference between models like GPT and others. But since it's a short intro, maybe just a few examples. \n\nI need to ensure the explanation is clear but not too detailed. Maybe start with a simple definition, then talk about their training, then their uses. Also, note that they're not just data processors but have some level of understanding. Make sure it's friendly and easy to grasp. Let me check the key points again: definition, training, applications, and maybe a sentence on their limitations. But since it's a short intro, maybe keep it to the basics. Alright, time to put it all together in a few sentences."}]

As we can see, the model still generates a similar response with a thinking process.

If you notice a significant difference, it might be due to the model, the configuration, the hardware, etc. As optimization can be non-deterministic, we encourage you to retry the optimization process or try out different configurations and models to find the best fit for your use case. However, feel free to reach out to us on Discord if you have any questions or feedback.

4. Evaluate the Smashed Model

As our smashed model is working, we can evaluate how much it has improved with our optimization. For this, we can run an evaluation of the performance using the EvaluationAgent and zwhe99/DeepMath-103K, as our reasoning custom dataset. In this case, we will also include metrics like total time,perplexity, throughput and energy_consumed.

A complete list of the available metrics can be found in Evaluation.

[ ]:

from datasets import load_dataset

from pruna import PrunaModel
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.data.utils import split_train_into_train_val_test
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import (
    EnergyConsumedMetric,
    ThroughputMetric,
    TorchMetricWrapper,
    TotalTimeMetric,
)
from pruna.evaluation.task import Task

# Define the metrics. Increment the number of iterations
# and warmup iterations to get a more accurate result.
metrics = [
    TotalTimeMetric(n_iterations=50, n_warmup_iterations=5),
    ThroughputMetric(n_iterations=50, n_warmup_iterations=5),
    TorchMetricWrapper("perplexity", call_type="single"),
    EnergyConsumedMetric(n_iterations=50, n_warmup_iterations=5),
]

# Load the dataset and split it into train, validation and test
train_ds = load_dataset("zwhe99/DeepMath-103K", split="train")
train_ds = train_ds.rename_column(
    "question", "text"
)  # Rename the column to match the `text_generation_collate` function
train_ds, val_ds, test_ds = split_train_into_train_val_test(train_ds, seed=42)

# (Optional) Use the eos_token as the pad_token
pipe.tokenizer.pad_token = pipe.tokenizer.eos_token

# Create the data module. Increment the `max_seq_len` to match the
# `max_new_tokens` of the model for a more accurate evaluation.
datamodule = PrunaDataModule.from_datasets(
    datasets=(train_ds, val_ds, test_ds),
    collate_fn="text_generation_collate",
    tokenizer=pipe.tokenizer,
    collate_fn_args={"max_seq_len": 512},
    dataloader_args={"batch_size": 16, "num_workers": 4},
)
datamodule.limit_datasets(100)

# Define the task and the evaluation agent
task = Task(metrics, datamodule=datamodule, device=device)
eval_agent = EvaluationAgent(task)

# (Optional) Define specific inference arguments for benchmarking.
inference_args = {
    "max_new_tokens": 512,  # Increment the `max_new_tokens` for a more accurate evaluation.
}

[ ]:

from pruna.engine.utils import move_to_device

# Evaluate the smashed model and offload it to CPU
move_to_device(smashed_model, device)
smashed_model.inference_handler.model_args.update(inference_args)
smashed_model_results = eval_agent.evaluate(smashed_model)
move_to_device(smashed_model, "cpu")

[ ]:

# Evaluate the base model and offload it to CPU
base_pipe = PrunaModel(model=copy_model)
move_to_device(base_pipe, device)
base_pipe.inference_handler.model_args.update(inference_args)
base_model_results = eval_agent.evaluate(base_pipe)
move_to_device(base_pipe, "cpu")

Now we can see the results of the evaluation and compare the performance of the original and the optimized model.

[13]:

from IPython.display import Markdown, display  # noqa


def make_comparison_table(base_model_results, smashed_model_results):  # noqa
    header = "| Metric | Base Model | Smashed Model | Improvement % |\n"
    header += "|" + "-----|" * 4 + "\n"
    rows = []

    for base, smashed in zip(base_model_results, smashed_model_results):
        base_result = base.result
        smashed_result = smashed.result
        if base.higher_is_better:
            diff = ((smashed_result - base_result) / base_result) * 100
        else:
            diff = ((base_result - smashed_result) / base_result) * 100
        row = f"| {base.name} | {base_result:.4f} {base.metric_units or ''}"
        row += f"| {smashed_result:.4f} {smashed.metric_units or ''} | {diff:.2f}% |"
        rows.append(row)
    return header + "\n".join(rows)


display(Markdown(make_comparison_table(base_model_results, smashed_model_results)))

Metric	Base Model	Smashed Model	Improvement %
perplexity	3.3330	2.8230	15.30%
total_time	42390.9036 ms	6869.6069 ms	83.79%
throughput	0.0189 num_iterations/ms	0.1165 num_iterations/ms	517.08%
energy_consumed	0.0059 kWh	0.0011 kWh	81.92%

As expected, we can observe a significant improvement. The compressed model is almost 6× faster and delivers over 5× more throughput. Even better, we didn’t lose performance (remember, lower perplexity means better results), and energy use went down too. This really is the best-case scenario!

With this results, we can save the optimized model to disk or share it with others:

[ ]:

# Save the model to disk
smashed_model.save_pretrained("Qwen3-1.7B-smashed")
# Load the model from disk
# smashed_model = PrunaModel.from_pretrained("Qwen3-1.7B-smashed/")

# Save the model to HuggingFace
# smashed_model.push_to_hub("PrunaAI/Qwen3-1.7B-smashed")

Conclusions

In this tutorial, we have seen how to optimize and evaluate a reasoning Large Language Model using Pruna. We have seen how to use the SmashConfig to customize the optimizations applied during smashing and how to evaluate the performance of the optimized model using the EvaluationAgent.

The results show that by compressing the model and combining different algorithms, we can achieve a significant improvement in performance without losing accuracy.

Check out our other tutorials for more examples on how to optimize and evaluate image/video generation models or LLM models.