FAQ

General Questions

Do I need a GPU to use Pruna?

While some algorithms require a GPU, many can run on CPU. The package clearly indicates GPU requirements through the run_on_cpu and run_on_cuda flags for each algorithm. You can check our algorithm documentation for specific hardware requirements.

Do I need Linux to use Pruna?

Pruna is officially supported on Linux. While it can be used on macOS and Windows, compatibility depends on the specific algorithms you plan to use, as some may have platform-specific dependencies. If you don’t have access to a Linux environment, you can still easily experiment with Pruna using cloud platforms like Google Colab.

How do I evaluate if the compression worked well?

Pruna provides a comprehensive evaluation framework through the EvaluationAgent and Task classes. You can evaluate models using:

  1. Pre-defined task metrics (e.g., image_generation_quality)

  2. Custom metric lists (e.g., ["clip_score", "psnr"])

  3. Your own custom metrics

The framework helps you compare the original and compressed models across quality, speed, and resource usage metrics.

Can I combine multiple compression algorithms?

Yes! Pruna supports combining compatible algorithms through the SmashConfig. Each algorithm specifies its compatible combinations in the compatible_algorithms property.

What datasets should I use for calibration?

Some algorithms require calibration data for optimal results. You can a) use Pruna’s built-in datasets (e.g., LAION256 for image models) or b) provide your own custom dataset through SmashConfig. Be mindful that the calibration dataset should be representative of your actual usage patterns for best results.

Why should I use Pruna instead of individual compression packages?

First, it provides a unified interface with a consistent API across all compression methods, eliminating the need to install and learn multiple package interfaces and reducing integration complexity. It also manages compatibility automatically by handling algorithm compatibility checks and version requirements. Moreover, Pruna’s evaluation framework offers a comprehensive system that works seamlessly with all compression methods, making it easy to compare and validate results. The compression methods come with smart defaults, optimized through extensive testing, to deliver strong out-of-the-box performance while still allowing customization. Pruna’s simplified workflow includes a single-call smash function for applying multiple compression techniques at once, and it ensures consistent model saving/loading by handling the complexities of saving and loading compressed models. Finally, the Pruna team actively maintains this package, guaranteeing compatibility with the latest versions of underlying packages and providing prompt issue resolution.

How big are the improvements?

How big the improvements are depends on your own pipeline as well as the configuration you choose. It’s often 2-10x gains, sometimes more and sometimes less. You can check out our public models on Hugging Face, where we report benchmark results for the compressed models.

I compiled my model but my inference is slower than before.

What you are experiencing can happen! With some compilation algorithms, the first inference call will be slower than usual. Just call it again after the initial inference and it will be lightning fast.

I am comparing the smashed model to the original model but there is no speed improvement.

Some algorithms, in particular compilation algorithms, modify the model in-place. Make sure to make a deepcopy of your original model if you want to compare the performance.

Does the model quality change?

The quality of the smashed model depends on your own pipeline as well as the configuration you choose. Some configs do not change quality, while others can slightly vary the output (usually to make the model even faster and smaller). We put a lot of work to have the package adapt efficiency algorithms in a way that minimizes their combined impact on model output.

For the public Hugging Face models, how is the model efficiency evaluated?

These results were obtained with the configuration described in model/smash_config.json and are measured after a hardware warmup. The smashed model is directly compared to the original base model. Efficiency results may vary in other settings (e.g. other hardware, image size, batch size, …). We recommend to directly run them in the use-case conditions to know if the smashed model can benefit you.

What is the model format of the smashed model?

Which format Pruna can save the model depends on the algorithm and combination of algorithms. We try to save the model in non-pickled formats wherever possible.

For the public Hugging Face models, what is the “first” metric?

Results mentioning “first” are obtained after the first run of the model. The first run might take more memory or be slower than the subsequent runs due cuda overheads.

How can I serve Pruna models in production?

Pruna models can be served in production using any inference framework that supports pytorch. We have tutorials for Triton Inference Server, Replicate, and ComfyUI. We will be adding even more tutorials soon!

Pruna Pro Questions

I’m getting a “403: Exception: Token is not a valid Pruna token” error. What should I do?

You might encounter this error when trying to use Pruna Pro:

403: Exception: Token is not a valid Pruna token. Please make sure you are passing the correct token.

This typically means there’s an issue with the token or the version of Pruna you’re using.

1. Double-Check the Token

  • Make sure you’re using the correct token in your configuration.

  • If you’re unsure, contact us and send your token privately (we’ll only use it to check it on our side and handle it securely).

2. Check Your Installed Pruna Version

New tokens are only supported from version 0.2.2 and above.

To check which versions you have installed, run:

# On Windows
pip freeze

# On macOS/Linux
pip freeze | grep pruna

You should see something like:

pruna==0.2.6
pruna_pro==0.2.6

3. Update to the Latest Version

If needed, update by running:

pip install pruna_pro==0.2.6

Then re-run pip freeze to confirm that you’re using the right version.

4. Still Seeing the Error?

If the issue persists after updating:

  • Please send us the full error message or traceback.

  • Include the token (privately) if possible so we can verify it.

  • We’ll take it from there and get you unstuck.

How does Pruna Pro track runtime hours?

It’s the “runtime” hours and it’s similar to how GPU hours are billed. It refers to the total time that an optimized deep learning model is active and available in memory to perform its tasks. The hours are metered per minute and charged per hour.

How can I track my usage?

Each time you either smash or load a smashed model [PrunaModel.load_model()], a message will appear in your terminal: “You have used XYZ hours this month.”

How do I access my billing information?

You can access your billing information, view invoices, or manage your subscription by visiting the Stripe Customer Portal.

You can log in using the email you used when you purchased Pruna Pro or Enterprise.

How can I get an AI Efficiency training?

The “AI Efficiency” Fundamental is a two-day training we designed initially for AI Teams at Stellantis and BPCE. The course (including the lecture, exercises, and notebooks) is fully open-source (see the GitHub repo), and we’re offering a trainer-led service to equip up to 12 developers, engineers, or researchers with the skills to build, compress, evaluate, and deploy efficient AI models.

Conditions:

  • 2 days or 4x 1/2 days

  • Remote or onsite (Paris or Munich)

  • 12 participants max

  • 50/50 split between lectures and hands-on exercises

Every participant receives an AI Efficiency Fundamentals certificate upon completion.

Contact our Sales Team for more information.

How can I get a Model Benchmark?

We’re often asked: “What can Pruna do on the XYZ model?” And the answer depends on your goal. Are you exploring possibilities or validating for real-world use?

We offer two clear paths:

🟡 Simple Overview: If you’re just looking to get a ballpark view, something like “what can you do on Model X?”, we will either:

  • Share existing benchmark numbers, if we’ve already run the model.

  • Run a quick optimization pass on our side using our optimization agent.

It’s free and suitable for early discovery. You’ll get quick signals using open-source models, general datasets, and no custom tuning.

🟢 Full Benchmark: the go-to path when inference optimization is critical enough to justify time and budget. We replicate your production setup, test multiple strategies, and show if there’s real performance to gain and ROI to capture.

It all starts with an intake: the Benchmark Request Document, where we collect:

  • The context is needed to avoid wrong assumptions and align with the success criteria

  • Your technical environment: hosting provider, hardware, serving framework

  • Your inference setup: latency targets, batch size, evaluation metrics, custom logic

You can fill a request at bench.pruna.ai.

vLLM Integration Questions

Why is vLLM so popular for optimizing LLMs?

vLLM has become one of the most widely adopted inference engines because it delivers strong performance out of the box. When you load a model from Hugging Face into vLLM, it automatically applies two key improvements:

  • Custom transforms architecture

  • Default opt-in compilation feature

What happens when I combine Pruna with vLLM?

Pruna adds an additional speed-up on top of vLLM’s optimizations:

  • +20% with Pruna Open-Source

  • +50% with Pruna Pro

This acceleration is independent of the model size: whether you’re running a 1B parameter model or a 70B one, you should expect measurable speed-up.

Importantly, this benefit is also independent of vLLM’s optional features (page attention, continuous batching, prefill chunking, etc.), meaning any extra serving optimizations you enable in vLLM will stack with Pruna’s acceleration.

For a smooth start, Pruna provides ready-to-use notebooks and tutorials. You can refer to the official Pruna documentation for technical details.

Why not just use vLLM quantizers directly?

Our quantizers differ from vLLM’s implementation: we guarantee that quantization provides speed-up, not just smaller weights. On top of that, Pruna’s quantizers offer broader stability and compatibility where vLLM quantizers may fail, Pruna keeps running.

Examples:

  • If you try HQQ on Llama-3-8B, vLLM throws an error due to the model size and kernel. With Pruna, it runs without issue.

  • If you want to use bitsandbytes, in vLLM you’re locked to batch_size = 1. With Pruna, switching to another quantizer only takes 3 lines of code.

Which quantizer should I use?

We recommend HIGGS as the default quantizer: it gives the best balance between throughput and latency efficiency.

For more advanced setups, you can use a dispatcher to dynamically route requests within the same model, but optimized in different ways depending on the use case:

  • HIGGS → when throughput is critical

  • HQQ → when latency is critical

(Note: This dispatcher concept isn’t specific to Pruna or vLLM, but rather a general observation.)

What else should I know about Pruna and vLLM?

Here are some additional important points:

  • We’re continuously exploring new optimization options, expect more updates soon.

  • Unlike vLLM, Pruna supports DiffusionLLMs, delivering 3–5× speed-ups compared to base models. You can try these on Replicate today, and we’re happy to help with configuration.