FAQ
Do I need a GPU to use Pruna?
While some algorithms require a GPU, many can run on CPU.
The package clearly indicates GPU requirements through the run_on_cpu
and run_on_cuda
flags for each algorithm.
You can check our algorithm documentation for specific hardware requirements.
Do I need Linux to use Pruna?
Pruna is officially supported on Linux. While it can be used on macOS and Windows, compatibility depends on the specific algorithms you plan to use, as some may have platform-specific dependencies. If you don’t have access to a Linux environment, you can still easily experiment with Pruna using cloud platforms like Google Colab.
How do I evaluate if the compression worked well?
Pruna provides a comprehensive evaluation framework through the EvaluationAgent
and Task
classes. You can evaluate models using:
1. Pre-defined task metrics (e.g., image_generation_quality
)
2. Custom metric lists (e.g., ["clip_score", "psnr"]
)
3. Your own custom metrics
The framework helps you compare the original and compressed models across quality, speed, and resource usage metrics.
Can I combine multiple compression algorithms?
Yes! Pruna supports combining compatible algorithms through the SmashConfig
.
Each algorithm specifies its compatible combinations in the compatible_algorithms
property.
What datasets should I use for calibration?
Some algorithms require calibration data for optimal results.
You can a) use Pruna’s built-in datasets (e.g., LAION256
for image models) or b) provide your own custom dataset through SmashConfig
.
Be mindful that the calibration dataset should be representative of your actual usage patterns for best results.
Why should I use Pruna instead of individual compression packages?
While you can use individual packages directly, Pruna offers several key advantages. First, it provides a unified interface with a consistent API across all compression methods, eliminating the need to install and learn multiple package interfaces and reducing integration complexity. It also manages compatibility automatically by handling algorithm compatibility checks and version requirements. Moreover, Pruna’s evaluation framework offers a comprehensive system that works seamlessly with all compression methods, making it easy to compare and validate results. The compression methods come with smart defaults, optimized through extensive testing, to deliver strong out-of-the-box performance while still allowing customization. Pruna’s simplified workflow includes a single-call smash function for applying multiple compression techniques at once, and it ensures consistent model saving/loading by handling the complexities of saving and loading compressed models. Finally, the Pruna team actively maintains this package, guaranteeing compatibility with the latest versions of underlying packages and providing prompt issue resolution.
How big are the improvements?
How big the improvements are depends on your own pipeline as well as the configuration you choose. It’s often 2-10x gains, sometimes more and sometimes less. You can check out our public models on Hugging Face, where we report benchmark results for the compressed models.
I compiled my model but my inference is slower than before.
What you are experiencing can happen! With some compilation algorithms, the first inference call will be slower than usual. Just call it again after the initial inference and it will be lightning fast.
I am comparing the smashed model to the original model but there is no speed improvement.
Some algorithms, in particular compilation algorithms, modify the model in-place. Make sure to make a deepcopy of your original model if you want to compare the performance.
Does the model quality change?
The quality of the smashed model depends on your own pipeline as well as the configuration you choose. Some configs do not change quality, while others can slightly vary the output (usually to make the model even faster and smaller). We put a lot of work to have the package adapt efficiency algorithms in a way that minimizes their combined impact on model output.
For the public Hugging Face models, how is the model efficiency evaluated?
These results were obtained with the configuration described in model/smash_config.json and are measured after a hardware warmup. The smashed model is directly compared to the original base model. Efficiency results may vary in other settings (e.g. other hardware, image size, batch size, …). We recommend to directly run them in the use-case conditions to know if the smashed model can benefit you.
What is the model format of the smashed model?
Which format Pruna can save the model depends on the algorithm and combination of algorithms. We try to save the model in non-pickled formats wherever possible.
For the public Hugging Face models, what is the “first” metric?
Results mentioning “first” are obtained after the first run of the model. The first run might take more memory or be slower than the subsequent runs due cuda overheads.
How can I serve Pruna models in production?
Pruna models can be served in production using any inference framework that supports pytorch. We have tutorials for Triton Inference Server, Replicate, and ComfyUI. We will be adding even more tutorials soon!