FAQ
How big are the improvements?
How big the improvements are depends on your own pipeline as well as the configuration you choose. It’s often 2-10x gains, sometimes more and sometimes less. You can check out our public models on Hugging Face, where we report benchmark results for the compressed models.
I compiled my model but my inference is slower than before.
What you are experiencing can happen! With some compilation methods, the first inference call will be slower than usual. Just call it again after the initial inference and it will be lightning fast.
I am comparing the smashed model to the original model but there is no speed improvement.
Some methods, in particular compilation methods, modify the model in-place. Make sure to make a deepcopy of your original model if you want to compare the performance.
Does the model quality change?
The quality of the smashed model depends on your own pipeline as well as the configuration you choose. Some configs do not change quality, while others can slightly vary the output (usually to make the model even faster and smaller). We put a lot of work to have the package adapt efficiency methods in a way that minimizes their combined impact on model output.
For the public Hugging Face models, how is the model efficiency evaluated?
These results were obtained with the configuration described in model/smash_config.json and are measured after a hardware warmup. The smashed model is directly compared to the original base model. Efficiency results may vary in other settings (e.g. other hardware, image size, batch size, …). We recommend to directly run them in the use-case conditions to know if the smashed model can benefit you.
What is the model format of the smashed model?
We use a custom Pruna model format based on pickle to make models compatible with the compression methods.
For the public Hugging Face models, what is the “first” metric?
Results mentioning “first” are obtained after the first run of the model. The first run might take more memory or be slower than the subsequent runs due cuda overheads.
What happens if I exceed my Pruna token limit?
Contact us! Reach out to us either on Discord or via email at support@pruna.ai to increase your limit.
I lost my Pruna token… What now?
It’s okay, we got you! 💜 Just reach out to us on Discord or via email at support@pruna.ai and we will help you out.