Compression Methods
At its core, the pruna package is a toolbox of compression methods. In this section, we will introduce you to all the methods you can currently apply with the package.
Pruna wouldn’t be possible without the amazing work of the authors behind these methods. 💜 We’re really grateful for their contributions and encourage you to check out their repositories!
There are various types of compression methods we support, two of which are currently available in the public version: Compilation and Quantization.
Compilation
Compilation optimizes the model for specific hardware. Supported methods include:
diffusers2: Contributors
- Time:
A few minutes.
- Compilation on CPU:
No.
- Quality:
Same as the original model.
- Required:
None.
- Hyperparameters:
None.
ctranslate: Contributors
- Time:
A few minutes.
- Compilation on CPU:
Yes.
- Quality:
Same as the original model.
- Required:
tokenizer.
- Hyperparameters:
comp_ctranslate_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).
cgenerate: Contributors
- Time:
A few minutes.
- Compilation on CPU:
Yes.
- Quality:
Equivalent to the original model.
- Required:
tokenizer.
- Hyperparameters:
comp_cgenerate_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).
cwhisper: Contributors
- Time:
A few minutes.
- Compilation on CPU:
Yes.
- Quality:
Equivalent to the original model.
- Required:
processor.
- Hyperparameters:
comp_cwhisper_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).
ifw: Contributors
- Time:
A few minutes.
- Compilation on CPU:
Yes.
- Quality:
Comparable to the original model.
- Required:
processor.
- Hyperparameters:
comp_ifw_weight_bits: Set the weight quantization bits to 16 or 32 (default 16).
w2st: Contributors
- Time:
A few minutes.
- Compilation on CPU:
No.
- Quality:
Comparable to the original model.
- Required:
processor.
- Hyperparameters:
comp_ws2t_int8: Whether to use int8 or not (default False).
step_caching: Contributors, Paper, Citation
- Time:
A few minutes.
- Compilation on CPU:
No.
- Quality:
Very close to the original model.
- Required:
None.
- Hyperparameters:
comp_step_caching_interval: Set the cache interval to None, 1, 2, 3, 4 or 5 (default 3). If None, the caching interval is determined automatically based on the num_inference_steps. If 1, caching is disabled. If 2, 3, 4 or 5, caching is fixed with the given interval. comp_step_caching_agressive: Whether to use aggressive caching or not (default False).
flux-caching: proprietary
- Time:
A few minutes.
- Compilation on CPU:
Yes.
- Quality:
Very close to the original model.
- Required:
None.
- Hyperparameters:
comp_flux_caching_cache_interval: Set the cache interval to 1, 2, 3, 4 or 5 (default 2). How many model steps to skip in a row. Higher is faster, but reduces quality. comp_flux_caching_start_step: Set the start step to 0, 1, 2, 3, 4 or 5 (default 2). How many steps to wait before starting to cache. comp_flux_caching_compile: Whether to additionally compile the model for extra speed up or not (default True). comp_flux_caching_save_model: Whether to save the model after compilation or not (default False). Set to False if you want to use the model for inference only.
x-fast: proprietary, based on diffusers2
- Time:
A few minutes.
- Compilation on CPU:
No.
- Quality:
Not specified.
- Required:
None.
- Hyperparameters:
comp_x-fast_xformers: Whether to activate xformers or not (default True).
torch_compile: Contributors
- Time:
A few minutes.
- Compilation on CPU:
Yes.
- Quality:
Not specified.
- Required:
None.
- Hyperparameters:
comp_torch_compile_mode: Set the mode to “default”, “reduce-overhead”, “max-autotune”, “max-autotune-no-cudagraphs” (default “max-autotune”). comp_torch_compile_backend: Set the backend to “inductor”, “cudagraphs”, “ipex”, “onnxrt”, “tensorrt”, “tvm”, “openvino” (default “inductor”). comp_torch_compile_fullgraph: Whether to compile the full graph or not (default False). comp_torch_compile_dynamic: Set the dynamic compilation to None (determined automatically), True or False (default None).
onediff: Contributors, Citation
- Time:
A few minutes.
- Compilation on CPU:
No.
- Quality:
Same as the original model.
- Required:
None.
- Hyperparameters:
None.
Quantization
Quantization methods reduce the precision of the model’s weights and activations, making them much smaller in terms of memory required, at the cost of some quality loss. Supported methods include:
torch_dynamic: Contributors
- Time:
A few minutes.
- Quantization on CPU:
Yes.
- Quality:
Not specified.
- Required:
None.
- Hyperparameters:
quant_torch_dynamic_weight_bits: Set the weight quantization bits to quint8 or qint8 (default qint8).
torch_static: Contributors
- Time:
A few minutes.
- Quantization on CPU:
Yes.
- Quality:
Not specified.
- Required:
dataset.
- Hyperparameters:
quant_torch_static_weight_bits: Set the weight quantization bits to quint8 or qint8 (default qint8). quant_torch_static_act_bits: Set the activation quantization bits to quint8 or qint8 (default qint8). quant_torch_static_qscheme: Set the quantization scheme to per_tensor_symmetric orper_tensor_affine (default per_tensor_affine). quant_torch_static_qobserver: Set the quantization observer to MinMaxObserver, MovingAverageMinMaxObserver, PerChannelMinMaxObserver, or HistogramObserver (default MinMaxObserver).
llm-int8: Contributors
- Time:
A few minutes.
- Quantization on CPU:
Yes.
- Quality:
Lower than the original model with 4 bits worse than 8 bits.
- Required:
None.
- Hyperparameters:
quant_llm-int8_weight_bits: Set the weight quantization bits to 4 or 8 (default 8). quant_llm-int8_double_quant: Whether to use double quantization or not (default False). quant_llm-int8_enable_fp32_cpu_offload: Whether to enable fp32 CPU offload or not (default False).
gptq: Contributors
- Time:
30 minutes to a day depending on the size of the model.
- Quantization on CPU:
Yes.
- Quality:
Lower than the original model with 2 bits worse than 3 bits worse than 4 bits worse than 8 bits.
- Required:
tokenizer, dataset.
- Hyperparameters:
quant_gptq_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_gptq_use_exllama: Whether to use exllama or not (default True).
awq: Contributors
- Time:
30 minutes to a day depending on the size of the model.
- Quantization on CPU:
Yes.
- Quality:
Not specified.
- Required:
tokenizer, dataset.
- Hyperparameters:
quant_awq_group_size: Set the group size to 8, 16, 32, 64, or 128 (default 128).
hqq: Contributors, Article, Citation
- Time:
A few minutes.
- Quantization on CPU:
Yes.
- Quality:
Not specified.
- Required:
None.
- Hyperparameters:
quant_hqq_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_hqq_group_size: Set the group size to 8, 16, 32, 64, or 128 (default 64).
half: Contributors
- Time:
A few minutes.
- Quantization on CPU:
Yes.
- Quality:
Not specified.
- Required:
None.
- Hyperparameters:
None.
quanto: Contributors
- Time:
A few minutes.
- Quantization on CPU:
Yes.
- Quality:
Not specified.
- Required:
dataset (only when activating calibration).
- Hyperparameters:
quant_quanto_weight_bits: Set the weight quantization bits to qint2, qint4, qint8, or qfloat8 (default qfloat8). quant_quanto_calibrate: Whether to activate calibration or not (default True).
Pruning
torch-unstructured: Contributors
- Time:
A few minutes.
- Pruning on CPU:
Yes.
- Quality:
Not specified.
- Required:
None.
- Hyperparameters:
prune_torch-unstructured_pruning_method: random or l1 (default l1). prune_torch-unstructured_amount: Set the pruning amount between 0.0 and 1.0 (default 0.5). prune_torch-unstructured_dim: Dimension along which to prune (default None).
Factorization
Coming to the public version soon!