Compression Methods
At its core, the pruna package is a toolbox of compression methods. In this section, we will introduce you to all the methods you can currently apply with the package.
Pruna wouldn’t be possible without the amazing work of the authors behind these methods. 💜 We’re really grateful for their contributions and encourage you to check out their repositories!
There are various types of compression methods we support, two of which are currently available in the public version: Compilation and Quantization.
Compilation
Compilation optimizes the model for specific hardware. Supported methods include:
diffusers2: Contributors
- Time:
A few minutes.
- Compilation on CPU:
No.
- Quality:
Same as the original model.
- Required:
None.
- Hyperparameters:
None.
ctranslate: Contributors
- Time:
A few minutes.
- Compilation on CPU:
Yes.
- Quality:
Same as the original model.
- Required:
tokenizer.
- Hyperparameters:
comp_ctranslate_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).
cgenerate: Contributors
- Time:
A few minutes.
- Compilation on CPU:
Yes.
- Quality:
Equivalent to the original model.
- Required:
tokenizer.
- Hyperparameters:
comp_cgenerate_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).
cwhisper: Contributors
- Time:
A few minutes.
- Compilation on CPU:
Yes.
- Quality:
Equivalent to the original model.
- Required:
processor.
- Hyperparameters:
comp_cwhisper_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).
ifw: Contributors
- Time:
A few minutes.
- Compilation on CPU:
Yes.
- Quality:
Comparable to the original model.
- Required:
processor.
- Hyperparameters:
comp_ifw_weight_bits: Set the weight quantization bits to 16 or 32 (default 16).
w2st: Contributors
- Time:
A few minutes.
- Compilation on CPU:
No.
- Quality:
Comparable to the original model.
- Required:
processor.
- Hyperparameters:
comp_ws2t_int8: Whether to use int8 or not (default False).
step_caching: Contributors, Paper, Citation
- Time:
A few minutes.
- Compilation on CPU:
No.
- Quality:
Very close to the original model.
- Required:
None.
- Hyperparameters:
comp_step_caching_interval: Set the cache interval to 2, 3, 4 or 5 (default 3). comp_step_caching_agressive: Whether to use aggressive caching or not (default False).
x-fast: proprietary, based on diffusers2
- Time:
A few minutes.
- Compilation on CPU:
No.
- Quality:
Not specified.
- Required:
None.
- Hyperparameters:
comp_x-fast_xformers: Whether to activate xformers or not (default True).
torch_compile: Contributors
- Time:
A few minutes.
- Compilation on CPU:
Yes.
- Quality:
Not specified.
- Required:
None.
- Hyperparameters:
comp_torch_compile_mode: Set the mode to “default”, “reduce-overhead”, “max-autotune”, “max-autotune-no-cudagraphs” (default “max-autotune”). comp_torch_compile_backend: Set the backend to “inductor”, “cudagraphs”, “ipex”, “onnxrt”, “tensorrt”, “tvm”, “openvino” (default “inductor”). comp_torch_compile_fullgraph: Whether to compile the full graph or not (default False). comp_torch_compile_dynamic: Set the dynamic compilation to None (determined automatically), True or False (default None).
onediff: Contributors, Citation
- Time:
A few minutes.
- Compilation on CPU:
No.
- Quality:
Same as the original model.
- Required:
None.
- Hyperparameters:
None.
Quantization
Quantization methods reduce the precision of the model’s weights and activations, making them much smaller in terms of memory required, at the cost of some quality loss. Supported methods include:
torch_dynamic: Contributors
- Time:
A few minutes.
- Quantization on CPU:
Yes.
- Quality:
Not specified.
- Required:
None.
- Hyperparameters:
quant_torch_dynamic_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_torch_dynamic_weight_type: Set the weight type to qint or qfloat (default qint).
torch_static: Contributors
- Time:
A few minutes.
- Quantization on CPU:
Yes.
- Quality:
Not specified.
- Required:
dataset.
- Hyperparameters:
quant_torch_static_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_torch_static_weight_type: Set the weight type to qint or qfloat (default qint). quant_torch_static_act_bits: Set the activation quantization bits to 2, 4, or 8 (default 8). quant_torch_static_act_type: Set the activation type to qint or qfloat (default qint). quant_torch_static_qscheme: Set the quantization scheme to per_tensor_symmetric, per_channel_symmetric, per_tensor_affine, or per_channel_affine (default per_tensor_affine). quant_torch_static_qobserver: Set the quantization observer to MinMaxObserver, MovingAverageMinMaxObserver, PerChannelMinMaxObserver, MovingAveragePerChannelMinMaxObserver, or HistogramObserver (default MinMaxObserver).
llm-int8: Contributors
- Time:
A few minutes.
- Quantization on CPU:
Yes.
- Quality:
Lower than the original model with 4 bits worse than 8 bits.
- Required:
None.
- Hyperparameters:
quant_llm-int8_weight_bits: Set the weight quantization bits to 4 or 8 (default 8). quant_llm-int8_double_quant: Whether to use double quantization or not (default False). quant_llm-int8_enable_fp32_cpu_offload: Whether to enable fp32 CPU offload or not (default False). quant_llm-int8_has_fp16_weight: Whether the model has fp16 weights or not (default False).
gptq: Contributors
- Time:
30 minutes to a day depending on the size of the model.
- Quantization on CPU:
Yes.
- Quality:
Lower than the original model with 2 bits worse than 3 bits worse than 4 bits worse than 8 bits.
- Required:
tokenizer, dataset.
- Hyperparameters:
quant_gptq_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_gptq_use_exllama: Whether to use exllama or not (default True).
awq: Contributors
- Time:
30 minutes to a day depending on the size of the model.
- Quantization on CPU:
Yes.
- Quality:
Not specified.
- Required:
tokenizer, dataset.
- Hyperparameters:
quant_awq_group_size: Set the group size to 8, 16, 32, 64, or 128 (default 128). quant_awq_zero_point: Whether to use zero point or not (default True). quant_awq_version: Set the version to gemm or gemv (default gemm).
hqq: Contributors, Article, Citation
- Time:
A few minutes.
- Quantization on CPU:
Yes.
- Quality:
Not specified.
- Required:
None.
- Hyperparameters:
quant_hqq_weight_bits: Set the weight quantization bits to 2, 4, 8, or 16 (default 8). quant_hqq_group_size: Set the group size to 8, 16, 32, 64, or 128 (default 64).
half: Contributors
- Time:
A few minutes.
- Quantization on CPU:
Yes.
- Quality:
Not specified.
- Required:
None.
- Hyperparameters:
None.
quanto: Contributors
- Time:
A few minutes.
- Quantization on CPU:
Yes.
- Quality:
Not specified.
- Required:
dataset (only when activating calibration).
- Hyperparameters:
quant_quanto_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_quanto_weight_type: Set the weight type to qint or qfloat (default qint). quant_quanto_act_bits: Set the activation quantization bits to None (no quantization of activations), 2, 4, or 8 (default None). quant_quanto_act_type: Set the activation type to qint or qfloat (default qint). quant_quanto_calibrate: Whether to activate calibration or not (default True). quant_quanto_calibration_samples: Set the number of calibration samples to 1, 64, 128, or 256 (default 64).
Pruning
Coming to the public version soon!
Factorization
Coming to the public version soon!