Compression Methods

At its core, the pruna package is a toolbox of compression methods. In this section, we will introduce you to all the methods we currently support.

Pruna wouldn’t be possible without the amazing work of the authors behind these methods. 💜 We’re really grateful for their contributions and encourage you to check out their repositories!

There are two types of compression methods we currently support: Compilation and Quantization.

Compilation

Compilation optimizes the model for specific hardware. Supported methods include:

diffusers2: Contributors

Time:

A few minutes.

Compilation on CPU:

No.

Quality:

Same as the original model.

Required:

None.

Hyperparameters:

None.

ctranslate: Contributors

Time:

A few minutes.

Compilation on CPU:

Yes.

Quality:

Same as the original model.

Required:

tokenizer.

Hyperparameters:

comp_ctranslate_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).

cgenerate: Contributors

Time:

A few minutes.

Compilation on CPU:

Yes.

Quality:

Equivalent to the original model.

Required:

tokenizer.

Hyperparameters:

comp_cgenerate_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).

cwhisper: Contributors

Time:

A few minutes.

Compilation on CPU:

Yes.

Quality:

Equivalent to the original model.

Required:

processor.

Hyperparameters:

comp_cwhisper_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).

ifw: Contributors

Time:

A few minutes.

Compilation on CPU:

Yes.

Quality:

Comparable to the original model.

Required:

processor.

Hyperparameters:

comp_ifw_weight_bits: Set the weight quantization bits to 16 or 32 (default 16).

w2st: Contributors

Time:

A few minutes.

Compilation on CPU:

No.

Quality:

Comparable to the original model.

Required:

processor.

Hyperparameters:

comp_ws2t_int8: Whether to use int8 or not (default False).

step_caching: Contributors, Paper, Citation

Time:

A few minutes.

Compilation on CPU:

No.

Quality:

Very close to the original model.

Required:

None.

Hyperparameters:

comp_step_caching_interval: Set the cache interval to 2, 3, 4 or 5 (default 3). comp_step_caching_agressive: Whether to use aggressive caching or not (default False).

x-fast: proprietary, based on diffusers2

Time:

A few minutes.

Compilation on CPU:

No.

Quality:

Not specified.

Required:

None.

Hyperparameters:

comp_x-fast_xformers: Whether to activate xformers or not (default True).

torch_compile: Contributors

Time:

A few minutes.

Compilation on CPU:

Yes.

Quality:

Not specified.

Required:

None.

Hyperparameters:

comp_torch_compile_mode: Set the mode to “default”, “reduce-overhead”, “max-autotune”, “max-autotune-no-cudagraphs” (default “max-autotune”). comp_torch_compile_backend: Set the backend to “inductor”, “cudagraphs”, “ipex”, “onnxrt”, “tensorrt”, “tvm”, “openvino” (default “inductor”). comp_torch_compile_fullgraph: Whether to compile the full graph or not (default False). comp_torch_compile_dynamic: Set the dynamic compilation to None (determined automatically), True or False (default None).

onediff: Contributors, Citation

Time:

A few minutes.

Compilation on CPU:

No.

Quality:

Same as the original model.

Required:

None.

Hyperparameters:

None.

Quantization

Quantization methods reduce the precision of the model’s weights and activations, making them much smaller in terms of memory required, at the cost of some quality loss. Supported methods include:

torch_dynamic: Contributors

Time:

A few minutes.

Quantization on CPU:

Yes.

Quality:

Not specified.

Required:

None.

Hyperparameters:

quant_torch_dynamic_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_torch_dynamic_weight_type: Set the weight type to qint or qfloat (default qint).

torch_static: Contributors

Time:

A few minutes.

Quantization on CPU:

Yes.

Quality:

Not specified.

Required:

dataset.

Hyperparameters:

quant_torch_static_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_torch_static_weight_type: Set the weight type to qint or qfloat (default qint). quant_torch_static_act_bits: Set the activation quantization bits to 2, 4, or 8 (default 8). quant_torch_static_act_type: Set the activation type to qint or qfloat (default qint). quant_torch_static_qscheme: Set the quantization scheme to per_tensor_symmetric, per_channel_symmetric, per_tensor_affine, or per_channel_affine (default per_tensor_affine). quant_torch_static_qobserver: Set the quantization observer to MinMaxObserver, MovingAverageMinMaxObserver, PerChannelMinMaxObserver, MovingAveragePerChannelMinMaxObserver, or HistogramObserver (default MinMaxObserver).

llm-int8: Contributors

Time:

A few minutes.

Quantization on CPU:

Yes.

Quality:

Lower than the original model with 4 bits worse than 8 bits.

Required:

None.

Hyperparameters:

quant_llm-int8_weight_bits: Set the weight quantization bits to 4 or 8 (default 8). quant_llm-int8_double_quant: Whether to use double quantization or not (default False). quant_llm-int8_enable_fp32_cpu_offload: Whether to enable fp32 CPU offload or not (default False). quant_llm-int8_has_fp16_weight: Whether the model has fp16 weights or not (default False).

gptq: Contributors

Time:

30 minutes to a day depending on the size of the model.

Quantization on CPU:

Yes.

Quality:

Lower than the original model with 2 bits worse than 3 bits worse than 4 bits worse than 8 bits.

Required:

tokenizer, dataset.

Hyperparameters:

quant_gptq_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_gptq_use_exllama: Whether to use exllama or not (default True).

awq: Contributors

Time:

30 minutes to a day depending on the size of the model.

Quantization on CPU:

Yes.

Quality:

Not specified.

Required:

tokenizer, dataset.

Hyperparameters:

quant_awq_group_size: Set the group size to 8, 16, 32, 64, or 128 (default 128). quant_awq_zero_point: Whether to use zero point or not (default True). quant_awq_version: Set the version to gemm or gemv (default gemm).

hqq: Contributors, Article, Citation

Time:

A few minutes.

Quantization on CPU:

Yes.

Quality:

Not specified.

Required:

None.

Hyperparameters:

quant_hqq_weight_bits: Set the weight quantization bits to 2, 4, 8, or 16 (default 8). quant_hqq_group_size: Set the group size to 8, 16, 32, 64, or 128 (default 64).

half: Contributors

Time:

A few minutes.

Quantization on CPU:

Yes.

Quality:

Not specified.

Required:

None.

Hyperparameters:

None.

quanto: Contributors

Time:

A few minutes.

Quantization on CPU:

Yes.

Quality:

Not specified.

Required:

dataset (only when activating calibration).

Hyperparameters:

quant_quanto_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_quanto_weight_type: Set the weight type to qint or qfloat (default qint). quant_quanto_act_bits: Set the activation quantization bits to None (no quantization of activations), 2, 4, or 8 (default None). quant_quanto_act_type: Set the activation type to qint or qfloat (default qint). quant_quanto_calibrate: Whether to activate calibration or not (default True). quant_quanto_calibration_samples: Set the number of calibration samples to 1, 64, 128, or 256 (default 64).

Pruning

Coming Soon!

Factorization

Coming Soon!