Compression Methods

At its core, the pruna package is a toolbox of compression methods. In this section, we will introduce you to all the methods we currently support.

Pruna wouldn’t be possible without the amazing work of the authors behind these methods. 💜 We’re really grateful for their contributions and encourage you to check out their repositories!

There are two types of compression methods we currently support: Compilation and Quantization.

Compilation

Compilation optimizes the model for specific hardware. Supported methods include:

diffusers2: Contributors

Time:

A few minutes.

Quality:

Same as the original model.

Required:

None.

Hyperparameters:

None.

ctranslate: Contributors

Time:

A few minutes.

Quality:

Same as the original model.

Required:

tokenizer.

Hyperparameters:

comp_ctranslate_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).

cgenerate: Contributors

Time:

A few minutes.

Quality:

Equivalent to the original model.

Required:

tokenizer.

Hyperparameters:

comp_cgenerate_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).

cwhisper: Contributors

Time:

A few minutes.

Quality:

Equivalent to the original model.

Required:

processor.

Hyperparameters:

comp_cwhisper_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).

ifw: Contributors

Time:

A few minutes.

Quality:

Comparable to the original model.

Required:

processor.

Hyperparameters:

comp_ifw_weight_bits: Set the weight quantization bits to 16 or 32 (default 16).

w2st: Contributors

Time:

A few minutes.

Quality:

Comparable to the original model.

Required:

processor.

Hyperparameters:

comp_ws2t_int8: Whether to use int8 or not (default False).

step_caching: Contributors, Paper, Citation

Time:

A few minutes.

Quality:

Very close to the original model.

Required:

None.

Hyperparameters:

comp_step_caching_interval: Set the cache interval to 2, 3, 4 or 5 (default 3). comp_step_caching_agressive: Whether to use aggressive caching or not (default False).

x-fast: proprietary, based on diffusers2

Time:

A few minutes.

Quality:

Not specified.

Required:

None.

Hyperparameters:

comp_x-fast_xformers: Whether to activate xformers or not (default True).

torch_compile: Contributors

Time:

A few minutes.

Quality:

Not specified.

Required:

None.

Hyperparameters:

comp_torch_compile_mode: Set the mode to “default”, “reduce-overhead”, “max-autotune”, “max-autotune-no-cudagraphs” (default “max-autotune”). comp_torch_compile_backend: Set the backend to “inductor”, “cudagraphs”, “ipex”, “onnxrt”, “tensorrt”, “tvm”, “openvino” (default “inductor”). comp_torch_compile_fullgraph: Whether to compile the full graph or not (default False). comp_torch_compile_dynamic: Set the dynamic compilation to None (determined automatically), True or False (default None).

onediff: Contributors, Citation

Time:

A few minutes.

Quality:

Same as the original model.

Required:

None.

Hyperparameters:

None.

Quantization

Quantization methods reduce the precision of the model’s weights and activations, making them much smaller in terms of memory required, at the cost of some quality loss. Supported methods include:

torch_dynamic: Contributors

Time:

A few minutes.

Quality:

Not specified.

Required:

None.

Hyperparameters:

quant_torch_dynamic_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_torch_dynamic_weight_type: Set the weight type to qint or qfloat (default qint).

torch_static: Contributors

Time:

A few minutes.

Quality:

Not specified.

Required:

dataset.

Hyperparameters:

quant_torch_static_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_torch_static_weight_type: Set the weight type to qint or qfloat (default qint). quant_torch_static_act_bits: Set the activation quantization bits to 2, 4, or 8 (default 8). quant_torch_static_act_type: Set the activation type to qint or qfloat (default qint). quant_torch_static_qscheme: Set the quantization scheme to per_tensor_symmetric, per_channel_symmetric, per_tensor_affine, or per_channel_affine (default per_tensor_affine). quant_torch_static_qobserver: Set the quantization observer to MinMaxObserver, MovingAverageMinMaxObserver, PerChannelMinMaxObserver, MovingAveragePerChannelMinMaxObserver, or HistogramObserver (default MinMaxObserver).

llm-int8: Contributors

Time:

A few minutes.

Quality:

Lower than the original model with 4 bits worse than 8 bits.

Required:

None.

Hyperparameters:

quant_llm-int8_weight_bits: Set the weight quantization bits to 4 or 8 (default 8). quant_llm-int8_double_quant: Whether to use double quantization or not (default False). quant_llm-int8_enable_fp32_cpu_offload: Whether to enable fp32 CPU offload or not (default False). quant_llm-int8_has_fp16_weight: Whether the model has fp16 weights or not (default False).

gptq: Contributors

Time:

30 minutes to a day depending on the size of the model.

Quality:

Lower than the original model with 2 bits worse than 3 bits worse than 4 bits worse than 8 bits.

Required:

tokenizer, dataset.

Hyperparameters:

quant_gptq_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_gptq_use_exllama: Whether to use exllama or not (default True).

awq: Contributors

Time:

30 minutes to a day depending on the size of the model.

Quality:

Not specified.

Required:

tokenizer, dataset.

Hyperparameters:

quant_awq_group_size: Set the group size to 8, 16, 32, 64, or 128 (default 128). quant_awq_zero_point: Whether to use zero point or not (default True). quant_awq_version: Set the version to gemm or gemv (default gemm).

hqq: Contributors, Article, Citation

Time:

A few minutes.

Quality:

Not specified.

Required:

None.

Hyperparameters:

quant_hqq_weight_bits: Set the weight quantization bits to 2, 4, 8, or 16 (default 8). quant_hqq_group_size: Set the group size to 8, 16, 32, 64, or 128 (default 64).

half: Contributors

Time:

A few minutes.

Quality:

Not specified.

Required:

None.

Hyperparameters:

None.

quanto: Contributors

Time:

A few minutes.

Quality:

Not specified.

Required:

dataset (only when activating calibration).

Hyperparameters:

quant_quanto_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_quanto_weight_type: Set the weight type to qint or qfloat (default qint). quant_quanto_act_bits: Set the activation quantization bits to None (no quantization of activations), 2, 4, or 8 (default None). quant_quanto_act_type: Set the activation type to qint or qfloat (default qint). quant_quanto_calibrate: Whether to activate calibration or not (default True). quant_quanto_calibration_samples: Set the number of calibration samples to 1, 64, 128, or 256 (default 64).

Pruning

Coming Soon!

Factorization

Coming Soon!