Compression Methods

At its core, the pruna package is a toolbox of compression methods. In this section, we will introduce you to all the methods you can currently apply with the package.

Pruna wouldn’t be possible without the amazing work of the authors behind these methods. 💜 We’re really grateful for their contributions and encourage you to check out their repositories!

There are various types of compression methods we support: Caching, Batching, Distillation, Compilation, Quantization, Pruning and Recovery.

Caching

Caching is a technique used to store intermediate results of computations to speed up subsequent operations, particularly useful in reducing inference time for machine learning models by reusing previously computed results.

step_caching: Contributors, Paper, Citation

Time:: A few minutes.
Compilation on CPU:: No.
Quality:: Very close to the original model.
Required:: None.
Hyperparameters:: cache_step_caching_interval: Set the cache interval to 1, 2, 3, 4, 5, 6, or 7 (default 3). If 1, caching is disabled. If 2, 3, 4, 5, 6 or 7, caching is fixed with the given interval where higher is faster but less quality.

flux-caching: proprietary

Time:: A few minutes.
Compilation on CPU:: Yes.
Quality:: Very close to the original model.
Required:: None.
Hyperparameters:: cache_flux_caching_cache_interval: Set the cache interval to 1, 2, 3, 4 or 5 (default 2). How many model steps to skip in a row. Higher is faster, but reduces quality. cache_flux_caching_start_step: Set the start step to 0, 1, 2, 3, 4 or 5 (default 2). How many steps to wait before starting to cache.

Batching

Batching is a technique used to group multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time.

ifw: Contributors

Time:: A few minutes.
Compilation on CPU:: Yes.
Quality:: Comparable to the original model.
Required:: processor.
Hyperparameters:: batch_ifw_weight_bits: Set the weight quantization bits to 16 or 32 (default 16).

w2st: Contributors

Time:: A few minutes.
Compilation on CPU:: No.
Quality:: Comparable to the original model.
Required:: processor.
Hyperparameters:: batch_ws2t_int8: Whether to use int8 or not (default False).

Distillation

Distillation is a technique used to create a smaller and faster version of a pre-trained model by training it on a smaller dataset or using a different loss function. It can also be used to create versions of the model that require less computational effort (steps) to generate the same output. When you don’t pass any number of steps, the model will default to the number of steps it is distilled to.

hyper: Paper

Time:: A few seconds.
Compilation on CPU:: Yes.
Quality:: Not specified.
Required:: None.
Hyperparameters:: distill_hyper_agressive: When this is set to True, the model is distilled to even less steps.

Compilation

Compilation optimizes the model for specific hardware. Supported methods include:

diffusers2: Contributors

Time:: A few minutes.
Compilation on CPU:: No.
Quality:: Same as the original model.
Required:: None.
Hyperparameters:: None.

ctranslate: Contributors

Time:: A few minutes.
Compilation on CPU:: Yes.
Quality:: Same as the original model.
Required:: tokenizer.
Hyperparameters:: comp_ctranslate_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).

cgenerate: Contributors

Time:: A few minutes.
Compilation on CPU:: Yes.
Quality:: Equivalent to the original model.
Required:: tokenizer.
Hyperparameters:: comp_cgenerate_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).

cwhisper: Contributors

Time:: A few minutes.
Compilation on CPU:: Yes.
Quality:: Equivalent to the original model.
Required:: processor.
Hyperparameters:: comp_cwhisper_weight_bits: Set the weight quantization bits to 8 or 16 (default 16).

ipex_llm: Contributors

Time:: A few minutes.
Compilation on CPU:: Yes.
Quality:: Equivalent to the original model.
Required:: None.
Hyperparameters:: comp_ipex_llm_weight_bits: Set the weight quantization bits to 8 or 4 (default 8).

x-fast: proprietary, based on diffusers2

Time:: A few minutes.
Compilation on CPU:: No.
Quality:: Not specified.
Required:: None.
Hyperparameters:: comp_x-fast_xformers: Whether to activate xformers or not (default True).

torch_compile: Contributors

Time:: A few minutes.
Compilation on CPU:: Yes.
Quality:: Not specified.
Required:: None.
Hyperparameters:: comp_torch_compile_mode: Set the mode to “default”, “reduce-overhead”, “max-autotune”, “max-autotune-no-cudagraphs” (default “max-autotune”). comp_torch_compile_backend: Set the backend to “inductor”, “cudagraphs”, “ipex”, “onnxrt”, “tensorrt”, “tvm”, “openvino” (default “inductor”). comp_torch_compile_fullgraph: Whether to compile the full graph or not (default False). comp_torch_compile_dynamic: Set the dynamic compilation to None (determined automatically), True or False (default None).

onediff: Contributors, Citation

Time:: A few minutes.
Compilation on CPU:: No.
Quality:: Same as the original model.
Required:: None.
Hyperparameters:: None.

Quantization

Quantization methods reduce the precision of the model’s weights and activations, making them much smaller in terms of memory required, at the cost of some quality loss. Supported methods include:

torch_dynamic: Contributors

Time:: A few minutes.
Quantization on CPU:: Yes.
Quality:: Not specified.
Required:: None.
Hyperparameters:: quant_torch_dynamic_weight_bits: Set the weight quantization bits to quint8 or qint8 (default qint8).

torch_static: Contributors

Time:: A few minutes.
Quantization on CPU:: Yes.
Quality:: Not specified.
Required:: dataset.
Hyperparameters:: quant_torch_static_weight_bits: Set the weight quantization bits to quint8 or qint8 (default qint8). quant_torch_static_act_bits: Set the activation quantization bits to quint8 or qint8 (default qint8). quant_torch_static_qscheme: Set the quantization scheme to per_tensor_symmetric orper_tensor_affine (default per_tensor_affine). quant_torch_static_qobserver: Set the quantization observer to MinMaxObserver, MovingAverageMinMaxObserver, PerChannelMinMaxObserver, or HistogramObserver (default MinMaxObserver).

llm-int8: Contributors

Time:: A few minutes.
Quantization on CPU:: Yes.
Quality:: Lower than the original model with 4 bits worse than 8 bits.
Required:: None.
Hyperparameters:: quant_llm-int8_weight_bits: Set the weight quantization bits to 4 or 8 (default 8). quant_llm-int8_double_quant: Whether to use double quantization or not (default False). quant_llm-int8_enable_fp32_cpu_offload: Whether to enable fp32 CPU offload or not (default False).

gptq: Contributors

Time:: 30 minutes to a day depending on the size of the model.
Quantization on CPU:: Yes.
Quality:: Lower than the original model with 2 bits worse than 3 bits worse than 4 bits worse than 8 bits.
Required:: tokenizer, dataset.
Hyperparameters:: quant_gptq_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_gptq_use_exllama: Whether to use exllama or not (default True).

awq: Contributors

Time:: 30 minutes to a day depending on the size of the model.
Quantization on CPU:: Yes.
Quality:: Not specified.
Required:: tokenizer, dataset.
Hyperparameters:: quant_awq_group_size: Set the group size to 8, 16, 32, 64, or 128 (default 128).

hqq: Contributors, Article, Citation

Time:: A few minutes.
Quantization on CPU:: Yes.
Quality:: Not specified.
Required:: None.
Hyperparameters:: quant_hqq_weight_bits: Set the weight quantization bits to 2, 4, or 8 (default 8). quant_hqq_group_size: Set the group size to 8, 16, 32, 64, or 128 (default 64).

half: Contributors

Time:: A few minutes.
Quantization on CPU:: Yes.
Quality:: Not specified.
Required:: None.
Hyperparameters:: None.

quanto: Contributors

Time:: A few minutes.
Quantization on CPU:: Yes.
Quality:: Not specified.
Required:: dataset (only when activating calibration).
Hyperparameters:: quant_quanto_weight_bits: Set the weight quantization bits to qint2, qint4, qint8, or qfloat8 (default qfloat8). quant_quanto_calibrate: Whether to activate calibration or not (default True).

torchao_autoquant: Contributors

Time:: A few seconds.
Quantization on CPU:: Yes.
Quality:: Not specified.
Required:: None.
Hyperparameters:: quant_torchao_autoquant_compile: Whether to compile the model or not (default True).

Pruning

torch-unstructured: Contributors

Time:: A few minutes.
Pruning on CPU:: Yes.
Quality:: Not specified.
Required:: None.
Hyperparameters:: prune_torch-unstructured_pruning_method: random or l1 (default l1). prune_torch-unstructured_amount: Set the pruning amount between 0.0 and 1.0 (default 0.5). prune_torch-unstructured_dim: Dimension along which to prune (default None).

Recovery

Recovery methods recover a model’s quality after applying smashing methods in a more aggressive manner than would have been acceptable otherwise. They allow the smashing methods to be pushed to their limits without the quality loss it implies. Supported methods include:

llm-lora: Contributors

Time:: 30 minutes to a day depending on the dataset, model size and hyperparameters.
Recovery on CPU:: Yes.
Quality:: Better than the smashed model without recovery.
Required:: tokenizer, dataset.
Hyperparameters:: recov_llm-lora_r: Set the LoRA rank to 4, 8, 16, 32, 64 or 128 (default 8). recov_llm-lora_alpha_r_ratio: Set the alpha/rank ratio to 0.5, 1.0 or 2.0 (default 2.0). recov_llm-lora_target_modules: Set the target modules to “all-linear” or [“q_proj”, “v_proj”] (default “all-linear”). recov_llm-lora_batch_size: Set the batch size (default 1). recov_llm-lora_gradient_accumulation_steps: Set the gradient accumulation steps (default 1). recov_llm-lora_num_epochs: Set the number of epochs (default 1.0). recov_llm-lora_learning_rate: Set the learning rate (default 2e-4). recov_llm-lora_report_to: Set the logging backend to “none”, “wandb” or “tensorboard” (default “none”).