Algorithms Overview

At its core, the pruna package is a framework of compression algorithms. By offering a consistent interface, it simplifies the integration of diverse compression algorithms. In this section, we will introduce you to all the algorithms you can currently apply with the package. Algorithms marked with “(Pro)” are only available in the pruna_pro package.

pruna wouldn’t be possible without the amazing work of the authors behind these algorithms. 💜 We’re really grateful for their contributions and encourage you to check out their repositories!

Batchers

Batching groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time.

ifw

Insanely Fast Whisper is an optimized version of Whisper models that significantly speeds up transcription. It achieves lower latency and higher throughput through low-level code optimizations and efficient batching, making real-time speech recognition more practical.

References: GitHub.
Can be applied on: GPU.
Required: Tokenizer, Processor.
Compatible with: half.

Parameter

Default

Options

Description

ifw_weight_bits

16

16 or 32

Sets the number of bits to use for weight quantization.

ifw_batch_size

16

1, 2, 4, 8, 16, 32 or 64

The batch size to use for inference. Higher is faster but needs more memory.

whisper_s2t

WhisperS2T is an optimized speech-to-text pipeline built for Whisper models.

References: GitHub.
Can be applied on: GPU.
Required: Tokenizer, Processor.
Compatible with: c_translate, c_generate, c_whisper, half.

Parameter

Default

Options

Description

whisper_s2t_int8

False

True, False

Whether to quantize to int8 for inference.

whisper_s2t_batch_size

16

1, 2, 4, 8, 16, 32 or 64

The batch size to use for inference. Higher is faster but needs more memory.

Cachers

Caching is a technique used to store intermediate results of computations to speed up subsequent operations, particularly useful in reducing inference time for machine learning models by reusing previously computed results.

deepcache

DeepCache accelerates inference by leveraging the U-Net blocks of diffusion pipelines to reuse high-level features.

References: GitHub, Paper.
Can be applied on: CPU, GPU.
Required: None.
Compatible with: stable_fast, torch_compile, half, hqq_diffusers, diffusers_int8.

Parameter

Default

Options

Description

deepcache_interval

2

1, 2, 3, 4 or 5

Interval at which to cache - 1 disables caching. Higher is faster but might affect quality.

adaptive (Pro)

Adaptive caching adjusts caching dynamically for each prompt, determining the optimal inference steps to reuse cached outputs.

References: None.
Can be applied on: CPU, GPU.
Required: None.
Compatible with: hyper, torch_compile, stable_fast, hqq_diffusers, diffusers_int8.

Parameter

Default

Options

Description

adaptive_threshold

0.01

Range 0.001 to 0.2

How much the difference between the current and previous latent can be before caching.Higher is faster, but reduces quality.

adaptive_max_skip_steps

4

1, 2, 3, 4 or 5

How many steps are allowed to be skipped in a row. Higher is faster, but reduces quality.

auto (Pro)

Given a speed_factor (e.g., 0.5 to halve latency), auto caching determines the optimal caching schedule to achieve the desired latency reduction.

References: None.
Can be applied on: CPU, GPU.
Required: None.
Compatible with: hyper, torch_compile, stable_fast, hqq_diffusers, diffusers_int8.

Parameter

Default

Options

Description

auto_speed_factor

0.5

Range 0.0 to 1.0

Controls inference latency. Lower values yield faster inference but may compromise quality.

flux_caching (Pro)

Flux caching works similarly to periodic caching, but stores outputs of the transformer blocks instead of the output of the whole backbone.

References: None.
Can be applied on: CPU, GPU.
Required: None.
Compatible with: hyper, torch_compile, stable_fast, diffusers_int8.

Parameter

Default

Options

Description

flux_caching_cache_interval

2

1, 2, 3, 4, 5, 6 or 7

How many model steps to skip in a row. Higher is faster, but reduces quality.

flux_caching_start_step

2

0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10

How many steps to wait before starting to cache.

periodic (Pro)

After a configurable start_step, periodic caching computes the output of the backbone (can be a UNet or a Transformer) every cache_interval steps and reuses this cached output for the remaining steps.

References: None.
Can be applied on: CPU, GPU.
Required: None.
Compatible with: hyper, torch_compile, stable_fast, hqq_diffusers, diffusers_int8.

Parameter

Default

Options

Description

periodic_cache_interval

2

1, 2, 3, 4, 5, 6 or 7

How many model steps to skip in a row. Higher is faster, but reduces quality.

periodic_start_step

2

0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10

How many steps to wait before starting to cache.

Compilers

Compilation optimizes the model for specific hardware.

c_generate

CGenerate employs a custom runtime that leverages optimizations like weight quantization, layer fusion, and batch reordering to boost performance and reduce memory usage on both CPUs and GPUs for Causal LM models.

References: GitHub.
Can be applied on: GPU.
Required: Tokenizer.
Compatible with: whisper_s2t, half.

Parameter

Default

Options

Description

c_generate_weight_bits

16

8 or 16

Sets the number of bits to use for weight quantization.

c_translate

CTranslate employs a custom runtime that leverages optimizations like weight quantization, layer fusion, and batch reordering to boost performance and reduce memory usage on both CPUs and GPUs for Causal LM models used for Translation.

References: GitHub.
Can be applied on: GPU.
Required: Tokenizer.
Compatible with: whisper_s2t, half.

Parameter

Default

Options

Description

c_translate_weight_bits

16

8 or 16

Sets the number of bits to use for weight quantization.

c_whisper

CWhisper employs a custom runtime that leverages optimizations like weight quantization, layer fusion, and batch reordering to boost performance and reduce memory usage on both CPUs and GPUs for Whisper models.

References: GitHub.
Can be applied on: GPU.
Required: Processor.
Compatible with: whisper_s2t, half.

Parameter

Default

Options

Description

c_whisper_weight_bits

16

8 or 16

Sets the number of bits to use for weight quantization.

onediff

OneDiff achieves acceleration by converting diffusion model modules into optimized static graphs via PyTorch module compilation. This process fuses operations, applies low-level GPU kernel optimizations, and supports dynamic input shapes without the overhead of re-compilation.

References: GitHub.
Can be applied on: GPU.
Required: None.
Compatible with: half.
Required install: pip install pruna[onediff] or pip install pruna[full].

stable_fast

Stable-fast is an optimization framework for Image-Gen models. It accelerates inference by fusing key operations into optimized kernels and converting diffusion pipelines into efficient TorchScript graphs.

References: GitHub.
Can be applied on: GPU.
Required: None.
Compatible with: deepcache, half.
Required install: pip install pruna[stable-fast] or pip install pruna[stable-fast-cu11] --extra-index-url https://prunaai.pythonanywhere.com/ or pip install pruna[full].

torch_compile

Optimizes given model or function using various backends and is compatible with any model containing PyTorch modules.

References: GitHub.
Can be applied on: CPU, GPU.
Required: None.
Compatible with: half, deepcache.

Parameter

Default

Options

Description

torch_compile_mode

default

default, reduce-overhead, max-autotune, max-autotune-no-cudagraphs

Compilation mode.

torch_compile_backend

inductor

inductor, cudagraphs, onnxrt, tvm, openvino, openxla

Compilation backend.

torch_compile_fullgraph

True

True, False

Whether to discover compileable subgraphs or compile the full input graph.

torch_compile_dynamic

None

None, True, False

Whether to use dynamic shape tracing or not.

x_fast (Pro)

Based on stable_fast, this compiler speeds up inference latency for any model using a combination of xformers, triton, cudnn, and torch tracing.

References: None.
Can be applied on: GPU.
Required: None.
Compatible with: quanto, half, text_to_text_lora, text_to_image_lora.
Required install: pip install pruna[stable-fast] or pip install pruna[stable-fast-cu11] or pip install pruna[full].

Parameter

Default

Options

Description

x_fast_xformers

True

True, False

Whether to use xformers for faster inference.

ipex_llm (Pro)

This compiler leverages advanced graph optimizations, quantization, and kernel fusion techniques to accelerate PyTorch-based LLM inference on Intel CPUs.

References: Github.
Can be applied on: CPU.
Required: None.
Compatible with: half.
Required install: pip install pruna_pro[intel] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/cn/ or pip install pruna_pro[full] --extra-index-url https://prunaai.pythonanywhere.com/ --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/cn/.

Parameter

Default

Options

Description

ipex_llm_weight_bits

8

8 or 4

The number of bits to use for weight quantization.

Distillers

Distillation trains a smaller, simpler model to mimic a larger, more complex model.

hyper (Pro)

Hyper-SD is a distillation framework that segments the diffusion process into time-step groups to preserve and reformulate the ODE trajectory. By integrating human feedback and score distillation, it enables near-lossless performance with drastically fewer inference steps.

References: Paper.
Can be applied on: CPU, GPU.
Required: None.
Compatible with: half, diffusers_int8, deepcache, auto, adaptive, flux_caching, periodic, torch_compile, stable_fast.

Parameter

Default

Options

Description

hyper_agressive

False

True, False

When this is set to True, the model is distilled to even less steps

Pruners

Pruning removes less important or redundant connections and neurons from a model, resulting in a sparser, more efficient network.

torch_structured

Structured pruning removes entire units like neurons, channels, or filters from a network, leading to a more compact and computationally efficient model while preserving a regular structure that standard hardware can easily optimize.

References: GitHub.
Can be applied on: CPU, GPU.
Required: Dataset.
Compatible with: half.

Parameter

Default

Options

Description

torch_structured_type

MagnitudeImportance

RandomImportance, MagnitudeImportance, LAMPImportance, TaylorImportance, HessianImportance

Importance criterion for pruning.

torch_structured_calibration_samples

64

Range 1 to 256

Number of calibration samples for importance computation.

torch_structured_prune_head_dims

False

True, False

Whether to prune head dimensions.

torch_structured_prune_num_heads

False

True, False

Whether to prune number of heads.

torch_structured_global_pruning

False

True, False

Whether to perform global pruning.

torch_structured_sparsity

0.1

Range 0.0 to 1.0

Sparsity level up to which to prune.

torch_structured_head_sparsity

0.0

Range 0.0 to 1.0

Sparsity level up to which to prune heads.

torch_structured_it_steps

1

Range 1 to 10

Number of iterations for pruning.

torch_unstructured

Unstructured pruning sets individual weights to 0 based on criteria such as magnitude, resulting in sparse weight matrices that retain the overall model architecture but may require specialized sparse computation support to fully exploit the efficiency gains.

References: GitHub.
Can be applied on: CPU, GPU.
Required: None.
Compatible with: half.

Parameter

Default

Options

Description

torch_unstructured_pruning_method

l1

random, l1

Pruning method to use.

torch_unstructured_sparsity

0.1

Range 0.0 to 1.0

Sparsity level up to which to prune.

Quantizers

Quantization reduces the precision of the model’s weights and activations, making them much smaller in terms of memory required.

half

Converting model parameters to half precision (FP16) reduces memory usage and can accelerate computations on GPUs that support it.

References: GitHub.
Can be applied on: CPU, GPU.
Required: None.
Compatible with: ifw, whisper_s2t, deepcache, c_translate, c_generate, c_whisper, stable_fast, onediff, torch_compile, torch_structured, torch_unstructured.

hqq

Half-Quadratic Quantization (HQQ) leverages fast, robust optimization techniques for on-the-fly quantization, eliminating the need for calibration data.

References: GitHub, Article.
Can be applied on: GPU.
Required: None.
Compatible with: None.

Parameter

Default

Options

Description

hqq_weight_bits

8

2, 4 or 8

Number of bits to use for quantization.

hqq_group_size

64

8, 16, 32, 64 or 128

Group size for quantization.

hqq_diffusers

Half-Quadratic Quantization (HQQ) leverages fast, robust optimization techniques for on-the-fly quantization, eliminating the need for calibration data and making it applicable to any model. This algorithm is specifically adapted for diffusers models.

References: GitHub, Article.
Can be applied on: GPU.
Required: None.
Compatible with: deepcache.

Parameter

Default

Options

Description

hqq_diffusers_weight_bits

8

2, 4 or 8

Number of bits to use for quantization.

hqq_diffusers_group_size

64

8, 16, 32, 64 or 128

Group size for quantization.

hqq_diffusers_backend

torchao_int4

gemlite, bitblas, torchao_int4 or marlin

Backend to use for quantization.

awq

Activation-aware Weight Quantization (AWQ) selectively quantizes model weights using a calibration dataset, preserving a small fraction that are important for maintaining performance in LLMs. This minimizes quantization loss, allowing models to operate at 4-bit precision without significantly sacrificing accuracy.

References: GitHub.
Can be applied on: GPU.
Required: Dataset.
Compatible with: None.
Required install: pip install pruna[autoawq] or pip install pruna[full].

Parameter

Default

Options

Description

awq_group_size

128

8, 16, 32, 64 or 128

Group size for quantization.

diffusers_int8

BitsAndBytes offers a simple method to quantize models to 8-bit or 4-bit precision. The 8-bit mode blends outlier fp16 values with int8 non-outliers to mitigate performance degradation, while 4-bit quantization further compresses the model and is often used with QLoRA for fine-tuning. This algorithm is specifically adapted for diffusers models.

References: GitHub.
Can be applied on: GPU.
Required: None.
Compatible with: deepcache.

Parameter

Default

Options

Description

diffusers_int8_weight_bits

8

4 or 8

Number of bits to use for quantization.

diffusers_int8_double_quant

False

True, False

Whether to enable double quantization.

diffusers_int8_enable_fp32_cpu_offload

False

True, False

Whether to enable fp32 cpu offload.

diffusers_int8_quant_type

fp4

fp4, nf4

Quantization type to use.

gptq

GPTQ is a post-training quantization technique that independently quantizes each row of the weight matrix to minimize error. The weights are quantized to int4, stored as int32, and then dequantized on the fly to fp16 during inference, resulting in nearly 4x memory savings and faster performance due to custom kernels that take advantage of the lower precision.

References: GitHub.
Can be applied on: GPU.
Required: Tokenizer, Dataset.
Compatible with: None.

Parameter

Default

Options

Description

gptq_weight_bits

8

2, 4 or 8

Sets the number of bits to use for weight quantization.

gptq_use_exllama

True

True, False

Whether to use exllama for quantization.

gptq_group_size

128

64, 128 or 256

Group size for quantization.

llm_int8

BitsAndBytes offers a simple method to quantize models to 8-bit or 4-bit precision. The 8-bit mode blends outlier fp16 values with int8 non-outliers to mitigate performance degradation, while 4-bit quantization further compresses the model and is often used with QLoRA for fine-tuning.

References: GitHub.
Can be applied on: GPU.
Required: None.
Compatible with: None.

Parameter

Default

Options

Description

llm_int8_weight_bits

8

4 or 8

Sets the number of bits to use for weight quantization.

llm_int8_double_quant

False

True, False

Whether to enable double quantization.

llm_int8_enable_fp32_cpu_offload

False

True, False

Whether to enable fp32 cpu offload.

llm_int8_quant_type

fp4

fp4, nf4

Quantization type to use.

quanto

With Quanto, models with int8/float8 weights and float8 activations maintain nearly full-precision accuracy. Lower bit quantization is also supported. When only weights are quantized and optimized kernels are available, inference latency remains comparable, and device memory usage is roughly reduced in proportion to the bitwidth ratio.

References: GitHub.
Can be applied on: GPU.
Required: None.
Compatible with: None.

Parameter

Default

Options

Description

quanto_weight_bits

qfloat8

qint2, qint4, qint8 or qfloat8

Tensor type to use for quantization.

quanto_calibrate

True

True, False

Whether to calibrate the model.

torch_dynamic

This technique converts model weights to lower precision (typically int8) dynamically at runtime, reducing model size and improving inference speed with minimal impact on accuracy and without the need for calibration data.

References: GitHub.
Can be applied on: CPU, GPU.
Required: None.
Compatible with: None.

Parameter

Default

Options

Description

torch_dynamic_weight_bits

qint8

quint8 or qint8

Tensor type to use for quantization.

torch_static

In static quantization, both weights and activations are pre-converted to lower precision (e.g., int8) using a calibration process on representative data, which typically yields greater efficiency gains but requires additional steps during model preparation.

References: GitHub.
Can be applied on: CPU, GPU.
Required: Dataset.
Compatible with: None.

Parameter

Default

Options

Description

torch_static_weight_bits

qint8

quint8 or qint8

Tensor type to use for weight quantization.

torch_static_act_bits

qint8

quint8 or qint8

Tensor type to use for activation quantization.

torch_static_qscheme

per_tensor_affine

per_tensor_symmetric, per_tensor_affine

Quantization scheme to use.

torch_static_qobserver

MinMaxObserver

MinMaxObserver, MovingAverageMinMaxObserver, PerChannelMinMaxObserver, HistogramObserver

Observer to use for quantization.

torchao_autoquant (Pro)

This algorithm compiles, quantizes and sparsifies weights, gradients, and activations for inference. This algorithm is specifically adapted for Image-Gen models.

References: GitHub.
Can be applied on: CPU, GPU.
Required: None.
Compatible with: None.

Parameter

Default

Options

Description

torchao_autoquant_compile

True

True, False

Whether to compile the model after quantization or not.

higgs (Pro)

HIGGS is a zero-shot quantization method that uses Hadamard preprocessing to transform weights and then selects MSE-optimal quantization grids.

References: Paper.
Can be applied on: GPU.
Required: None.
Compatible with: torch_compile, torch_unstructured.
Required install: pip install pruna_pro[higgs] --extra-index-url https://prunaai.pythonanywhere.com/ or pip install pruna_pro[higgs-cu11] --extra-index-url https://prunaai.pythonanywhere.com/ or pip install pruna_pro[full] --extra-index-url https://prunaai.pythonanywhere.com/ --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/cn/.

Parameter

Default

Options

Description

higgs_weight_bits

4

2, 3 or 4

The number of bits to use for weight quantization.

higgs_p

2

1 or 2

The number of groups to use for weight quantization.

higgs_group_size

256

64, 128 or 256

The size of each group.

higgs_hadamard_size

1024

512, 1024 or 2048

The size of the hadamard matrix.

higgs_example_batch_size

1

1, 2, 4, 8 or 16

Set the batch size when running the inference. It is recommended to keep this the batch size for inference if you want to take adavantage of faster CUDA kernels.

Recoverers

Recovery (experimental) restores the performance of a model after compression.

text_to_text_perp (Pro)

This recoverer is a general purpose PERP recoverer for text-to-text models using norm, head and bias finetuning and optionally HuggingFace’s LoRA.

References: GitHub, Paper.
Can be applied on: CPU, GPU.
Required: Tokenizer, Dataset.
Compatible with: half, quanto, torch_dynamic, llm_int8, torch_compile, x_fast.

Parameter

Default

Options

Description

text_to_text_perp_lora_r

8

4, 8, 16, 32, 64 or 128

Rank of the LoRA layers.

text_to_text_perp_lora_alpha_r_ratio

2.0

0.5, 1.0 or 2.0

Alpha/Rank ratio of the LoRA layers.

text_to_text_perp_lora_target_modules

None

None, all-linear

Target modules for the LoRA layers.

text_to_text_perp_batch_size

1

Range 1 to 4096

Batch size for finetuning.

text_to_text_perp_gradient_accumulation_steps

1

Range 1 to 1024

Number of gradient accumulation steps for finetuning.

text_to_text_perp_num_epochs

1.0

Range 0.0 to 4096.0

Number of epochs for finetuning.

text_to_text_perp_learning_rate

0.0002

Range 0.0 to 1.0

Learning rate for finetuning.

text_to_text_perp_report_to

none

none, wandb, tensorboard

Where to report the finetuning results.

text_to_text_perp_optimizer

AdamW8bit

AdamW, AdamW8bit, PagedAdamW8bit

Which optimizer to use for finetuning.

text_to_text_inplace_perp (Pro)

This is the same as text_to_text_perp, but without LoRA layers which add extra computations and thus slow down the inference of the final model.

References: GitHub, Paper.
Can be applied on: CPU, GPU.
Required: Tokenizer, Dataset.
Compatible with: half, quanto, torch_dynamic, llm_int8, torch_compile, x_fast.

Parameter

Default

Options

Description

text_to_text_inplace_perp_batch_size

1

Range 1 to 4096

Batch size for finetuning.

text_to_text_inplace_perp_gradient_accumulation_steps

1

Range 1 to 1024

Number of gradient accumulation steps for finetuning.

text_to_text_inplace_perp_num_epochs

1.0

Range 0.0 to 4096.0

Number of epochs for finetuning.

text_to_text_inplace_perp_learning_rate

0.0002

Range 0.0 to 1.0

Learning rate for finetuning.

text_to_text_inplace_perp_report_to

none

none, wandb, tensorboard

Where to report the finetuning results.

text_to_text_inplace_perp_optimizer

AdamW8bit

AdamW, AdamW8bit, PagedAdamW8bit

Which optimizer to use for finetuning.

text_to_image_perp (Pro)

This recoverer is a general purpose PERP recoverer for text-to-image models using norm, head and bias finetuning and optionally HuggingFace’s LoRA.

References: GitHub, Paper.
Can be applied on: GPU.
Required: Dataset.
Compatible with: quanto, torch_dynamic, diffusers_int8, deepcache, flux_caching, torch_compile, x_fast.

Parameter

Default

Options

Description

text_to_image_perp_lora_r

4

4, 8, 16, 32, 64 or 128

Rank of the LoRA layers.

text_to_image_perp_lora_alpha_r_ratio

1.0

0.5, 1.0 or 2.0

Alpha/Rank ratio of the LoRA layers.

text_to_image_perp_batch_size

0

Range 0 to 4096

Batch size for finetuning.

text_to_image_perp_gradient_accumulation_steps

1

Range 1 to 1024

Number of gradient accumulation steps for finetuning.

text_to_image_perp_num_epochs

1.0

Range 0.0 to 4096.0

Number of epochs for finetuning.

text_to_image_perp_learning_rate

1e-05

Range 0.0 to 1.0

Learning rate for finetuning.

text_to_image_perp_use_cpu_offloading

True

True, False

Whether to use CPU offloading for finetuning.

text_to_image_perp_optimizer

AdamW8bit

AdamW8bit, AdamW, Adam

Which optimizer to use for finetuning.

text_to_image_inplace_perp (Pro)

This is the same as text_to_image_perp, but without LoRA layers which add extra computations and thus slow down the inference of the final model.

References: GitHub, Paper.
Can be applied on: GPU.
Required: Dataset.
Compatible with: quanto, torch_dynamic, diffusers_int8, deepcache, flux_caching, torch_compile, x_fast.

Parameter

Default

Options

Description

text_to_image_inplace_perp_batch_size

0

Range 0 to 4096

Batch size for finetuning.

text_to_image_inplace_perp_gradient_accumulation_steps

1

Range 1 to 1024

Number of gradient accumulation steps for finetuning.

text_to_image_inplace_perp_num_epochs

1.0

Range 0.0 to 4096.0

Number of epochs for finetuning.

text_to_image_inplace_perp_learning_rate

1e-05

Range 0.0 to 1.0

Learning rate for finetuning.

text_to_image_inplace_perp_use_cpu_offloading

True

True, False

Whether to use CPU offloading for finetuning.

text_to_image_inplace_perp_optimizer

AdamW8bit

AdamW8bit, AdamW, Adam

Which optimizer to use for finetuning.