Algorithms Overview

At its core, the pruna package is a framework of compression algorithms. By offering a consistent interface, it simplifies the integration of diverse compression algorithms. In this section, we will introduce you to all the algorithms you can currently apply with the package. Algorithms marked with “(Pro)” are only available in the pruna_pro package.

pruna wouldn’t be possible without the amazing work of the authors behind these algorithms. 💜 We’re really grateful for their contributions and encourage you to check out their repositories!

Batchers

Batching groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time.

ifw

Insanely Fast Whisper is an optimized version of Whisper models that significantly speeds up transcription. It achieves lower latency and higher throughput through low-level code optimizations and efficient batching, making real-time speech recognition more practical. Note: IFW prepares the model for inference with the batch size specified in the smash config. Make sure to set the batch size to a value that corresponds to your inference requirements.

References: GitHub.
Can be applied on: CUDA.
Required: Tokenizer, Processor.
Compatible with: half.

Parameter

Default

Options

Description

ifw_weight_bits

16

16 or 32

Sets the number of bits to use for weight quantization.

whisper_s2t

WhisperS2T is an optimized speech-to-text pipeline built for Whisper models. Note: WS2T prepares the model for inference with the batch size specified in the smash config. Make sure to set the batch size to a value that corresponds to your inference requirements.

References: GitHub.
Can be applied on: CUDA.
Required: Tokenizer, Processor.
Compatible with: c_generate, c_translate, c_whisper, half.

Parameter

Default

Options

Description

whisper_s2t_int8

False

True, False

Whether to quantize to int8 for inference.

Cachers

Caching is a technique used to store intermediate results of computations to speed up subsequent operations, particularly useful in reducing inference time for machine learning models by reusing previously computed results.

deepcache

DeepCache accelerates inference by leveraging the U-Net blocks of diffusion pipelines to reuse high-level features.

References: GitHub, Paper.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.

Parameter

Default

Options

Description

deepcache_interval

2

1, 2, 3, 4 or 5

Interval at which to cache - 1 disables caching. Higher is faster but might affect quality.

fastercache

FasterCache is a method that speeds up inference in diffusion transformers by: - Reusing attention states between successive inference steps, due to high similarity between them - Skipping unconditional branch prediction used in classifier-free guidance by revealing redundancies between unconditional and conditional branch outputs for the same timestep, and therefore approximating the unconditional branch output using the conditional branch output This implementation reduces the number of tunable parameters by setting pipeline specific parameters according to https://github.com/huggingface/diffusers/pull/9562.

References: GitHub, Paper.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: diffusers_int8, hqq_diffusers.

Parameter

Default

Options

Description

fastercache_interval

2

1, 2, 3, 4 or 5

Interval at which to cache spatial attention blocks - 1 disables caching.Higher is faster but might degrade quality.

fora

FORA reuses the outputs of the transformer blocks for N steps before recomputing them. Different from the official implementation, this implementation exposes a start step parameter that allows to obtain a higher fidelity to the base model.

References: Paper.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.

Parameter

Default

Options

Description

fora_interval

2

1, 2, 3, 4 or 5

Interval at which the outputs are computed. Higher is faster, but reduces quality.

fora_start_step

2

0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10

How many steps to wait before starting to cache.

pab

Pyramid Attention Broadcast (PAB) is a method that speeds up inference in diffusion models by systematically skipping attention computations between successive inference steps and reusing cached attention states. This implementation reduces the number of tunable parameters by setting pipeline specific parameters according to https://github.com/huggingface/diffusers/pull/9562.

References: Paper, HuggingFace.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: diffusers_int8, hqq_diffusers.

Parameter

Default

Options

Description

pab_interval

2

1, 2, 3, 4 or 5

Interval at which to cache spatial attention blocks - 1 disables caching.Higher is faster but might degrade quality.

Compilers

Compilation optimizes the model for specific hardware.

c_generate

CGenerate employs a custom runtime that leverages optimizations like weight quantization, layer fusion, and batch reordering to boost performance and reduce memory usage on both CPUs and GPUs for Causal LM models.

References: GitHub.
Can be applied on: CUDA.
Required: Tokenizer.
Compatible with: half, whisper_s2t.

Parameter

Default

Options

Description

c_generate_weight_bits

16

8 or 16

Sets the number of bits to use for weight quantization.

c_translate

CTranslate employs a custom runtime that leverages optimizations like weight quantization, layer fusion, and batch reordering to boost performance and reduce memory usage on both CPUs and GPUs for Causal LM models used for Translation.

References: GitHub.
Can be applied on: CUDA.
Required: Tokenizer.
Compatible with: half, whisper_s2t.

Parameter

Default

Options

Description

c_translate_weight_bits

16

8 or 16

Sets the number of bits to use for weight quantization.

c_whisper

CWhisper employs a custom runtime that leverages optimizations like weight quantization, layer fusion, and batch reordering to boost performance and reduce memory usage on both CPUs and GPUs for Whisper models.

References: GitHub.
Can be applied on: CUDA.
Required: Processor.
Compatible with: half, whisper_s2t.

Parameter

Default

Options

Description

c_whisper_weight_bits

16

8 or 16

Sets the number of bits to use for weight quantization.

stable_fast

Stable-fast is an optimization framework for Image-Gen models. It accelerates inference by fusing key operations into optimized kernels and converting diffusion pipelines into efficient TorchScript graphs.

References: GitHub.
Can be applied on: CUDA.
Required: None.
Compatible with: deepcache, fora, half.
Required install: pip install pruna[stable-fast] or pip install pruna[full].

torch_compile

Optimizes given model or function using various backends and is compatible with any model containing PyTorch modules.

References: GitHub.
Can be applied on: CPU, CUDA.
Required: None.

Parameter

Default

Options

Description

torch_compile_mode

default

default, reduce-overhead, max-autotune, max-autotune-no-cudagraphs

Compilation mode.

torch_compile_backend

inductor

inductor, cudagraphs, onnxrt, tvm, openvino, openxla

Compilation backend.

torch_compile_fullgraph

False

True, False

Whether to discover compilable subgraphs or compile the full input graph.

torch_compile_dynamic

None

None, True, False

Whether to use dynamic shape tracing or not.

torch_compile_max_kv_cache_size

400

100, 200, 400, 512, 800, 1600, 3200, 6400, 12800, 25600, 51200 or 102400

The maximum number of new tokens to generate, for LLMs.

torch_compile_seqlen_manual_cuda_graph

100

100, 200, 400, 512, 800, 1600, 3200, 6400, 12800, 25600, 51200 or 102400

The sequence length to use for manual CUDA graph capture, for LLMs. We recommend to use a smaller value than max_kv_cache_size to avoid CUDA graph capture overhead.

torch_compile_make_portable

False

True, False

Whether to make the model compiled model portable or not, and significantly reduce the warmup time of the model on a different machine.

torch_compile_target

model

model or module_list

Whether to compile the model itself or the module list. Compiling the model itself has a longer warmup and could fail in case of graphbreaks but could lead to slightly faster compilation. Whereas compiling the module list has a shorter warmup and is more robust to graphbreaks but could be slightly slower.

Factorizers

Factorization batches several small matrix multiplications into one large fused operation or splits matrix operations into smaller ones.

qkv_diffusers

QKV factorizing fuses the QKV matrices of the denoiser model into a single matrix, reducing the number of operations. In the attention layer, we can compute the q, k, v signals all at once: the matrix multiplication involve a larger matrix but we compute one operation instead of three.

References: BFL, Github.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.

Kernels

Kernels are compact and highly optimized routines that run as fast and efficient as possible on a given hardware.

flash_attn3

Flash Attention 3 is a fast and memory-efficient attention mechanism. It uses a combination of tiling, streaming and fusing to speed up attention computations.

References: GitHub, Kernel Hub.
Can be applied on: CUDA, Accelerate distributed.
Required: None.
Compatible with: fora, torch_compile, torchao.

Pruners

Pruning removes less important or redundant connections and neurons from a model, resulting in a sparser, more efficient network.

torch_structured

Structured pruning removes entire units like neurons, channels, or filters from a network, leading to a more compact and computationally efficient model while preserving a regular structure that standard hardware can easily optimize.

References: GitHub.
Can be applied on: CPU, CUDA.
Required: Dataset.
Compatible with: half, hqq, torch_compile, torchao.

Parameter

Default

Options

Description

torch_structured_type

MagnitudeImportance

RandomImportance, MagnitudeImportance, LAMPImportance, TaylorImportance, HessianImportance

Importance criterion for pruning.

torch_structured_calibration_samples

64

Range 1 to 256

Number of calibration samples for importance computation.

torch_structured_prune_head_dims

False

True, False

Whether to prune head dimensions.

torch_structured_prune_num_heads

False

True, False

Whether to prune number of heads.

torch_structured_global_pruning

False

True, False

Whether to perform global pruning.

torch_structured_sparsity

0.1

Range 0.0 to 1.0

Sparsity level up to which to prune.

torch_structured_head_sparsity

0.0

Range 0.0 to 1.0

Sparsity level up to which to prune heads.

torch_structured_it_steps

1

Range 1 to 10

Number of iterations for pruning.

torch_unstructured

Unstructured pruning sets individual weights to 0 based on criteria such as magnitude, resulting in sparse weight matrices that retain the overall model architecture but may require specialized sparse computation support to fully exploit the efficiency gains.

References: GitHub.
Can be applied on: CPU, CUDA.
Required: None.
Compatible with: half.

Parameter

Default

Options

Description

torch_unstructured_pruning_method

l1

random, l1

Pruning method to use.

torch_unstructured_sparsity

0.1

Range 0.0 to 1.0

Sparsity level up to which to prune.

Quantizers

Quantization reduces the precision of the model’s weights and activations, making them much smaller in terms of memory required.

gptq

GPTQ is a post-training quantization technique that independently quantizes each row of the weight matrix to minimize error. The weights are quantized to int4, stored as int32, and then dequantized on the fly to fp16 during inference, resulting in nearly 4x memory savings and faster performance due to custom kernels that take advantage of the lower precision.

References: GitHub.
Can be applied on: CUDA.
Required: Tokenizer, Dataset.
Compatible with: torch_compile.
Required install: You must first install the base package with pip install pruna before installing the GPTQ extension with pip install pruna[gptq] --extra-index-url https://prunaai.pythonanywhere.com/.

Parameter

Default

Options

Description

gptq_weight_bits

4

2, 3, 4 or 8

Sets the number of bits to use for weight quantization.

gptq_use_exllama

False

True, False

Whether to use exllama for quantization.

gptq_group_size

128

64, 128 or 256

Group size for quantization.

half

Converting model parameters to half precision (FP16) reduces memory usage and can accelerate computations on GPUs that support it.

References: GitHub.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.

hqq

Half-Quadratic Quantization (HQQ) leverages fast, robust optimization techniques for on-the-fly quantization, eliminating the need for calibration data.

References: GitHub, Article.
Can be applied on: CUDA.
Required: None.
Compatible with: torch_compile, torch_structured.

Parameter

Default

Options

Description

hqq_weight_bits

8

2, 4 or 8

Number of bits to use for quantization.

hqq_group_size

64

8, 16, 32, 64 or 128

Group size for quantization.

hqq_compute_dtype

torch.float16

torch.bfloat16, torch.float16

Compute dtype for quantization.

hqq_use_torchao_kernels

True

True, False

Whether to use the torchaoint4 kernels for inference.

hqq_force_hf_implementation

False

True, False

Whether or not to bypass the HQQ quantization and use the generic HF quantization.

hqq_diffusers

Half-Quadratic Quantization (HQQ) leverages fast, robust optimization techniques for on-the-fly quantization, eliminating the need for calibration data and making it applicable to any model. This algorithm is specifically adapted for diffusers models.

References: GitHub, Article.
Can be applied on: CUDA.
Required: None.

Parameter

Default

Options

Description

hqq_diffusers_weight_bits

8

2, 4 or 8

Number of bits to use for quantization.

hqq_diffusers_group_size

64

8, 16, 32, 64 or 128

Group size for quantization.

hqq_diffusers_backend

torchao_int4

gemlite, bitblas, torchao_int4 or marlin

Backend to use for quantization.

diffusers_int8

BitsAndBytes offers a simple method to quantize models to 8-bit or 4-bit precision. The 8-bit mode blends outlier fp16 values with int8 non-outliers to mitigate performance degradation, while 4-bit quantization further compresses the model and is often used with QLoRA for fine-tuning. This algorithm is specifically adapted for diffusers models.

References: GitHub.
Can be applied on: CUDA, Accelerate distributed.
Required: None.

Parameter

Default

Options

Description

diffusers_int8_weight_bits

8

4 or 8

Number of bits to use for quantization.

diffusers_int8_double_quant

False

True, False

Whether to enable double quantization.

diffusers_int8_enable_fp32_cpu_offload

False

True, False

Whether to enable fp32 cpu offload.

diffusers_int8_quant_type

fp4

fp4, nf4

Quantization type to use.

diffusers_int8_target_modules

None

Unconstrained

Precise choices of which modules to quantize, e.g. {include: [‘transformer.*’]} to quantize only the transformer in a diffusion pipeline. See the Target Modules documentation for more details.

llm_int8

BitsAndBytes offers a simple method to quantize models to 8-bit or 4-bit precision. The 8-bit mode blends outlier fp16 values with int8 non-outliers to mitigate performance degradation, while 4-bit quantization further compresses the model and is often used with QLoRA for fine-tuning.

References: GitHub.
Can be applied on: CUDA, Accelerate distributed.
Required: None.
Compatible with: torch_compile.

Parameter

Default

Options

Description

llm_int8_weight_bits

8

4 or 8

Sets the number of bits to use for weight quantization.

llm_int8_double_quant

False

True, False

Whether to enable double quantization.

llm_int8_enable_fp32_cpu_offload

False

True, False

Whether to enable fp32 cpu offload.

llm_int8_quant_type

fp4

fp4, nf4

Quantization type to use.

llm_int8_target_modules

None

Unconstrained

Precise choices of which modules to quantize, e.g. {include: [‘transformer.*’]} to quantize only the transformer in a diffusion pipeline. See the Target Modules documentation for more details.

awq

Activation Aware Quantization (AWQ) is a state-of-the-art technique to quantize the weights of large language models which involves using a small calibration dataset to calibrate the model. The AWQ algorithm utilizes calibration data to derive scaling factors which reduce the dynamic range of weights while minimizing accuracy loss to the most salient weight values.

References: GitHub.
Can be applied on: CUDA.
Required: Tokenizer, Dataset.
Compatible with: None.

Parameter

Default

Options

Description

awq_quant_scheme

W4A16

W4A16, W4A16_ASYM

Quantization scheme to use. Use symmetric quantization to avoid decompression issues.

quanto

With Quanto, models with int8/float8 weights and float8 activations maintain nearly full-precision accuracy. Lower bit quantization is also supported. When only weights are quantized and optimized kernels are available, inference latency remains comparable, and device memory usage is roughly reduced in proportion to the bitwidth ratio.

References: GitHub.
Can be applied on: CUDA.
Required: None.
Compatible with: deepcache, qkv_diffusers.

Parameter

Default

Options

Description

quanto_weight_bits

qfloat8

qint2, qint4, qint8 or qfloat8

Tensor type to use for quantization.

quanto_calibrate

True

True, False

Whether to calibrate the model.

quanto_target_modules

None

Unconstrained

Precise choices of which modules to quantize, e.g. {include: [‘transformer.*’]} to quantize only the transformer in a diffusion pipeline. See the Target Modules documentation for more details.

torch_dynamic

This technique converts model weights to lower precision (typically int8) dynamically at runtime, reducing model size and improving inference speed with minimal impact on accuracy and without the need for calibration data.

References: GitHub.
Can be applied on: CPU, CUDA.
Required: None.
Compatible with: None.

Parameter

Default

Options

Description

torch_dynamic_weight_bits

qint8

quint8 or qint8

Tensor type to use for quantization.

torchao

This replaces each nn.Linear in-place with a low-precision Tensor subclass via torchao.quantization.quantize. It uses per-channel uniform affine (“linear”) quantization for weights (e.g. symmetric int8 or int4) and dynamic per-tensor affine quantization for activations (8-bit at runtime). When combined with torch.compile, this can yield substantial inference speedups over full-precision model.

References: GitHub.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.

Parameter

Default

Options

Description

torchao_quant_type

int8dq

int4dq, int4wo, int8dq, int8wo, fp8wo, fp8dq, fp8dqrow

Quantization type: prefix selects data format (int4/int8/fp8); wo quantizes only the weights (activations remain in full precision); dq fully quantizes and dequantizes both weights and activations; dqrow also does full quantize-dequantize but computes a separate scale for each row

torchao_excluded_modules

none

none, norm, embedding, norm+embedding

Which types of modules to omit when applying quantization.