Algorithms Overview

At its core, the pruna package is a framework of compression algorithms. By offering a consistent interface, it simplifies the integration of diverse compression algorithms. In this section, we will introduce you to all the algorithms you can currently apply with the package. Algorithms marked with “(Pro)” are only available in the pruna_pro package.

pruna wouldn’t be possible without the amazing work of the authors behind these algorithms. 💜 We’re really grateful for their contributions and encourage you to check out their repositories!

Batchers

Batching groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time.

`ifw`

Insanely Fast Whisper is an optimized version of Whisper models that significantly speeds up transcription. It achieves lower latency and higher throughput through low-level code optimizations and efficient batching, making real-time speech recognition more practical. Note: IFW prepares the model for inference with the batch size specified in the smash config. Make sure to set the batch size to a value that corresponds to your inference requirements.

References: GitHub.
Can be applied on: CUDA.
Required: Tokenizer, Processor.
Compatible with: half.

Parameter	Default	Options	Description
`ifw_weight_bits`	16	16 or 32	Sets the number of bits to use for weight quantization.

`whisper_s2t`

WhisperS2T is an optimized speech-to-text pipeline built for Whisper models. Note: WS2T prepares the model for inference with the batch size specified in the smash config. Make sure to set the batch size to a value that corresponds to your inference requirements.

References: GitHub.
Can be applied on: CUDA.
Required: Tokenizer, Processor.
Compatible with: c_translate, c_generate, c_whisper, half.

Parameter	Default	Options	Description
`whisper_s2t_int8`	False	True, False	Whether to quantize to int8 for inference.

Cachers

Caching is a technique used to store intermediate results of computations to speed up subsequent operations, particularly useful in reducing inference time for machine learning models by reusing previously computed results.

`deepcache`

DeepCache accelerates inference by leveraging the U-Net blocks of diffusion pipelines to reuse high-level features.

References: GitHub, Paper.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: qkv_diffusers, stable_fast, torch_compile, half, hqq_diffusers, diffusers_int8, quanto.

Parameter	Default	Options	Description
`deepcache_interval`	2	1, 2, 3, 4 or 5	Interval at which to cache - 1 disables caching. Higher is faster but might affect quality.

`fastercache`

FasterCache is a method that speeds up inference in diffusion transformers by: - Reusing attention states between successive inference steps, due to high similarity between them - Skipping unconditional branch prediction used in classifier-free guidance by revealing redundancies between unconditional and conditional branch outputs for the same timestep, and therefore approximating the unconditional branch output using the conditional branch output This implementation reduces the number of tunable parameters by setting pipeline specific parameters according to https://github.com/huggingface/diffusers/pull/9562.

References: GitHub, Paper.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: hqq_diffusers, diffusers_int8.

Parameter	Default	Options	Description
`fastercache_interval`	2	1, 2, 3, 4 or 5	Interval at which to cache spatial attention blocks - 1 disables caching.Higher is faster but might degrade quality.

`fora`

FORA reuses the outputs of the transformer blocks for N steps before recomputing them. Different from the official implementation, this implementation exposes a start step parameter that allows to obtain a higher fidelity to the base model.

References: Paper.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: stable_fast, torch_compile, diffusers_int8, hqq_diffusers, torchao, qkv_diffusers.

Parameter	Default	Options	Description
`fora_interval`	2	1, 2, 3, 4 or 5	Interval at which the outputs are computed. Higher is faster, but reduces quality.
`fora_start_step`	2	0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10	How many steps to wait before starting to cache.

`pab`

Pyramid Attention Broadcast (PAB) is a method that speeds up inference in diffusion models by systematically skipping attention computations between successive inference steps and reusing cached attention states. This implementation reduces the number of tunable parameters by setting pipeline specific parameters according to https://github.com/huggingface/diffusers/pull/9562.

References: Paper, HuggingFace.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: hqq_diffusers, diffusers_int8.

Parameter	Default	Options	Description
`pab_interval`	2	1, 2, 3, 4 or 5	Interval at which to cache spatial attention blocks - 1 disables caching.Higher is faster but might degrade quality.

`adaptive` (Pro)

Adaptive caching adjusts caching dynamically for each prompt, determining the optimal inference steps to reuse cached outputs.

References: None.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: qkv_diffusers, hyper, torch_compile, stable_fast, hqq_diffusers, diffusers_int8, diffusers_higgs, torchao, fp8.

Parameter	Default	Options	Description
`adaptive_threshold`	0.01	Range 0.001 to 0.2	How much the difference between the current and previous latent can be before caching.Higher is faster, but reduces quality.
`adaptive_max_skip_steps`	4	1, 2, 3, 4 or 5	How many steps are allowed to be skipped in a row. Higher is faster, but reduces quality.
`adaptive_cache_mode`	default	default, taylor, ab, bdf	Controls cache mode. Default uses a simple caching strategy, while taylor uses Taylor expansion for a more accurate approximation, and ab uses Adams-Bashforth cache mode, and bdf uses Backward Differentiation Formula cache mode.
`adaptive_custom_model`	False	True, False	Enables caching for custom models. When enabled, the cache helper must be manually configured to work with the specific model architecture.

`auto` (Pro)

Given a speed_factor (e.g., 0.5 to halve latency), auto caching determines the optimal caching schedule to achieve the desired latency reduction.

References: None.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: qkv_diffusers, hyper, torch_compile, stable_fast, hqq_diffusers, diffusers_int8, diffusers_higgs, torchao, fp8, ring_attn.

Parameter	Default	Options	Description
`auto_speed_factor`	0.5	Range 0.0 to 1.0	Controls inference latency. Lower values yield faster inference but may compromise quality.
`auto_cache_mode`	default	default, taylor, ab, bdf, midpoint	Controls caching mode. Default uses a simple caching strategy, while taylor uses Taylor expansion for more accurate approximation, and ab uses Adams-Bashforth cache mode, and bdf uses Backward Differentiation Formula cache mode, and midpoint uses a mix of interpolation and extrapolation for better results.
`auto_objective`	fidelity	fidelity, quality	Controls the objective of the caching schedule. Fidelity yields high similarityto the outputs of base model, while quality objective prioritizes quality.
`auto_custom_model`	False	True, False	Enables caching for custom models. When enabled, the cache helper must be manually configured to work with the specific model architecture.

`flux_caching` (Pro)

Flux caching works similarly to periodic caching, but stores outputs of the transformer blocks instead of the output of the whole backbone.

References: None.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: qkv_diffusers, hyper, torch_compile, stable_fast, diffusers_int8, diffusers_higgs, fp8, torchao.

Parameter	Default	Options	Description
`flux_caching_cache_interval`	2	1, 2, 3, 4, 5, 6 or 7	How many model steps to skip in a row. Higher is faster, but reduces quality.
`flux_caching_start_step`	2	0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10	How many steps to wait before starting to cache.

`periodic` (Pro)

After a configurable start_step, periodic caching computes the output of the backbone (can be a UNet or a Transformer) every cache_interval steps and reuses this cached output for the remaining steps.

References: None.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: qkv_diffusers, hyper, torch_compile, stable_fast, hqq_diffusers, diffusers_int8, diffusers_higgs, torchao, fp8.

Parameter	Default	Options	Description
`periodic_cache_interval`	2	1, 2, 3, 4, 5, 6 or 7	How many model steps to skip in a row. Higher is faster, but reduces quality.
`periodic_start_step`	2	0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10	How many steps to wait before starting to cache.
`periodic_cache_mode`	default	default, taylor, ab, bdf, midpoint	Controls cache mode. Default uses a simple caching strategy, while taylor uses Taylor expansion for more accurate approximation, and ab uses Adams-Bashforth cache mode, and bdf uses Backward Differentiation Formula cache mode, and midpoint uses a mix of interpolation and extrapolation for better results.
`periodic_custom_model`	False	True, False	Enables caching for custom models. When enabled, the cache helper must be manually configured to work with the specific model architecture.

`taylor` (Pro)

Taylor caching approximates the output of the backbone (UNet or Transformer) using Taylor series expansion. It computes the actual output at configurable intervals and uses Taylor approximation for intermediate steps, providing a balance between performance and accuracy.

References: None.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: qkv_diffusers, hyper, torch_compile, stable_fast, hqq_diffusers, diffusers_int8, diffusers_higgs, torchao, fp8.

Parameter	Default	Options	Description
`taylor_max_order`	2	1, 2, 3 or 4	Taylor expansion order; higher is more accurate, slower.
`taylor_cache_interval`	2	1, 2, 3, 4, 5, 6 or 7	How many model steps to skip in a row. Higher is faster, but reduces quality.
`taylor_start_step`	2	0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10	How many steps to wait before starting to cache.

`taylor_auto` (Pro)

Given a speed_factor (e.g., 0.5 to halve latency), auto caching determines the optimal caching schedule to achieve the desired latency reduction. This version uses Taylor expansion for more accurate approximation.

References: None.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: qkv_diffusers, hyper, torch_compile, stable_fast, hqq_diffusers, diffusers_int8, diffusers_higgs, torchao, fp8, ring_attn.

Parameter	Default	Options	Description
`taylor_auto_speed_factor`	0.5	Range 0.0 to 1.0	Controls inference latency. Lower values yield faster inference but may compromise quality.
`taylor_auto_max_order`	1	1, 2, 3 or 4	Taylor expansion order; higher is more accurate, slower.

Compilers

Compilation optimizes the model for specific hardware.

`c_generate`

CGenerate employs a custom runtime that leverages optimizations like weight quantization, layer fusion, and batch reordering to boost performance and reduce memory usage on both CPUs and GPUs for Causal LM models.

References: GitHub.
Can be applied on: CUDA.
Required: Tokenizer.
Compatible with: whisper_s2t, half.

Parameter	Default	Options	Description
`c_generate_weight_bits`	16	8 or 16	Sets the number of bits to use for weight quantization.

`c_translate`

CTranslate employs a custom runtime that leverages optimizations like weight quantization, layer fusion, and batch reordering to boost performance and reduce memory usage on both CPUs and GPUs for Causal LM models used for Translation.

References: GitHub.
Can be applied on: CUDA.
Required: Tokenizer.
Compatible with: whisper_s2t, half.

Parameter	Default	Options	Description
`c_translate_weight_bits`	16	8 or 16	Sets the number of bits to use for weight quantization.

`c_whisper`

CWhisper employs a custom runtime that leverages optimizations like weight quantization, layer fusion, and batch reordering to boost performance and reduce memory usage on both CPUs and GPUs for Whisper models.

References: GitHub.
Can be applied on: CUDA.
Required: Processor.
Compatible with: whisper_s2t, half.

Parameter	Default	Options	Description
`c_whisper_weight_bits`	16	8 or 16	Sets the number of bits to use for weight quantization.

`stable_fast`

Stable-fast is an optimization framework for Image-Gen models. It accelerates inference by fusing key operations into optimized kernels and converting diffusion pipelines into efficient TorchScript graphs.

References: GitHub.
Can be applied on: CUDA.
Required: None.
Compatible with: deepcache, fora, half.
Required install: pip install pruna[stable-fast] or pip install pruna[full].

`torch_compile`

Optimizes given model or function using various backends and is compatible with any model containing PyTorch modules.

References: GitHub.
Can be applied on: CPU, CUDA.
Required: None.
Compatible with: qkv_diffusers, half, hqq_diffusers, diffusers_int8, gptq, llm_int8, hqq, torchao, deepcache, fora, torch_structured.

Parameter	Default	Options	Description
`torch_compile_mode`	default	default, reduce-overhead, max-autotune, max-autotune-no-cudagraphs	Compilation mode.
`torch_compile_backend`	inductor	inductor, cudagraphs, onnxrt, tvm, openvino, openxla	Compilation backend.
`torch_compile_fullgraph`	True	True, False	Whether to discover compilable subgraphs or compile the full input graph.
`torch_compile_dynamic`	None	None, True, False	Whether to use dynamic shape tracing or not.
`torch_compile_max_kv_cache_size`	400	100, 200, 400, 512, 800, 1600, 3200, 6400, 12800, 25600, 51200 or 102400	The maximum number of new tokens to generate, for LLMs.
`torch_compile_seqlen_manual_cuda_graph`	100	100, 200, 400, 512, 800, 1600, 3200, 6400, 12800, 25600, 51200 or 102400	The sequence length to use for manual CUDA graph capture, for LLMs. We recommend to use a smaller value than max_kv_cache_size to avoid CUDA graph capture overhead.
`torch_compile_make_portable`	False	True, False	Whether to make the model compiled model portable or not, and significantly reduce the warmup time of the model on a different machine.
`torch_compile_target`	model	model or module_list	Whether to compile the model itself or the module list. Compiling the model itself has a longer warmup and could fail in case of graphbreaks but could lead to slightly faster compilation. Whereas compiling the module list has a shorter warmup and is more robust to graphbreaks but could be slightly slower.

`x_fast` (Pro)

Based on stable_fast, this compiler speeds up inference latency for any model using a combination of xformers, triton, cudnn, and torch tracing.

References: None.
Can be applied on: CUDA.
Required: None.
Compatible with: quanto, half, text_to_text_lora, text_to_image_lora.
Required install: pip install pruna[stable-fast] or pip install pruna[full].

Parameter	Default	Options	Description
`x_fast_xformers`	True	True, False	Whether to use xformers for faster inference.

`ipex_llm` (Pro)

This compiler leverages advanced graph optimizations, quantization, and kernel fusion techniques to accelerate PyTorch-based LLM inference on Intel CPUs.

References: Github.
Can be applied on: CPU.
Required: None.
Compatible with: half.
Required install: pip install pruna_pro[intel] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/cn/ or pip install pruna_pro[full] --extra-index-url https://prunaai.pythonanywhere.com/ --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/cn/.

Parameter	Default	Options	Description
`ipex_llm_weight_bits`	8	8 or 4	The number of bits to use for weight quantization.

Distillers

Distillation trains a smaller, simpler model to mimic a larger, more complex model.

`hyper` (Pro)

Hyper-SD is a distillation framework that segments the diffusion process into time-step groups to preserve and reformulate the ODE trajectory. By integrating human feedback and score distillation, it enables near-lossless performance with drastically fewer inference steps.

References: Paper.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: half, diffusers_int8, deepcache, auto, adaptive, flux_caching, periodic, torch_compile, stable_fast.

Parameter	Default	Options	Description
`hyper_agressive`	False	True, False	When this is set to True, the model is distilled to even less steps

Distributers

Distributors distribute the model or certain calculations across multiple devices, improving computational efficiency and reducing overall processing time.

`ring_attn` (Pro)

Each GPU stores only its own slice of Q/K/V and participates in a Ring Attention shuffle that lets every query attend to every key/value. The result is lower KV-cache/activation memory per GPU and higher arithmetic intensity.

References: Implementation, Paper.
Can be applied on: CUDA.
Required: None.
Compatible with: torch_compile, auto, taylor_auto, fp8, qkv_diffusers, padding_pruning.

Parameter	Default	Options	Description
`ring_attn_convert_to_f32`	True	True, False	Allowing intermediate computations in the attention mechanism to be upcast to 32-bit.
`ring_attn_rotate_method`	ALL_TO_ALL	ALL_TO_ALL, ALL_GATHER	The method to use for rotating the computations.

Enhancers

Enhancers improve the quality of the model’s output. Enhancers can range from post-processing to test time compute algorithms.

`realesrgan_upscale` (Pro)

This enhancer applies the Real-ESRGAN model to upscale images produced by diffusion models or other image generation pipelines.

References: Paper, GitHub.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: auto, taylor_auto, flux_caching, periodic, deepcache, taylor, fora, adaptive, hyper, torch_compile, stable_fast, hqq_diffusers, diffusers_int8, torchao, fp8, qkv_diffusers.

`img2img_denoise` (Pro)

This enhancer takes the output images from a diffusion pipeline and refines them by smartly reusing the same pipeline. This assumes the base model is a diffusers pipeline supporting image-to-image.

References: Diffusers.
Can be applied on: CPU, CUDA.
Required: None.
Compatible with: auto, taylor_auto, flux_caching, periodic, deepcache, taylor, fora, adaptive, hyper, torch_compile, stable_fast, hqq_diffusers, diffusers_int8, torchao, fp8, qkv_diffusers.

Parameter	Default	Options	Description
`img2img_denoise_strength`	0.02	Range 0.0 to 1.0	Strength of the denoising/refinement. Lower values mean less change/more refinement.

Factorizers

Factorization batches several small matrix multiplications into one large fused operation which, while neutral on memory and raw latency, unlocks notable speed-ups when used alongside quantization.

`qkv_diffusers`

QKV factorizing fuses the QKV matrices of the denoiser model into a single matrix, reducing the number of operations. In the attention layer, we can compute the q, k, v signals all at once: the matrix multiplication involve a larger matrix but we compute one operation instead of three.

References: BFL, Github.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: diffusers_int8, hqq_diffusers, quanto, torchao, deepcache, fora, torch_compile.

Pruners

Pruning removes less important or redundant connections and neurons from a model, resulting in a sparser, more efficient network.

`torch_structured`

Structured pruning removes entire units like neurons, channels, or filters from a network, leading to a more compact and computationally efficient model while preserving a regular structure that standard hardware can easily optimize.

References: GitHub.
Can be applied on: CPU, CUDA.
Required: Dataset.
Compatible with: half, torchao, hqq, torch_compile.

Parameter	Default	Options	Description
`torch_structured_type`	MagnitudeImportance	RandomImportance, MagnitudeImportance, LAMPImportance, TaylorImportance, HessianImportance	Importance criterion for pruning.
`torch_structured_calibration_samples`	64	Range 1 to 256	Number of calibration samples for importance computation.
`torch_structured_prune_head_dims`	False	True, False	Whether to prune head dimensions.
`torch_structured_prune_num_heads`	False	True, False	Whether to prune number of heads.
`torch_structured_global_pruning`	False	True, False	Whether to perform global pruning.
`torch_structured_sparsity`	0.1	Range 0.0 to 1.0	Sparsity level up to which to prune.
`torch_structured_head_sparsity`	0.0	Range 0.0 to 1.0	Sparsity level up to which to prune heads.
`torch_structured_it_steps`	1	Range 1 to 10	Number of iterations for pruning.

`torch_unstructured`

Unstructured pruning sets individual weights to 0 based on criteria such as magnitude, resulting in sparse weight matrices that retain the overall model architecture but may require specialized sparse computation support to fully exploit the efficiency gains.

References: GitHub.
Can be applied on: CPU, CUDA.
Required: None.
Compatible with: half.

Parameter	Default	Options	Description
`torch_unstructured_pruning_method`	l1	random, l1	Pruning method to use.
`torch_unstructured_sparsity`	0.1	Range 0.0 to 1.0	Sparsity level up to which to prune.

`padding_pruning` (Pro)

Padding Pruning removes unused padding tokens from the prompt embedding of diffusers pipelines.

References: None.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: Tokenizer.
Compatible with: qkv_diffusers, hyper, adaptive, auto, deepcache, fora, fastercache, pab, periodic, torch_compile, stable_fast, hqq_diffusers, diffusers_int8, diffusers_higgs, torchao, fp8.

Parameter	Default	Options	Description
`padding_pruning_min_sequence_length`	64	32, 64, 128 or 256	Minimum sequence length used to embed a prompt.

Quantizers

Quantization reduces the precision of the model’s weights and activations, making them much smaller in terms of memory required.

`gptq`

GPTQ is a post-training quantization technique that independently quantizes each row of the weight matrix to minimize error. The weights are quantized to int4, stored as int32, and then dequantized on the fly to fp16 during inference, resulting in nearly 4x memory savings and faster performance due to custom kernels that take advantage of the lower precision.

References: GitHub.
Can be applied on: CUDA.
Required: Tokenizer, Dataset.
Compatible with: torch_compile.
Required install: You must first install the base package with pip install pruna before installing the GPTQ extension with pip install pruna[gptq] --extra-index-url https://prunaai.pythonanywhere.com/.

Parameter	Default	Options	Description
`gptq_weight_bits`	4	2, 3, 4 or 8	Sets the number of bits to use for weight quantization.
`gptq_use_exllama`	False	True, False	Whether to use exllama for quantization.
`gptq_group_size`	128	64, 128 or 256	Group size for quantization.

`half`

Converting model parameters to half precision (FP16) reduces memory usage and can accelerate computations on GPUs that support it.

References: GitHub.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: ifw, whisper_s2t, deepcache, c_translate, c_generate, c_whisper, stable_fast, torch_compile, torch_structured, torch_unstructured.

`hqq`

Half-Quadratic Quantization (HQQ) leverages fast, robust optimization techniques for on-the-fly quantization, eliminating the need for calibration data.

References: GitHub, Article.
Can be applied on: CUDA.
Required: None.
Compatible with: torch_compile, torch_structured.

Parameter	Default	Options	Description
`hqq_weight_bits`	8	2, 4 or 8	Number of bits to use for quantization.
`hqq_group_size`	64	8, 16, 32, 64 or 128	Group size for quantization.
`hqq_compute_dtype`	torch.float16	torch.bfloat16, torch.float16	Compute dtype for quantization.

`hqq_diffusers`

Half-Quadratic Quantization (HQQ) leverages fast, robust optimization techniques for on-the-fly quantization, eliminating the need for calibration data and making it applicable to any model. This algorithm is specifically adapted for diffusers models.

References: GitHub, Article.
Can be applied on: CUDA.
Required: None.
Compatible with: qkv_diffusers, deepcache, fastercache, fora, pab, torch_compile.

Parameter	Default	Options	Description
`hqq_diffusers_weight_bits`	8	2, 4 or 8	Number of bits to use for quantization.
`hqq_diffusers_group_size`	64	8, 16, 32, 64 or 128	Group size for quantization.
`hqq_diffusers_backend`	torchao_int4	gemlite, bitblas, torchao_int4 or marlin	Backend to use for quantization.

`diffusers_int8`

BitsAndBytes offers a simple method to quantize models to 8-bit or 4-bit precision. The 8-bit mode blends outlier fp16 values with int8 non-outliers to mitigate performance degradation, while 4-bit quantization further compresses the model and is often used with QLoRA for fine-tuning. This algorithm is specifically adapted for diffusers models.

References: GitHub.
Can be applied on: CUDA, Accelerate distributed.
Required: None.
Compatible with: qkv_diffusers, deepcache, fastercache, fora, pab, torch_compile.

Parameter	Default	Options	Description
`diffusers_int8_weight_bits`	8	4 or 8	Number of bits to use for quantization.
`diffusers_int8_double_quant`	False	True, False	Whether to enable double quantization.
`diffusers_int8_enable_fp32_cpu_offload`	False	True, False	Whether to enable fp32 cpu offload.
`diffusers_int8_quant_type`	fp4	fp4, nf4	Quantization type to use.

`llm_int8`

BitsAndBytes offers a simple method to quantize models to 8-bit or 4-bit precision. The 8-bit mode blends outlier fp16 values with int8 non-outliers to mitigate performance degradation, while 4-bit quantization further compresses the model and is often used with QLoRA for fine-tuning.

References: GitHub.
Can be applied on: CUDA, Accelerate distributed.
Required: None.
Compatible with: torch_compile.

Parameter	Default	Options	Description
`llm_int8_weight_bits`	8	4 or 8	Sets the number of bits to use for weight quantization.
`llm_int8_double_quant`	False	True, False	Whether to enable double quantization.
`llm_int8_enable_fp32_cpu_offload`	False	True, False	Whether to enable fp32 cpu offload.
`llm_int8_quant_type`	fp4	fp4, nf4	Quantization type to use.

`awq`

Activation Aware Quantization (AWQ) is a state-of-the-art technique to quantize the weights of large language models which involves using a small calibration dataset to calibrate the model. The AWQ algorithm utilizes calibration data to derive scaling factors which reduce the dynamic range of weights while minimizing accuracy loss to the most salient weight values.

References: GitHub.
Can be applied on: CUDA, Accelerate distributed.
Required: Tokenizer, Dataset.
Compatible with: None.

`quanto`

With Quanto, models with int8/float8 weights and float8 activations maintain nearly full-precision accuracy. Lower bit quantization is also supported. When only weights are quantized and optimized kernels are available, inference latency remains comparable, and device memory usage is roughly reduced in proportion to the bitwidth ratio.

References: GitHub.
Can be applied on: CUDA.
Required: None.
Compatible with: qkv_diffusers, deepcache.

Parameter	Default	Options	Description
`quanto_weight_bits`	qfloat8	qint2, qint4, qint8 or qfloat8	Tensor type to use for quantization.
`quanto_calibrate`	True	True, False	Whether to calibrate the model.

`torch_dynamic`

This technique converts model weights to lower precision (typically int8) dynamically at runtime, reducing model size and improving inference speed with minimal impact on accuracy and without the need for calibration data.

References: GitHub.
Can be applied on: CPU, CUDA.
Required: None.
Compatible with: None.

Parameter	Default	Options	Description
`torch_dynamic_weight_bits`	qint8	quint8 or qint8	Tensor type to use for quantization.

`torchao`

This replaces each nn.Linear in-place with a low-precision Tensor subclass via torchao.quantization.quantize_. It uses per-channel uniform affine (“linear”) quantization for weights (e.g. symmetric int8 or int4) and dynamic per-tensor affine quantization for activations (8-bit at runtime). When combined with torch.compile, this can yield substantial inference speedups over full-precision model.

References: GitHub.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: fora, torch_compile, qkv_diffusers, torch_structured.

Parameter	Default	Options	Description
`torchao_quant_type`	int8dq	int4dq, int4wo, int8dq, int8wo, fp8wo, fp8dq, fp8dqrow	Quantization type: prefix selects data format (int4/int8/fp8); wo quantizes only the weights (activations remain in full precision); dq fully quantizes and dequantizes both weights and activations; dqrow also does full quantize-dequantize but computes a separate scale for each row
`torchao_excluded_modules`	none	none, norm, embedding, norm+embedding	Which types of modules to omit when applying quantization.

`torchao_autoquant` (Pro)

This algorithm compiles, quantizes and sparsifies weights, gradients, and activations for inference. This algorithm is specifically adapted for Image-Gen models.

References: GitHub.
Can be applied on: CPU, CUDA.
Required: None.
Compatible with: None.

Parameter	Default	Options	Description
`torchao_autoquant_compile`	True	True, False	Whether to compile the model after quantization or not.

`higgs` (Pro)

HIGGS is a zero-shot quantization method that uses Hadamard preprocessing to transform weights and then selects MSE-optimal quantization grids. Note: The higgs kernels prepare the model for inference with the batch size specified in the smash config. Make sure to set the batch size to a value that corresponds to your inference requirements.

References: Github, Article.
Can be applied on: CUDA.
Required: None.
Compatible with: torch_compile, torch_unstructured.
Required install: pip install pruna_pro[higgs] --extra-index-url https://prunaai.pythonanywhere.com/ or pip install pruna_pro[full] --extra-index-url https://prunaai.pythonanywhere.com/ --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/cn/.

Parameter	Default	Options	Description
`higgs_weight_bits`	4	2, 3 or 4	The number of bits to use for weight quantization.
`higgs_p`	2	1 or 2	The number of groups to use for weight quantization.
`higgs_group_size`	256	64, 128 or 256	The size of each group.
`higgs_hadamard_size`	1024	512, 1024 or 2048	The size of the hadamard matrix.

`diffusers_higgs` (Pro)

Based on the HIGGS algorithm, this quantizer compresses the weights of the model using vector quantization and attaches FLUTE cuda kernels to accelerate the inference. HIGGS is a data-free quantization method that uses Hadamard preprocessing to transform weights and then selects MSE-optimal quantization grids. Note: The higgs kernels prepare the model for inference with the batch size specified in the smash config. Make sure to set the batch size to a value that corresponds to your inference requirements.

References: Github, Article.
Can be applied on: CUDA.
Required: None.
Compatible with: torch_compile, torch_unstructured, flux_caching, periodic_caching, adaptive_caching, auto_caching.
Required install: pip install pruna_pro[higgs] --extra-index-url https://prunaai.pythonanywhere.com/ or pip install pruna_pro[full] --extra-index-url https://prunaai.pythonanywhere.com/ --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/cn/.

Parameter	Default	Options	Description
`diffusers_higgs_weight_bits`	4	2, 3 or 4	The number of bits to use for weight quantization.
`diffusers_higgs_p`	2	1 or 2	The dimension of the vectors used for vector-quantization.
`diffusers_higgs_group_size`	256	64, 128 or 256	The size of each group.
`diffusers_higgs_hadamard_size`	1024	256, 512, 1024 or 2048	The size of the hadamard matrix.
`diffusers_higgs_percentage_to_not_convert`	20	0, 5, 10, 15 or 20	How many layers (in %) should not be quantized.

`fp8` (Pro)

Based on the torch.float8_e4m3fn and torch.float8_e5m2 formats, this quantizer compresses the weights, but also the activations, to reduce the memory usage and the inference time.

References: Github.
Can be applied on: CPU, CUDA, Accelerate distributed.
Required: None.
Compatible with: qkv_diffusers, torch_compile, flux_caching, periodic, adaptive, fora, auto, deepcache, taylor, taylor_auto, ring_attn.

Recoverers

Recovery (experimental) restores the performance of a model after compression.

`text_to_text_perp` (Pro)

This recoverer is a general purpose PERP recoverer for text-to-text models using norm, head and bias finetuning and optionally HuggingFace’s LoRA.

References: GitHub, Paper.
Can be applied on: CPU, CUDA.
Required: Tokenizer, Dataset.
Compatible with: half, quanto, torch_dynamic, llm_int8, torch_compile, x_fast.

Parameter	Default	Options	Description
`text_to_text_perp_lora_r`	8	4, 8, 16, 32, 64 or 128	Rank of the LoRA layers.
`text_to_text_perp_lora_alpha_r_ratio`	2.0	0.5, 1.0 or 2.0	Alpha/Rank ratio of the LoRA layers.
`text_to_text_perp_lora_target_modules`	None	None, all-linear	Target modules for the LoRA layers.
`text_to_text_perp_batch_size`	1	Range 1 to 4096	Batch size for finetuning.
`text_to_text_perp_gradient_accumulation_steps`	1	Range 1 to 1024	Number of gradient accumulation steps for finetuning.
`text_to_text_perp_num_epochs`	1.0	Range 0.0 to 4096.0	Number of epochs for finetuning.
`text_to_text_perp_learning_rate`	0.0002	Range 0.0 to 1.0	Learning rate for finetuning.
`text_to_text_perp_report_to`	none	none, wandb, tensorboard	Where to report the finetuning results.
`text_to_text_perp_optimizer`	AdamW8bit	AdamW, AdamW8bit, PagedAdamW8bit	Which optimizer to use for finetuning.

`text_to_text_inplace_perp` (Pro)

This is the same as text_to_text_perp, but without LoRA layers which add extra computations and thus slow down the inference of the final model.

References: GitHub, Paper.
Can be applied on: CPU, CUDA.
Required: Tokenizer, Dataset.
Compatible with: half, quanto, torch_dynamic, llm_int8, torch_compile, x_fast.

Parameter	Default	Options	Description
`text_to_text_inplace_perp_batch_size`	1	Range 1 to 4096	Batch size for finetuning.
`text_to_text_inplace_perp_gradient_accumulation_steps`	1	Range 1 to 1024	Number of gradient accumulation steps for finetuning.
`text_to_text_inplace_perp_num_epochs`	1.0	Range 0.0 to 4096.0	Number of epochs for finetuning.
`text_to_text_inplace_perp_learning_rate`	0.0002	Range 0.0 to 1.0	Learning rate for finetuning.
`text_to_text_inplace_perp_report_to`	none	none, wandb, tensorboard	Where to report the finetuning results.
`text_to_text_inplace_perp_optimizer`	AdamW8bit	AdamW, AdamW8bit, PagedAdamW8bit	Which optimizer to use for finetuning.

`text_to_text_lora` (Pro)

This recoverer attaches LoRA adapters to the model and finetunes them.

References: GitHub, Paper.
Can be applied on: CPU, CUDA.
Required: Tokenizer, Dataset.
Compatible with: half, quanto, torch_dynamic, llm_int8, torch_compile, x_fast.

Parameter	Default	Options	Description
`text_to_text_lora_lora_r`	8	4, 8, 16, 32, 64 or 128	Rank of the LoRA layers.
`text_to_text_lora_lora_alpha_r_ratio`	2.0	0.5, 1.0 or 2.0	Alpha/Rank ratio of the LoRA layers.
`text_to_text_lora_lora_target_modules`	None	None, all-linear	Target modules for the LoRA layers.
`text_to_text_lora_batch_size`	1	Range 1 to 4096	Batch size for finetuning.
`text_to_text_lora_gradient_accumulation_steps`	1	Range 1 to 1024	Number of gradient accumulation steps for finetuning.
`text_to_text_lora_num_epochs`	1.0	Range 0.0 to 4096.0	Number of epochs for finetuning.
`text_to_text_lora_learning_rate`	0.0002	Range 0.0 to 1.0	Learning rate for finetuning.
`text_to_text_lora_report_to`	none	none, wandb, tensorboard	Where to report the finetuning results.
`text_to_text_lora_optimizer`	AdamW8bit	AdamW, AdamW8bit, PagedAdamW8bit	Which optimizer to use for finetuning.

`text_to_image_perp` (Pro)

This recoverer is a general purpose PERP recoverer for text-to-image models using norm and bias finetuning and optionally HuggingFace’s LoRA.

References: GitHub, Paper.
Can be applied on: CUDA.
Required: Dataset.
Compatible with: quanto, torch_dynamic, diffusers_int8, deepcache, flux_caching, torch_compile, x_fast.

Parameter	Default	Options	Description
`text_to_image_perp_lora_r`	4	4, 8, 16, 32, 64 or 128	Rank of the LoRA layers.
`text_to_image_perp_lora_alpha_r_ratio`	1.0	0.5, 1.0 or 2.0	Alpha/Rank ratio of the LoRA layers.
`text_to_image_perp_batch_size`	0	Range 0 to 4096	Batch size for finetuning.
`text_to_image_perp_gradient_accumulation_steps`	1	Range 1 to 1024	Number of gradient accumulation steps for finetuning.
`text_to_image_perp_num_epochs`	1	Range 0 to 4096	Number of epochs for finetuning.
`text_to_image_perp_validate_every_n_epoch`	1.0	Range 0.0 to 4096.0	Number of epochs between each round of validation and model checkpointing. If the value is between 0 and 1, validation will be performed multiple times per epoch, e.g. 1/8 will result in 8 validations per epoch.
`text_to_image_perp_learning_rate`	0.0001	Range 0.0 to 1.0	Learning rate for finetuning.
`text_to_image_perp_use_cpu_offloading`	False	True, False	Whether to use CPU offloading for finetuning.
`text_to_image_perp_optimizer`	AdamW	AdamW8bit, AdamW, Adam	Which optimizer to use for finetuning.

`text_to_image_inplace_perp` (Pro)

This is the same as text_to_image_perp, but without LoRA layers which add extra computations and thus slow down the inference of the final model.

References: GitHub, Paper.
Can be applied on: CUDA.
Required: Dataset.
Compatible with: quanto, torch_dynamic, diffusers_int8, deepcache, flux_caching, torch_compile, x_fast.

Parameter	Default	Options	Description
`text_to_image_inplace_perp_batch_size`	0	Range 0 to 4096	Batch size for finetuning.
`text_to_image_inplace_perp_gradient_accumulation_steps`	1	Range 1 to 1024	Number of gradient accumulation steps for finetuning.
`text_to_image_inplace_perp_num_epochs`	1	Range 0 to 4096	Number of epochs for finetuning.
`text_to_image_inplace_perp_validate_every_n_epoch`	1.0	Range 0.0 to 4096.0	Number of epochs between each round of validation and model checkpointing. If the value is between 0 and 1, validation will be performed multiple times per epoch, e.g. 1/8 will result in 8 validations per epoch.
`text_to_image_inplace_perp_learning_rate`	0.0001	Range 0.0 to 1.0	Learning rate for finetuning.
`text_to_image_inplace_perp_use_cpu_offloading`	False	True, False	Whether to use CPU offloading for finetuning.
`text_to_image_inplace_perp_optimizer`	AdamW	AdamW8bit, AdamW, Adam	Which optimizer to use for finetuning.

`text_to_image_lora` (Pro)

This recoverer attaches LoRA adapters to the model and finetunes them.

References: GitHub, Paper.
Can be applied on: CUDA.
Required: Dataset.
Compatible with: quanto, torch_dynamic, diffusers_int8, deepcache, flux_caching, torch_compile, x_fast.

Parameter	Default	Options	Description
`text_to_image_lora_lora_r`	4	4, 8, 16, 32, 64 or 128	Rank of the LoRA layers.
`text_to_image_lora_lora_alpha_r_ratio`	1.0	0.5, 1.0 or 2.0	Alpha/Rank ratio of the LoRA layers.
`text_to_image_lora_batch_size`	0	Range 0 to 4096	Batch size for finetuning.
`text_to_image_lora_gradient_accumulation_steps`	1	Range 1 to 1024	Number of gradient accumulation steps for finetuning.
`text_to_image_lora_num_epochs`	1	Range 0 to 4096	Number of epochs for finetuning.
`text_to_image_lora_validate_every_n_epoch`	1.0	Range 0.0 to 4096.0	Number of epochs between each round of validation and model checkpointing. If the value is between 0 and 1, validation will be performed multiple times per epoch, e.g. 1/8 will result in 8 validations per epoch.
`text_to_image_lora_learning_rate`	0.0001	Range 0.0 to 1.0	Learning rate for finetuning.
`text_to_image_lora_use_cpu_offloading`	False	True, False	Whether to use CPU offloading for finetuning.
`text_to_image_lora_optimizer`	AdamW	AdamW8bit, AdamW, Adam	Which optimizer to use for finetuning.

Algorithms Overview

Batchers

ifw

whisper_s2t

Cachers

deepcache

fastercache

fora

pab

adaptive (Pro)

auto (Pro)

flux_caching (Pro)

periodic (Pro)

taylor (Pro)

taylor_auto (Pro)

Compilers

c_generate

c_translate

c_whisper

stable_fast

torch_compile

x_fast (Pro)

ipex_llm (Pro)

Distillers

hyper (Pro)

Distributers

ring_attn (Pro)

Enhancers

realesrgan_upscale (Pro)

img2img_denoise (Pro)

Factorizers

qkv_diffusers