SmashConfig User Manual

SmashConfig is an essential tool in Pruna for configuring parameters to optimize your models. This manual explains how to define and use SmashConfig.

Defining a SmashConfig

Define a SmashConfig using the following code:

from pruna.algorithms.SmashConfig import SmashConfig
smash_config = SmashConfig()

After creating a SmashConfig, you can set the parameters for optimization:

smash_config['task'] = 'text_image_generation'
smash_config['compilers'] = ['diffusers2']

Passing a SmashConfig to the Smash Function

Pass a SmashConfig to the smash function as follows:

from pruna.smash import smash

mashed_model = smash(
    model=pipe,
    api_key='<your-api-key>',  # Replace <your-api-key> with your actual API key
    smash_config=smash_config,
    dataloader= None, # Optional
)

SmashConfig Parameters

Tasks

The task parameter specifies the type of model you want to optimize. Supported tasks include:

image_classification: Optimize image classification models.
image_instance_segmentation: Optimize instance segmentation models.
image_keypoint_detection: Optimize keypoint detection models.
image_object_detection: Optimize object detection models.
image_semantic_segmentation: Optimize semantic segmentation models.
image_image_generation: Optimize image generation models.
image_image_inpainting: Optimize image inpainting models.
image_image_control: Optimize image control models.
image_video_generation: Optimize video generation models.
text_image_generation: Optimize text-to-image generation models.
text_video_generation: Optimize text-to-video generation models.
text_text_generation: Optimize text generation models.
text_text_translation: Optimize text translation models.
text+image_image_generation: Optimize text and image generation models.
audio_text_transcription: Optimize audio-to-text transcription models.

Automatic ML Compression Search

The automatic ML compression search optimmally compresses your model using a combination of compilation, quantization, pruning, and factorization methods which are supported for your model. To set it up you need to enter the following: .. code-block:: python

smasher_config[‘task’] = ‘<your-task>’ # Replace <your-task> with the task above you want to optimize smash_config[“pruners”] = None smash_config[“factorizers”] = None smash_config[“quantizers”] = None smash_config[“compilers”] = None

Additionally, you can specify any target metric that you would like the compressed model to optimize. The metrics could be any of the following: - memory_disk_first: Optimize the model for memory usage on disk when loading for a first time. - memory_disk: Optimize the model for memory usage on disk. - memory_inference_first: Optimize the model for memory usage during inference for the first time. - memory_inference: Optimize the model for memory usage during inference. - token_generation_latency_sync: Optimize the model for token generation latency in synchronous mode. - token_generation_latency_async: Optimize the model for token generation latency in asynchronous mode. - token_generation_throughput_sync: Optimize the model for token generation throughput in synchronous mode. - token_generation_throughput_async: Optimize the model for token generation throughput in asynchronous mode. - inference_latency_sync: Optimize the model for inference latency in synchronous mode. - inference_latency_async: Optimize the model for inference latency in asynchronous mode. - inference_throughput_sync: Optimize the model for inference throughput in synchronous mode. - inference_throughput_async: Optimize the model for inference throughput in asynchronous mode. - inference_CO2_emissions: Optimize the model for inference CO2 emissions. - inference_energy_consumption: Optimize the model for inference energy consumption. In this case, you need to enter the following: .. code-block:: python

smasher_config[‘task’] = ‘<your-task>’ # Replace <your-task> with the task above you want to optimize smash_config[“pruners”] = None smash_config[“factorizers”] = None smash_config[“quantizers”] = None smash_config[“compilers”] = None smash_config[“target_metric”] = ‘<your-target-metric>’ # Replace <your-target-metric> with the target metric above you want to optimize

Compression Methods

There are two types of optimization methods: Compilation and Quantization.

Compilation Methods

Compilation methods optimize the model for specific hardware. Supported methods include:

all:
- Time: 30 minutes.
- Quality: Similar to the original model.
- Required Argument:
  - device: ‘cpu’ or ‘cuda’. e.g. smash_config['device'] = 'cuda'
- Optional Argument: None.
diffusers:
- Time: 1 hour.
- Quality: Same as the original model.
- Required Argument: None.
- Optional Argument: None.
diffusers2:
- Time: A few minutes.
- Quality: Same as the original model.
- Required Argument: None.
- Optional Argument:
  - save_dir: Working directory during compilation. e.g. smash_config['save_dir'] = '/tmp/'
c_translation:
- Time: A few minutes.
- Quality: Same as the original model.
- Required Argument:
  - tokenizer: Associated tokenizer. e.g. smash_config['tokenizer'] = AutoTokenizer.from_pretrained('facebook/opt-125m')
- Optional Argument:
  - weight_quantization_bits: 8 or 16 bits (default 16). e.g. smash_config['weight_quantization_bits'] = 8
c_generation:
- Time: A few minutes.
- Quality: Equivalent to the original model.
- Required Argument: - tokenizer: The tokenizer associated with your generation model.
- Optional Argument:
  - weight_quantization_bits: Specify 8 or 16 bits (16 by default).
c_whisper:
- Time: A few minutes.
- Quality: Same as the original model.
- Required Argument:
  - processor: The processor for your whisper model.
- Optional Argument:
  - weight_quantization_bits: Choose between 8 or 16 bits (16 if unspecified). e.g. smash_config['weight_quantization_bits'] = 8
ifw:
- Time: A few minutes.
- Quality: Comparable to the original model.
- Required Arguments:
  - processor: Processor for your whisper model. e.g. smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')
  - device: Target hardware (‘cpu’ or ‘cuda’). e.g. smash_config['device'] = 'cuda'
- Optional Argument: None.
ws2t:
- Time: A few minutes.
- Quality: Maintains original model performance.
- Required Argument:
  - processor: Processor for your whisper model. e.g. smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')
- Optional Argument: None.
step_caching:
- Time: A few minutes.
- Quality: Very close to the original model.
- Required Argument: None.
- Optional Argument: None.
tiling:
- Time: A few minutes.
- Quality: Not specified.
- Required Argument: None.
- Optional Argument: None.
x-fast:
- Time: A few minutes.
- Quality: Not specified.
- Required Argument: None.
- Optional Argument:
  - fn_to_compile: The function to compile. e.g. smash_config['fn_to_compile'] = 'forward'
  - save_dir: The working directory during compilation. e.g. smash_config['save_dir'] = '/tmp
torch_compile:
- Time: A few minutes.
- Quality: Not specified.
- Required Argument: None.
- Optional Argument:
  - cache_dir: The directory to cache the compiled model. e.g. smash_config['cache_dir'] = '/tmp'
  - fullgraph: Whether to compile the full graph. e.g. smash_config['fullgraph'] = True
  - dynamic: Whether to compile the model dynamically. e.g. smash_config['dynamic'] = True
  - mode: The mode to use. e.g. smash_config['mode'] = '"max-autotune"'
  - backend: The backend to use. e.g. smash_config['backend'] = 'inductor'

Quantization Methods

Quantization methods reduce the precision of the model’s weights and activations making them much smaller in terms of memory required at the cost of some quality loss. Supported methods include:

torch_dynamic:
- Time: A few minutes.
- Quality: Not specified.
- Required Argument: None.
- Optional Argument: None.
torch_static:
- Time: A few minutes.
- Quality: Not specified.
- Required Argument: None.
- Optional Argument: None.
llm-int8:
- Time: A few minutes.
- Quality: Lower than the original model with 4 bits worse than 8 bits.
- Required Argument:
  - weight_quantization_bits: 4 or 8 bits. e.g. smash_config['weight_quantization_bits'] = 8
- Optional Argument: None.
gptq:
- Time: 30 minutes to a day depending on the size of the model.
- Quality: Lower than the original model with 2 bits worse than 3 bits worse than 4 bits worse than 8 bits.
- Required Argument:
  - weight_quantization_bits: 2, 3, 4, or 8 bits. e.g. smash_config['weight_quantization_bits'] = 4
- Optional Argument: None.
awq:
- Time: 30 minutes to a day depending on the size of the model.
- Quality: Not specified.
- Required Argument: None.
- Optional Argument: None.
hqq:
- Time: A few minutes.
- Quality: Not specified.
- Required Argument:
  - weight_quantization_bits: 2, 3, 4, or 8 bits. e.g. smash_config['weight_quantization_bits'] = 4
- Optional Argument: None.
auto-gptq:
- Time: 30 minutes to a day depending on the size of the model.
- Quality: Not specified.
- Required Argument:
  - weight_quantization_bits: 2, 3, 4, or 8 bits. e.g. smash_config['weight_quantization_bits'] = 4
- Optional Argument: None.
lit-llm-int8:
- Time: A few minutes.
- Quality: Not specified.
- Required Argument:
  - weight_quantization_bits: 4 or 8 bits. e.g. smash_config['weight_quantization_bits'] = 8
- Optional Argument: None.
half:
- Time: A few minutes.
- Quality: Not specified.
- Required Argument: None.
- Optional Argument: None.
quanto:
- Time: A few minutes.
- Quality: Not specified.
- Required Argument:
  - weight_quantization_bits: e.g. smash_config['weight_quantization_bits'] = qint8
  - activation_quantization_bits: e.g. smash_config['activation_quantization_bits'] = qint8
- Optional Argument: None.

Pruning

Coming Soon!

Factorization

Coming Soon!