SmashConfig User Manual
SmashConfig is an essential tool in Pruna for configuring parameters to optimize your models. This manual explains how to define and use SmashConfig.
Defining a SmashConfig
Define a SmashConfig using the following code:
from pruna.algorithms.SmashConfig import SmashConfig
smash_config = SmashConfig()
After creating a SmashConfig, you can set the parameters for optimization:
smash_config['task'] = 'text_image_generation'
smash_config['compilers'] = ['diffusers2']
Passing a SmashConfig to the Smash Function
Pass a SmashConfig to the smash function as follows:
from pruna.smash import smash
mashed_model = smash(
model=pipe,
api_key='<your-api-key>', # Replace <your-api-key> with your actual API key
smash_config=smash_config,
dataloader= None, # Optional
)
SmashConfig Parameters
Tasks
The task parameter specifies the type of model you want to optimize. Supported tasks include:
image_classification: Optimize image classification models.
image_instance_segmentation: Optimize instance segmentation models.
image_keypoint_detection: Optimize keypoint detection models.
image_object_detection: Optimize object detection models.
image_semantic_segmentation: Optimize semantic segmentation models.
image_image_generation: Optimize image generation models.
image_image_inpainting: Optimize image inpainting models.
image_image_control: Optimize image control models.
image_video_generation: Optimize video generation models.
text_image_generation: Optimize text-to-image generation models.
text_video_generation: Optimize text-to-video generation models.
text_text_generation: Optimize text generation models.
text_text_translation: Optimize text translation models.
text+image_image_generation: Optimize text and image generation models.
audio_text_transcription: Optimize audio-to-text transcription models.
Automatic ML Compression Search
The automatic ML compression search optimmally compresses your model using a combination of compilation, quantization, pruning, and factorization methods which are supported for your model. To set it up you need to enter the following: .. code-block:: python
smasher_config[‘task’] = ‘<your-task>’ # Replace <your-task> with the task above you want to optimize smash_config[“pruners”] = None smash_config[“factorizers”] = None smash_config[“quantizers”] = None smash_config[“compilers”] = None
Additionally, you can specify any target metric that you would like the compressed model to optimize. The metrics could be any of the following: - memory_disk_first: Optimize the model for memory usage on disk when loading for a first time. - memory_disk: Optimize the model for memory usage on disk. - memory_inference_first: Optimize the model for memory usage during inference for the first time. - memory_inference: Optimize the model for memory usage during inference. - token_generation_latency_sync: Optimize the model for token generation latency in synchronous mode. - token_generation_latency_async: Optimize the model for token generation latency in asynchronous mode. - token_generation_throughput_sync: Optimize the model for token generation throughput in synchronous mode. - token_generation_throughput_async: Optimize the model for token generation throughput in asynchronous mode. - inference_latency_sync: Optimize the model for inference latency in synchronous mode. - inference_latency_async: Optimize the model for inference latency in asynchronous mode. - inference_throughput_sync: Optimize the model for inference throughput in synchronous mode. - inference_throughput_async: Optimize the model for inference throughput in asynchronous mode. - inference_CO2_emissions: Optimize the model for inference CO2 emissions. - inference_energy_consumption: Optimize the model for inference energy consumption. In this case, you need to enter the following: .. code-block:: python
smasher_config[‘task’] = ‘<your-task>’ # Replace <your-task> with the task above you want to optimize smash_config[“pruners”] = None smash_config[“factorizers”] = None smash_config[“quantizers”] = None smash_config[“compilers”] = None smash_config[“target_metric”] = ‘<your-target-metric>’ # Replace <your-target-metric> with the target metric above you want to optimize
Compression Methods
There are two types of optimization methods: Compilation and Quantization.
Compilation Methods
Compilation methods optimize the model for specific hardware. Supported methods include:
- all:
Time: 30 minutes.
Quality: Similar to the original model.
Required Argument:
device: ‘cpu’ or ‘cuda’. e.g.
smash_config['device'] = 'cuda'
Optional Argument: None.
- diffusers:
Time: 1 hour.
Quality: Same as the original model.
Required Argument: None.
Optional Argument: None.
- diffusers2:
Time: A few minutes.
Quality: Same as the original model.
Required Argument: None.
Optional Argument:
save_dir: Working directory during compilation. e.g.
smash_config['save_dir'] = '/tmp/'
- c_translation:
Time: A few minutes.
Quality: Same as the original model.
Required Argument:
tokenizer: Associated tokenizer. e.g.
smash_config['tokenizer'] = AutoTokenizer.from_pretrained('facebook/opt-125m')
Optional Argument:
weight_quantization_bits: 8 or 16 bits (default 16). e.g.
smash_config['weight_quantization_bits'] = 8
- c_generation:
Time: A few minutes.
Quality: Equivalent to the original model.
Required Argument: - tokenizer: The tokenizer associated with your generation model.
Optional Argument:
weight_quantization_bits: Specify 8 or 16 bits (16 by default).
- c_whisper:
Time: A few minutes.
Quality: Same as the original model.
Required Argument:
processor: The processor for your whisper model.
Optional Argument:
weight_quantization_bits: Choose between 8 or 16 bits (16 if unspecified). e.g.
smash_config['weight_quantization_bits'] = 8
- ifw:
Time: A few minutes.
Quality: Comparable to the original model.
Required Arguments:
processor: Processor for your whisper model. e.g.
smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')
device: Target hardware (‘cpu’ or ‘cuda’). e.g.
smash_config['device'] = 'cuda'
Optional Argument: None.
- ws2t:
Time: A few minutes.
Quality: Maintains original model performance.
Required Argument:
processor: Processor for your whisper model. e.g.
smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')
Optional Argument: None.
- step_caching:
Time: A few minutes.
Quality: Very close to the original model.
Required Argument: None.
Optional Argument: None.
- tiling:
Time: A few minutes.
Quality: Not specified.
Required Argument: None.
Optional Argument: None.
- x-fast:
Time: A few minutes.
Quality: Not specified.
Required Argument: None.
Optional Argument:
fn_to_compile: The function to compile. e.g.
smash_config['fn_to_compile'] = 'forward'
save_dir: The working directory during compilation. e.g.
smash_config['save_dir'] = '/tmp
- torch_compile:
Time: A few minutes.
Quality: Not specified.
Required Argument: None.
Optional Argument:
cache_dir: The directory to cache the compiled model. e.g.
smash_config['cache_dir'] = '/tmp'
fullgraph: Whether to compile the full graph. e.g.
smash_config['fullgraph'] = True
dynamic: Whether to compile the model dynamically. e.g.
smash_config['dynamic'] = True
mode: The mode to use. e.g.
smash_config['mode'] = '"max-autotune"'
backend: The backend to use. e.g.
smash_config['backend'] = 'inductor'
Quantization Methods
Quantization methods reduce the precision of the model’s weights and activations making them much smaller in terms of memory required at the cost of some quality loss. Supported methods include:
- torch_dynamic:
Time: A few minutes.
Quality: Not specified.
Required Argument: None.
Optional Argument: None.
- torch_static:
Time: A few minutes.
Quality: Not specified.
Required Argument: None.
Optional Argument: None.
- llm-int8:
Time: A few minutes.
Quality: Lower than the original model with 4 bits worse than 8 bits.
Required Argument:
weight_quantization_bits: 4 or 8 bits. e.g.
smash_config['weight_quantization_bits'] = 8
Optional Argument: None.
- gptq:
Time: 30 minutes to a day depending on the size of the model.
Quality: Lower than the original model with 2 bits worse than 3 bits worse than 4 bits worse than 8 bits.
Required Argument:
weight_quantization_bits: 2, 3, 4, or 8 bits. e.g.
smash_config['weight_quantization_bits'] = 4
Optional Argument: None.
- awq:
Time: 30 minutes to a day depending on the size of the model.
Quality: Not specified.
Required Argument: None.
Optional Argument: None.
- hqq:
Time: A few minutes.
Quality: Not specified.
Required Argument:
weight_quantization_bits: 2, 3, 4, or 8 bits. e.g.
smash_config['weight_quantization_bits'] = 4
Optional Argument: None.
- auto-gptq:
Time: 30 minutes to a day depending on the size of the model.
Quality: Not specified.
Required Argument:
weight_quantization_bits: 2, 3, 4, or 8 bits. e.g.
smash_config['weight_quantization_bits'] = 4
Optional Argument: None.
- lit-llm-int8:
Time: A few minutes.
Quality: Not specified.
Required Argument:
weight_quantization_bits: 4 or 8 bits. e.g.
smash_config['weight_quantization_bits'] = 8
Optional Argument: None.
- half:
Time: A few minutes.
Quality: Not specified.
Required Argument: None.
Optional Argument: None.
- quanto:
Time: A few minutes.
Quality: Not specified.
Required Argument:
weight_quantization_bits: e.g.
smash_config['weight_quantization_bits'] = qint8
activation_quantization_bits: e.g.
smash_config['activation_quantization_bits'] = qint8
Optional Argument: None.
Pruning
Coming Soon!
Factorization
Coming Soon!