SmashConfig User Manual ========================= SmashConfig is an essential tool in Pruna for configuring parameters to optimize your models. This manual explains how to define and use SmashConfig. Defining a SmashConfig ------------------------ Define a SmashConfig using the following code: .. code-block:: python from pruna.algorithms.SmashConfig import SmashConfig smash_config = SmashConfig() After creating a SmashConfig, you can set the parameters for optimization: .. code-block:: python smash_config['task'] = 'text_image_generation' smash_config['compilers'] = ['diffusers2'] Passing a SmashConfig to the Smash Function --------------------------------------------- Pass a SmashConfig to the smash function as follows: .. code-block:: python from pruna.smash import smash mashed_model = smash( model=pipe, api_key='', # Replace with your actual API key smash_config=smash_config, dataloader= None, # Optional ) SmashConfig Parameters ------------------------ Tasks ^^^^^ The task parameter specifies the type of model you want to optimize. Supported tasks include: - **image_classification**: Optimize image classification models. - **image_instance_segmentation**: Optimize instance segmentation models. - **image_keypoint_detection**: Optimize keypoint detection models. - **image_object_detection**: Optimize object detection models. - **image_semantic_segmentation**: Optimize semantic segmentation models. - **image_image_generation**: Optimize image generation models. - **image_image_inpainting**: Optimize image inpainting models. - **image_image_control**: Optimize image control models. - **image_video_generation**: Optimize video generation models. - **text_image_generation**: Optimize text-to-image generation models. - **text_video_generation**: Optimize text-to-video generation models. - **text_text_generation**: Optimize text generation models. - **text_text_translation**: Optimize text translation models. - **text+image_image_generation**: Optimize text and image generation models. - **audio_text_transcription**: Optimize audio-to-text transcription models. Automatic ML Compression Search ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The automatic ML compression search optimmally compresses your model using a combination of compilation, quantization, pruning, and factorization methods which are supported for your model. To set it up you need to enter the following: .. code-block:: python smasher_config['task'] = '' # Replace with the task above you want to optimize smash_config["pruners"] = None smash_config["factorizers"] = None smash_config["quantizers"] = None smash_config["compilers"] = None Additionally, you can specify any target metric that you would like the compressed model to optimize. The metrics could be any of the following: - **memory_disk_first**: Optimize the model for memory usage on disk when loading for a first time. - **memory_disk**: Optimize the model for memory usage on disk. - **memory_inference_first**: Optimize the model for memory usage during inference for the first time. - **memory_inference**: Optimize the model for memory usage during inference. - **token_generation_latency_sync**: Optimize the model for token generation latency in synchronous mode. - **token_generation_latency_async**: Optimize the model for token generation latency in asynchronous mode. - **token_generation_throughput_sync**: Optimize the model for token generation throughput in synchronous mode. - **token_generation_throughput_async**: Optimize the model for token generation throughput in asynchronous mode. - **inference_latency_sync**: Optimize the model for inference latency in synchronous mode. - **inference_latency_async**: Optimize the model for inference latency in asynchronous mode. - **inference_throughput_sync**: Optimize the model for inference throughput in synchronous mode. - **inference_throughput_async**: Optimize the model for inference throughput in asynchronous mode. - **inference_CO2_emissions**: Optimize the model for inference CO2 emissions. - **inference_energy_consumption**: Optimize the model for inference energy consumption. In this case, you need to enter the following: .. code-block:: python smasher_config['task'] = '' # Replace with the task above you want to optimize smash_config["pruners"] = None smash_config["factorizers"] = None smash_config["quantizers"] = None smash_config["compilers"] = None smash_config["target_metric"] = '' # Replace with the target metric above you want to optimize Compression Methods ^^^^^^^^^^^^^^^^^^^^ There are two types of optimization methods: Compilation and Quantization. Compilation Methods ^^^^^^^^^^^^^^^^^^^ Compilation methods optimize the model for specific hardware. Supported methods include: - **all**: - Time: 30 minutes. - Quality: Similar to the original model. - Required Argument: - `device`: 'cpu' or 'cuda'. e.g. ``smash_config['device'] = 'cuda'`` - Optional Argument: None. - **diffusers**: - Time: 1 hour. - Quality: Same as the original model. - Required Argument: None. - Optional Argument: None. - **diffusers2**: - Time: A few minutes. - Quality: Same as the original model. - Required Argument: None. - Optional Argument: - `save_dir`: Working directory during compilation. e.g. ``smash_config['save_dir'] = '/tmp/'`` - **c_translation**: - Time: A few minutes. - Quality: Same as the original model. - Required Argument: - `tokenizer`: Associated tokenizer. e.g. ``smash_config['tokenizer'] = AutoTokenizer.from_pretrained('facebook/opt-125m')`` - Optional Argument: - `weight_quantization_bits`: 8 or 16 bits (default 16). e.g. ``smash_config['weight_quantization_bits'] = 8`` - **c_generation**: - Time: A few minutes. - Quality: Equivalent to the original model. - Required Argument: - `tokenizer`: The tokenizer associated with your generation model. - Optional Argument: - `weight_quantization_bits`: Specify 8 or 16 bits (16 by default). - **c_whisper**: - Time: A few minutes. - Quality: Same as the original model. - Required Argument: - `processor`: The processor for your whisper model. - Optional Argument: - `weight_quantization_bits`: Choose between 8 or 16 bits (16 if unspecified). e.g. ``smash_config['weight_quantization_bits'] = 8`` - **ifw**: - Time: A few minutes. - Quality: Comparable to the original model. - Required Arguments: - `processor`: Processor for your whisper model. e.g. ``smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')`` - `device`: Target hardware ('cpu' or 'cuda'). e.g. ``smash_config['device'] = 'cuda'`` - Optional Argument: None. - **ws2t**: - Time: A few minutes. - Quality: Maintains original model performance. - Required Argument: - `processor`: Processor for your whisper model. e.g. ``smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')`` - Optional Argument: None. - **step_caching**: - Time: A few minutes. - Quality: Very close to the original model. - Required Argument: None. - Optional Argument: None. - **tiling**: - Time: A few minutes. - Quality: Not specified. - Required Argument: None. - Optional Argument: None. - **x-fast**: - Time: A few minutes. - Quality: Not specified. - Required Argument: None. - Optional Argument: - `fn_to_compile`: The function to compile. e.g. ``smash_config['fn_to_compile'] = 'forward'`` - `save_dir`: The working directory during compilation. e.g. ``smash_config['save_dir'] = '/tmp`` - **torch_compile**: - Time: A few minutes. - Quality: Not specified. - Required Argument: None. - Optional Argument: - `cache_dir`: The directory to cache the compiled model. e.g. ``smash_config['cache_dir'] = '/tmp'`` - `fullgraph`: Whether to compile the full graph. e.g. ``smash_config['fullgraph'] = True`` - `dynamic`: Whether to compile the model dynamically. e.g. ``smash_config['dynamic'] = True`` - `mode`: The mode to use. e.g. ``smash_config['mode'] = '"max-autotune"'`` - `backend`: The backend to use. e.g. ``smash_config['backend'] = 'inductor'`` Quantization Methods ^^^^^^^^^^^^^^^^^^^^ Quantization methods reduce the precision of the model's weights and activations making them much smaller in terms of memory required at the cost of some quality loss. Supported methods include: - **torch_dynamic**: - Time: A few minutes. - Quality: Not specified. - Required Argument: None. - Optional Argument: None. - **torch_static**: - Time: A few minutes. - Quality: Not specified. - Required Argument: None. - Optional Argument: None. - **llm-int8**: - Time: A few minutes. - Quality: Lower than the original model with 4 bits worse than 8 bits. - Required Argument: - `weight_quantization_bits`: 4 or 8 bits. e.g. ``smash_config['weight_quantization_bits'] = 8`` - Optional Argument: None. - **gptq**: - Time: 30 minutes to a day depending on the size of the model. - Quality: Lower than the original model with 2 bits worse than 3 bits worse than 4 bits worse than 8 bits. - Required Argument: - `weight_quantization_bits`: 2, 3, 4, or 8 bits. e.g. ``smash_config['weight_quantization_bits'] = 4`` - Optional Argument: None. - **awq**: - Time: 30 minutes to a day depending on the size of the model. - Quality: Not specified. - Required Argument: None. - Optional Argument: None. - **hqq**: - Time: A few minutes. - Quality: Not specified. - Required Argument: - `weight_quantization_bits`: 2, 3, 4, or 8 bits. e.g. ``smash_config['weight_quantization_bits'] = 4`` - Optional Argument: None. - **auto-gptq**: - Time: 30 minutes to a day depending on the size of the model. - Quality: Not specified. - Required Argument: - `weight_quantization_bits`: 2, 3, 4, or 8 bits. e.g. ``smash_config['weight_quantization_bits'] = 4`` - Optional Argument: None. - **lit-llm-int8**: - Time: A few minutes. - Quality: Not specified. - Required Argument: - `weight_quantization_bits`: 4 or 8 bits. e.g. ``smash_config['weight_quantization_bits'] = 8`` - Optional Argument: None. - **half**: - Time: A few minutes. - Quality: Not specified. - Required Argument: None. - Optional Argument: None. - **quanto**: - Time: A few minutes. - Quality: Not specified. - Required Argument: - `weight_quantization_bits`: e.g. ``smash_config['weight_quantization_bits'] = qint8`` - `activation_quantization_bits`: e.g. ``smash_config['activation_quantization_bits'] = qint8`` - Optional Argument: None. Pruning ^^^^^^^ Coming Soon! Factorization ^^^^^^^^^^^^^ Coming Soon!