SmashConfig User Manual
=========================

SmashConfig is an essential tool in Pruna for configuring parameters to optimize your models. This manual explains how to define and use SmashConfig.

Defining a SmashConfig
------------------------

Define a SmashConfig using the following code:

.. code-block:: python

    from pruna.algorithms.SmashConfig import SmashConfig
    smash_config = SmashConfig()

After creating a SmashConfig, you can set the parameters for optimization:

.. code-block:: python

    smash_config['task'] = 'text_image_generation'
    smash_config['compilers'] = ['diffusers2']

Passing a SmashConfig to the Smash Function
---------------------------------------------

Pass a SmashConfig to the smash function as follows:

.. code-block:: python

    from pruna.smash import smash

    mashed_model = smash(
        model=pipe,
        api_key='<your-api-key>',  # Replace <your-api-key> with your actual API key
        smash_config=smash_config,
        dataloader= None, # Optional
    )

SmashConfig Parameters
------------------------

Tasks
^^^^^

The task parameter specifies the type of model you want to optimize. Supported tasks include:

- **image_classification**: Optimize image classification models.
- **image_instance_segmentation**: Optimize instance segmentation models.
- **image_keypoint_detection**: Optimize keypoint detection models.
- **image_object_detection**: Optimize object detection models.
- **image_semantic_segmentation**: Optimize semantic segmentation models.
- **image_image_generation**: Optimize image generation models.
- **image_image_inpainting**: Optimize image inpainting models.
- **image_image_control**: Optimize image control models.
- **image_video_generation**: Optimize video generation models.
- **text_image_generation**: Optimize text-to-image generation models.
- **text_video_generation**: Optimize text-to-video generation models.
- **text_text_generation**: Optimize text generation models.
- **text_text_translation**: Optimize text translation models.
- **text+image_image_generation**: Optimize text and image generation models.
- **audio_text_transcription**: Optimize audio-to-text transcription models.

Automatic ML Compression Search
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The automatic ML compression search optimmally compresses your model using a combination of compilation, quantization, pruning, and factorization methods which are supported for your model. To set it up you need to enter the following:
.. code-block:: python

    smasher_config['task'] = '<your-task>' # Replace <your-task> with the task above you want to optimize
    smash_config["pruners"] = None
    smash_config["factorizers"] = None
    smash_config["quantizers"] = None
    smash_config["compilers"] = None
Additionally, you can specify any target metric that you would like the compressed model to optimize. The metrics could be any of the following:
- **memory_disk_first**: Optimize the model for memory usage on disk when loading for a first time.
- **memory_disk**: Optimize the model for memory usage on disk.
- **memory_inference_first**: Optimize the model for memory usage during inference for the first time.
- **memory_inference**: Optimize the model for memory usage during inference.
- **token_generation_latency_sync**: Optimize the model for token generation latency in synchronous mode.
- **token_generation_latency_async**: Optimize the model for token generation latency in asynchronous mode.
- **token_generation_throughput_sync**: Optimize the model for token generation throughput in synchronous mode.
- **token_generation_throughput_async**: Optimize the model for token generation throughput in asynchronous mode.
- **inference_latency_sync**: Optimize the model for inference latency in synchronous mode.
- **inference_latency_async**: Optimize the model for inference latency in asynchronous mode.
- **inference_throughput_sync**: Optimize the model for inference throughput in synchronous mode.
- **inference_throughput_async**: Optimize the model for inference throughput in asynchronous mode.
- **inference_CO2_emissions**: Optimize the model for inference CO2 emissions.
- **inference_energy_consumption**: Optimize the model for inference energy consumption.
In this case, you need to enter the following:
.. code-block:: python

    smasher_config['task'] = '<your-task>' # Replace <your-task> with the task above you want to optimize
    smash_config["pruners"] = None
    smash_config["factorizers"] = None
    smash_config["quantizers"] = None
    smash_config["compilers"] = None
    smash_config["target_metric"] = '<your-target-metric>' # Replace <your-target-metric> with the target metric above you want to optimize

Compression Methods
^^^^^^^^^^^^^^^^^^^^

There are two types of optimization methods: Compilation and Quantization.

Compilation Methods
^^^^^^^^^^^^^^^^^^^

Compilation methods optimize the model for specific hardware. Supported methods include:

- **all**:
    - Time: 30 minutes.
    - Quality: Similar to the original model.
    - Required Argument:

      - `device`: 'cpu' or 'cuda'. e.g. ``smash_config['device'] = 'cuda'``
    - Optional Argument: None.

- **diffusers**:
    - Time: 1 hour.
    - Quality: Same as the original model.
    - Required Argument: None.
    - Optional Argument: None.

- **diffusers2**:
    - Time: A few minutes.
    - Quality: Same as the original model.
    - Required Argument: None.
    - Optional Argument:

      - `save_dir`: Working directory during compilation. e.g. ``smash_config['save_dir'] = '/tmp/'``

- **c_translation**:
    - Time: A few minutes.
    - Quality: Same as the original model.
    - Required Argument:

      - `tokenizer`: Associated tokenizer. e.g. ``smash_config['tokenizer'] = AutoTokenizer.from_pretrained('facebook/opt-125m')``
    - Optional Argument:

      - `weight_quantization_bits`: 8 or 16 bits (default 16). e.g. ``smash_config['weight_quantization_bits'] = 8``

- **c_generation**:
    - Time: A few minutes.
    - Quality: Equivalent to the original model.
    - Required Argument:
      - `tokenizer`: The tokenizer associated with your generation model.
    - Optional Argument:

      - `weight_quantization_bits`: Specify 8 or 16 bits (16 by default).

- **c_whisper**:
    - Time: A few minutes.
    - Quality: Same as the original model.
    - Required Argument:

      - `processor`: The processor for your whisper model.
    - Optional Argument:

      - `weight_quantization_bits`: Choose between 8 or 16 bits (16 if unspecified). e.g. ``smash_config['weight_quantization_bits'] = 8``

- **ifw**:
    - Time: A few minutes.
    - Quality: Comparable to the original model.
    - Required Arguments:

      - `processor`: Processor for your whisper model. e.g. ``smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')``
      - `device`: Target hardware ('cpu' or 'cuda'). e.g. ``smash_config['device'] = 'cuda'``
    - Optional Argument: None.

- **ws2t**:
    - Time: A few minutes.
    - Quality: Maintains original model performance.
    - Required Argument:

      - `processor`: Processor for your whisper model. e.g. ``smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')``
    - Optional Argument: None.

- **step_caching**:
    - Time: A few minutes.
    - Quality: Very close to the original model.
    - Required Argument: None.
    - Optional Argument: None.

- **tiling**:
    - Time: A few minutes.
    - Quality: Not specified.
    - Required Argument: None.
    - Optional Argument: None.

- **x-fast**:
    - Time: A few minutes.
    - Quality: Not specified.
    - Required Argument: None.
    - Optional Argument:

      - `fn_to_compile`: The function to compile. e.g. ``smash_config['fn_to_compile'] = 'forward'``
      - `save_dir`: The working directory during compilation. e.g. ``smash_config['save_dir'] = '/tmp``

- **torch_compile**:
    - Time: A few minutes.
    - Quality: Not specified.
    - Required Argument: None.
    - Optional Argument:

      - `cache_dir`: The directory to cache the compiled model. e.g. ``smash_config['cache_dir'] = '/tmp'``
      - `fullgraph`: Whether to compile the full graph. e.g. ``smash_config['fullgraph'] = True``
      - `dynamic`: Whether to compile the model dynamically. e.g. ``smash_config['dynamic'] = True``
      - `mode`: The mode to use. e.g. ``smash_config['mode'] = '"max-autotune"'``
      - `backend`: The backend to use. e.g. ``smash_config['backend'] = 'inductor'``

Quantization Methods
^^^^^^^^^^^^^^^^^^^^

Quantization methods reduce the precision of the model's weights and activations making them much smaller in terms of memory required at the cost of some quality loss. Supported methods include:

- **torch_dynamic**:
    - Time: A few minutes.
    - Quality: Not specified.
    - Required Argument: None.
    - Optional Argument: None.

- **torch_static**:
    - Time: A few minutes.
    - Quality: Not specified.
    - Required Argument: None.
    - Optional Argument: None.

- **llm-int8**:
    - Time: A few minutes.
    - Quality: Lower than the original model with 4 bits worse than 8 bits.
    - Required Argument:

      - `weight_quantization_bits`: 4 or 8 bits. e.g. ``smash_config['weight_quantization_bits'] = 8``
    - Optional Argument: None.

- **gptq**:
    - Time: 30 minutes to a day depending on the size of the model.
    - Quality: Lower than the original model with 2 bits worse than 3 bits worse than 4 bits worse than 8 bits.
    - Required Argument:

      - `weight_quantization_bits`: 2, 3, 4, or 8 bits. e.g. ``smash_config['weight_quantization_bits'] = 4``
    - Optional Argument: None.

- **awq**:
    - Time: 30 minutes to a day depending on the size of the model.
    - Quality: Not specified.
    - Required Argument: None.
    - Optional Argument: None.

- **hqq**:
    - Time: A few minutes.
    - Quality: Not specified.
    - Required Argument:

      - `weight_quantization_bits`: 2, 3, 4, or 8 bits. e.g. ``smash_config['weight_quantization_bits'] = 4``
    - Optional Argument: None.

- **auto-gptq**:
    - Time: 30 minutes to a day depending on the size of the model.
    - Quality: Not specified.
    - Required Argument:

      - `weight_quantization_bits`: 2, 3, 4, or 8 bits. e.g. ``smash_config['weight_quantization_bits'] = 4``
    - Optional Argument: None.

- **lit-llm-int8**:
    - Time: A few minutes.
    - Quality: Not specified.
    - Required Argument:

      - `weight_quantization_bits`: 4 or 8 bits. e.g. ``smash_config['weight_quantization_bits'] = 8``
    - Optional Argument: None.

- **half**:
    - Time: A few minutes.
    - Quality: Not specified.
    - Required Argument: None.
    - Optional Argument: None.

- **quanto**:
    - Time: A few minutes.
    - Quality: Not specified.
    - Required Argument: 

      - `weight_quantization_bits`: e.g. ``smash_config['weight_quantization_bits'] = qint8``
      - `activation_quantization_bits`: e.g. ``smash_config['activation_quantization_bits'] = qint8``
    - Optional Argument: None.

Pruning
^^^^^^^

Coming Soon!

Factorization
^^^^^^^^^^^^^

Coming Soon!