SmashConfig User Manual

SmashConfig is an essential tool in Pruna for configuring parameters to optimize your models. This manual explains how to define and use SmashConfig.

Defining a SmashConfig

Define a SmashConfig using the following code:

from pruna.algorithms.SmashConfig import SmashConfig
smash_config = SmashConfig()

After creating a SmashConfig, you can set the parameters for optimization:

smash_config['task'] = 'text_image_generation'
smash_config['compilers'] = ['diffusers2']

Passing a SmashConfig to the Smash Function

Pass a SmashConfig to the smash function as follows:

from pruna.smash import smash

mashed_model = smash(
    model=pipe,
    api_key='<your-api-key>',  # Replace <your-api-key> with your actual API key
    smash_config=smash_config,
    dataloader= None, # Optional
)

SmashConfig Parameters

Tasks

The task parameter specifies the type of model you want to optimize. Supported tasks include:

  • image_classification: Optimize image classification models.

  • image_instance_segmentation: Optimize instance segmentation models.

  • image_keypoint_detection: Optimize keypoint detection models.

  • image_object_detection: Optimize object detection models.

  • image_semantic_segmentation: Optimize semantic segmentation models.

  • image_image_generation: Optimize image generation models.

  • image_image_inpainting: Optimize image inpainting models.

  • image_image_control: Optimize image control models.

  • image_video_generation: Optimize video generation models.

  • text_image_generation: Optimize text-to-image generation models.

  • text_video_generation: Optimize text-to-video generation models.

  • text_text_generation: Optimize text generation models.

  • text_text_translation: Optimize text translation models.

  • text+image_image_generation: Optimize text and image generation models.

  • audio_text_transcription: Optimize audio-to-text transcription models.

Compression Methods

There are two types of optimization methods: Compilation and Quantization.

Compilation Methods

Compilation methods optimize the model for specific hardware. Supported methods include:

  • all:
    • Time: 30 minutes.

    • Quality: Similar to the original model.

    • Required Argument:

      • device: ‘cpu’ or ‘cuda’. e.g. smash_config['device'] = 'cuda'

    • Optional Argument: None.

  • diffusers:
    • Time: 1 hour.

    • Quality: Same as the original model.

    • Required Argument: None.

    • Optional Argument: None.

  • diffusers2:
    • Time: A few minutes.

    • Quality: Same as the original model.

    • Required Argument: None.

    • Optional Argument:

      • save_dir: Working directory during compilation. e.g. smash_config['save_dir'] = '/tmp/'

  • c_translation:
    • Time: A few minutes.

    • Quality: Same as the original model.

    • Required Argument:

      • tokenizer: Associated tokenizer. e.g. smash_config['tokenizer'] = AutoTokenizer.from_pretrained('facebook/opt-125m')

    • Optional Argument:

      • weight_quantization_bits: 8 or 16 bits (default 16). e.g. smash_config['weight_quantization_bits'] = 8

  • c_generation:
    • Time: A few minutes.

    • Quality: Equivalent to the original model.

    • Required Argument: - tokenizer: The tokenizer associated with your generation model.

    • Optional Argument:

      • weight_quantization_bits: Specify 8 or 16 bits (16 by default).

  • c_whisper:
    • Time: A few minutes.

    • Quality: Same as the original model.

    • Required Argument:

      • processor: The processor for your whisper model.

    • Optional Argument:

      • weight_quantization_bits: Choose between 8 or 16 bits (16 if unspecified). e.g. smash_config['weight_quantization_bits'] = 8

  • ifw:
    • Time: A few minutes.

    • Quality: Comparable to the original model.

    • Required Arguments:

      • processor: Processor for your whisper model. e.g. smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')

      • device: Target hardware (‘cpu’ or ‘cuda’). e.g. smash_config['device'] = 'cuda'

    • Optional Argument: None.

  • ws2t:
    • Time: A few minutes.

    • Quality: Maintains original model performance.

    • Required Argument:

      • processor: Processor for your whisper model. e.g. smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')

    • Optional Argument: None.

  • step_caching:
    • Time: A few minutes.

    • Quality: Very close to the original model.

    • Required Argument: None.

    • Optional Argument: None.

  • tiling:
    • Time: A few minutes.

    • Quality: Not specified.

    • Required Argument: None.

    • Optional Argument: None.

  • x-fast:
    • Time: A few minutes.

    • Quality: Not specified.

    • Required Argument: None.

    • Optional Argument:

      • fn_to_compile: The function to compile. e.g. smash_config['fn_to_compile'] = 'forward'

      • save_dir: The working directory during compilation. e.g. smash_config['save_dir'] = '/tmp

  • torch_compile:
    • Time: A few minutes.

    • Quality: Not specified.

    • Required Argument: None.

    • Optional Argument:

      • cache_dir: The directory to cache the compiled model. e.g. smash_config['cache_dir'] = '/tmp'

      • fullgraph: Whether to compile the full graph. e.g. smash_config['fullgraph'] = True

      • dynamic: Whether to compile the model dynamically. e.g. smash_config['dynamic'] = True

      • mode: The mode to use. e.g. smash_config['mode'] = '"max-autotune"'

      • backend: The backend to use. e.g. smash_config['backend'] = 'inductor'

Quantization Methods

Quantization methods reduce the precision of the model’s weights and activations making them much smaller in terms of memory required at the cost of some quality loss. Supported methods include:

  • torch_dynamic:
    • Time: A few minutes.

    • Quality: Not specified.

    • Required Argument: None.

    • Optional Argument: None.

  • torch_static:
    • Time: A few minutes.

    • Quality: Not specified.

    • Required Argument: None.

    • Optional Argument: None.

  • llm-int8:
    • Time: A few minutes.

    • Quality: Lower than the original model with 4 bits worse than 8 bits.

    • Required Argument:

      • weight_quantization_bits: 4 or 8 bits. e.g. smash_config['weight_quantization_bits'] = 8

    • Optional Argument: None.

  • gptq:
    • Time: 30 minutes to a day depending on the size of the model.

    • Quality: Lower than the original model with 2 bits worse than 3 bits worse than 4 bits worse than 8 bits.

    • Required Argument:

      • weight_quantization_bits: 2, 3, 4, or 8 bits. e.g. smash_config['weight_quantization_bits'] = 4

    • Optional Argument: None.

  • awq:
    • Time: 30 minutes to a day depending on the size of the model.

    • Quality: Not specified.

    • Required Argument: None.

    • Optional Argument: None.

  • hqq:
    • Time: A few minutes.

    • Quality: Not specified.

    • Required Argument:

      • weight_quantization_bits: 2, 3, 4, or 8 bits. e.g. smash_config['weight_quantization_bits'] = 4

    • Optional Argument: None.

  • auto-gptq:
    • Time: 30 minutes to a day depending on the size of the model.

    • Quality: Not specified.

    • Required Argument:

      • weight_quantization_bits: 2, 3, 4, or 8 bits. e.g. smash_config['weight_quantization_bits'] = 4

    • Optional Argument: None.

  • lit-llm-int8:
    • Time: A few minutes.

    • Quality: Not specified.

    • Required Argument:

      • weight_quantization_bits: 4 or 8 bits. e.g. smash_config['weight_quantization_bits'] = 8

    • Optional Argument: None.

  • half:
    • Time: A few minutes.

    • Quality: Not specified.

    • Required Argument: None.

    • Optional Argument: None.

  • quanto:
    • Time: A few minutes.

    • Quality: Not specified.

    • Required Argument:

      • weight_quantization_bits: e.g. smash_config['weight_quantization_bits'] = qint8

      • activation_quantization_bits: e.g. smash_config['activation_quantization_bits'] = qint8

    • Optional Argument: None.

Pruning

Coming Soon!

Factorization

Coming Soon!