Smashing Large Language Models
===============================

This tutorial demonstrates how to use the `pruna` package to optimize any custom large language model. We will use the facebook/opt-125m model as an example.

Loading the LLM
----------------------------------

First, load your llm.

.. code-block:: python

    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch

    model_id = "facebook/opt-125m"
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto").cuda()

    text = "The 45th president of the United States of America is"
    ins = tokenizer(text, return_tensors="pt").to('cuda')

Initializing the Smash Config
-------------------------------

Next, initialize the smash_config.

.. code-block:: python

    from pruna_engine.SmashConfig import SmashConfig

    # Initialize the SmashConfig
    smash_config = SmashConfig()
    smasher_config['task'] = 'text_text_generation'
    smash_config['quantizers'] = ['gptq']
    smash_config["tokenizer_name"] = tokenizer
    smash_config["weight_quantization_bits"] = 4

Smashing the Model
------------------

Now, smash the model.

.. code-block:: python

    from pruna.smash import smash

    # Smash the model
    smashed_model = smash(
        model=model,
        dataloader="WikiText_128",
        api_key='<your-api-key>',  # replace <your-api-key> with your actual API key
        smash_config=smash_config,
    )

Don't forget to replace the api_key by the one provided by PrunaAI.

Running the Model
-----------------

Finally, run the model to generate the text.

.. code-block:: python

    # Display the result
    smashed_model.generate(**ins)

Wrap Up
---------

Congratulations! You have successfully smashed a large language model. You can now use the `pruna` package to optimize any custom LLM. The only parts that you should modify are step 1 and step 4 to fit your use case. Additionally you can use the quantizer 'llm-int8' to quantize the model or compiler 'transformers-fast' to compile for inference speedup.