Making your LLMs 4x smaller

This tutorial demonstrates how to use the pruna package to optimize any custom large language model. We will use the facebook/opt-125m model as an example. Do not forget to install the transformer-version of the pruna package before running this tutorial.

1. Loading the LLM

First, load your LLM.

[ ]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto").cuda()

text = "The 45th president of the United States of America is"

tokenizer = AutoTokenizer.from_pretrained(model_id)
ins = tokenizer(text, return_tensors="pt").to('cuda')

2. Initialize the SmashConfig

Next, initialize the SmashConfig.

[ ]:
from pruna import SmashConfig

# Initialize the SmashConfig
smash_config = SmashConfig()
smash_config.add_tokenizer(model_id)
smash_config.add_data("WikiText_128")
smash_config['quantizers'] = ['gptq']

3. Smashing the model

Now, smash the model. This can take up to 8 minutes on a T4 GPU. Don’t forget to replace the token by the one provided by PrunaAI.

[ ]:
from pruna import smash

# Smash the model
smashed_model = smash(
    model=model,
    token='<your_token>',  # replace <your-token> with your actual token or set to None if you do not have one yet
    smash_config=smash_config,
)

4. Running the Model

Finally, run the model to generate the text.

[ ]:
# Display the result
tokenizer.batch_decode(smashed_model.generate(**ins))

Wrap Up

Congratulations! You have successfully smashed an LLM. You can now use the pruna package to optimize any LLM. The only parts that you should modify are step 1 and step 4 to fit your use case.