Making your LLMs 4x smaller
This tutorial demonstrates how to use the pruna
package to optimize any custom large language model. We will use the facebook/opt-125m
model as an example. Do not forget to install the transformer-version of the pruna
package before running this tutorial.
1. Loading the LLM
First, load your LLM.
[ ]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto").cuda()
text = "The 45th president of the United States of America is"
tokenizer = AutoTokenizer.from_pretrained(model_id)
ins = tokenizer(text, return_tensors="pt").to('cuda')
2. Initialize the SmashConfig
Next, initialize the SmashConfig.
[ ]:
from pruna import SmashConfig
# Initialize the SmashConfig
smash_config = SmashConfig()
smash_config.add_tokenizer(model_id)
smash_config.add_data("WikiText_128")
smash_config['quantizers'] = ['gptq']
3. Smashing the model
Now, smash the model. This can take up to 8 minutes on a T4 GPU. Don’t forget to replace the token by the one provided by PrunaAI.
[ ]:
from pruna import smash
# Smash the model
smashed_model = smash(
model=model,
token='<your_token>', # replace <your-token> with your actual token or set to None if you do not have one yet
smash_config=smash_config,
)
4. Running the Model
Finally, run the model to generate the text.
[ ]:
# Display the result
tokenizer.batch_decode(smashed_model.generate(**ins))
Wrap Up
Congratulations! You have successfully smashed an LLM. You can now use the pruna
package to optimize any LLM. The only parts that you should modify are step 1 and step 4 to fit your use case.