Making your LLMs 4x smaller
This tutorial demonstrates how to use the pruna
package to optimize any custom large language model. We will use the facebook/opt-125m
model as an example. Do not forget to install the transformer-version of the pruna
package before running this tutorial.
1. Loading the LLM
First, load your LLM.
[ ]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto").cuda()
text = "The 45th president of the United States of America is"
tokenizer = AutoTokenizer.from_pretrained(model_id)
ins = tokenizer(text, return_tensors="pt").to('cuda')
2. Initialize the SmashConfig
Next, initialize the SmashConfig.
[ ]:
from pruna import SmashConfig
# Initialize the SmashConfig
smash_config = SmashConfig()
smash_config['quantizers'] = ['gptq']
3. Smashing the model
Now, smash the model. This can take up to 8 minutes on a T4 GPU. Don’t forget to replace the token by the one provided by PrunaAI.
[ ]:
from pruna import smash
# Smash the model
smashed_model = smash(
token='<your_token>', # replace <your-token> with your actual token or set to None if you do not have one yet
4. Running the Model
Finally, run the model to generate the text.
[ ]:
# Display the result
Wrap Up
Congratulations! You have successfully smashed an LLM. You can now use the pruna
package to optimize any LLM. The only parts that you should modify are step 1 and step 4 to fit your use case.