Smashing Large Language Models =============================== This tutorial demonstrates how to use the `pruna` package to optimize any custom large language model. We will use the facebook/opt-125m model as an example. Loading the LLM ---------------------------------- First, load your llm. .. code-block:: python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "facebook/opt-125m" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto").cuda() text = "The 45th president of the United States of America is" ins = tokenizer(text, return_tensors="pt").to('cuda') Initializing the Smash Config ------------------------------- Next, initialize the smash_config. .. code-block:: python from pruna_engine.SmashConfig import SmashConfig # Initialize the SmashConfig smash_config = SmashConfig() smasher_config['task'] = 'text_text_generation' smash_config['quantizers'] = ['gptq'] smash_config["tokenizer_name"] = tokenizer smash_config["weight_quantization_bits"] = 4 Smashing the Model ------------------ Now, smash the model. .. code-block:: python from pruna.smash import smash # Smash the model smashed_model = smash( model=model, dataloader="WikiText_128", api_key='', # replace with your actual API key smash_config=smash_config, ) Don't forget to replace the api_key by the one provided by PrunaAI. Running the Model ----------------- Finally, run the model to generate the text. .. code-block:: python # Display the result smashed_model.generate(**ins) Wrap Up --------- Congratulations! You have successfully smashed a large language model. You can now use the `pruna` package to optimize any custom LLM. The only parts that you should modify are step 1 and step 4 to fit your use case. Additionally you can use the quantizer 'llm-int8' to quantize the model or compiler 'transformers-fast' to compile for inference speedup.