Optimize and Deploy AI Models with Pruna and Hugging Face

Objective: Build a complete tutorial demonstrating how to optimize the Efficient-Large-Model/Sana_600M_512px_diffusers diffusion model using Pruna and deploy it seamlessly to the Hugging Face Hub.

Model: Efficient-Large-Model/Sana_600M_512px_diffusers

Dataset: data-is-better-together/open-image-preferences-v1-binarized

To follow along, ensure that you have the Pruna SDK installed along with all required third-party libraries. Running this tutorial in a clean virtual environment is recommended for a smooth setup.

[ ]:

<a target="_blank" href="https://colab.research.google.com/github/PrunaAI/pruna/blob/v0.2.11/docs/tutorials/deploying_sana_tutorial.ipynb">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

[ ]:

!pip install pruna

[ ]:

!pip install datasets huggingface_hub gradio

You will need to login on the Hugging Face Hub for using the model weights. We also need to select the best available device for executing the notebook. Run the cells below to do the same.

[ ]:

from huggingface_hub import notebook_login

notebook_login()

[5]:

import torch

device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

1. Smash Configuration

To optimize the model effectively, we first define the configuration methods that enhance its performance. For detailed options and parameter explanations, refer to the SmashConfig guide.

In this tutorial, we will:

Select torchao to reduce memory usage during inference.
Save the model to disk and optionally upload the smashed model to the Hugging Face Hub for easy access.

[ ]:

import torch
from diffusers import SanaPipeline

from pruna import PrunaModel, SmashConfig, smash

# 1. Define the model ID
model_id = "Efficient-Large-Model/Sana_600M_512px_diffusers"

# 2. Load the pre-trained model
pipe = SanaPipeline.from_pretrained(model_id, variant="fp16", torch_dtype=torch.float16)
pipe = pipe.to(device)

# 3. Configure Pruna smash
smash_config = SmashConfig()
smash_config["quantizer"] = "torchao"

# 4. Smash (optimize) the model
smashed_pipe = smash(model=pipe, smash_config=smash_config)

# 5. Save the model to disk
smashed_pipe.save_pretrained("Sana_600M_512px_diffusers-smashed")

print("✅ Smashed Sana model uploaded successfully to Hugging Face Hub.")

# 6. Push the smashed pipeline to Hugging Face Hub using save_to_hub
# smashed_pipe.save_to_hub("AINovice2005/Sana_600M_512px_diffusers-smashed")

Multiple distributions found for package optimum. Picked distribution: optimum-quanto
/teamspace/studios/this_studio/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]

INFO - Using best available device: 'cuda'
INFO - Starting quantizer torchao...
INFO - quantizer torchao was applied successfully.

✅ Smashed Sana model uploaded successfully to Hugging Face Hub.

2. Load and Collate Dataset

In this step, we will load the dataset required for optimizing and evaluating the model. This dataset will provide the input data needed to assess the model’s performance after applying optimization techniques such as quantization.

We will use the `data-is-better-together/open-image-preferences-v1-binarized <https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1-binarized>`__ dataset, which contains binarized user image preferences and prompts for image generation tasks. Correctly loading and collating the dataset ensures that the input is properly prepared, enabling smooth evaluation.

[ ]:

from datasets import Image, load_dataset

from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.data.utils import split_train_into_train_val_test

# Load dataset
dataset = load_dataset("data-is-better-together/open-image-preferences-v1-binarized")["train"]

dataset = dataset.rename_column("chosen", "image")
dataset = dataset.rename_column("prompt", "text")

dataset = dataset.cast_column("image", Image())

# Split train into train/val/test
train_ds, val_ds, test_ds = split_train_into_train_val_test(dataset, seed=42)

# Initialize PrunaDataModule
datamodule = PrunaDataModule.from_datasets(
    datasets=(train_ds, val_ds, test_ds),
    collate_fn="image_generation_collate",
    collate_fn_args={"img_size": 512},
)

# Limit datasets to 5 samples each for quick testing
datamodule.limit_datasets(5)

INFO - Loaded only training, splitting train 80/10/10 into train, validation and test...
INFO - Testing compatibility with image_generation_collate...

3. Evaluate the Model

Now that the model and dataset are set up, we can proceed to evaluate the model using the Pruna Evaluation Agent. This evaluation helps us measure the model’s current performance before optimization, providing a baseline for comparison. It assesses how well the model performs on the given dataset and generates relevant metrics that will guide us in understanding the impact of our optimization configurations later.

[ ]:

# Import required modules from Pruna
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.engine.utils import move_to_device
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import (
    LatencyMetric,
    TotalTimeMetric,
)
from pruna.evaluation.task import Task

# Load the smashed (optimized) model pipeline from disk
smashed_pipe = PrunaModel.from_pretrained("Sana_600M_512px_diffusers-smashed")

# Define evaluation metrics (example: total time and latency)
metrics = [
    TotalTimeMetric(n_iterations=1, n_warmup_iterations=1),
    LatencyMetric(n_iterations=1, n_warmup_iterations=1),
]

# Define the evaluation task with metrics and datamodule
# (Ensure `datamodule` and `device` are defined before this script runs)
task = Task(metrics, datamodule=datamodule, device=device)

# Initialize the evaluation agent
eval_agent = EvaluationAgent(task)

# Move smashed model to evaluation device (GPU or CPU)
move_to_device(smashed_pipe, device)

# Evaluate the smashed model pipeline using the evaluation agent
smashed_model_results = eval_agent.evaluate(smashed_pipe)

# Print results for verification
print(smashed_model_results)

INFO - Using best available device: 'cuda'
WARNING - Argument cache_dir not found in config file. Skipping...

WARNING - Model and SmashConfig have different devices. Model: cuda, SmashConfig: cuda:0. Casting model to cuda:0.If this is not desired, please use SmashConfig(device='cuda').
INFO - Starting quantizer torchao...
INFO - quantizer torchao was applied successfully.
INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Using provided list of metric instances.
INFO - Using best available device: 'cuda'
INFO - Evaluating a smashed model.
INFO - Detected diffusers model. Using DiffuserHandler with fixed seed.
- The first element of the batch is passed as input.
- The generated outputs are expected to have .images attribute.
Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device. It is not recommended to move them to `cpu` as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support for`float16` operations on this device in PyTorch. Please, remove the `torch_dtype=torch.float16` argument, or use another device for inference.
Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device. It is not recommended to move them to `cpu` as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support for`float16` operations on this device in PyTorch. Please, remove the `torch_dtype=torch.float16` argument, or use another device for inference.
Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device. It is not recommended to move them to `cpu` as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support for`float16` operations on this device in PyTorch. Please, remove the `torch_dtype=torch.float16` argument, or use another device for inference.
INFO - Evaluating stateful metrics.
INFO - Evaluating isolated inference metrics.

Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device. It is not recommended to move them to `cpu` as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support for`float16` operations on this device in PyTorch. Please, remove the `torch_dtype=torch.float16` argument, or use another device for inference.
Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device. It is not recommended to move them to `cpu` as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support for`float16` operations on this device in PyTorch. Please, remove the `torch_dtype=torch.float16` argument, or use another device for inference.
Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device. It is not recommended to move them to `cpu` as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support for`float16` operations on this device in PyTorch. Please, remove the `torch_dtype=torch.float16` argument, or use another device for inference.

[MetricResult(name='total_time', params={'n_iterations': 1, 'n_warmup_iterations': 1, 'device': 'cuda', 'timing_type': 'sync', 'batch_size': 1}, result=12479.54296875), MetricResult(name='latency', params={'n_iterations': 1, 'n_warmup_iterations': 1, 'device': 'cuda', 'timing_type': 'sync', 'batch_size': 1}, result=12479.54296875)]

4. Gradio Demo

Once the model has been optimized, we can deploy the smashed model using Gradio to create an interactive demo. This allows anyone to test the model’s capabilities directly in their browser.

In this section, we will:

Show how to deploy the optimized model on the Hugging Face Hub with a Gradio demo
Discuss considerations such as handling queuing, especially if multiple users access the demo simultaneously
Highlight best practices for integrating Gradio demos in your Hugging Face Space to ensure a smooth and responsive user experience

Creating a Gradio demo not only showcases your optimized model effectively but also enables easy sharing and real-world testing by the community.

[ ]:

import gradio as gr

from pruna import PrunaModel

# Load PrunaModel
pipe = PrunaModel.from_pretrained("Sana_600M_512px_diffusers-smashed")


# Inference function
def generate_image(prompt):
    """Generate an image from a given text prompt."""
    result = pipe(prompt, num_inference_steps=25, guidance_scale=7.5)
    return result.images[0]


# Create Gradio interface with queueing enabled
demo = gr.Interface(
    fn=generate_image,
    inputs=gr.Textbox(lines=2, placeholder="Enter your prompt here...", label="Prompt"),
    outputs=gr.Image(type="pil"),
    title="Sana Smashed Text-to-Image Demo",
    description="Generate high-quality images using the smashed Sana diffusion model optimized with Pruna.",
    allow_flagging="never",
)

# Enable queueing to handle multiple users
demo.queue()

# Launch the app
if __name__ == "__main__":
    demo.launch(server_port=7861, share=True)

INFO - Using best available device: 'cuda'
WARNING - Argument cache_dir not found in config file. Skipping...

WARNING - Model and SmashConfig have different devices. Model: cuda, SmashConfig: cuda:0. Casting model to cuda:0.If this is not desired, please use SmashConfig(device='cuda').
INFO - Starting quantizer torchao...
INFO - quantizer torchao was applied successfully.
/teamspace/studios/this_studio/.venv/lib/python3.11/site-packages/gradio/interface.py:415: UserWarning: The `allow_flagging` parameter in `Interface` is deprecated. Use `flagging_mode` instead.
  warnings.warn(
/teamspace/studios/this_studio/.venv/lib/python3.11/site-packages/websockets/legacy/__init__.py:6: DeprecationWarning: websockets.legacy is deprecated; see https://websockets.readthedocs.io/en/stable/howto/upgrade.html for upgrade instructions
  warnings.warn(  # deprecated in 14.0 - 2024-11-09
/teamspace/studios/this_studio/.venv/lib/python3.11/site-packages/uvicorn/protocols/websockets/websockets_impl.py:17: DeprecationWarning: websockets.server.WebSocketServerProtocol is deprecated
  from websockets.server import WebSocketServerProtocol

* Running on local URL:  http://127.0.0.1:7861
* Running on public URL: https://c6b0d9514c5fa7415e.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)

/teamspace/studios/this_studio/.venv/lib/python3.11/site-packages/gradio/routes.py:1341: DeprecationWarning: 'HTTP_422_UNPROCESSABLE_ENTITY' is deprecated. Use 'HTTP_422_UNPROCESSABLE_CONTENT' instead.
  return await queue_join_helper(body, request, username)

/teamspace/studios/this_studio/.venv/lib/python3.11/site-packages/gradio/routes.py:1341: DeprecationWarning: 'HTTP_422_UNPROCESSABLE_ENTITY' is deprecated. Use 'HTTP_422_UNPROCESSABLE_CONTENT' instead.
  return await queue_join_helper(body, request, username)

/teamspace/studios/this_studio/.venv/lib/python3.11/site-packages/gradio/routes.py:1341: DeprecationWarning: 'HTTP_422_UNPROCESSABLE_ENTITY' is deprecated. Use 'HTTP_422_UNPROCESSABLE_CONTENT' instead.
  return await queue_join_helper(body, request, username)

Conclusions

In this tutorial, we have covered the end-to-end workflow for optimizing and evaluating a text-to-image diffusion model using Pruna.

We began by loading the Sana base model and defining the SmashConfig with the desired optimization algorithms and parameters. We then smashed the base model, obtaining an optimized version, and ensured its performance improvements by running an evaluation with the EvaluationAgent.

After optimization, we demonstrated how to deploy the smashed model using Gradio to create an interactive demo. This enables anyone to test the model’s capabilities directly in their browser. This end-to-end approach makes it easy to explore trade-offs, iterate on optimization configurations, and deploy robust, production-ready text-to-image models.

Check out our other tutorials for more examples on optimizing and evaluating large language models, text-to-video models using Pruna.