P-Video-Avatar end-to-end
Walk through still → talking-head video with Pruna’s p-video-avatar (speech-driven lip sync from a single image), then extend the same scenario to more languages.
Goal: Build start frames with p-image, run p-video-avatar for a primary clip, optionally replace built-in TTS with your own audio, generate localized variants, and compare runtime and rough cost in one table.
How this maps to the model: the video step accepts an ``image`` URL (your first frame). The strings you use in p-image are your still prompts; ``voice_script``, ``voice_prompt`` (how it is said), and ``video_prompt`` (face, body, background motion) come from the config cell below. Same mental model as the P-Video-Avatar docs—this notebook composes those fields from SCENARIO, LANGUAGE_SCRIPTS, VOICE_PROMPTS, and VIDEO_PROMPTS.
You will:
Generate start frames per variant with p-image, or set
USE_P_IMAGE_START_FRAME = Falseand reuse ``DIRECT_IMAGE_URL`` for a quicker debug loop.Render the main English avatar clip (
MAIN_VARIANT).Optionally re-run with ``AUDIO_OVERRIDE_URL`` so timing follows uploaded audio.
Render French and Spanish variants (same pattern; localized script + voice).
Review all runs in a summary table.
Setup: set ``REPLICATE_API_TOKEN`` with access to the Pruna Replicate deployments used below. Run the first cell to install replicate and pandas.
[1]:
%pip install -q replicate pandas
/Users/davidberenstein/Documents/programming/pruna/prunatree/.venv/bin/python3: No module named pip
Note: you may need to restart the kernel to use updated packages.
Runtime setup
Install dependencies above, then run imports. The token must allow ``prunaai/p-image`` and ``prunaai/p-video-avatar`` predictions on Replicate.
[2]:
# Imports and auth: Replicate Python SDK reads REPLICATE_API_TOKEN from the environment.
import os
import time
import base64
import urllib.request
from dataclasses import dataclass
from typing import Any
import pandas as pd
import replicate
from IPython.display import HTML, display
REPLICATE_API_TOKEN = os.getenv("REPLICATE_API_TOKEN", "")
if not REPLICATE_API_TOKEN:
raise ValueError("Set REPLICATE_API_TOKEN before running this notebook.")
Configure the scenario
Edit the next cell to change copy, voices, persona profiles, and which variants run. Defaults: English landscape (16:9), French portrait (9:16), Spanish portrait (9:16).
What to adjust:
``SCENARIO`` — shared labels, default ``video_prompt``, ``resolution``, and ``estimated_seconds`` (used only for rough cost hints in the summary table).
``VARIANTS`` — one dict per clip; fields drive ``language``, ``voice``, ``voice_gender`` (aligned with the generated face), ``frame_format`` / aspect, ``persona_style`` (selects motion copy), ``seed``, and ``profile_key`` (identity + look).
``LANGUAGE_SCRIPTS``, ``VOICE_PROMPTS``, ``VIDEO_PROMPTS`` — keep wording consistent when you add or rename locales.
Remove extra ``VARIANTS`` rows for a cheaper run; copy an entry and change ``language`` / ``voice`` to add another market.
[3]:
# --- Scenario configuration (edit here) ---
# SCENARIO: defaults for resolution, rough duration for cost estimates, shared tone hints.
# IDENTITY_PROFILES / PROFILE_VISUAL_STYLES: feed into p-image prompts per profile_key.
# LANGUAGE_SCRIPTS / VOICE_PROMPTS / VIDEO_PROMPTS: per-language script, delivery, and motion lines.
# VARIANTS: one row per rendered clip; MAIN_VARIANT is English baseline for Step 2.
# USE_P_IMAGE_START_FRAME: False = reuse DIRECT_IMAGE_URL for every variant (cheap debug).
# AUDIO_OVERRIDE_URL: non-empty = Step 3 drives lip sync from uploaded audio instead of TTS.
SCENARIO = {
"name": "learn_pruna_ai_generation",
"base_script": "Welcome to Pruna AI. You will generate a still, animate it as an avatar, then compare settings for production.",
"voice_prompt": "Warm, practical, and confident teacher tone with clear pacing.",
"video_prompt": "Natural body language, steady camera presence, and subtle dynamic office atmosphere while explaining concrete steps.",
"resolution": "720p",
"estimated_seconds": 18,
}
IDENTITY_PROFILES = {
"friendly_mentor": "Friendly AI mentor, approachable expression, medium close-up",
"creator_coach": "Confident creator coach, energetic presentation style, medium close-up",
"product_trainer": "Calm product trainer, clear instructional posture, medium close-up",
}
PROFILE_VISUAL_STYLES = {
"friendly_mentor": "clean professional styling, soft key light, modern product studio, photorealistic skin detail",
"creator_coach": "high-energy creator look, vibrant color accents, crisp edge lighting, premium camera depth",
"product_trainer": "minimal modern wardrobe, balanced neutral palette, documentary-grade realism, sharp facial detail",
}
LANGUAGE_SCRIPTS = {
"English (US)": (
"Welcome to Pruna AI. Today, we will create your first polished image, animate it into an avatar, "
"and compare quality, speed, and cost settings so you can choose the right production setup."
),
"French": (
"Bienvenue sur Pruna AI. Dans cette courte lecon, nous allons creer une image de qualite, "
"la transformer en avatar video, puis comparer la qualite, la vitesse et le cout pour choisir la meilleure configuration."
),
"Spanish": (
"Bienvenido a Pruna AI. En esta leccion breve, vas a crear una imagen pulida, convertirla en un avatar en video "
"y comparar calidad, velocidad y costo para elegir la mejor configuracion de produccion."
),
}
VOICE_PROMPTS = {
"English (US)": "Warm product mentor tone, articulate pacing, confident and reassuring delivery.",
"French": "Ton pedagogique et rassurant, rythme naturel, diction claire, energie positive.",
"Spanish": "Tono cercano y didactico, ritmo claro, energia tranquila y convincente.",
}
VIDEO_PROMPTS = {
"mentor": "Steady camera, natural hand gestures, direct eye contact, polished startup office atmosphere.",
"coach": "Dynamic framing, energetic but controlled gestures, bright creator-studio look, confident onboarding style.",
"trainer": "Calm and clear body language, balanced framing, modern workspace background, practical teaching delivery.",
}
VARIANTS = [
{
"variant_id": "main_en_us",
"language": "English (US)",
"voice": "Zephyr (Female)",
"voice_gender": "female",
"persona_style": "mentor",
"frame_format": "horizontal",
"profile_key": "friendly_mentor",
"seed": 42,
},
{
"variant_id": "localized_fr",
"language": "French",
"voice": "Kore (Female)",
"voice_gender": "female",
"persona_style": "coach",
"frame_format": "vertical",
"profile_key": "creator_coach",
"seed": 101,
},
{
"variant_id": "localized_es",
"language": "Spanish",
"voice": "Puck (Male)",
"voice_gender": "male",
"persona_style": "trainer",
"frame_format": "vertical",
"profile_key": "product_trainer",
"seed": 102,
},
]
MAIN_VARIANT = VARIANTS[0]
ADDITIONAL_VARIANTS = VARIANTS[1:]
USE_P_IMAGE_START_FRAME = True
DIRECT_IMAGE_URL = "https://huggingface.co/datasets/pruna-test/documentation-media/resolve/main/prompt_guide/p-video/026_A_group_of_adults_African_white_Asian_sit_at_a_long_table_in_a_cosy_Parisian_living_room_filled_.jpeg?download=true"
AUDIO_OVERRIDE_URL = ""
P_IMAGE_MODEL = "prunaai/p-image"
P_VIDEO_AVATAR_DEPLOYMENT = "prunaai/p-video-avatar"
[4]:
# Helpers: normalize Replicate outputs to a URL, fetch bytes for inline HTML video, time each deployment call.
@dataclass
class RunResult:
scenario: str
variant: str
status: str
output_url: str
output_data_uri_preview: str
persona_profile: str
elapsed_seconds: float
estimated_cost_usd: float
def extract_output_url(output: Any) -> str:
if isinstance(output, str):
return output
if isinstance(output, list) and output:
first_value = output[0]
return str(first_value) if first_value is not None else ""
return "" if output is None else str(output)
def is_http_url(value: str) -> bool:
return isinstance(value, str) and value.startswith(("http://", "https://"))
def url_to_data_uri(url: str, mime_type: str) -> str:
with urllib.request.urlopen(url) as response:
payload = response.read()
encoded = base64.b64encode(payload).decode("ascii")
return f"data:{mime_type};base64,{encoded}"
def show_video_data_uri(data_uri: str) -> None:
display(HTML(f"<video controls style='width: 100%; max-width: 860px;' src='{data_uri}'></video>"))
def run_prediction(deployment_name: str, payload: dict[str, Any]) -> tuple[str, float]:
deployment = replicate.deployments.get(deployment_name)
start = time.perf_counter()
prediction = deployment.predictions.create(input=payload)
prediction.wait()
elapsed = time.perf_counter() - start
return extract_output_url(prediction.output), elapsed
Step 1 — Start frames
Each variant gets its own still so identity, aspect ratio, and locale stay aligned.
When ``USE_P_IMAGE_START_FRAME`` is True: the notebook calls ``prunaai/p-image`` on Replicate with a composed prompt (IDENTITY_PROFILES + PROFILE_VISUAL_STYLES + language/persona hints). The returned URL is what we pass as ``image`` to p-video-avatar—there is no separate start_image_prompt API field; the prompt exists only to generate that file.
When ``USE_P_IMAGE_START_FRAME`` is False: every variant reuses ``DIRECT_IMAGE_URL`` (fast for debugging; you lose per-locale framing).
Quality: favor single subject, clear light, and explicit horizontal vs vertical intent—same habits as the P-Image and P-Video-Avatar documentation. The preview table shows the exact prompt string and script snippet per variant.
[5]:
# Step 1: Build one still per VARIANT (or reuse DIRECT_IMAGE_URL). Output URLs become the `image` input for p-video-avatar.
start_images_by_variant: dict[str, str] = {}
start_images_data_uri: dict[str, str] = {}
prompt_rows: list[dict[str, str]] = []
if USE_P_IMAGE_START_FRAME:
for variant in VARIANTS:
aspect_ratio = "16:9" if variant["frame_format"] == "horizontal" else "9:16"
profile_desc = IDENTITY_PROFILES[variant["profile_key"]]
visual_style = PROFILE_VISUAL_STYLES[variant["profile_key"]]
language_script = LANGUAGE_SCRIPTS[variant["language"]]
prompt = (
f"{profile_desc}, {variant['persona_style']} presenter for a Pruna AI onboarding lesson, "
f"speaking {variant['language']}, {variant['voice_gender']}-voiced persona, {variant['frame_format']} framing, "
f"{visual_style}, expressive but natural micro-expressions, cinematic contrast, clean skin texture, "
f"high-end commercial realism, implied script context: {language_script}"
)
image_output = replicate.run(P_IMAGE_MODEL, input={"prompt": prompt, "aspect_ratio": aspect_ratio})
image_url = extract_output_url(image_output)
start_images_by_variant[variant["variant_id"]] = image_url
start_images_data_uri[variant["variant_id"]] = url_to_data_uri(image_url, "image/jpeg")
prompt_rows.append(
{
"variant_id": variant["variant_id"],
"language": variant["language"],
"voice": variant["voice"],
"voice_gender": variant["voice_gender"],
"profile_key": variant["profile_key"],
"persona_style": variant["persona_style"],
"frame_format": variant["frame_format"],
"aspect_ratio": aspect_ratio,
"generated_image_prompt": prompt,
"voice_script_preview": language_script[:120] + "...",
}
)
else:
for variant in VARIANTS:
start_images_by_variant[variant["variant_id"]] = DIRECT_IMAGE_URL
start_images_data_uri[variant["variant_id"]] = url_to_data_uri(DIRECT_IMAGE_URL, "image/jpeg")
prompt_rows.append(
{
"variant_id": variant["variant_id"],
"language": variant["language"],
"voice": variant["voice"],
"voice_gender": variant["voice_gender"],
"profile_key": variant["profile_key"],
"persona_style": variant["persona_style"],
"frame_format": variant["frame_format"],
"aspect_ratio": "16:9" if variant["frame_format"] == "horizontal" else "9:16",
"generated_image_prompt": "external_url_used",
"voice_script_preview": LANGUAGE_SCRIPTS[variant["language"]][:120] + "...",
}
)
prompt_df = pd.DataFrame(prompt_rows)
variant_df = pd.DataFrame(VARIANTS)
display(prompt_df)
display(variant_df)
| variant_id | language | voice | voice_gender | profile_key | persona_style | frame_format | aspect_ratio | generated_image_prompt | voice_script_preview | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | main_en_us | English (US) | Zephyr (Female) | female | friendly_mentor | mentor | horizontal | 16:9 | Friendly AI mentor, approachable expression, m... | Welcome to Pruna AI. Today, we will create you... |
| 1 | localized_fr | French | Kore (Female) | female | creator_coach | coach | vertical | 9:16 | Confident creator coach, energetic presentatio... | Bienvenue sur Pruna AI. Dans cette courte leco... |
| 2 | localized_es | Spanish | Puck (Male) | male | product_trainer | trainer | vertical | 9:16 | Calm product trainer, clear instructional post... | Bienvenido a Pruna AI. En esta leccion breve, ... |
| variant_id | language | voice | voice_gender | persona_style | frame_format | profile_key | seed | |
|---|---|---|---|---|---|---|---|---|
| 0 | main_en_us | English (US) | Zephyr (Female) | female | mentor | horizontal | friendly_mentor | 42 |
| 1 | localized_fr | French | Kore (Female) | female | coach | vertical | creator_coach | 101 |
| 2 | localized_es | Spanish | Puck (Male) | male | trainer | vertical | product_trainer | 102 |
Step 2 — Main avatar clip
Calls ``prunaai/p-video-avatar`` for ``MAIN_VARIANT`` (English). Inputs include ``image`` from Step 1, ``voice_script`` / ``voice_prompt`` / ``video_prompt``, plus ``voice``, ``voice_language``, ``resolution``, and ``seed``.
Validate here first: if lip sync or motion fails, tighten the still (Step 1) and ``video_prompt`` (fixed camera, less aggressive background motion) before spending runs on other languages. The scenario ``VIDEO_PROMPTS`` entry (mentor / coach / trainer) selects which motion line goes to the API.
[6]:
# Step 2: Primary English clip — sanity-check this before burning runs on other locales.
results: list[RunResult] = []
main_payload = {
"image": start_images_by_variant[MAIN_VARIANT["variant_id"]],
"voice_script": LANGUAGE_SCRIPTS[MAIN_VARIANT["language"]],
"voice": MAIN_VARIANT["voice"],
"voice_language": MAIN_VARIANT["language"],
"voice_prompt": VOICE_PROMPTS[MAIN_VARIANT["language"]],
"video_prompt": VIDEO_PROMPTS[MAIN_VARIANT["persona_style"]],
"resolution": SCENARIO["resolution"],
"seed": MAIN_VARIANT["seed"],
"disable_safety_filter": True,
"disable_prompt_upsampling": False,
}
main_url, main_elapsed = run_prediction(P_VIDEO_AVATAR_DEPLOYMENT, main_payload)
if is_http_url(main_url):
main_data_uri = url_to_data_uri(main_url, "video/mp4")
main_data_uri_preview = main_data_uri[:80] + "..."
main_status = "succeeded"
else:
main_data_uri = ""
main_data_uri_preview = ""
main_status = "failed"
main_cost = SCENARIO["estimated_seconds"] * (0.025 if SCENARIO["resolution"] == "720p" else 0.045)
results.append(
RunResult(
scenario=SCENARIO["name"],
variant=MAIN_VARIANT["variant_id"],
status=main_status,
output_url=main_url,
output_data_uri_preview=main_data_uri_preview,
persona_profile=MAIN_VARIANT["profile_key"],
elapsed_seconds=main_elapsed,
estimated_cost_usd=main_cost,
)
)
print(
f"Main avatar generated in {main_elapsed:.1f}s with {MAIN_VARIANT['language']} / {MAIN_VARIANT['voice']} / {MAIN_VARIANT['profile_key']}"
)
if is_http_url(main_url):
show_video_data_uri(main_data_uri)
else:
print("Main variant did not return a downloadable URL.")
Main avatar generated in 31.4s with English (US) / Zephyr (Female) / friendly_mentor
Step 3 — Optional uploaded audio
Set ``AUDIO_OVERRIDE_URL`` to a public audio file URL to drive speech timing from your recording instead of built-in TTS. The model uses ``audio`` for lip sync; ``voice_script`` may still be present for logging but timing follows your file.
Leave ``AUDIO_OVERRIDE_URL`` empty (default) to skip this step entirely.
[7]:
# Step 3 (optional): Same still + uploaded audio URL — lip sync follows the audio file when enabled.
if AUDIO_OVERRIDE_URL:
audio_payload = {
"image": start_images_by_variant[MAIN_VARIANT["variant_id"]],
"audio": AUDIO_OVERRIDE_URL,
"voice_script": "Ignored because audio override is present.",
"video_prompt": "Calm and steady delivery with minimal body movement.",
"resolution": "1080p",
"seed": 101,
}
audio_url, audio_elapsed = run_prediction(P_VIDEO_AVATAR_DEPLOYMENT, audio_payload)
audio_data_uri = url_to_data_uri(audio_url, "video/mp4")
audio_cost = SCENARIO["estimated_seconds"] * 0.045
results.append(
RunResult(
scenario=SCENARIO["name"],
variant="audio_override_1080p",
status="succeeded",
output_url=audio_url,
output_data_uri_preview=audio_data_uri[:80] + "...",
persona_profile=MAIN_VARIANT["profile_key"],
elapsed_seconds=audio_elapsed,
estimated_cost_usd=audio_cost,
)
)
print(f"Audio override avatar generated in {audio_elapsed:.1f}s")
show_video_data_uri(audio_data_uri)
else:
print("Skipping audio override variant. Set AUDIO_OVERRIDE_URL to enable.")
Skipping audio override variant. Set AUDIO_OVERRIDE_URL to enable.
Step 4 — More languages
Iterates ``ADDITIONAL_VARIANTS`` (every row after the English ``MAIN_VARIANT``). Each call passes that variant’s ``image``, localized ``voice_script``, ``voice_language``, ``voice``, and the matching ``VOICE_PROMPTS`` / ``VIDEO_PROMPTS`` lines.
Localization: write each ``LANGUAGE_SCRIPTS`` entry in the target language—do not paste English into a foreign voice. Keep voice gender consistent with the generated face when possible.
[8]:
# Step 4: Remaining locales — same API shape; swap script, voice, and voice_prompt per LANGUAGE_SCRIPTS / VOICE_PROMPTS.
for variant in ADDITIONAL_VARIANTS:
payload = {
"image": start_images_by_variant[variant["variant_id"]],
"voice_script": LANGUAGE_SCRIPTS[variant["language"]],
"voice": variant["voice"],
"voice_language": variant["language"],
"voice_prompt": VOICE_PROMPTS[variant["language"]],
"video_prompt": VIDEO_PROMPTS[variant["persona_style"]],
"resolution": "720p",
"seed": variant["seed"],
}
output_url, elapsed = run_prediction(P_VIDEO_AVATAR_DEPLOYMENT, payload)
if is_http_url(output_url):
output_data_uri = url_to_data_uri(output_url, "video/mp4")
output_data_uri_preview = output_data_uri[:80] + "..."
status = "succeeded"
else:
output_data_uri = ""
output_data_uri_preview = ""
status = "failed"
results.append(
RunResult(
scenario=SCENARIO["name"],
variant=variant["variant_id"],
status=status,
output_url=output_url,
output_data_uri_preview=output_data_uri_preview,
persona_profile=variant["profile_key"],
elapsed_seconds=elapsed,
estimated_cost_usd=SCENARIO["estimated_seconds"] * 0.025,
)
)
print(
f"Generated {variant['variant_id']} in {elapsed:.1f}s with "
f"{variant['language']} / {variant['voice']} / {variant['profile_key']} ({variant['frame_format']})"
)
if is_http_url(output_url):
show_video_data_uri(output_data_uri)
else:
print(f"Variant {variant['variant_id']} did not return a downloadable URL.")
Generated localized_fr in 40.3s with French / Kore (Female) / creator_coach (vertical)