P-Video-Avatar end-to-end

Walk through still → talking-head video with Pruna’s p-video-avatar (speech-driven lip sync from a single image), then extend the same scenario to more languages.

Goal: Build start frames with p-image, run p-video-avatar for a primary clip, optionally replace built-in TTS with your own audio, generate localized variants, and compare runtime and rough cost in one table.

How this maps to the model: the video step accepts an ``image`` URL (your first frame). The strings you use in p-image are your still prompts; ``voice_script``, ``voice_prompt`` (how it is said), and ``video_prompt`` (face, body, background motion) come from the config cell below. Same mental model as the P-Video-Avatar docs—this notebook composes those fields from SCENARIO, LANGUAGE_SCRIPTS, VOICE_PROMPTS, and VIDEO_PROMPTS.

You will:

  1. Generate start frames per variant with p-image, or set USE_P_IMAGE_START_FRAME = False and reuse ``DIRECT_IMAGE_URL`` for a quicker debug loop.

  2. Render the main English avatar clip (MAIN_VARIANT).

  3. Optionally re-run with ``AUDIO_OVERRIDE_URL`` so timing follows uploaded audio.

  4. Render French and Spanish variants (same pattern; localized script + voice).

  5. Review all runs in a summary table.

Setup: set ``REPLICATE_API_TOKEN`` with access to the Pruna Replicate deployments used below. Run the first cell to install replicate and pandas.

[1]:
%pip install -q replicate pandas
/Users/davidberenstein/Documents/programming/pruna/prunatree/.venv/bin/python3: No module named pip
Note: you may need to restart the kernel to use updated packages.

Runtime setup

Install dependencies above, then run imports. The token must allow ``prunaai/p-image`` and ``prunaai/p-video-avatar`` predictions on Replicate.

[2]:
# Imports and auth: Replicate Python SDK reads REPLICATE_API_TOKEN from the environment.

import os
import time
import base64
import urllib.request
from dataclasses import dataclass
from typing import Any

import pandas as pd
import replicate
from IPython.display import HTML, display

REPLICATE_API_TOKEN = os.getenv("REPLICATE_API_TOKEN", "")
if not REPLICATE_API_TOKEN:
    raise ValueError("Set REPLICATE_API_TOKEN before running this notebook.")

Configure the scenario

Edit the next cell to change copy, voices, persona profiles, and which variants run. Defaults: English landscape (16:9), French portrait (9:16), Spanish portrait (9:16).

What to adjust:

  • ``SCENARIO`` — shared labels, default ``video_prompt``, ``resolution``, and ``estimated_seconds`` (used only for rough cost hints in the summary table).

  • ``VARIANTS`` — one dict per clip; fields drive ``language``, ``voice``, ``voice_gender`` (aligned with the generated face), ``frame_format`` / aspect, ``persona_style`` (selects motion copy), ``seed``, and ``profile_key`` (identity + look).

  • ``LANGUAGE_SCRIPTS``, ``VOICE_PROMPTS``, ``VIDEO_PROMPTS`` — keep wording consistent when you add or rename locales.

Remove extra ``VARIANTS`` rows for a cheaper run; copy an entry and change ``language`` / ``voice`` to add another market.

[3]:
# --- Scenario configuration (edit here) ---
# SCENARIO: defaults for resolution, rough duration for cost estimates, shared tone hints.
# IDENTITY_PROFILES / PROFILE_VISUAL_STYLES: feed into p-image prompts per profile_key.
# LANGUAGE_SCRIPTS / VOICE_PROMPTS / VIDEO_PROMPTS: per-language script, delivery, and motion lines.
# VARIANTS: one row per rendered clip; MAIN_VARIANT is English baseline for Step 2.
# USE_P_IMAGE_START_FRAME: False = reuse DIRECT_IMAGE_URL for every variant (cheap debug).
# AUDIO_OVERRIDE_URL: non-empty = Step 3 drives lip sync from uploaded audio instead of TTS.

SCENARIO = {
    "name": "learn_pruna_ai_generation",
    "base_script": "Welcome to Pruna AI. You will generate a still, animate it as an avatar, then compare settings for production.",
    "voice_prompt": "Warm, practical, and confident teacher tone with clear pacing.",
    "video_prompt": "Natural body language, steady camera presence, and subtle dynamic office atmosphere while explaining concrete steps.",
    "resolution": "720p",
    "estimated_seconds": 18,
}

IDENTITY_PROFILES = {
    "friendly_mentor": "Friendly AI mentor, approachable expression, medium close-up",
    "creator_coach": "Confident creator coach, energetic presentation style, medium close-up",
    "product_trainer": "Calm product trainer, clear instructional posture, medium close-up",
}

PROFILE_VISUAL_STYLES = {
    "friendly_mentor": "clean professional styling, soft key light, modern product studio, photorealistic skin detail",
    "creator_coach": "high-energy creator look, vibrant color accents, crisp edge lighting, premium camera depth",
    "product_trainer": "minimal modern wardrobe, balanced neutral palette, documentary-grade realism, sharp facial detail",
}

LANGUAGE_SCRIPTS = {
    "English (US)": (
        "Welcome to Pruna AI. Today, we will create your first polished image, animate it into an avatar, "
        "and compare quality, speed, and cost settings so you can choose the right production setup."
    ),
    "French": (
        "Bienvenue sur Pruna AI. Dans cette courte lecon, nous allons creer une image de qualite, "
        "la transformer en avatar video, puis comparer la qualite, la vitesse et le cout pour choisir la meilleure configuration."
    ),
    "Spanish": (
        "Bienvenido a Pruna AI. En esta leccion breve, vas a crear una imagen pulida, convertirla en un avatar en video "
        "y comparar calidad, velocidad y costo para elegir la mejor configuracion de produccion."
    ),
}

VOICE_PROMPTS = {
    "English (US)": "Warm product mentor tone, articulate pacing, confident and reassuring delivery.",
    "French": "Ton pedagogique et rassurant, rythme naturel, diction claire, energie positive.",
    "Spanish": "Tono cercano y didactico, ritmo claro, energia tranquila y convincente.",
}

VIDEO_PROMPTS = {
    "mentor": "Steady camera, natural hand gestures, direct eye contact, polished startup office atmosphere.",
    "coach": "Dynamic framing, energetic but controlled gestures, bright creator-studio look, confident onboarding style.",
    "trainer": "Calm and clear body language, balanced framing, modern workspace background, practical teaching delivery.",
}

VARIANTS = [
    {
        "variant_id": "main_en_us",
        "language": "English (US)",
        "voice": "Zephyr (Female)",
        "voice_gender": "female",
        "persona_style": "mentor",
        "frame_format": "horizontal",
        "profile_key": "friendly_mentor",
        "seed": 42,
    },
    {
        "variant_id": "localized_fr",
        "language": "French",
        "voice": "Kore (Female)",
        "voice_gender": "female",
        "persona_style": "coach",
        "frame_format": "vertical",
        "profile_key": "creator_coach",
        "seed": 101,
    },
    {
        "variant_id": "localized_es",
        "language": "Spanish",
        "voice": "Puck (Male)",
        "voice_gender": "male",
        "persona_style": "trainer",
        "frame_format": "vertical",
        "profile_key": "product_trainer",
        "seed": 102,
    },
]

MAIN_VARIANT = VARIANTS[0]
ADDITIONAL_VARIANTS = VARIANTS[1:]

USE_P_IMAGE_START_FRAME = True
DIRECT_IMAGE_URL = "https://huggingface.co/datasets/pruna-test/documentation-media/resolve/main/prompt_guide/p-video/026_A_group_of_adults_African_white_Asian_sit_at_a_long_table_in_a_cosy_Parisian_living_room_filled_.jpeg?download=true"
AUDIO_OVERRIDE_URL = ""

P_IMAGE_MODEL = "prunaai/p-image"
P_VIDEO_AVATAR_DEPLOYMENT = "prunaai/p-video-avatar"

[4]:
# Helpers: normalize Replicate outputs to a URL, fetch bytes for inline HTML video, time each deployment call.

@dataclass
class RunResult:
    scenario: str
    variant: str
    status: str
    output_url: str
    output_data_uri_preview: str
    persona_profile: str
    elapsed_seconds: float
    estimated_cost_usd: float

def extract_output_url(output: Any) -> str:
    if isinstance(output, str):
        return output
    if isinstance(output, list) and output:
        first_value = output[0]
        return str(first_value) if first_value is not None else ""
    return "" if output is None else str(output)

def is_http_url(value: str) -> bool:
    return isinstance(value, str) and value.startswith(("http://", "https://"))

def url_to_data_uri(url: str, mime_type: str) -> str:
    with urllib.request.urlopen(url) as response:
        payload = response.read()
    encoded = base64.b64encode(payload).decode("ascii")
    return f"data:{mime_type};base64,{encoded}"

def show_video_data_uri(data_uri: str) -> None:
    display(HTML(f"<video controls style='width: 100%; max-width: 860px;' src='{data_uri}'></video>"))

def run_prediction(deployment_name: str, payload: dict[str, Any]) -> tuple[str, float]:
    deployment = replicate.deployments.get(deployment_name)
    start = time.perf_counter()
    prediction = deployment.predictions.create(input=payload)
    prediction.wait()
    elapsed = time.perf_counter() - start
    return extract_output_url(prediction.output), elapsed

Step 1 — Start frames

Each variant gets its own still so identity, aspect ratio, and locale stay aligned.

When ``USE_P_IMAGE_START_FRAME`` is True: the notebook calls ``prunaai/p-image`` on Replicate with a composed prompt (IDENTITY_PROFILES + PROFILE_VISUAL_STYLES + language/persona hints). The returned URL is what we pass as ``image`` to p-video-avatar—there is no separate start_image_prompt API field; the prompt exists only to generate that file.

When ``USE_P_IMAGE_START_FRAME`` is False: every variant reuses ``DIRECT_IMAGE_URL`` (fast for debugging; you lose per-locale framing).

Quality: favor single subject, clear light, and explicit horizontal vs vertical intent—same habits as the P-Image and P-Video-Avatar documentation. The preview table shows the exact prompt string and script snippet per variant.

[5]:
# Step 1: Build one still per VARIANT (or reuse DIRECT_IMAGE_URL). Output URLs become the `image` input for p-video-avatar.

start_images_by_variant: dict[str, str] = {}
start_images_data_uri: dict[str, str] = {}
prompt_rows: list[dict[str, str]] = []

if USE_P_IMAGE_START_FRAME:
    for variant in VARIANTS:
        aspect_ratio = "16:9" if variant["frame_format"] == "horizontal" else "9:16"
        profile_desc = IDENTITY_PROFILES[variant["profile_key"]]
        visual_style = PROFILE_VISUAL_STYLES[variant["profile_key"]]
        language_script = LANGUAGE_SCRIPTS[variant["language"]]
        prompt = (
            f"{profile_desc}, {variant['persona_style']} presenter for a Pruna AI onboarding lesson, "
            f"speaking {variant['language']}, {variant['voice_gender']}-voiced persona, {variant['frame_format']} framing, "
            f"{visual_style}, expressive but natural micro-expressions, cinematic contrast, clean skin texture, "
            f"high-end commercial realism, implied script context: {language_script}"
        )
        image_output = replicate.run(P_IMAGE_MODEL, input={"prompt": prompt, "aspect_ratio": aspect_ratio})
        image_url = extract_output_url(image_output)
        start_images_by_variant[variant["variant_id"]] = image_url
        start_images_data_uri[variant["variant_id"]] = url_to_data_uri(image_url, "image/jpeg")
        prompt_rows.append(
            {
                "variant_id": variant["variant_id"],
                "language": variant["language"],
                "voice": variant["voice"],
                "voice_gender": variant["voice_gender"],
                "profile_key": variant["profile_key"],
                "persona_style": variant["persona_style"],
                "frame_format": variant["frame_format"],
                "aspect_ratio": aspect_ratio,
                "generated_image_prompt": prompt,
                "voice_script_preview": language_script[:120] + "...",
            }
        )
else:
    for variant in VARIANTS:
        start_images_by_variant[variant["variant_id"]] = DIRECT_IMAGE_URL
        start_images_data_uri[variant["variant_id"]] = url_to_data_uri(DIRECT_IMAGE_URL, "image/jpeg")
        prompt_rows.append(
            {
                "variant_id": variant["variant_id"],
                "language": variant["language"],
                "voice": variant["voice"],
                "voice_gender": variant["voice_gender"],
                "profile_key": variant["profile_key"],
                "persona_style": variant["persona_style"],
                "frame_format": variant["frame_format"],
                "aspect_ratio": "16:9" if variant["frame_format"] == "horizontal" else "9:16",
                "generated_image_prompt": "external_url_used",
                "voice_script_preview": LANGUAGE_SCRIPTS[variant["language"]][:120] + "...",
            }
        )

prompt_df = pd.DataFrame(prompt_rows)
variant_df = pd.DataFrame(VARIANTS)

display(prompt_df)
display(variant_df)

variant_id language voice voice_gender profile_key persona_style frame_format aspect_ratio generated_image_prompt voice_script_preview
0 main_en_us English (US) Zephyr (Female) female friendly_mentor mentor horizontal 16:9 Friendly AI mentor, approachable expression, m... Welcome to Pruna AI. Today, we will create you...
1 localized_fr French Kore (Female) female creator_coach coach vertical 9:16 Confident creator coach, energetic presentatio... Bienvenue sur Pruna AI. Dans cette courte leco...
2 localized_es Spanish Puck (Male) male product_trainer trainer vertical 9:16 Calm product trainer, clear instructional post... Bienvenido a Pruna AI. En esta leccion breve, ...
variant_id language voice voice_gender persona_style frame_format profile_key seed
0 main_en_us English (US) Zephyr (Female) female mentor horizontal friendly_mentor 42
1 localized_fr French Kore (Female) female coach vertical creator_coach 101
2 localized_es Spanish Puck (Male) male trainer vertical product_trainer 102

Step 2 — Main avatar clip

Calls ``prunaai/p-video-avatar`` for ``MAIN_VARIANT`` (English). Inputs include ``image`` from Step 1, ``voice_script`` / ``voice_prompt`` / ``video_prompt``, plus ``voice``, ``voice_language``, ``resolution``, and ``seed``.

Validate here first: if lip sync or motion fails, tighten the still (Step 1) and ``video_prompt`` (fixed camera, less aggressive background motion) before spending runs on other languages. The scenario ``VIDEO_PROMPTS`` entry (mentor / coach / trainer) selects which motion line goes to the API.

[6]:
# Step 2: Primary English clip — sanity-check this before burning runs on other locales.

results: list[RunResult] = []

main_payload = {
    "image": start_images_by_variant[MAIN_VARIANT["variant_id"]],
    "voice_script": LANGUAGE_SCRIPTS[MAIN_VARIANT["language"]],
    "voice": MAIN_VARIANT["voice"],
    "voice_language": MAIN_VARIANT["language"],
    "voice_prompt": VOICE_PROMPTS[MAIN_VARIANT["language"]],
    "video_prompt": VIDEO_PROMPTS[MAIN_VARIANT["persona_style"]],
    "resolution": SCENARIO["resolution"],
    "seed": MAIN_VARIANT["seed"],
    "disable_safety_filter": True,
    "disable_prompt_upsampling": False,
}

main_url, main_elapsed = run_prediction(P_VIDEO_AVATAR_DEPLOYMENT, main_payload)
if is_http_url(main_url):
    main_data_uri = url_to_data_uri(main_url, "video/mp4")
    main_data_uri_preview = main_data_uri[:80] + "..."
    main_status = "succeeded"
else:
    main_data_uri = ""
    main_data_uri_preview = ""
    main_status = "failed"

main_cost = SCENARIO["estimated_seconds"] * (0.025 if SCENARIO["resolution"] == "720p" else 0.045)
results.append(
    RunResult(
        scenario=SCENARIO["name"],
        variant=MAIN_VARIANT["variant_id"],
        status=main_status,
        output_url=main_url,
        output_data_uri_preview=main_data_uri_preview,
        persona_profile=MAIN_VARIANT["profile_key"],
        elapsed_seconds=main_elapsed,
        estimated_cost_usd=main_cost,
    )
)

print(
    f"Main avatar generated in {main_elapsed:.1f}s with {MAIN_VARIANT['language']} / {MAIN_VARIANT['voice']} / {MAIN_VARIANT['profile_key']}"
)
if is_http_url(main_url):
    show_video_data_uri(main_data_uri)
else:
    print("Main variant did not return a downloadable URL.")

Main avatar generated in 31.4s with English (US) / Zephyr (Female) / friendly_mentor

Step 3 — Optional uploaded audio

Set ``AUDIO_OVERRIDE_URL`` to a public audio file URL to drive speech timing from your recording instead of built-in TTS. The model uses ``audio`` for lip sync; ``voice_script`` may still be present for logging but timing follows your file.

Leave ``AUDIO_OVERRIDE_URL`` empty (default) to skip this step entirely.

[7]:
# Step 3 (optional): Same still + uploaded audio URL — lip sync follows the audio file when enabled.

if AUDIO_OVERRIDE_URL:
    audio_payload = {
        "image": start_images_by_variant[MAIN_VARIANT["variant_id"]],
        "audio": AUDIO_OVERRIDE_URL,
        "voice_script": "Ignored because audio override is present.",
        "video_prompt": "Calm and steady delivery with minimal body movement.",
        "resolution": "1080p",
        "seed": 101,
    }
    audio_url, audio_elapsed = run_prediction(P_VIDEO_AVATAR_DEPLOYMENT, audio_payload)
    audio_data_uri = url_to_data_uri(audio_url, "video/mp4")
    audio_cost = SCENARIO["estimated_seconds"] * 0.045
    results.append(
        RunResult(
            scenario=SCENARIO["name"],
            variant="audio_override_1080p",
            status="succeeded",
            output_url=audio_url,
            output_data_uri_preview=audio_data_uri[:80] + "...",
            persona_profile=MAIN_VARIANT["profile_key"],
            elapsed_seconds=audio_elapsed,
            estimated_cost_usd=audio_cost,
        )
    )
    print(f"Audio override avatar generated in {audio_elapsed:.1f}s")
    show_video_data_uri(audio_data_uri)
else:
    print("Skipping audio override variant. Set AUDIO_OVERRIDE_URL to enable.")

Skipping audio override variant. Set AUDIO_OVERRIDE_URL to enable.

Step 4 — More languages

Iterates ``ADDITIONAL_VARIANTS`` (every row after the English ``MAIN_VARIANT``). Each call passes that variant’s ``image``, localized ``voice_script``, ``voice_language``, ``voice``, and the matching ``VOICE_PROMPTS`` / ``VIDEO_PROMPTS`` lines.

Localization: write each ``LANGUAGE_SCRIPTS`` entry in the target language—do not paste English into a foreign voice. Keep voice gender consistent with the generated face when possible.

[8]:
# Step 4: Remaining locales — same API shape; swap script, voice, and voice_prompt per LANGUAGE_SCRIPTS / VOICE_PROMPTS.

for variant in ADDITIONAL_VARIANTS:
    payload = {
        "image": start_images_by_variant[variant["variant_id"]],
        "voice_script": LANGUAGE_SCRIPTS[variant["language"]],
        "voice": variant["voice"],
        "voice_language": variant["language"],
        "voice_prompt": VOICE_PROMPTS[variant["language"]],
        "video_prompt": VIDEO_PROMPTS[variant["persona_style"]],
        "resolution": "720p",
        "seed": variant["seed"],
    }
    output_url, elapsed = run_prediction(P_VIDEO_AVATAR_DEPLOYMENT, payload)
    if is_http_url(output_url):
        output_data_uri = url_to_data_uri(output_url, "video/mp4")
        output_data_uri_preview = output_data_uri[:80] + "..."
        status = "succeeded"
    else:
        output_data_uri = ""
        output_data_uri_preview = ""
        status = "failed"
    results.append(
        RunResult(
            scenario=SCENARIO["name"],
            variant=variant["variant_id"],
            status=status,
            output_url=output_url,
            output_data_uri_preview=output_data_uri_preview,
            persona_profile=variant["profile_key"],
            elapsed_seconds=elapsed,
            estimated_cost_usd=SCENARIO["estimated_seconds"] * 0.025,
        )
    )
    print(
        f"Generated {variant['variant_id']} in {elapsed:.1f}s with "
        f"{variant['language']} / {variant['voice']} / {variant['profile_key']} ({variant['frame_format']})"
    )
    if is_http_url(output_url):
        show_video_data_uri(output_data_uri)
    else:
        print(f"Variant {variant['variant_id']} did not return a downloadable URL.")

Generated localized_fr in 40.3s with French / Kore (Female) / creator_coach (vertical)
Generated localized_es in 36.8s with Spanish / Puck (Male) / product_trainer (vertical)

Step 5 — Review runs

Builds one DataFrame from every ``RunResult``: variant id, HTTP status, output URL (or data-uri preview), elapsed seconds, and an approximate USD cost using ``SCENARIO[“estimated_seconds”]`` and published per-second rates.

Use this table to compare which variants cost most, spot failures quickly, and copy output URLs for review. Figures are indicative—confirm charges in your workspace billing.

Optionally open inline previews where the notebook embeds returned video.

[9]:
# Step 5: Aggregate runs — estimated_cost_usd uses SCENARIO estimated_seconds × per-second rate (approximate).

df = pd.DataFrame([r.__dict__ for r in results])
if df.empty:
    print("No generations completed.")
else:
    df = df.sort_values(["scenario", "variant"]).reset_index(drop=True)
    display(df)
    print(f"\nTotal estimated cost: ${df['estimated_cost_usd'].sum():.3f}")

scenario variant status output_url output_data_uri_preview persona_profile elapsed_seconds estimated_cost_usd
0 learn_pruna_ai_generation localized_es succeeded https://replicate.delivery/xezq/9wCjoullxM6zA9... data:video/mp4;base64,AAAAIGZ0eXBpc29tAAACAGlz... product_trainer 36.807317 0.45
1 learn_pruna_ai_generation localized_fr succeeded https://replicate.delivery/xezq/6R5K25O45B6fP6... data:video/mp4;base64,AAAAIGZ0eXBpc29tAAACAGlz... creator_coach 40.331302 0.45
2 learn_pruna_ai_generation main_en_us succeeded https://replicate.delivery/xezq/f3dm7fcOGqlyv0... data:video/mp4;base64,AAAAIGZ0eXBpc29tAAACAGlz... friendly_mentor 31.379131 0.45

Total estimated cost: $1.350

Next steps

  • Ship English first: finalize ``MAIN_VARIANT`` (still + video_prompt + script) before scaling spend on extra locales.

  • Align fields per variant: ``voice_script`` language, ``voice_language``, and ``voice`` must match; misalignment is the most common cause of poor audio.

  • Fair comparisons: hold ``seed`` constant when A/B testing prompt wording so variance isn’t random noise.

  • Docs vs notebook: this flow uses Replicate deployments (prunaai/p-image, prunaai/p-video-avatar) for a runnable notebook—the same ``image`` + script + prompt fields apply to the Pruna HTTP API if you move to production (see the P-Video-Avatar endpoint documentation for limits and authentication).