P-Video-Avatar

P-Video-Avatar is Pruna’s performance model for speech-driven avatar video from a single image—spokesperson-style clips with strong lip sync and multilingual speech.

Designed for professional results, it combines fast turnaround, script or uploaded audio, voice and language control, and P-Image-compatible start frames, so you can keep one approved look across ads, support, education, and localized variants.

Note

Some visuals shown here are inspired by, derived from brand assets, or reminiscent of representative brands across various industries and have been adapted for demo purposes. When using P-Video-Avatar, respect the copyright of images you use as input and of the speech and video you generate.

Pricing:

Resolution

Price

720p

$0.025 per second of output video

1080p

$0.045 per second of output video

Tip

Test it now in the P-Video-Avatar Playground.

Prompt formula

Biggest levers: (1) Still—face and wardrobe consistency start in P-Image; always single subject. (2) Script—spoken language matches voice locale; write for the ear. (3) voice_prompt is performance only, not dialogue. (4) video_prompt—use fixed camera when lip sync drifts; calm background beats busy motion.

Fast pass

Fewer constraints—good for first renders and timing checks.

[p-image start frame] [voice_script] [voice_prompt] [video_prompt]
one line: age, outfit, soft light, short lines, easy for TTS, warm / calm + pace in a few words, steady shot, small gestures

Locked-in

Explicit rules—same look and motion across runs and longer clips.

[p-image start frame] [voice_script] [voice_prompt] [video_prompt]
single subject, aspect, light direction, wardrobe & set spelled out, full script, localized, pauses that sound natural aloud, role + energy + what to avoid (hype, theatrical), fixed camera; gesture zone; background static or soft blur; no pan/zoom/handheld if sync slips
Fast pass — lighter prompts; good for first cuts and timing.
Locked-in — richer still + script; tighter voice_prompt and video_prompt (fixed camera, motion limits) for repeatable lip sync.
  1. Still / first frame — The API takes an uploaded image (first frame). start_image_prompt is not a request field; generate the still with P-Image using a prompt like the badge row above (demographic, wardrobe, lighting, lens, framing—single subject), then upload the file as image.

    • Examples (P-Image prompt style): “Professional woman in her 30s, medium close-up, office window light”, “single subject, 9:16, soft key, direct eye contact”

  2. voice_script — What the avatar says, in the target language (or use uploaded audio instead).

    • Examples: “Welcome—let’s connect your data in under two minutes.”, short lines for TTS.

  3. voice_promptHow it is said: tone, pace, energy, role—not the words of the script.

    • Examples: “warm support specialist, calm pace”, “no sales hype, clear consonants”

  4. video_prompt — On-camera motion: face, shoulders, hands; background behavior; usually fixed camera for stable lip sync.

    • Examples: “fixed eye-level shot, small hand gestures”, “soft office blur behind subject, no zoom”

Slot

Fast pass (enough to run)

Locked-in (stronger control)

Still prompt (P-Image → image)

One line: who, outfit, basic light; single subject implied.

Single subject explicit; age, skin, hair, wardrobe, expression, light direction, set, lens + aspect—stable identity; matches voice + locale intent.

voice_script

Short lines, one beat per sentence, target language.

Ear-tested for TTS; rewrite per locale (do not ship English in a foreign voice).

voice_prompt

Tone + pace in a few words.

Role (host, coach, support), register, what to avoid (hype, theatrical); must match the still persona.

video_prompt

Steady shot, small motion, simple background.

Fixed camera when lip sync drifts; repeatable gestures; background static or soft blur; no pan / zoom / handheld if the mouth or frame wobbles.

Tip

When you need strict speaking cadence and pronunciation, prefer uploaded audio over generated TTS.

Tip

For comprehensive video prompting (motion, framing, atmosphere), see the Video Generation guide.

Key Features

P-Video-Avatar fits the same Pruna API patterns as P-Video while specializing in talking-head generation:

Script or uploaded audio

Drive speech with voice_script and built-in voices, or upload audio for exact timing and pronunciation.

Multilingual voices

Match voice_language and voice to your region; keep the same start-frame identity across locales.

Lip-sync-friendly motion

Use video_prompt to describe stable framing, gestures, and background—optimized for clear mouth motion.

720p and 1080p output

Choose resolution per asset; cost scales per second of output video (see Pricing).

P-Image-aligned start frames

Generate the image still with P-Image using the same prompt habits as the P-Image documentation.

Practical constraints

  • We recommend clips under 3 minutes for best consistency.

  • Output aspect ratio follows the input image.

  • Very long clips may show gradual consistency drift over time. This is a current diffusion-model limitation across the industry.

Horizontal and vertical strategy

p-video-avatar inherits aspect ratio from the input image:

  • Horizontal output: provide a landscape start frame (for example 16:9).

  • Vertical output: provide a portrait start frame (for example 9:16).

Generate landscape and portrait starts with P-Image so the first frame matches the aspect ratio you want in the final avatar clip (for example 16:9 for web hero cuts, 9:16 for social).

Identity + voice/language alignment

To keep examples realistic and coherent, each scenario aligns:

  • identity prompt (including gender and demographic descriptor),

  • voice gender (female/male voice),

  • voice_language (target locale/language),

  • frame format (horizontal/vertical input image).

Example alignment presets:

Use case

Identity prompt (image)

voice

voice_language

Gender alignment

Format

SaaS onboarding

Black woman, professional spokesperson, direct camera engagement

Zephyr (Female)

English (US)

female image + female voice

horizontal

EU founder update

White woman, founder-style delivery, social-first framing

Kore (Female)

French

female image + female voice

vertical

Product manager explainer

East Asian man, product walkthrough style, concise delivery

Puck (Male)

Spanish

male image + male voice

vertical

Support tutorial

Male support agent, reassuring tone, instructional style

Charon (Male)

English (UK)

male image + male voice

horizontal

Education short

Female educator, calm teaching posture, high clarity

Aoede (Female)

Hindi

female image + female voice

vertical

Examples

One tab per domain, with side-by-side cards: video (poster matches the still), voice and resolution line, and copy-ready prompts. The still text is a P-Image prompt (see Domain Use Cases there for tone and structure). Use it in P-Image to create the frame, then upload that file as image in the API. The label start_image_prompt in the copy blocks is documentation shorthand—it is not a request field.

Integration

P-Video-Avatar uses the same Pruna prediction API as P-Video. You supply a portrait or spokesperson still (often from P-Image) as image, then drive speech with voice_script and voice fields—or override with audio for exact timing.

For the full p-video-avatar request and response reference, use P-Video-Avatar in the API guides.

Tip

For more information on how to use the API, see the API Reference.

API Endpoint

Base URL: https://api.pruna.ai/v1/predictions

Authentication

-H 'apikey: YOUR_API_KEY'

Step 1: Upload your avatar source image

curl -X POST "https://api.pruna.ai/v1/files" \
  -H "apikey: YOUR_API_KEY" \
  -F "content=@/path/to/portrait.jpg"

Use the returned file URL as image in generation requests.

Step 2: Create avatar generation request

Script + built-in TTS (asynchronous)

curl -X POST 'https://api.pruna.ai/v1/predictions' \
-H 'Content-Type: application/json' \
-H 'apikey: YOUR_API_KEY' \
-H 'Model: p-video-avatar' \
-d '{
  "input": {
    "image": "https://api.pruna.ai/v1/files/file-abc123",
    "voice_script": "Hello and welcome to our product demo.",
    "voice": "Zephyr (Female)",
    "voice_language": "English (US)",
    "voice_prompt": "Warm, energetic, sales presentation tone.",
    "video_prompt": "The person speaks with subtle hand gestures and a dynamic office background.",
    "resolution": "720p"
  }
}'

Script + built-in TTS (synchronous)

curl -X POST 'https://api.pruna.ai/v1/predictions' \
-H 'Content-Type: application/json' \
-H 'apikey: YOUR_API_KEY' \
-H 'Model: p-video-avatar' \
-H 'Try-Sync: true' \
-d '{
  "input": {
    "image": "https://api.pruna.ai/v1/files/file-abc123",
    "voice_script": "Welcome to your onboarding. Let us configure your first workflow.",
    "voice": "Puck (Male)",
    "resolution": "1080p",
    "seed": 42
  }
}'

Uploaded audio override

curl -X POST 'https://api.pruna.ai/v1/predictions' \
-H 'Content-Type: application/json' \
-H 'apikey: YOUR_API_KEY' \
-H 'Model: p-video-avatar' \
-d '{
  "input": {
    "image": "https://api.pruna.ai/v1/files/file-abc123",
    "audio": "https://api.pruna.ai/v1/files/file-audio456",
    "voice_script": "This text is ignored when audio is provided.",
    "video_prompt": "Natural body-camera engagement, slight camera push-in."
  }
}'

Configuration

Required parameters

Parameter

Type

Description

image

file/string

Input image (first frame). Supports jpg, jpeg, png, webp.

You must also provide either voice_script or audio (or both, with audio taking priority).

Optional parameters

Parameter

Type

Default

Description

audio

file/string

Uploaded audio URL used to drive speech and timing.

voice

string

"Zephyr (Female)"

Voice used for generated speech.

voice_script

string

""

Script spoken when audio is not provided.

voice_prompt

string

"Say the following."

Speaking style instructions (tone, pacing, emotion).

voice_language

string

"English (US)"

Output language for generated speech.

video_prompt

string

"The person is talking."

Prompt controlling body movement, framing behavior, and atmosphere.

resolution

string

"720p"

Output resolution. Allowed values: 720p, 1080p.

seed

integer

random

Random seed for reproducible generations.

disable_safety_filter

boolean

true

Disables prompt/image safety checks when true.

disable_prompt_upsampling

boolean

false

Skip prompt upsampling and pass raw prompt text to the model.

Supported option values

  • resolution: 720p, 1080p.

  • voice_language: English (US), English (UK), Spanish, French, German, Italian, Portuguese (Brazil), Japanese, Korean, Hindi.

  • voice: Zephyr (Female), Puck (Male), Charon (Male), Kore (Female), Fenrir (Male), Leda (Female), Orus (Male), Aoede (Female), Callirrhoe (Female), Autonoe (Female), Enceladus (Male), Iapetus (Male), Umbriel (Male), Algenib (Male), Despina (Female), Erinome (Female), Laomedeia (Female), Achernar (Female), Algieba (Male), Schedar (Male), Gacrux (Female), Pulcherrima (Female), Achird (Male), Zubenelgenubi (Male), Vindemiatrix (Female), Sadachbia (Male), Sadaltager (Male), Sulafat (Female), Alnilam (Male), Rasalgethi (Male).

Argument recommendations

Use these patterns for consistent quality:

  • image: the only first-frame input; use P-Image to create the still, then upload the file. Example bundles on this page label the still text start_image_prompt for readability—that name is not an API parameter.

  • seed: set when you need reproducible A/B variants; change only one variable at a time.

  • audio vs voice_script: prefer audio when exact timing/pronunciation is critical; otherwise use voice_script for speed and scale.

  • voice + voice_language: choose together and align persona with your start-frame identity.

  • voice_prompt: keep to delivery style only (tone, speed, emotion), not content.

  • video_prompt: use for movement/framing/background behavior; avoid re-stating the script.

  • resolution: iterate in 720p, then rerun final assets in 1080p.

  • disable_prompt_upsampling: set true for strict prompt control and reproducibility; keep false when you want automatic prompt enhancement.

  • disable_safety_filter: keep default behavior unless you have an explicit moderated workflow for disabled filtering.