P-Video-Avatar
P-Video-Avatar is Pruna’s performance model for speech-driven avatar video from a single image—spokesperson-style clips with strong lip sync and multilingual speech.
Designed for professional results, it combines fast turnaround, script or uploaded audio, voice and language control, and P-Image-compatible start frames, so you can keep one approved look across ads, support, education, and localized variants.
Note
Some visuals shown here are inspired by, derived from brand assets, or reminiscent of representative brands across various industries and have been adapted for demo purposes. When using P-Video-Avatar, respect the copyright of images you use as input and of the speech and video you generate.
Pricing:
Resolution |
Price |
|---|---|
720p |
$0.025 per second of output video |
1080p |
$0.045 per second of output video |
Tip
Test it now in the P-Video-Avatar Playground.
Prompt formula
Biggest levers: (1) Still—face and wardrobe consistency start in P-Image; always single subject. (2) Script—spoken language matches voice locale; write for the ear. (3) voice_prompt is performance only, not dialogue. (4) video_prompt—use fixed camera when lip sync drifts; calm background beats busy motion.
Fast pass
Fewer constraints—good for first renders and timing checks.
Locked-in
Explicit rules—same look and motion across runs and longer clips.
Still / first frame — The API takes an uploaded
image(first frame).start_image_promptis not a request field; generate the still with P-Image using a prompt like the badge row above (demographic, wardrobe, lighting, lens, framing—single subject), then upload the file asimage.Examples (P-Image prompt style): “Professional woman in her 30s, medium close-up, office window light”, “single subject, 9:16, soft key, direct eye contact”
voice_script— What the avatar says, in the target language (or use uploadedaudioinstead).Examples: “Welcome—let’s connect your data in under two minutes.”, short lines for TTS.
voice_prompt— How it is said: tone, pace, energy, role—not the words of the script.Examples: “warm support specialist, calm pace”, “no sales hype, clear consonants”
video_prompt— On-camera motion: face, shoulders, hands; background behavior; usually fixed camera for stable lip sync.Examples: “fixed eye-level shot, small hand gestures”, “soft office blur behind subject, no zoom”
Slot |
Fast pass (enough to run) |
Locked-in (stronger control) |
|---|---|---|
Still prompt (P-Image → |
One line: who, outfit, basic light; single subject implied. |
Single subject explicit; age, skin, hair, wardrobe, expression, light direction, set, lens + aspect—stable identity; matches voice + locale intent. |
|
Short lines, one beat per sentence, target language. |
Ear-tested for TTS; rewrite per locale (do not ship English in a foreign voice). |
|
Tone + pace in a few words. |
Role (host, coach, support), register, what to avoid (hype, theatrical); must match the still persona. |
|
Steady shot, small motion, simple background. |
Fixed camera when lip sync drifts; repeatable gestures; background static or soft blur; no pan / zoom / handheld if the mouth or frame wobbles. |
Tip
When you need strict speaking cadence and pronunciation, prefer uploaded audio over generated TTS.
Tip
For comprehensive video prompting (motion, framing, atmosphere), see the Video Generation guide.
Key Features
P-Video-Avatar fits the same Pruna API patterns as P-Video while specializing in talking-head generation:
- Script or uploaded audio
Drive speech with
voice_scriptand built-in voices, or uploadaudiofor exact timing and pronunciation.- Multilingual voices
Match
voice_languageandvoiceto your region; keep the same start-frame identity across locales.- Lip-sync-friendly motion
Use
video_promptto describe stable framing, gestures, and background—optimized for clear mouth motion.- 720p and 1080p output
Choose resolution per asset; cost scales per second of output video (see Pricing).
- P-Image-aligned start frames
Generate the
imagestill with P-Image using the same prompt habits as the P-Image documentation.
Practical constraints
We recommend clips under 3 minutes for best consistency.
Output aspect ratio follows the input image.
Very long clips may show gradual consistency drift over time. This is a current diffusion-model limitation across the industry.
Horizontal and vertical strategy
p-video-avatar inherits aspect ratio from the input image:
Horizontal output: provide a landscape start frame (for example 16:9).
Vertical output: provide a portrait start frame (for example 9:16).
Generate landscape and portrait starts with P-Image so the first frame matches the aspect ratio you want in the final avatar clip (for example 16:9 for web hero cuts, 9:16 for social).
Identity + voice/language alignment
To keep examples realistic and coherent, each scenario aligns:
identity prompt (including gender and demographic descriptor),
voice gender (female/male voice),
voice_language (target locale/language),
frame format (horizontal/vertical input image).
Example alignment presets:
Use case |
Identity prompt (image) |
voice |
voice_language |
Gender alignment |
Format |
|---|---|---|---|---|---|
SaaS onboarding |
Black woman, professional spokesperson, direct camera engagement |
|
|
female image + female voice |
horizontal |
EU founder update |
White woman, founder-style delivery, social-first framing |
|
|
female image + female voice |
vertical |
Product manager explainer |
East Asian man, product walkthrough style, concise delivery |
|
|
male image + male voice |
vertical |
Support tutorial |
Male support agent, reassuring tone, instructional style |
|
|
male image + male voice |
horizontal |
Education short |
Female educator, calm teaching posture, high clarity |
|
|
female image + female voice |
vertical |
Examples
One tab per domain, with side-by-side cards: video (poster matches the still), voice and resolution line, and copy-ready prompts. The still text is a P-Image prompt (see Domain Use Cases there for tone and structure). Use it in P-Image to create the frame, then upload that file as image in the API. The label start_image_prompt in the copy blocks is documentation shorthand—it is not a request field.
Each card leads with the video. The first line is the P-Image still prompt (API: upload the generated image as image—start_image_prompt is a doc label, not a JSON key). Then voice_script, voice_prompt, and video_prompt. Style matches Domain Use Cases. Copy full example prompts copies the bundle for your notes.
Integration
P-Video-Avatar uses the same Pruna prediction API as P-Video. You supply a portrait or spokesperson still (often from P-Image) as image, then drive speech with voice_script and voice fields—or override with audio for exact timing.
For the full p-video-avatar request and response reference, use P-Video-Avatar in the API guides.
Tip
For more information on how to use the API, see the API Reference.
- API Endpoint
Base URL:
https://api.pruna.ai/v1/predictions
Authentication
-H 'apikey: YOUR_API_KEY'
Step 1: Upload your avatar source image
curl -X POST "https://api.pruna.ai/v1/files" \
-H "apikey: YOUR_API_KEY" \
-F "content=@/path/to/portrait.jpg"
Use the returned file URL as image in generation requests.
Step 2: Create avatar generation request
Script + built-in TTS (asynchronous)
curl -X POST 'https://api.pruna.ai/v1/predictions' \
-H 'Content-Type: application/json' \
-H 'apikey: YOUR_API_KEY' \
-H 'Model: p-video-avatar' \
-d '{
"input": {
"image": "https://api.pruna.ai/v1/files/file-abc123",
"voice_script": "Hello and welcome to our product demo.",
"voice": "Zephyr (Female)",
"voice_language": "English (US)",
"voice_prompt": "Warm, energetic, sales presentation tone.",
"video_prompt": "The person speaks with subtle hand gestures and a dynamic office background.",
"resolution": "720p"
}
}'
Script + built-in TTS (synchronous)
curl -X POST 'https://api.pruna.ai/v1/predictions' \
-H 'Content-Type: application/json' \
-H 'apikey: YOUR_API_KEY' \
-H 'Model: p-video-avatar' \
-H 'Try-Sync: true' \
-d '{
"input": {
"image": "https://api.pruna.ai/v1/files/file-abc123",
"voice_script": "Welcome to your onboarding. Let us configure your first workflow.",
"voice": "Puck (Male)",
"resolution": "1080p",
"seed": 42
}
}'
Uploaded audio override
curl -X POST 'https://api.pruna.ai/v1/predictions' \
-H 'Content-Type: application/json' \
-H 'apikey: YOUR_API_KEY' \
-H 'Model: p-video-avatar' \
-d '{
"input": {
"image": "https://api.pruna.ai/v1/files/file-abc123",
"audio": "https://api.pruna.ai/v1/files/file-audio456",
"voice_script": "This text is ignored when audio is provided.",
"video_prompt": "Natural body-camera engagement, slight camera push-in."
}
}'
Configuration
Required parameters
Parameter |
Type |
Description |
|---|---|---|
image |
file/string |
Input image (first frame). Supports jpg, jpeg, png, webp. |
You must also provide either voice_script or audio (or both, with audio taking priority).
Optional parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
audio |
file/string |
Uploaded audio URL used to drive speech and timing. |
|
voice |
string |
|
Voice used for generated speech. |
voice_script |
string |
|
Script spoken when |
voice_prompt |
string |
|
Speaking style instructions (tone, pacing, emotion). |
voice_language |
string |
|
Output language for generated speech. |
video_prompt |
string |
|
Prompt controlling body movement, framing behavior, and atmosphere. |
resolution |
string |
|
Output resolution. Allowed values: |
seed |
integer |
random |
Random seed for reproducible generations. |
disable_safety_filter |
boolean |
|
Disables prompt/image safety checks when true. |
disable_prompt_upsampling |
boolean |
|
Skip prompt upsampling and pass raw prompt text to the model. |
Supported option values
resolution:720p,1080p.voice_language:English (US),English (UK),Spanish,French,German,Italian,Portuguese (Brazil),Japanese,Korean,Hindi.voice:Zephyr (Female),Puck (Male),Charon (Male),Kore (Female),Fenrir (Male),Leda (Female),Orus (Male),Aoede (Female),Callirrhoe (Female),Autonoe (Female),Enceladus (Male),Iapetus (Male),Umbriel (Male),Algenib (Male),Despina (Female),Erinome (Female),Laomedeia (Female),Achernar (Female),Algieba (Male),Schedar (Male),Gacrux (Female),Pulcherrima (Female),Achird (Male),Zubenelgenubi (Male),Vindemiatrix (Female),Sadachbia (Male),Sadaltager (Male),Sulafat (Female),Alnilam (Male),Rasalgethi (Male).
Argument recommendations
Use these patterns for consistent quality:
image: the only first-frame input; use P-Image to create the still, then upload the file. Example bundles on this page label the still textstart_image_promptfor readability—that name is not an API parameter.seed: set when you need reproducible A/B variants; change only one variable at a time.audiovsvoice_script: preferaudiowhen exact timing/pronunciation is critical; otherwise usevoice_scriptfor speed and scale.voice+voice_language: choose together and align persona with your start-frame identity.voice_prompt: keep to delivery style only (tone, speed, emotion), not content.video_prompt: use for movement/framing/background behavior; avoid re-stating the script.resolution: iterate in720p, then rerun final assets in1080p.disable_prompt_upsampling: set true for strict prompt control and reproducibility; keep false when you want automatic prompt enhancement.disable_safety_filter: keep default behavior unless you have an explicit moderated workflow for disabled filtering.