How to Generate an AI Music Video with P-Video and MiniMax Music

Turn your lyrics into a full music video—no editing software, no timeline scrubbing. Just run the cells and watch it come together.

What we’ll do:

Generate the song — MiniMax Music 1.5 turns your lyrics into a complete track (up to 4 minutes).
Split the audio — We chop the song into short segments (about 6 seconds each) so each gets its own visual.
Plan the scenes — An LLM reads your lyrics and designs a scene for each segment: image, optional edit, and motion.
Build and merge — For each segment we generate an image, optionally edit it, animate it with P-Video (synced to the audio), then stitch everything into one video.

You’ll see the final music video at the end. Change the lyrics and style prompt to make it your own.

Models used: minimax/music-1.5, p-image, p-image-edit, p-video

Setup

We’ll need a few extra packages for audio handling (pydub) and video merging (moviepy). You’ll also need Replicate and OpenAI API keys—grab them from Replicate and OpenAI.

[1]:

%pip install replicate openai requests pydub moviepy

/Users/davidberenstein/Documents/programming/pruna/prunatree/.venv/bin/python3: No module named pip
Note: you may need to restart the kernel to use updated packages.

[2]:

import io
import json
import os
import tempfile
import requests
from IPython.display import Audio, Video, display
from replicate.client import Client
from openai import OpenAI
from pydub import AudioSegment
from moviepy import VideoFileClip, concatenate_videoclips

[3]:

token = os.environ.get("REPLICATE_API_TOKEN")
if not token:
    token = input("Replicate API token (r8_...): ").strip()
replicate = Client(api_token=token)

[4]:

openai_token = os.environ.get("OPENAI_API_KEY")
if not openai_token:
    openai_token = input("OpenAI API key (sk-...): ").strip()
openai_client = OpenAI(api_key=openai_token)

Step 1: Generate your song

Paste your lyrics into the cell below and pick a style (e.g., “Jazz, smooth, upbeat”). MiniMax Music 1.5 will generate a full song with vocals and instrumentation. The model supports up to 600 characters of lyrics and outputs up to 4 minutes of audio.

Run the cell—you’ll get a URL to the generated MP3. The next step will download and split it automatically.

[5]:

lyrics = """
Walking through the city lights
Feeling the rhythm of the night
Every step we take together
Dancing in the summer weather
"""

music_output = replicate.run(
    "minimax/music-1.5",
    input={
        "lyrics": lyrics.strip(),
        "prompt": "Jazz, smooth, upbeat",
    },
)
audio_url = music_output.url
resp = requests.get(audio_url)
resp.raise_for_status()
audio_path = tempfile.mktemp(suffix=".mp3")
with open(audio_path, "wb") as f:
    f.write(resp.content)
display(Audio(filename=audio_path))

Step 2: Split the audio into segments

We divide the song into ~6-second chunks. Each chunk will become one video segment, so the visuals can change as the song progresses. P-Video can sync each segment to its audio, so the motion and music stay in time.

After running, you’ll see how many segments were created (e.g., 4–8 for a short song).

[6]:

segment_duration_ms = 6000
NUM_SEGMENTS = 3

resp = requests.get(audio_url)
resp.raise_for_status()
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
    f.write(resp.content)
    audio_path = f.name

audio = AudioSegment.from_file(audio_path)
segments = []
for i in range(0, len(audio), segment_duration_ms):
    chunk = audio[i : i + segment_duration_ms]
    if len(chunk) > 500:
        seg_path = tempfile.mktemp(suffix=".mp3")
        chunk.export(seg_path, format="mp3")
        segments.append(seg_path)
segments = segments[:NUM_SEGMENTS]
print(f"Split into {len(segments)} segments")

Split into 3 segments

Step 3: Let the LLM plan your scenes

The LLM reads your lyrics and the number of segments, then designs a scene for each one. For every segment it returns:

Image prompt — What the still image should show (e.g., “city skyline at dusk” for the first verse).
Edit prompt (optional) — Refinements before animating.
Video prompt — How the scene should move.

This keeps the visuals aligned with the mood and story of the song. Run the cell to generate the scene plan—you can inspect the prompts in the output.

[7]:

response = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You generate scene prompts for a music video. Return valid JSON array. Each object: image_prompt, edit_prompt (optional), video_prompt. image_prompt: text-to-image. edit_prompt: optional refinement. video_prompt: motion for image-to-video.",
        },
        {
            "role": "user",
            "content": f"Create {len(segments)} scene prompts for a music video. Lyrics: {lyrics[:100]}...",
        },
    ],
)
raw = response.choices[0].message.content
if "```" in raw:
    raw = raw.split("```")[1]
    if raw.startswith("json"):
        raw = raw[4:]
scene_prompts = json.loads(raw.strip())
if isinstance(scene_prompts, dict):
    scene_prompts = scene_prompts.get("scenes", [scene_prompts])
while len(scene_prompts) < len(segments):
    scene_prompts.append(
        scene_prompts[-1]
        if scene_prompts
        else {"image_prompt": "abstract art", "video_prompt": "abstract motion"}
    )
scene_prompts = scene_prompts[: len(segments)]

Step 4: Generate each segment and merge

This is the main loop. For each segment we:

Generate an image with P-Image
Optionally refine it with P-Image-Edit
Upload the audio chunk to Replicate
Generate a video with P-Video (image + audio synced)
Download the segment

Then we concatenate all segments into one video. This step takes a few minutes—each segment needs an image and a video generation. When it’s done, your full music video will appear below.

[8]:

video_clips = []

for i, (seg_path, scene) in enumerate(zip(segments, scene_prompts)):
    img_prompt = scene.get("image_prompt", "abstract art")
    edit_prompt = scene.get("edit_prompt")
    vid_prompt = scene.get("video_prompt", "smooth motion")

    img_out = replicate.run(
        "prunaai/p-image", input={"prompt": img_prompt, "aspect_ratio": "16:9"}
    )
    image_url = img_out.url

    if edit_prompt:
        edit_out = replicate.run(
            "prunaai/p-image-edit", input={"images": [image_url], "prompt": edit_prompt}
        )
        image_url = edit_out.url

    with open(seg_path, "rb") as f:
        audio_data = f.read()
    audio_file = replicate.files.create(file=io.BytesIO(audio_data), filename="segment.mp3")
    audio_replicate_url = (
        audio_file.urls.get("get")
        if hasattr(audio_file, "urls")
        else getattr(audio_file, "url", str(audio_file))
    )

    vid_input = {"image": image_url, "prompt": vid_prompt, "audio": audio_replicate_url}

    vid_out = replicate.run("prunaai/p-video", input=vid_input)
    video_url = vid_out.url

    vpath = tempfile.mktemp(suffix=".mp4")
    r = requests.get(video_url)
    r.raise_for_status()
    with open(vpath, "wb") as f:
        f.write(r.content)
    video_clips.append(VideoFileClip(vpath))

final = concatenate_videoclips(video_clips)
output_path = "music_video_output.mp4"
final.write_videofile(output_path, codec="libx264", audio_codec="aac")
final.close()
for c in video_clips:
    c.close()
print("Saved:", output_path)
display(Video(output_path, embed=True))

MoviePy - Building video music_video_output.mp4.
MoviePy - Writing audio in music_video_outputTEMP_MPY_wvf_snd.mp4


chunk:   0%|          | 0/397 [00:00<?, ?it/s, now=None]
chunk:  28%|██▊       | 110/397 [00:00<00:00, 976.68it/s, now=None]
chunk:  82%|████████▏ | 325/397 [00:00<00:00, 1614.33it/s, now=None]

MoviePy - Done.
MoviePy - Writing video music_video_output.mp4


frame_index:   0%|          | 0/432 [00:00<?, ?it/s, now=None]
frame_index:   5%|▍         | 21/432 [00:00<00:01, 209.04it/s, now=None]
frame_index:  12%|█▏        | 53/432 [00:00<00:01, 268.35it/s, now=None]
frame_index:  19%|█▊        | 80/432 [00:00<00:01, 180.69it/s, now=None]
frame_index:  25%|██▍       | 107/432 [00:00<00:01, 206.95it/s, now=None]
frame_index:  30%|███       | 130/432 [00:00<00:01, 207.20it/s, now=None]
frame_index:  35%|███▌      | 153/432 [00:00<00:01, 178.34it/s, now=None]
frame_index:  40%|████      | 173/432 [00:00<00:01, 159.74it/s, now=None]
frame_index:  44%|████▍     | 191/432 [00:01<00:01, 161.14it/s, now=None]
frame_index:  51%|█████     | 221/432 [00:01<00:01, 196.04it/s, now=None]
frame_index:  56%|█████▌    | 242/432 [00:01<00:01, 166.12it/s, now=None]
frame_index:  60%|██████    | 261/432 [00:01<00:01, 115.56it/s, now=None]
frame_index:  64%|██████▍   | 276/432 [00:01<00:01, 120.45it/s, now=None]
frame_index:  67%|██████▋   | 291/432 [00:02<00:01, 91.73it/s, now=None]
frame_index:  70%|███████   | 303/432 [00:02<00:01, 93.77it/s, now=None]
frame_index:  74%|███████▎  | 318/432 [00:02<00:01, 97.61it/s, now=None]
frame_index:  76%|███████▋  | 330/432 [00:02<00:01, 91.59it/s, now=None]
frame_index:  79%|███████▉  | 341/432 [00:02<00:00, 94.17it/s, now=None]
frame_index:  81%|████████▏ | 352/432 [00:02<00:01, 60.24it/s, now=None]
frame_index:  84%|████████▎ | 361/432 [00:03<00:01, 64.41it/s, now=None]
frame_index:  87%|████████▋ | 374/432 [00:03<00:00, 75.62it/s, now=None]
frame_index:  89%|████████▉ | 384/432 [00:03<00:00, 79.24it/s, now=None]
frame_index:  92%|█████████▏| 399/432 [00:03<00:00, 93.80it/s, now=None]
frame_index:  96%|█████████▌| 414/432 [00:03<00:00, 106.08it/s, now=None]
frame_index:  99%|█████████▉| 429/432 [00:03<00:00, 116.96it/s, now=None]

MoviePy - Done !
MoviePy - video ready music_video_output.mp4
Saved: music_video_output.mp4

Conclusion

Congratulations! You’ve just created a music video using AI. This method allows you to generate long-form content quickly and efficiently.

You can now use this method to create longer and more complex music videos by adding more segments and dedicating more time to the prompts.

You can check out other workflows or sign up for for our API and get started at https://dashboard.pruna.ai/login