Open In Colab

Create a Music Video in 10 Minutes (No Editing Skills)

Turn your lyrics into a full music video—no editing software, no timeline scrubbing. Just run the cells and watch it come together.

What we’ll do:

  1. Generate the song — MiniMax Music 1.5 turns your lyrics into a complete track (up to 4 minutes).

  2. Split the audio — We chop the song into short segments (about 6 seconds each) so each gets its own visual.

  3. Plan the scenes — An LLM reads your lyrics and designs a scene for each segment: image, optional edit, and motion.

  4. Build and merge — For each segment we generate an image, optionally edit it, animate it with P-Video (synced to the audio), then stitch everything into one video.

You’ll see the final music video at the end. Change the lyrics and style prompt to make it your own.

Models used: minimax/music-1.5, p-image, p-image-edit, p-video

Setup

We’ll need a few extra packages for audio handling (pydub) and video merging (moviepy). You’ll also need Replicate and OpenAI API keys—grab them from Replicate and OpenAI.

[1]:
%pip install replicate openai requests pydub moviepy
/Users/davidberenstein/Documents/programming/pruna/prunatree/.venv/bin/python3: No module named pip
Note: you may need to restart the kernel to use updated packages.
[2]:
import json
import os
import tempfile
import requests
from IPython.display import Audio, Video, display
from replicate.client import Client
from openai import OpenAI
from pydub import AudioSegment
from moviepy import VideoFileClip, concatenate_videoclips
[3]:
token = os.environ.get("REPLICATE_API_TOKEN")
if not token:
    token = input("Replicate API token (r8_...): ").strip()
replicate = Client(api_token=token)
[4]:
openai_token = os.environ.get("OPENAI_API_KEY")
if not openai_token:
    openai_token = input("OpenAI API key (sk-...): ").strip()
openai_client = OpenAI(api_key=openai_token)

Step 1: Generate your song

Paste your lyrics into the cell below and pick a style (e.g., “Jazz, smooth, upbeat”). MiniMax Music 1.5 will generate a full song with vocals and instrumentation. The model supports up to 600 characters of lyrics and outputs up to 4 minutes of audio.

Run the cell—you’ll get a URL to the generated MP3. The next step will download and split it automatically.

[5]:
lyrics = """
Walking through the city lights
Feeling the rhythm of the night
Every step we take together
Dancing in the summer weather
"""

def _extract_url(obj):
    if isinstance(obj, str):
        return obj
    if hasattr(obj, "url"):
        return obj.url
    if hasattr(obj, "content_url"):
        return obj.content_url
    if isinstance(obj, list) and obj:
        return _extract_url(obj[0])
    if isinstance(obj, dict):
        return obj.get("audio") or obj.get("video") or obj.get("output") or (list(obj.values())[0] if obj else None)
    return str(obj)

music_output = replicate.run(
    "minimax/music-1.5",
    input={
        "lyrics": lyrics.strip(),
        "prompt": "Jazz, smooth, upbeat",
    },
)
audio_url = _extract_url(music_output)
resp = requests.get(audio_url)
resp.raise_for_status()
audio_path = tempfile.mktemp(suffix=".mp3")
with open(audio_path, "wb") as f:
    f.write(resp.content)
display(Audio(filename=audio_path))

Step 2: Split the audio into segments

We divide the song into ~6-second chunks. Each chunk will become one video segment, so the visuals can change as the song progresses. P-Video can sync each segment to its audio, so the motion and music stay in time.

After running, you’ll see how many segments were created (e.g., 4–8 for a short song).

[6]:
segment_duration_ms = 6000
NUM_SEGMENTS = 3

resp = requests.get(audio_url)
resp.raise_for_status()
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
    f.write(resp.content)
    audio_path = f.name

audio = AudioSegment.from_file(audio_path)
segments = []
for i in range(0, len(audio), segment_duration_ms):
    chunk = audio[i : i + segment_duration_ms]
    if len(chunk) > 500:
        seg_path = tempfile.mktemp(suffix=".mp3")
        chunk.export(seg_path, format="mp3")
        segments.append(seg_path)
segments = segments[:NUM_SEGMENTS]
print(f"Split into {len(segments)} segments")
Split into 3 segments

Step 3: Let the LLM plan your scenes

The LLM reads your lyrics and the number of segments, then designs a scene for each one. For every segment it returns:

  • Image prompt — What the still image should show (e.g., “city skyline at dusk” for the first verse).

  • Edit prompt (optional) — Refinements before animating.

  • Video prompt — How the scene should move.

This keeps the visuals aligned with the mood and story of the song. Run the cell to generate the scene plan—you can inspect the prompts in the output.

[7]:
response = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You generate scene prompts for a music video. Return valid JSON array. Each object: image_prompt, edit_prompt (optional), video_prompt. image_prompt: text-to-image. edit_prompt: optional refinement. video_prompt: motion for image-to-video.",
        },
        {
            "role": "user",
            "content": f"Create {len(segments)} scene prompts for a music video. Lyrics: {lyrics[:100]}...",
        },
    ],
)
scene_prompts = json.loads(response.choices[0].message.content)
if isinstance(scene_prompts, dict):
    scene_prompts = scene_prompts.get("scenes", [scene_prompts])
while len(scene_prompts) < len(segments):
    scene_prompts.append(
        scene_prompts[-1]
        if scene_prompts
        else {"image_prompt": "abstract art", "video_prompt": "abstract motion"}
    )
scene_prompts = scene_prompts[: len(segments)]
---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
Cell In[7], line 14
      1 response = openai_client.chat.completions.create(
      2     model="gpt-4o-mini",
      3     messages=[
   (...)
     12     ],
     13 )
---> 14 scene_prompts = json.loads(response.choices[0].message.content)
     15 if isinstance(scene_prompts, dict):
     16     scene_prompts = scene_prompts.get("scenes", [scene_prompts])

File ~/.local/share/uv/python/cpython-3.10.16-macos-aarch64-none/lib/python3.10/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    341     s = s.decode(detect_encoding(s), 'surrogatepass')
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:
    348     cls = JSONDecoder

File ~/.local/share/uv/python/cpython-3.10.16-macos-aarch64-none/lib/python3.10/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
    332 def decode(self, s, _w=WHITESPACE.match):
    333     """Return the Python representation of ``s`` (a ``str`` instance
    334     containing a JSON document).
    335
    336     """
--> 337     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338     end = _w(s, end).end()
    339     if end != len(s):

File ~/.local/share/uv/python/cpython-3.10.16-macos-aarch64-none/lib/python3.10/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
    353     obj, end = self.scan_once(s, idx)
    354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
    356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Step 4: Generate each segment and merge

This is the main loop. For each segment we:

  1. Generate an image with P-Image

  2. Optionally refine it with P-Image-Edit

  3. Upload the audio chunk to Replicate

  4. Generate a video with P-Video (image + audio synced)

  5. Download the segment

Then we concatenate all segments into one video. This step takes a few minutes—each segment needs an image and a video generation. When it’s done, your full music video will appear below.

[ ]:
video_clips = []

for i, (seg_path, scene) in enumerate(zip(segments, scene_prompts)):
    img_prompt = scene.get("image_prompt", "abstract art")
    edit_prompt = scene.get("edit_prompt")
    vid_prompt = scene.get("video_prompt", "smooth motion")

    img_out = replicate.run(
        "prunaai/p-image", input={"prompt": img_prompt, "aspect_ratio": "16:9"}
    )
    image_url = (
        img_out
        if isinstance(img_out, str)
        else img_out[0]
        if isinstance(img_out, list)
        else str(img_out)
    )

    if edit_prompt:
        edit_out = replicate.run(
            "prunaai/p-image-edit", input={"images": [image_url], "prompt": edit_prompt}
        )
        image_url = edit_out if isinstance(edit_out, str) else edit_out[0]

    with open(seg_path, "rb") as f:
        audio_data = f.read()
    audio_file = replicate.files.create(content=audio_data, filename="segment.mp3")
    audio_replicate_url = (
        audio_file.urls.get("get")
        if hasattr(audio_file, "urls")
        else getattr(audio_file, "url", str(audio_file))
    )

    vid_input = {"image": image_url, "prompt": vid_prompt, "audio": audio_replicate_url}

    vid_out = replicate.run("prunaai/p-video", input=vid_input)
    video_url = (
        vid_out
        if isinstance(vid_out, str)
        else vid_out.get("video") or vid_out.get("output") or list(vid_out.values())[0]
    )

    vpath = tempfile.mktemp(suffix=".mp4")
    r = requests.get(video_url)
    r.raise_for_status()
    with open(vpath, "wb") as f:
        f.write(r.content)
    video_clips.append(VideoFileClip(vpath))

final = concatenate_videoclips(video_clips)
output_path = "music_video_output.mp4"
final.write_videofile(output_path, codec="libx264", audio_codec="aac")
final.close()
for c in video_clips:
    c.close()
print("Saved:", output_path)
display(Video(output_path, embed=True))