tutorials

The Complete Veo 3.1 Prompting Guide (2026)

Oakgen Team10 min read
The Complete Veo 3.1 Prompting Guide (2026)

Veo 3.1 is the first AI video model where the prompt carries dialog, sound design, camera blocking, and character identity in a single pass. That changes how you write for it. Short prompts get short results. Veo rewards density, precision, and a clear separation between what the camera sees and what the microphone hears. This guide walks through the exact prompt anatomy, shows what Veo listens to, explains why its character fidelity beats every competitor, and gives you 20 copy-paste templates across dialog, character, cinematic, product, and educational video. By the end you will know when Veo is worth the credits and when a cheaper model is the right call. Open the AI Video Generator alongside this guide and test each template as you read.

What Veo 3.1 does that nothing else does

Every other top-tier video model in 2026 is silent. Sora 2 generates silent clips. Kling 3 generates silent clips. Seedance 2 generates silent clips. You then take that footage, export it, match it to a separately generated voice track, add sound effects, and pray the lip sync lines up. Veo 3.1 collapses that entire pipeline into one generation.

Native audio in Veo 3.1 means three things happen inside the same inference pass: speech with matching mouth movement, ambient sound that fits the environment, and score or music when you ask for it. All of it is temporally aligned to the pixels because the model produced them together. You do not sync audio to video with Veo — audio and video are the same output.

This is why Veo lands differently from every other model. A dialog scene in Kling is a silent person moving their mouth. A dialog scene in Veo is a performance. The model has opinions about pacing, breath, intonation, and background room tone. You direct it like you would direct a human actor, not like you would prompt an image model.

Where to run Veo 3.1 on Oakgen

Veo 3.1 is selectable from the model dropdown in the AI Video Generator. For multi-shot assembly, storyboarding, and scene stitching, open Cinema Studio — it keeps character identity consistent across cuts.

The prompt anatomy — seven elements

A Veo 3.1 prompt has seven functional slots. You do not need to fill every slot on every prompt, but you should know which slot every sentence belongs to. This is what separates a prompt that generates a generic clip from one that generates the exact scene in your head.

1. Subject. Who or what is in the frame. For characters, give build, age range, ethnicity, hair, clothing, and any distinguishing detail. For objects, give material, condition, and color. Veo uses these tokens to lock identity for the full clip.

2. Action. What the subject is doing. Use active verbs. "Walks, turns, lifts, glances, laughs." Avoid abstract verbs like "exists" or "is present" — Veo needs motion to animate.

3. Environment. Where the scene takes place. Describe surfaces, time of day, weather, props in the background. Environment tokens also shape the audio — a cafe prompt produces cafe ambience automatically.

4. Camera. Lens, height, angle, and framing. "35mm lens, eye level, medium close-up." Veo understands professional camera language better than any current model.

5. Motion. How the camera moves. "Slow dolly in, handheld sway, locked tripod, crane rising." Separate this from Action — camera motion and subject motion are two different instructions.

6. Style. The visual treatment. "Kodak Portra 400 film stock, muted color grade, anamorphic lens flare, 24fps motion blur." Style tokens cascade through the entire clip.

7. Audio and dialog. This is the slot most users skip. Veo listens to explicit audio direction. "The man says 'I told you not to come back' in a low, tired voice. Light rain hitting a tin roof in the background. No music." Quote the actual dialog. Specify tone. Declare what should not be in the audio track.

Audio-first prompting — how to write for Veo

The biggest mistake new Veo users make is treating audio as an afterthought. They write a visual prompt, the generation comes back silent or mumbling, and they conclude the audio feature is broken. It is not broken — it is waiting for instructions.

Veo audio works best when you write it like a screenplay direction, not like a visual description. Here is the pattern that works:

"A woman in her early thirties, short auburn hair, navy knit sweater, sits at a kitchen island holding a mug. She looks directly at the camera and says, 'Hello world — I finally got this thing working.' Warm British accent, soft and slightly amused. Morning sunlight through window. Ambient city traffic faint in the background. No music."

Notice four things in that prompt. The dialog is inside quotation marks so Veo knows this is the spoken line. The voice description — "warm British accent, soft and slightly amused" — is attached to the dialog, not floating separately. The ambient audio is declared explicitly. And the music is declared as absent. Declaring absence matters. If you do not say "no music," Veo frequently adds a generic score that will fight your dialog.

What confuses the model. Three patterns reliably degrade Veo audio. First, unquoted dialog — writing "she talks about her day" gives you mumbling or silence. Always quote the actual words. Second, competing audio instructions — "upbeat music and quiet library ambience" produces mush. Pick one dominant audio layer. Third, instructions that contradict the visual — a whisper in a loud nightclub, a shout in a silent cathedral. Veo will split the difference and neither will land.

What the model responds to. Accent and tone adjectives ("warm British, raspy low, clipped New York, soft Southern"). Named sound effects ("glass clinking, keyboard typing, distant thunder"). Explicit music decisions ("upbeat lo-fi beat" or "no music, ambient only"). Silence as a direction ("beat of silence before she answers").

Character fidelity — Veo's strongest suit

Character fidelity is the single clearest advantage Veo 3.1 has over the competition. A character you define in the prompt stays that character for the full eight seconds — same face, same build, same clothing, same micro-expressions. Kling 3 drifts noticeably by second six. Sora 2 can swap identity entirely on fast motion. Veo holds.

The way to unlock this is to describe the character with tokens Veo can lock onto. Abstract descriptions — "a beautiful woman, a handsome man, a young person" — produce generic output because the model picks defaults. Concrete tokens produce specific identity.

A character description that Veo uses well looks like this: "A man in his late forties, medium build, close-cropped salt-and-pepper hair, olive skin, a short trimmed beard, light crow's feet around brown eyes, wearing a faded navy work shirt over a gray tee." That is eight identity tokens. Veo will carry all eight through the clip and reuse them if you reference the same character in a sequel prompt.

For recurring characters across multiple clips, save your identity paragraph and paste it verbatim at the top of every prompt. This is the closest thing Veo has to a character LoRA. For longer character work with stitched scenes, Cinema Studio manages this automatically. For a deeper dive on consistency techniques see the cinematic Veo 3 walkthrough.

20 copy-paste templates

Replace the bracketed tokens with your own content. Each template is tuned for Veo 3.1 specifically and will often under-perform on other models.

Dialog scenes

1. Interview setup. "Medium close-up, 50mm lens, eye level. [SUBJECT DESCRIPTION] sits in a [ENVIRONMENT] with soft window light from camera left. They look slightly off-axis past the lens and say, '[QUOTED LINE].' [TONE DESCRIPTION] delivery. Room tone, faint HVAC hum, no music. Locked tripod."

2. Direct-address monologue. "Tight close-up, 35mm lens. [SUBJECT] looks directly into the camera and says, '[QUOTED LINE].' [TONE]. Shallow depth of field, background softly blurred [ENVIRONMENT]. Ambient [ENVIRONMENT SOUND]. No music. Handheld micro-sway."

3. Phone call, one side. "Medium shot, [SUBJECT] holding a phone to their ear, pacing slowly in [ENVIRONMENT]. They say, '[QUOTED LINE].' Pause. '[QUOTED LINE].' [TONE]. Indoor room tone. Floor creaks under their steps. No music."

4. Two-person exchange. "Over-the-shoulder from [SUBJECT A] to [SUBJECT B] seated across a [SURFACE]. A says, '[LINE A].' B replies, '[LINE B].' [TONE FOR EACH]. [ENVIRONMENT] ambient sound. No music. Static frame."

Character-driven

5. Action beat. "Wide shot, low angle. [CHARACTER DESCRIPTION] sprints across a [ENVIRONMENT] toward camera. Handheld camera retreats ahead of them. Breathing audible, footfalls on [SURFACE], distant [ENVIRONMENT SOUND]. No dialog. No music."

6. Emotional close-up. "Extreme close-up, 85mm lens, shallow depth of field. [CHARACTER] — face fills frame — eyes welling, jaw tight. A single breath. They say quietly, '[QUOTED LINE].' [SOFT TONE]. Room tone only, no music."

7. Walk-and-talk. "Tracking shot moving backward at walking pace. [CHARACTER A] and [CHARACTER B] walk side by side through [ENVIRONMENT]. A says, '[LINE A].' B responds, '[LINE B].' Natural conversational tone. [ENVIRONMENT] ambient. Light footstep sound. No music."

8. Reaction shot. "Medium close-up, 50mm lens. [CHARACTER] stands in [ENVIRONMENT] reading something on a phone. Their expression shifts from neutral to [EMOTION]. They do not speak. Ambient [SOUND]. No music. Static frame with slight handheld breathing."

Cinematic

9. Establishing shot. "Wide aerial shot, slow push-in, drone height. [LOCATION] at [TIME OF DAY], [WEATHER]. [LIGHTING DESCRIPTION]. Anamorphic widescreen, cinematic color grade, 24fps motion blur. Ambient [ENVIRONMENT SOUND], low orchestral swell, no dialog."

10. Slow reveal. "Begin on extreme close-up of [DETAIL]. Camera slowly dollies back to reveal [WIDER CONTEXT]. [LIGHTING]. [STYLE TOKENS]. Ambient [SOUND]. Low sustained string note, no dialog."

11. Push-in on character. "Static wide of [CHARACTER] standing alone in [ENVIRONMENT]. Camera dollies slowly toward them over five seconds, ending on medium close-up. Their expression [EMOTION]. Wind through [ENVIRONMENT]. No music until final second, then low drone note."

12. Wide action. "Locked wide shot, [ENVIRONMENT]. [MULTIPLE SUBJECTS] moving through the frame — [ACTION DESCRIPTION]. Golden hour backlight, long shadows, lens flare. Natural ambient sound, no dialog, no music. Cinematic anamorphic framing."

Product

13. Testimonial-style. "Medium close-up, 50mm lens, soft window light. [CUSTOMER PERSONA] holds [PRODUCT] and says, 'I have been using this for [TIMEFRAME] and [SPECIFIC BENEFIT].' Natural warm tone. Home environment ambient. No music."

14. Product demo. "Top-down shot, hands only, soft diffused light on a [SURFACE COLOR] surface. Hands [ACTION] [PRODUCT]. Clean mechanical sound of [PRODUCT SOUND]. No dialog. Subtle upbeat lo-fi beat underneath."

15. Lifestyle context. "Handheld medium shot, [CHARACTER] using [PRODUCT] in [REAL-WORLD ENVIRONMENT]. Natural daylight, documentary aesthetic, slightly desaturated color grade. Ambient [ENVIRONMENT SOUND]. No dialog. Warm indie track, low volume."

16. Hero reveal. "Slow orbit shot around [PRODUCT] on a minimal [BACKGROUND]. Dramatic key light from above, rim light from behind. Macro detail visible. Low synth hum building, no dialog, no voice-over. 24fps."

Educational

17. Explainer, presenter on camera. "Medium shot, 35mm lens, eye level. [PRESENTER DESCRIPTION] stands in [ENVIRONMENT] and says directly to camera, '[QUOTED EXPLANATION — one or two sentences].' Clear, measured, friendly tone. Room tone only. No music. Static frame."

18. Tutorial step. "Top-down shot of hands demonstrating [TASK] on a [SURFACE]. Voice-over says, '[QUOTED STEP INSTRUCTION].' Clear neutral narrator voice. Light ambient sound of the task (e.g., paper, keys, keyboard). No music."

19. Voice-over narration over b-roll. "Wide cinematic shot of [SCENE] with slow drifting camera. A narrator voice-over says, '[QUOTED LINE — one sentence].' [NARRATOR TONE — e.g., calm documentary, warm storyteller]. Ambient [ENVIRONMENT SOUND] underneath. Gentle piano bed, very low volume."

20. Data point reveal. "Static medium shot of [PRESENTER] in a clean [ENVIRONMENT]. They hold up [VISUAL PROP] and say, '[QUOTED STAT OR FACT].' Confident measured tone. Room tone. Single piano note on the reveal, otherwise no music."

Veo 3.1 vs Kling 3 vs Seedance 2 vs Sora 2 — when to pick Veo

Every model in 2026 is good. The question is which one fits the job. This table reflects how each performs on the jobs Oakgen users bring most often.

FeatureCapabilityVeo 3.1Kling 3Seedance 2Sora 2
Native dialog audioYes, syncedNoNoNo
Character fidelity (8s)ExcellentGoodGoodVery good
Camera language fluencyExcellentVery goodGoodVery good
PhotorealismExcellentExcellentVery goodExcellent
Physics coherenceVery goodGoodVery goodExcellent
Cost per clipPremiumMidLow-midPremium
Best forDialog, testimonials, cinemaFast iteration, b-rollBudget productionSurreal, physics-heavy

Pick Veo 3.1 when dialog or synced audio is part of the shot, when character identity must hold, or when the result needs to look like professional film. Pick Kling when you are iterating fast and the clip is silent b-roll — see the Kling 3 prompting guide. Pick Seedance when budget is the constraint — see the Seedance prompting guide. For a head-to-head on everything else, the Veo vs Kling vs Wan 2026 comparison breaks down the tradeoffs clip by clip.

Cost and duration

Veo 3.1 is premium-priced. A single eight-second clip costs meaningfully more credits than the equivalent Kling or Seedance clip. That price reflects what you are getting — a generation that replaces video model, voice model, and sound design in one pass — but it means Veo is not the right default for every shot.

Use Veo for the shots that are the story. The on-camera testimonial. The hero monologue. The reveal. The emotional beat. For transitions, b-roll, establishing shots with no dialog, and rapid-iteration drafts, run Kling or Seedance and keep your budget for the shots that carry the scene. A common workflow on Oakgen: Kling for fifteen silent b-roll clips, Veo for three dialog clips, stitched in Cinema Studio. Full credit breakdowns are on the pricing page.

Veo pays you back when you share

If you already make videos on Oakgen, the referral program credits your account when someone you refer generates their first Veo clip. Creators running a Veo-heavy pipeline tend to recoup their own credit costs within the first month.

FAQ

Does Veo 3.1 always generate audio? Only if you ask for it. If your prompt contains no audio or dialog direction, Veo often produces a clip with minimal ambient sound or silence. Explicit audio direction is the on-switch.

How long can a Veo 3.1 clip be? Eight seconds per generation is the current maximum. For longer sequences, stitch multiple Veo clips together in Cinema Studio — it preserves character identity across cuts.

Can Veo 3.1 match a reference voice? No. You direct the voice with descriptive tokens (accent, tone, age, texture) but you cannot clone a specific person's voice inside Veo. Pair Veo visuals with an ElevenLabs voice if you need a specific cloned voice.

Why does my character change halfway through the clip? Your identity description is too short. Add four to six more concrete tokens — specific clothing items, hair detail, skin detail, eye color, distinguishing marks. Veo holds what it can lock onto.

Image-to-video or text-to-video — which is better for Veo? Text-to-video for flexibility and dialog scenes. Image-to-video when you need to match an exact look or character face you have already generated. For character work, generate the face as a still first, then use it as the Veo starting frame.

Open Veo 3.1 in the AI Video Generator and run Template 1 with your own character. The difference from your last silent video model will be obvious in the first clip.

veo 3 promptsveo promptinggoogle veo videoveo 3.1 tutorialai video with audio
Share

Related Articles