VideoAlibaba ATH-AI Innovation Divisionv1.0· Released 2026-04-10

HappyHorse 1.0

Single-pass text-and-image-to-video with native audio and 7-language lip-sync, ranked #1 on the public video leaderboard.

HappyHorse 1.0 is Alibaba ATH-AI Innovation Division's flagship video model, ranked #1 on the Artificial Analysis Video Arena with 1381 Elo and a 107-point margin over #2. A 15B-parameter, 40-layer single-stream Transformer generates video and audio in one forward pass at native 1080p in ~10 seconds, with synchronized lip-sync across 7 languages. On Oakgen it shares the unified credit pool with 30+ video models.

Try HappyHorse 1.0 →See pricing

Capabilities at a glance

#1 on the Artificial Analysis Video Arena — 1381 Elo, 107-pt margin over #2
Single forward pass generates video and audio together — no bolt-on TTS
Multilingual lip-sync in 7 languages: English, Mandarin, Cantonese, Japanese, Korean, German, French
Native 1080p HD output — no upscaling step
~10 second typical generation; ~38 seconds for full 1080p on a single H100
15B-parameter, 40-layer Transformer architecture with no cross-attention
Up to 15-second clips on paid tier (12s on lite tier)

Specs

Starting price: $0.15 / generation
Generation time: 10–38 seconds
Max resolution: 1920×1080 (1080p native)
Inputs → outputs: text, image → video, audio

How to use HappyHorse 1.0

1
Write a prompt with motion and camera direction
HappyHorse responds well to camera vocabulary. Example: 'Slow tracking shot of a courier on a bicycle weaving through Shanghai night traffic, neon reflections, ambient city sounds with light rain'.
2
Optional: add an image input for I2V
Drop in a reference image to anchor the opening frame. HappyHorse leads the Image-to-Video arena (1401 Elo) — strong subject consistency from a single still.
3
Pick a language for lip-sync if the clip has dialogue
Choose from English, Mandarin, Cantonese, Japanese, Korean, German, or French. Lip movement is generated in the same pass as the video — no separate dubbing step.
4
Generate
Oakgen runs HappyHorse via fal. A typical 10-second 1080p clip with audio renders in ~10–38 seconds.

API access

curl -X POST https://api.oakgen.ai/v1/generate/video \
  -H "Authorization: Bearer $OAKGEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "happyhorse-1-0",
    "prompt": "Slow tracking shot of a courier on a bicycle weaving through Shanghai night traffic, neon reflections, ambient city sounds with light rain",
    "duration": 10,
    "aspect_ratio": "16:9",
    "lip_sync_language": "english"
  }'

const res = await fetch("https://api.oakgen.ai/v1/generate/video", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.OAKGEN_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "happyhorse-1-0",
    prompt: "Slow tracking shot of a courier on a bicycle weaving through Shanghai night traffic, neon reflections, ambient city sounds with light rain",
    duration: 10,
    aspect_ratio: "16:9",
    lip_sync_language: "english",
  }),
});

const { jobId } = await res.json();
console.log("Job started:", jobId);

Compared to other models

vs. seedance-v2 — leaderboard rank — text-to-video and image-to-video

HappyHorse 1.0 leads the Artificial Analysis Video Arena across T2V (1365 vs 1270) and I2V (1401 vs 1347) without audio. Seedance 2.0 narrowly wins Image-to-Video with audio (1182 vs 1167) and accepts more input modalities (video and audio refs). Pick HappyHorse for highest-quality video; pick Seedance when you need multimodal reference control.

vs. veo-3 — English dialogue lip-sync

Veo 3 still has the tightest English dialogue lip-sync at sub-10ms latency. HappyHorse 1.0 wins on multilingual coverage (7 languages including Cantonese, Japanese, Korean) and on raw leaderboard Elo, but for English-only spoken dialogue Veo 3 remains best-in-class.

vs. kling-v3-pro — max resolution — 4K

Kling 3.0 supports up to 4K output; HappyHorse 1.0 caps at native 1080p. For deliverables that need 4K, pick Kling. For everything at 1080p and under, HappyHorse generates ~30–40% faster and currently outranks Kling on the public leaderboard.

License & commercial use

Commercial use included on all Oakgen paid plans

Yes — included on Pro and above

FAQs

How much does HappyHorse 1.0 cost on Oakgen?

HappyHorse 1.0 starts at $0.15 per generation on Oakgen. Most generations complete in 10–38 seconds. The $19/month Pro plan includes 5,000 credits, covering roughly 128 generations per month.

Can I use HappyHorse 1.0 commercially?

Yes — included on Pro and above

What is the maximum output resolution?

HappyHorse 1.0 supports up to 1920×1080 (1080p native).

Does Oakgen provide API access to HappyHorse 1.0?

Yes. Oakgen's REST API exposes HappyHorse 1.0 under the model slug 'happyhorse-1-0'. See the API snippet below for an example request.

Related models

Seedance V2

ByteDance's multi-resolution video model with extended duration up to 15 seconds.

Veo 3

Text-to-video with native synchronized audio — dialogue, ambient sound, and music in one pass.

Kling v3 Pro

Latest-generation Kling with pro-quality output, native audio, and superior human motion.