HappyHorse 1.0
Single-pass text-and-image-to-video with native audio and 7-language lip-sync, ranked #1 on the public video leaderboard.
HappyHorse 1.0 is Alibaba ATH-AI Innovation Division's flagship video model, ranked #1 on the Artificial Analysis Video Arena with 1381 Elo and a 107-point margin over #2. A 15B-parameter, 40-layer single-stream Transformer generates video and audio in one forward pass at native 1080p in ~10 seconds, with synchronized lip-sync across 7 languages. On Oakgen it shares the unified credit pool with 30+ video models.
Capabilities at a glance
- #1 on the Artificial Analysis Video Arena — 1381 Elo, 107-pt margin over #2
- Single forward pass generates video and audio together — no bolt-on TTS
- Multilingual lip-sync in 7 languages: English, Mandarin, Cantonese, Japanese, Korean, German, French
- Native 1080p HD output — no upscaling step
- ~10 second typical generation; ~38 seconds for full 1080p on a single H100
- 15B-parameter, 40-layer Transformer architecture with no cross-attention
- Up to 15-second clips on paid tier (12s on lite tier)
Specs
- Starting price
- $0.15 / generation
- Generation time
- 10–38 seconds
- Max resolution
- 1920×1080 (1080p native)
- Inputs → outputs
- text, image → video, audio
How to use HappyHorse 1.0
- 1Write a prompt with motion and camera directionHappyHorse responds well to camera vocabulary. Example: 'Slow tracking shot of a courier on a bicycle weaving through Shanghai night traffic, neon reflections, ambient city sounds with light rain'.
- 2Optional: add an image input for I2VDrop in a reference image to anchor the opening frame. HappyHorse leads the Image-to-Video arena (1401 Elo) — strong subject consistency from a single still.
- 3Pick a language for lip-sync if the clip has dialogueChoose from English, Mandarin, Cantonese, Japanese, Korean, German, or French. Lip movement is generated in the same pass as the video — no separate dubbing step.
- 4GenerateOakgen runs HappyHorse via fal. A typical 10-second 1080p clip with audio renders in ~10–38 seconds.
API access
curl -X POST https://api.oakgen.ai/v1/generate/video \
-H "Authorization: Bearer $OAKGEN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "happyhorse-1-0",
"prompt": "Slow tracking shot of a courier on a bicycle weaving through Shanghai night traffic, neon reflections, ambient city sounds with light rain",
"duration": 10,
"aspect_ratio": "16:9",
"lip_sync_language": "english"
}'const res = await fetch("https://api.oakgen.ai/v1/generate/video", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.OAKGEN_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "happyhorse-1-0",
prompt: "Slow tracking shot of a courier on a bicycle weaving through Shanghai night traffic, neon reflections, ambient city sounds with light rain",
duration: 10,
aspect_ratio: "16:9",
lip_sync_language: "english",
}),
});
const { jobId } = await res.json();
console.log("Job started:", jobId);Compared to other models
HappyHorse 1.0 leads the Artificial Analysis Video Arena across T2V (1365 vs 1270) and I2V (1401 vs 1347) without audio. Seedance 2.0 narrowly wins Image-to-Video with audio (1182 vs 1167) and accepts more input modalities (video and audio refs). Pick HappyHorse for highest-quality video; pick Seedance when you need multimodal reference control.
Veo 3 still has the tightest English dialogue lip-sync at sub-10ms latency. HappyHorse 1.0 wins on multilingual coverage (7 languages including Cantonese, Japanese, Korean) and on raw leaderboard Elo, but for English-only spoken dialogue Veo 3 remains best-in-class.
Kling 3.0 supports up to 4K output; HappyHorse 1.0 caps at native 1080p. For deliverables that need 4K, pick Kling. For everything at 1080p and under, HappyHorse generates ~30–40% faster and currently outranks Kling on the public leaderboard.
License & commercial use
Commercial use included on all Oakgen paid plans
Yes — included on Pro and above