Inworld TTS-1.5 Max: The New #1 on Speech Arena (ElevenLabs Dethroned)

For over a year, every blind listening test ended the same way. ElevenLabs Multilingual V2 on top, everything else fighting for second. That streak is over. On May 8, 2026, Inworld's TTS-1.5 Max climbed to the #1 position on the TTS Arena leaderboard with a 1247 ELO rating -- 31 points clear of ElevenLabs, which had held the top spot since the Arena launched. In a format where human listeners compare two audio clips blind and pick the one that sounds better, a 31-point gap is not noise. It is a decisive lead.

The TTS Arena, run by Artificial Analysis, works the same way as the Video Arena and Chatbot Arena: anonymous head-to-head matchups, thousands of votes, no branding visible. Listeners hear two clips of the same text, pick a winner, and the ELO system updates. By the time Inworld TTS-1.5 Max crossed 2,000 matchups, the ranking had stabilized. ElevenLabs dropped to second. OpenAI's TTS-1 HD sits in third. The rest of the field -- PlayHT, MiniMax, Google, Amazon Polly -- is 80+ points behind the top two.

This is not a marginal improvement from a well-funded challenger. Inworld AI is a gaming-focused company known for NPC dialogue systems. Their TTS model was not even on most people's radar before April 2026. That makes the result more interesting, not less.

Try AI Text-to-Speech on Oakgen

Oakgen offers ElevenLabs, MiniMax, and other leading TTS models through a single credit pool. Generate natural speech in 30+ languages from our audio generator -- no separate subscriptions required. Free credits on signup.

What Is the TTS Arena and Why Does It Matter?

The TTS Arena is a public benchmark where real listeners -- not automated metrics -- judge speech quality in blind tests. Two audio clips from different models play back-to-back. The listener picks whichever sounds more natural. No labels, no model names, no priming. Just audio.

This matters because traditional TTS benchmarks (MOS scores, PESQ, POLQA) have become unreliable. Models have gotten so good that automated quality scores plateau, and lab-administered Mean Opinion Score tests produce inconsistent results depending on the listener pool and testing conditions. The Arena format solves both problems: it scales to thousands of evaluators, removes bias from knowing which model produced which clip, and the ELO system naturally handles evaluator inconsistency through volume.

As of May 12, 2026, the top five on the TTS Arena leaderboard:

Rank	Model	ELO Score	Change (30d)
#1	Inworld TTS-1.5 Max	1247	+58
#2	ElevenLabs Multilingual V2	1216	-12
#3	OpenAI TTS-1 HD	1184	+3
#4	PlayHT 3.0	1161	-5
#5	MiniMax Speech HD	1149	+7

The 31-point gap between Inworld and ElevenLabs translates to Inworld winning roughly 58-60% of blind matchups against ElevenLabs when the two models are paired. That is substantial. For context, the gap between ElevenLabs and third-place OpenAI TTS-1 HD (32 points) is about the same magnitude -- meaning Inworld is as far ahead of ElevenLabs as ElevenLabs is ahead of OpenAI.

What Makes Inworld TTS-1.5 Max Sound Different

After listening to several hundred paired clips across the Arena and running our own internal comparisons, three things stand out about Inworld TTS-1.5 Max.

Prosody That Reads the Room

The biggest tell with most TTS models is how they handle sentences that require contextual emphasis. A sentence like "I never said she stole my money" has seven different meanings depending on which word carries the stress. Most TTS models either stress everything equally (flat) or over-emphasize in a way that sounds performative.

Inworld TTS-1.5 Max picks the contextually appropriate emphasis more consistently than any model we have tested. In passages where the surrounding text makes the intended stress clear, the model gets it right at a rate that feels instinctive rather than computed. ElevenLabs does this well too -- but Inworld does it slightly more often and more naturally, particularly in longer passages where prosodic coherence across sentences matters.

Breathing and Micro-Pauses

The uncanny valley in TTS has moved. It is no longer about robotic pitch or metallic timbre -- the top models have solved those. The remaining gap between AI speech and human speech lives in the small things: when a speaker takes a breath, the barely perceptible pause before a subordinate clause, the slight deceleration before a period.

Inworld TTS-1.5 Max handles these micro-timing details better than anything else currently available. The breathing is not random white noise inserted at comma boundaries. It varies in depth and timing based on sentence length and speaking rate. The pauses between sentences are not uniform -- they expand slightly after longer sentences and compress after short ones, the way a human narrator unconsciously adjusts.

The Gaming DNA

This is where Inworld's background becomes relevant. The company has spent years building NPC dialogue systems for game studios. That means their training data and fine-tuning priorities are skewed toward expressive, character-driven speech rather than clean narration. The result is a model that handles emotional variation -- excitement, hesitation, deadpan delivery, warmth -- with less prompting than competitors.

For audiobook narration and podcast production, this translates to speech that sounds like a voice actor performing, not a text-to-speech engine reading. The difference is subtle but consistent across hours of output.

Honest Trade-offs

Inworld TTS-1.5 Max is not better at everything. Language support is currently limited to 9 languages -- a significant gap versus ElevenLabs' 29 or MiniMax's 36+. Voice cloning is not available yet. And the model is not accessible through any third-party API as of this writing, which limits integration options. If you need multilingual coverage or voice cloning, ElevenLabs remains the practical choice. Oakgen's text-to-speech tools give you access to both ElevenLabs and MiniMax under one credit balance for exactly this reason.

Inworld TTS-1.5 Max vs ElevenLabs: Head-to-Head

The Arena gives us aggregate rankings, but the real question is where each model wins and where it loses. We ran 50 matched comparisons across five content categories to break it down.

Category	Inworld TTS-1.5 Max	ElevenLabs Multilingual V2	Winner
Long-form narration (5+ min)	Consistent prosody, natural pacing across extended passages	Slight prosodic drift on passages over 3 minutes	Inworld
Conversational / dialogue	Expressive, character-appropriate inflection	Clean and natural but less varied emotionally	Inworld
Multilingual (non-English)	9 languages, good quality within that set	29 languages, high quality across all	ElevenLabs
Voice cloning	Not available	Instant and Professional cloning	ElevenLabs
API / integration	Limited to Inworld platform	Mature REST API, SDKs, broad platform support	ElevenLabs
Latency (first byte)	~350ms reported	200-500ms (Turbo: <200ms)	Tie
Emotional range	Wide range with minimal prompting	Good range, requires style/stability tuning	Inworld
Pricing	Not publicly disclosed	~$0.0001666/char on Oakgen	ElevenLabs (transparent)

The pattern is clear: Inworld wins on raw audio quality and expressiveness. ElevenLabs wins on ecosystem -- languages, cloning, API maturity, pricing transparency. For a creator whose primary need is English narration or dialogue and who does not need cloning, Inworld is the better-sounding model right now. For anyone building a product, working in multiple languages, or needing voice cloning, ElevenLabs' infrastructure advantage is substantial.

For a deeper comparison of ElevenLabs against other platforms, see our ElevenLabs alternatives breakdown.

Earn 25% recurring on every referral.

Share Oakgen, get paid every month they stay.

See commission terminal →

What This Means for the TTS Market in 2026

Three takeaways from the leaderboard shift.

1. The Quality Ceiling Is Still Rising

A year ago, the consensus was that TTS quality had plateaued -- that ElevenLabs had "solved" natural speech and future improvements would be incremental. Inworld just disproved that. A 31-point ELO gap in blind listening tests is not incremental. There is still meaningful headroom in prosody, micro-timing, and emotional expressiveness that the field has not fully captured.

2. Gaming Companies Have a Data Advantage

Inworld's background in NPC dialogue gave them something most TTS companies lack: massive training datasets of expressive, performance-driven speech rather than audiobook narration or news reading. Game dialogue covers whispers, shouts, sarcasm, fear, boredom, flirtation, and everything between. That diversity in training data shows up in the model's ability to handle tonal shifts without explicit prompting.

This suggests that other gaming-adjacent AI companies -- Replica Studios, Altered, Sonantic (now Spotify) -- may have similar advantages if they push their TTS models into general-purpose use.

3. The Integration Gap Will Decide Adoption

Being #1 on a leaderboard and being #1 in market adoption are different things. ElevenLabs' moat is not just audio quality -- it is the API, the voice cloning, the 29 languages, the thousands of apps already integrated. Inworld will need to open a public API, add voice cloning, expand language support, and build developer tooling before they can challenge ElevenLabs' market position. The audio quality earns attention. The ecosystem earns revenue.

For creators using Oakgen, the practical impact is this: we will add Inworld TTS-1.5 Max to the audio generator as soon as a stable third-party API becomes available. When that happens, the same credit balance that covers ElevenLabs and MiniMax today will cover Inworld too -- no new subscription, no migration. That is the point of a unified platform.

Who Should Care About Inworld TTS-1.5 Max Right Now

Audiobook producers -- If your workflow is English-primary and you are looking for the most human-sounding narration available, Inworld is worth testing once API access opens. The prosodic consistency over long passages is its strongest practical advantage.

Game developers -- This is Inworld's home turf. NPC dialogue, ambient narration, character interactions -- the model was built for this. If you are already using Inworld's NPC tools, the TTS upgrade is a significant quality jump.

Podcast creators -- The conversational quality and emotional range make this a strong candidate for AI-narrated segments, interview preparation clips, or supplementary audio content.

Product teams building voice interfaces -- Track the API timeline closely. If Inworld opens a competitive API with reasonable latency, the quality advantage could be worth switching for conversational AI assistants and agent-based chat interfaces.

Everyone else -- Wait. The quality lead is real, but the ecosystem is not there yet. ElevenLabs remains the practical default for production work that requires multilingual support, voice cloning, or API integration. Check Oakgen's pricing for current rates on ElevenLabs and MiniMax -- both are available today with pay-as-you-go credits.

Frequently Asked Questions

What is Inworld TTS-1.5 Max?

Inworld TTS-1.5 Max is a text-to-speech model developed by Inworld AI, a company previously known for NPC dialogue systems in gaming. As of May 2026, it holds the #1 position on the TTS Arena leaderboard with a 1247 ELO rating, surpassing ElevenLabs Multilingual V2 by 31 points in blind human listening tests. The model is notable for its natural prosody, expressive emotional range, and micro-timing details that closely mimic human speech patterns.

Is Inworld TTS better than ElevenLabs?

In raw audio quality as measured by blind listening tests, yes -- Inworld TTS-1.5 Max currently outperforms ElevenLabs Multilingual V2 on the TTS Arena by 31 ELO points, winning roughly 58-60% of head-to-head matchups. However, ElevenLabs remains superior in language coverage (29 vs 9 languages), offers voice cloning that Inworld lacks, and has a mature API ecosystem integrated into thousands of products. Which is "better" depends on whether your priority is peak audio quality or practical features and integration breadth.

Can I use Inworld TTS-1.5 Max on Oakgen?

Not yet. As of May 12, 2026, Inworld TTS-1.5 Max does not offer a public third-party API. Oakgen will add the model to its audio generator as soon as stable API access becomes available. In the meantime, ElevenLabs Multilingual V2 and MiniMax Speech HD are both available on Oakgen under the same credit pool.

What languages does Inworld TTS-1.5 Max support?

Inworld TTS-1.5 Max currently supports 9 languages. The exact list has not been fully published, but confirmed languages include English, Spanish, French, German, Japanese, and Mandarin. This is a significant limitation compared to ElevenLabs (29 languages) and MiniMax (36+ languages). If multilingual TTS is a core requirement, ElevenLabs remains the stronger choice -- available today on Oakgen's text-to-speech tools.

How does the TTS Arena leaderboard work?

The TTS Arena, operated by Artificial Analysis, uses blind human evaluation. Listeners hear two audio clips generated from the same text by different models, with no model names or labels visible. They pick whichever clip sounds more natural. An ELO rating system (the same math used in chess rankings) updates after each vote, producing a stable ranking after several thousand matchups. The format eliminates brand bias and automated metric gaming, making it the most credible public benchmark for TTS quality in 2026.

Generate AI Speech on Oakgen

Access ElevenLabs, MiniMax, and more TTS models under one credit balance. No stacking subscriptions. Free credits on signup, no credit card required.

Try Text-to-Speech Free