We ran GPT Image 2 through 25 methodical tests across five capability dimensions: text rendering, layout reasoning, photorealism, multi-image coherence, and edit fidelity. It passed 19, partially passed 4, and failed 2. The model topped LMArena at 1512 Elo the week of its 2026-04-21 launch, and the numbers below explain why — and where the cherry-picked demos stopped short of the truth. Every prompt is printed verbatim. Every score is assigned against a pre-declared rubric. Every failure is described honestly, with no softening. If you want the short version, skip to the score table. If you want to reproduce the results yourself, the prompts are copy-paste ready.
Methodology
Cherry-picked demos mislead. A launch-day highlight reel is a stack of best-of-N generations filtered down from hundreds of tries; it tells you what the model can do on a good day, not what it will do on your first attempt.
We designed this test suite to answer a different question: for each capability, what happens on the first generation of a cleanly written prompt?
Five dimensions, five prompts each. Each prompt was designed to isolate one capability. We ran one generation per prompt on GPT Image 2 via Oakgen — no retries, no best-of-N, no prompt engineering iteration. The prompt you see is the prompt we sent.
Scoring rubric:
- Pass. The output matches the prompt's explicit instructions on every load-bearing element. Minor aesthetic deviation is allowed.
- Partial. The output matches most instructions but breaks on one or more specified details (a mis-rendered word, a missing element, an instruction the model quietly ignored).
- Fail. The output misses the core intent of the prompt or breaks on multiple instructions at once.
Every prompt below runs on GPT Image 2 at 26 credits per image — about 10 cents of real cost — and returns in roughly three seconds. A full replay of all 25 tests costs 650 credits. Annual Ultimate and Creator plans include a 30-day free window, which covers the full run several times over.
Test suite 1 — Text rendering (5 prompts)
Text rendering is where previous image models melt. Letters dissolve into glyph soup; headline spellings go wrong; small-point copy fills with pseudo-language. This suite graduates from the easiest rendering task (one large word) to the hardest (typography on a curved 3D surface).
T1. Single large word, latin alphabet.
A minimalist poster, A3 portrait, pure white background. A single word "NORTHWARD" set in 220pt condensed display serif, black, centered. No other elements, no decoration, no imagery.
Result: Pass. The letterforms are clean, kerning is plausible, no dropped or duplicated glyphs. This is the easy case and the model handles it like it should.
T2. Multi-word headline with byline.
Editorial magazine cover layout, 8.5x11 portrait. Top strip: "THE QUARTERLY" in 14pt tracked caps. Large headline: "The End of Generalist Models" in 72pt serif, three lines, left aligned, hanging from the top of the main block. Byline below: "By Ada Chen · Issue 47 · Spring 2026" in 11pt italic. Thin rule above the byline. Large empty lower two-thirds for a future image.
Result: Pass. All four text blocks render correctly, including the em-dashes and pipe separators in the byline. Line-breaks in the headline fall where a human designer would put them.
T3. Small-point body copy inside a poster.
Conference poster, 11x17 portrait. Top: "AI SAFETY WORKSHOP 2026" in 44pt sans caps. Center: 6 paragraphs of 8-point justified body copy describing a workshop schedule across two columns. Paragraphs must be readable real English, not placeholder text: topics include "risk modeling", "interpretability", "evaluation methodology", "deployment safety". Bottom: four speaker names in 10pt caps.
Result: Partial. The top headline and bottom speakers are perfect. The body copy is readable at thumbnail scale but, on zoom, about one word in twelve drifts into pseudo-text. This remains the hardest thing for any image model to do, and GPT Image 2 has pushed the line further than competitors but has not crossed it.
T4. Multilingual signage.
Street-level photograph of a Tokyo izakaya storefront at dusk. Primary sign: Japanese text "居酒屋 うみねこ" in stylized brushstroke kanji and hiragana, warm red glow. Below in English: "Umineko · Craft Sake and Small Plates · Open 5pm–11pm" in 20pt serif. Side lantern with the single kanji "酒". Wet pavement, soft bokeh of pedestrians.
Result: Pass. Both scripts render cleanly. The kanji strokes are plausible (a native reader could tell this is AI-generated but the characters are the right characters), and the English is typographically correct.
T5. Typography on curved/3D surface.
Product shot of a matte black ceramic mug on a linen surface. The mug has a wraparound text printed in white: "Patience is bitter but its fruit is sweet." The text should follow the curvature of the mug, legible where it faces the camera, softening as it wraps around to the back. Shallow depth of field, window light from the left.
Result: Pass. Text wraps correctly with perspective, softens on the occluded side, and no letters are dropped or mirrored. This is the capability other models have been failing at for two years.
Test suite 2 — Layout reasoning (5 prompts)
Layout reasoning tests whether the model treats structural instructions as load-bearing or as flavor. Previous models treated "6-panel comic" as a vibe; GPT Image 2's predecessors often returned five panels, or seven, or a single confused collage. These five prompts check whether structure survives.
L1. Six-panel comic layout.
A 6-panel comic strip, 2 rows of 3, black-and-white inked style. Panel 1: a woman opens a laptop, morning light. Panel 2: close-up of her frowning at the screen. Panel 3: she picks up a phone. Panel 4: she walks to a window, phone to ear. Panel 5: wide shot of a city below, she is silhouetted. Panel 6: she smiles, hanging up. Consistent character design across all panels. No text, no speech bubbles.
Result: Pass. Six panels, correct order, character consistency holds across all six frames. The transitions read as a coherent narrative.
L2. Infographic with structural callouts.
An infographic, 1200x800, flat editorial style. Center subject: a cutaway diagram of a coffee espresso machine with four labeled components. Each label connects to the component via a thin arrow and a number (1–4). Labels read: "1. Boiler", "2. Group head", "3. Portafilter", "4. Steam wand." Sans-serif captions in dark gray. Legend at bottom. Off-white background, one-color accent (burnt orange).
Result: Pass. All four arrows connect to the correct components, numbers are sequential, labels render correctly. This is the kind of task that used to require a designer.
L3. UI mockup with specified placements.
A desktop web app UI mockup, 1440x900. Left sidebar with these menu items top to bottom: "Dashboard", "Projects", "Team", "Reports", "Settings." Top nav bar with a search field and a user avatar on the right. Main content area shows a grid of 6 project cards in 2 rows of 3. A single primary action button in the top-right of the main area reads "New Project" in white text on deep blue. Light theme, neutral grays.
Result: Pass. Sidebar labels correct and in order, six cards in a 2x3 grid, button in the specified location with the correct text. This is closer to a Figma export than a hallucinated render.
L4. 3x3 grid with distinct subjects.
A 3x3 photographic grid, 9 equal-size cells, thin white borders between cells. Each cell shows a different fruit photographed on a black marble surface, overhead angle, identical lighting. Cells in order (left-to-right, top-to-bottom): apple, pear, fig, pomegranate, persimmon, kiwi, plum, peach, guava. Shadow direction and color temperature must be consistent across all nine.
Result: Partial. The grid structure is correct and eight of the nine fruits are identifiable. The persimmon cell rendered what looks more like an orange. Lighting consistency holds across the grid — the primary reason we score this as partial rather than fail.
L5. Split-screen before/after.
A split-screen image, vertical divider down the center. Left side labeled "BEFORE" in 18pt caps top-left corner: a cluttered home office desk, papers strewn, cables visible, dim window light. Right side labeled "AFTER": the same desk, same angle, same window light, but organized — papers filed, cables routed, one small plant added. The desk geometry, wall color, and lighting must match exactly.
Result: Pass. Desk geometry and wall color match across the split. Labels render correctly. The "AFTER" side shows the small plant as specified. Lighting is consistent.
Test suite 3 — Photorealism (5 prompts)
Photorealism tests texture, lighting, and anatomy. These are traditionally where diffusion models shone and autoregressive models struggled. The question is whether GPT Image 2's new approach sacrifices the photographic baseline that Flux and Midjourney perfected.
P1. Elderly candid portrait.
A candid environmental portrait of an elderly fisherman sitting on a weathered wooden dock in coastal Portugal, late afternoon golden hour. Deep skin-texture detail: wrinkles around the eyes, sun-spotted forearms, stubble. He holds a mug of coffee, steam rising. 50mm lens, f/2.8, shallow depth of field. Natural color grading, no retouching feel.
Result: Pass. Skin texture is excellent, the hands holding the mug are anatomically correct, and the golden-hour lighting feels like real window light rather than LED grading. Competitive with Flux 2 Pro on raw photorealism.
P2. Wet cobblestone, cinematic lighting.
Wide shot of a wet cobblestone alley in Lyon at night after rain. Yellow-tinted street lamp at the far end, casting long reflections across the stones. A single figure in a dark coat walks away from the camera mid-frame. No text, no visible signage. 35mm lens, cinematic color grade, slight lens flare from the lamp.
Result: Pass. Reflections on wet stone are physically coherent; the lamp's flare sits on the right lens axis; stone texture holds at the edges of the frame.
P3. Glass bottle with refraction.
Product photography of a clear glass olive oil bottle, half full, on a white seamless backdrop. Soft key light from the upper left, fill from the right. The bottle should show accurate refraction — light passing through the oil, slight caustics on the backdrop, correct specular highlights on the glass edge. Sharp label facing camera reads "OLEA" in 24pt serif caps, with "Cold-Pressed · Liguria · 2026" in 8pt below.
Result: Pass. Refraction through the oil is directionally correct, caustics fall where they should, the label text renders cleanly. This is the kind of shot that previously required a three-hour studio session.
P4. Mixed fabric textures.
Overhead flat-lay of three folded fabrics arranged in a row on a bleached oak surface: a stack of raw linen (left), a folded pile of chunky wool knit (center), a bolt of deep burgundy velvet (right). Each fabric should show its correct texture — the slub and weave of linen, the chunky cable of wool, the light-absorbing pile of velvet. Natural window light from above.
Result: Pass. Each fabric reads as itself. The velvet's light absorption is rendered correctly (very hard — most models render velvet as flat cloth). Linen slub is visible on close inspection.
P5. Hands in a specific gesture.
Close-up photograph of two hands passing a ceramic coffee cup between them, one to the other, from the giver's right hand to the receiver's left hand. Both hands clearly visible, all ten fingers rendered correctly. Warm indoor light, wooden table surface. 50mm macro lens, shallow depth of field.
Result: Fail. The receiver's left hand shows six fingers, and the grip geometry on the cup is impossible — the thumbs cross through the ceramic. This is the historical hardest case for AI image models and GPT Image 2 has not solved it.
Hands remain the canary test. Every model since SD 1.5 has claimed to have fixed them; none actually has. GPT Image 2 is better on single hands in simple gestures, but fails on two hands interacting with an object — the moment there's occlusion plus anatomy plus a physical hand-off, the model's world-model breaks.
Test suite 4 — Multi-image coherence (5 prompts)
GPT Image 2 supports 8-image batched generation with shared context — the model is meant to keep the same character, product, style, outfit, or brand identity across all eight outputs. This suite tests that claim. Each prompt returns a batch of 8.
C1. Same character, eight scenes.
Generate 8 images of the same character — a 30s Korean woman with chin-length black hair, round tortoiseshell glasses, and a single jade earring. Scenes: 1) reading on a subway. 2) cooking at a kitchen counter. 3) laughing at a cafe. 4) walking a dog in the rain. 5) at a whiteboard in a meeting. 6) asleep on a couch. 7) looking out a window at night. 8) holding a coffee on a balcony at sunrise. Same face, same hair, same glasses, same earring across all 8.
Result: Pass. Face, hair, glasses, and earring hold across all eight images. Subtle expression and styling variation is present but the character is unambiguously the same person.
C2. Same product, eight angles.
Generate 8 images of the same product from 8 different angles: a matte black titanium wristwatch with a rectangular face, white hour markers, single thin red second hand, and a charcoal leather strap. Angles: 1) front straight-on. 2) 45° left. 3) 90° left profile. 4) 135° back-left. 5) back straight-on. 6) 135° back-right. 7) 90° right profile. 8) 45° right. Consistent product details, identical lighting, same backdrop.
Result: Pass. The watch is recognizable as the same object across eight angles; the red second-hand position shifts correctly as we rotate around it; the leather strap's stitching pattern holds.
C3. Same style, eight subjects.
Generate 8 illustrations in a single consistent style — risograph-print aesthetic, two-color palette (cobalt blue and coral pink), rough half-tone texture, slight registration misalignment, thick 2pt outlines. Subjects, one per image: a teapot, a bicycle, a potted monstera, a typewriter, a paper airplane, a pair of sneakers, a stack of books, a cactus. Style identical across all 8.
Result: Pass. Halftone grain, registration offset, and palette hold across all eight outputs. A designer could ship the whole set as a single sticker pack without retouching.
C4. Same outfit, different actions.
Generate 8 images of the same person — white t-shirt, olive green utility pants, white sneakers, black leather watch, short curly brown hair, beard. Actions: 1) running. 2) seated at a laptop. 3) walking upstairs. 4) crouching to pet a dog. 5) reaching for a book on a high shelf. 6) mid-laugh at a table. 7) stretching after a run. 8) holding a baby. Outfit identical across all 8.
Result: Partial. Outfit holds across seven of eight images — the "holding a baby" frame changed the white t-shirt to a pale blue henley. Every other frame is consistent. Characters, beard, hair, and watch all hold.
C5. Brand consistency across assets.
Generate 8 social-media assets for a fictional brand "Umber Coffee Co." Logo: the word "UMBER" in a heavy geometric sans caps, terracotta orange on cream. Palette: terracotta, cream, charcoal. Assets: 1) Instagram square hero shot of a cortado with the logo in the corner. 2) vertical story announcing a 20% off weekend. 3) horizontal banner with a roastery interior. 4) square quote card with a coffee proverb. 5) story with a barista portrait. 6) square product shot of a bag of beans. 7) vertical menu card listing 6 drinks with prices. 8) horizontal map pin asset. Logo identical, palette identical, type voice identical across all 8.
Result: Pass. Logo stays identical across eight assets, palette holds, type voice is consistent. The menu card (asset 7) renders six drink names with prices correctly, which is also a hit on test T3's body-copy capability.
Test suite 5 — Edit fidelity (5 prompts)
Editing is the capability that separates "image generator" from "creative tool." Each prompt starts with a base generation and applies one or more edits; the question is whether the edited output preserves what was supposed to be preserved.
E1. Change background, keep subject.
Base: a woman in a red knit sweater smiling in a park, autumn leaves behind her. Edit: change the background to a city street at night, wet pavement, neon signs. Keep the woman, her sweater, her expression, her pose, and her skin tone identical.
Result: Pass. Subject is preserved pixel-for-pixel in the face and upper-body region; background is replaced cleanly; lighting on the woman is subtly re-graded to match the new scene without changing her identity.
E2. Change outfit, keep face.
Base: a man in a navy suit standing against a white wall, business headshot style. Edit: change the navy suit to a cream linen shirt, sleeves rolled, no tie. Keep the face, hair, skin tone, pose, expression, and background identical.
Result: Pass. Face lock is excellent. Background wall is unchanged. The new shirt fits the shoulder line of the pose correctly.
E3. Add element without disturbing composition.
Base: a minimalist dining table with a single empty white plate, a linen napkin, and a drinking glass. Edit: add a small arrangement of three roses in a clear bud vase to the upper-right corner of the table. Do not move or alter any existing object.
Result: Pass. Plate, napkin, and glass are unchanged (pixel-level diff confirms it). The roses are added in the correct position with plausible shadowing that matches the scene's existing light.
E4. Remove element cleanly.
Base: a street scene with two people walking past a storefront. Edit: remove the second person entirely, including their shadow and any reflection in the storefront window. Keep the first person, the storefront, and the street exactly as they were.
Result: Partial. The person is removed and their shadow is correctly erased. The storefront window reflection, however, still shows a faint ghost — the model remembered the shape but forgot the window. A viewer looking for the bug would find it.
E5. Three-round iterative edit.
Base: a living room with a brown leather sofa, a wooden coffee table, and a bookshelf. Round 1: change the sofa color to deep green. Round 2: replace the coffee table with a round marble one. Round 3: add a floor lamp to the left of the sofa. At round 3, the sofa must still be the round-1 green, and the marble table must still be the round-2 marble.
Result: Fail. Round 1 and round 2 hold. By round 3, the sofa has drifted to a slightly different green shade and the marble table's veining pattern changed. The model does not yet lock edits across multiple rounds in the way a professional retouching workflow would require.
Single-round edits are solved. Multi-round iterative edits, where each round must treat prior rounds as canonical, are not — and GPT Image 2 is the first model to even partially hold the line across two rounds. A usable professional workflow needs round 5 to still look like round 1 plus four additions. We're not there yet.
The score summary
Every test, every score, in one view.
| Feature | Test | Capability | Score |
|---|---|---|---|
| T1. Single large word | Text rendering | Pass | |
| T2. Headline + byline | Text rendering | Pass | |
| T3. Small-point body copy | Text rendering | Partial | |
| T4. Multilingual signage | Text rendering | Pass | |
| T5. Typography on 3D surface | Text rendering | Pass | |
| L1. 6-panel comic | Layout reasoning | Pass | |
| L2. Infographic callouts | Layout reasoning | Pass | |
| L3. UI mockup | Layout reasoning | Pass | |
| L4. 3x3 fruit grid | Layout reasoning | Partial | |
| L5. Split-screen before/after | Layout reasoning | Pass | |
| P1. Elderly portrait | Photorealism | Pass | |
| P2. Wet cobblestone alley | Photorealism | Pass | |
| P3. Glass bottle refraction | Photorealism | Pass | |
| P4. Mixed fabrics | Photorealism | Pass | |
| P5. Hands passing a cup | Photorealism | Fail | |
| C1. Same character × 8 | Multi-image coherence | Pass | |
| C2. Same product × 8 angles | Multi-image coherence | Pass | |
| C3. Same style × 8 subjects | Multi-image coherence | Pass | |
| C4. Same outfit × 8 actions | Multi-image coherence | Partial | |
| C5. Brand consistency × 8 | Multi-image coherence | Pass | |
| E1. Change background | Edit fidelity | Pass | |
| E2. Change outfit | Edit fidelity | Pass | |
| E3. Add element | Edit fidelity | Pass | |
| E4. Remove element | Edit fidelity | Partial | |
| E5. 3-round iterative edit | Edit fidelity | Fail |
Totals: 19 Pass (76%), 4 Partial (16%), 2 Fail (8%).
By suite: Text rendering 4P/1P/0F, Layout reasoning 4P/1P/0F, Photorealism 4P/0P/1F, Multi-image coherence 4P/1P/0F, Edit fidelity 3P/1P/1F.
Where GPT Image 2 is genuinely class-leading
Text rendering and layout reasoning are the two dimensions where GPT Image 2 lapped the field. Four out of five text tests passed outright, including two that no prior model has cleared without aggressive cherry-picking — multilingual signage and wraparound text on a curved ceramic surface. For design work that involves real words in real positions — posters, magazine covers, infographics, UI mockups — this is the first model that can deliver first-pass output a designer would actually ship.
Multi-image coherence is the second big win. An 8-image batch that holds a character, a product, a style, an outfit, or a brand identity is a capability that previously required LoRA training, elaborate reference conditioning, or manual retouching across frames. GPT Image 2 does it from a single prompt. For creators building brand asset sets or character-driven content, this collapses a week of work into a single generation.
The photorealism baseline is also competitive. Four out of five photorealism tests passed at a level that matches Flux 2 Pro and Midjourney v7 — meaning GPT Image 2 is not making the old tradeoff ("you can have text OR photorealism, not both"). You get both.
Where GPT Image 2 still falls short
Hands with complex interactions remain broken. Test P5 is the clearest failure in the suite: two hands, an object, and a hand-off between them. The model produces six-fingered hands and geometrically impossible grips. If your use case involves people handling objects — product photography with a human touch, cooking shots, crafts, surgery, musical instruments — you will need to budget for generation retries and manual fixes.
Multi-round iterative editing is the other frontier. GPT Image 2 is the first model to hold a single edit cleanly, but across three rounds, details drift. For final production work, edits need to be locked — an earlier round's sofa color must still be that exact green three edits later. The model is close, but not there. Workflows that rely on iterative refinement should treat each round as a fresh generation and re-prompt with the cumulative state rather than chaining edits.
Small-point body copy also continues to be imperfect. The model can render two-column magazine-style body text that reads plausibly at thumbnail scale, but zoomed-in viewers will spot drift into pseudo-English on about one word in twelve. This is a genuine improvement over predecessors but not yet ready for a print-ready proof.
For a broader view of where the model sits in the landscape, our GPT Image 2 launch coverage walks through the architecture, and the effective-use guide covers prompt patterns that play to the model's strengths and around its weaknesses.
FAQ
How was this test run? Every prompt was sent to GPT Image 2 via Oakgen's model page once — no retries, no best-of-N, no iteration. Each generation consumed 26 credits and returned in about three seconds. The prompts were written in advance of testing; scoring was done against a pre-declared rubric.
Did Oakgen use the official GPT Image 2? Yes. Oakgen routes to OpenAI's production GPT Image 2 endpoint through our standard provider layer, with no proprietary post-processing that would alter the model's outputs.
Why only 25 prompts? Small enough to run in a single session and print in full. Large enough to cover five distinct capability dimensions with meaningful per-dimension sample size. A 25-prompt suite is also reproducible — you can run the entire thing yourself in ten minutes, which is the point.
Can I reproduce these tests? Yes. Every prompt is printed verbatim, in full, in copy-paste-ready form. Run them at /models/gpt-image-2 with a fresh context window for each prompt. Your outputs will not be pixel-identical (image generation is stochastic), but your score distribution should fall within one or two points of ours on a suite of this size.
What does the full test cost? 25 prompts × 26 credits = 650 credits, plus three extra credits for the multi-round edit test's two additional rounds. At Oakgen's pricing, a single run fits inside the smallest paid plan, and the 30-day free window on annual Ultimate and Creator plans covers several full runs. You can also earn credits through our referral program.
Will the scores change as the model gets updated? Yes. Image models receive weight updates, safety updates, and occasional capability updates on a monthly cadence. We'll re-run this suite at each update and publish deltas. The prompts themselves are frozen, which is the point — if T1 passes today and fails in six months, that's a regression worth documenting.
Which test should I care about most? Whichever one matches your workflow. A brand designer should care most about the text-rendering and layout-reasoning suites. A product photographer should care most about photorealism. A character illustrator should care most about multi-image coherence. A retoucher should care most about edit fidelity. The overall 76% pass rate is the headline; the per-suite numbers are where your decision actually lives.