From Prompt to Playable Evaluating AI Generated Browser Games

#1 overall

Cursor

4.64 / 5

Fastest delivery, best architecture, and the smoothest iteration workflow.

#1 output quality

Claude Code

4.75 / 5

Best playable result and strongest visual polish, but a major time penalty.

Balanced middle

OpenAI Codex

4.55 / 5

Solid mechanics, quick delivery, and a reliable result with less workflow polish.

TL;DR

All three platforms produced a working browser game from a single prompt. Cursor won the full benchmark once time and iteration workflow were included. Claude Code produced the best-looking and most complete game, but took 22 minutes 20 seconds. Codex delivered the most balanced result: playable, fast, mechanically strong, but less visually polished and less seamless in file handling.

The Prompt

The test used a single controlled prompt asking each platform to generate a complete browser-based arcade game called Jumping Jim. The requirements were intentionally practical: HTML, CSS, and JavaScript only; one runnable HTML file; movement, jumping, hazards, collectibles, scoring, lives, collision detection, and requestAnimationFrame-based animation.

The prompt also offered extra credit for browser audio, increasing difficulty, multiple levels or procedural generation, and mobile-friendly controls. That made it a useful test of both instruction following and creative initiative.

Single prompt excerpt

Create a complete browser based arcade game called Jumping Jim inspired by classic platformers similar in style to Pitfall.

Use HTML, CSS, and JavaScript only. The game must run by opening a single HTML file. Include player movement, platforms, hazards, collectibles, score, lives, gravity, collision detection, and smooth animation.

Methodology

Each platform was run in a fresh session using the same prompt, with no clarification or prompt refinement. Outputs were tested locally in a desktop browser and scored across ten categories from 0 to 5. Time to completion was tracked separately as an additional 0 to 5 workflow metric. To keep the published comparison easy to read, all headline totals below are normalised to a 5-point scale.

Test rule	How it was applied	Why it matters
Identical prompt	Same game brief pasted into all three platforms	Lets the output reflect platform behaviour rather than prompt tuning
Single-file constraint	All code had to ship in one HTML file	Tests code organisation under realistic constraints
Browser test	Each result was run locally without major fixes	Separates runnable software from impressive-looking code blocks
Time recorded	Wall-clock time from prompt submission to complete output	Captures real workflow cost, not just final quality

Final Rankings

Rank	Platform	Core quality	Overall incl. time	Time taken
1	Cursor	4.60 / 5	4.64 / 5	1 min 2 sec
2	OpenAI Codex (GPT-5.4)	4.50 / 5	4.55 / 5	1 min 15 sec
3*	Claude Code (Sonnet 4.6)	4.75 / 5	4.41 / 5	22 min 20 sec

*Claude Code ranked third when time was included, but first by core output quality alone.

Score Breakdown

Category	Claude Code	Cursor	Codex GPT-5.4
Time score	1	5	5
Functional completion	5	5	5
Code quality	4.5	5	4
Game mechanics	5	4	5
Creativity and design	5	3	4
Performance	5	4	5
Instruction following	5	5	5
Error rate	5	5	5
Completeness	5	5	5
Extra features	5	5	4
Iteration capability	3	5	3

The Games

Claude Code Sonnet 4.6

The best playable artifact

Claude Code Jumping Jim browser game screenshot

Claude Code produced the strongest visual output: layered trees, grass terrain, floating platforms, collectibles, hearts for lives, a timer, level display, audio, and mobile controls. It looked and felt like the most complete game.

Best for polished demos. Weakness: the 22 minute 20 second generation time is a serious workflow cost.

Cursor Fastest

The best engineering workflow

Cursor Jumping Jim browser game screenshot

Cursor produced the best code architecture and completed in just 1 minute 2 seconds. Its output included named constants, encapsulated systems, declarative level data, coyote time, jump buffering, seeded procedural generation, and accessibility attributes.

Best for rapid prototyping and maintainability. Weakness: the visuals were basic and movement speed was hard to control.

OpenAI Codex GPT-5.4

The balanced middle option

OpenAI Codex Jumping Jim browser game screenshot

Codex delivered a playable and responsive game in 1 minute 15 seconds. The output had strong mechanics, clean collision behaviour, pine silhouettes, gems, spikes, pits, score, lives, level text, and difficulty scaling. It did not look as polished as Claude Code or feel as architecturally refined as Cursor, but it landed a strong middle position.

Best for quick, well-rounded output. Weakness: scattered magic numbers, a heavier update function, and manual copy/paste friction reduced the iteration score.

What The Test Revealed

Runnable output is now table stakes

All three platforms produced working browser games with no major manual fixes. That is a meaningful maturity signal for constrained single-file software prompts.

The best output and fastest workflow were different tools

Claude Code created the strongest artifact. Cursor created the strongest workflow. That distinction matters when choosing a coding assistant for real projects.

Architecture and aesthetics still trade off

Cursor prioritised maintainable structure over visual presentation. Claude Code pushed presentation harder. Codex stayed between the two.

Iteration workflow is part of quality

For vibe coding, output quality is not just the final code. File handling, update speed, and whether the assistant reduces friction all change the practical score.

Practical Recommendation

Choose Cursor when

Speed and maintainability matter

It is the best fit for rapid prototyping, iteration-heavy workflows, and code that may be extended beyond the first generated artifact.

Choose Claude Code when

Presentation matters most

It produced the most impressive playable game, making it a strong choice for demos, visual prototypes, and user-facing proof-of-concepts.

Choose Codex when

You want balance

It is a reliable middle option: fast, playable, mechanically solid, and good enough visually, though less seamless in this particular workflow.

Do not ignore

Time as a quality signal

A brilliant output that arrives too slowly can lose to a slightly less polished result that supports fast iteration and keeps momentum high.

Bottom Line

This benchmark shows that the current generation of AI coding tools can produce complete, playable browser games from a single natural-language prompt. The failures were not about basic runnable code. They were about workflow velocity, polish, tuning, and maintainability.

If the question is which tool made the best game?, the answer is Claude Code. If the question is which tool performed best as a practical coding workflow?, the answer is Cursor. If the question is which tool delivered the most balanced result quickly?, Codex makes a strong case.