AI Models Compete in Classic Pokémon Game Challenge

At Google's major presentation last week, CEO Sundar Pichai brought up an unexpected milestone:

it's perhaps the most impressive, at least in some circles; a few weeks ago, it cleared "Pokémon blue", he said from the stage in California's Mountain View.

A private individual has let Google's language model Gemini and competitor Anthropics model Claude play "Pokémon blue" and "Red" respectively from 1996. The adventures have been broadcast in real-time on the video platform Twitch.

Julian Togelius, researcher at New York University and guest professor at the University of Skövde, says it's an interesting challenge to get the language models to play the games.

Interpreting images

The AI plays through an emulator. It takes screenshots, which are then sent to the language model for analysis, along with some information from the game's RAM. Then a decision is made, which is translated into a button press, and then the process starts all over again.

The challenges when you're a large language model and play games are first about understanding what you're seeing in front of you. What do these pixels mean? says Togelius.

The models, which have been trained on what's available on the internet, have probably had great benefit from the fact that a lot has been written about the Pokémon games.

That's why they're good in this case. If you were to do the exact same thing but with a new game that no one had seen before and just asked the language models to play it, they would probably not get that far.

800 hours

For Gemini, it took 800 hours to clear "Pokémon blue".

Very often, it seems to have lost the thread and done something else weird. But that's probably what happens when many of us play such games too.

Gemini has had the help of an AI-based so-called agent, which observes and regularly evaluates and gives recommendations on what should happen next.

It's like having multiple threads in your head at the same time, but then and then looking up and wondering what you're actually supposed to do. What did I do last and what should I do next?

Those who have been beaten by a chess robot or a computer-controlled bot in a game might wonder what's so special about a language model clearing a nearly 30-year-old TV game. But it's completely different things, emphasizes Togelius.

Games are incredibly interesting to test AI with, because they reflect so much of human thinking, he says.

Gustav Sjöholm/TT

Facts: AI plays "Pokémon"

TT

A person who describes themselves as a 30-year-old software developer has let startup company Anthropics language model Claude 4 and Google's Gemini 2.5 Pro Experimental play "Pokémon red" and "Pokémon blue" respectively. The games were released for Game Boy in Japan in 1996 and appeared in Europe in 1999. The AI models can "see what's happening, understand the game's state and make decisions, similar to how a human player would do", and also do things like naming their Pokémon. Gemini cleared the game in over 800 hours, and since then the experiment has started over. Claude appears to have cleared Lt. Surge's Gym, in the middle of the game. Source: Gemini_Plays_Pokemon and ClaudePlaysPokemon on Twitch.