Minecraft study exposes a problem with AI agents

Tuesday, 28 April 2026 at 18:00

Onderzoek in Minecraft legt probleem met AI agents bloot

Researchers from the University of California, Los Angeles, Amazon, and others will publish a study in late April 2026 showing that modern AI systems systematically fall short when applying newly discovered knowledge.

Their Minecraft-based benchmark reveals that even the most advanced models complete, on average, only 26 percent of tasks successfully.

The implication is directly relevant AI systems that appear to reason well still struggle to turn insight into execution. That gap hits sectors like industry, education, and policymaking, where that translation is critical.

What exactly does this study examine?

The core question in this study is simple: can AI discover something on its own—and then apply it? The researchers call this the “discovery-to-application loop.” It’s the process where:

a system identifies a knowledge gap
runs experiments
records insights
and applies those insights in a working solution

This process underpins human innovation, from science to engineering. According to the study, this coherence is largely still missing in AI.

Why use Minecraft as a testbed?

The researchers use Minecraft because it offers a controlled yet complex sandbox. In the game, agents can build electrical circuits with so-called “redstone” mechanics.

This environment offers three advantages:

Controllable complexity: tasks get harder by tweaking parameters
Realistic causality: systems respond according to fixed rules
No prior knowledge possible: agents must truly discover how it works

Example task: switch on 64 lamps simultaneously within a tight space. That requires understanding signal loss, timing, and circuit design.

How do today’s AI models perform?

The researchers test, among others:

GPT-5.2
Gemini-3-Pro
Claude-Opus-4.5

The result is strikingly consistent: all models hover around a 26 percent success rate.

In other words, three-quarters of tasks fail—even with access to tools, memory, and multiple attempts.

Where does it break down?

The study identifies four fundamental capabilities where AI falls short:

1. Spotting knowledge gaps

AI often doesn’t know what it doesn’t know. When researchers provide hints, performance doubles.

2. Experimentation

A dedicated “scientist agent” that runs systematic experiments boosts success further to around 64 percent.

3. Structuring knowledge

How knowledge is stored proves crucial. Structured formats work far better than free-form summaries.

4. Applying knowledge

The biggest bottleneck is still execution: actually building a working system.

The researchers conclude that this final step is especially underdeveloped.

Key takeaway: AI asks the wrong questions

A notable finding is a shift in the failure mode of top models. The hardest part isn’t solving problems—it’s formulating the right questions.

That echoes a classic insight from Albert Einstein: defining the problem correctly is often more important than solving it.

For AI, this means systems still don’t sufficiently understand where to look for new knowledge.

Industry and automation

Companies using AI for design or engineering will hit limits. AI can propose ideas but falters on complex implementation.

Education and research

AI as a “research partner” remains only moderately reliable. Students and researchers must scrutinize generated solutions.

Labor market

Human experts stay essential—especially in roles where insight meets application, such as engineering and data science.

Policy

For policymakers, this means expectations around autonomous AI need recalibrating. Fully automating complex decision-making isn’t realistic yet.

Why this research matters

This study reframes the AI capabilities debate. Until now, the focus was on language understanding and reasoning. This research shows that:

reasoning ≠ application
knowledge ≠ understanding
output ≠ functionality

It underscores that real intelligence isn’t just about giving answers—it’s about building things that work.

Conclusion: AI still misses a critical link

The conclusion is clear: current AI systems lack a fundamental bridge between thinking and doing.

While models are improving at reasoning and experimentation, practical application lags behind. For now, AI is a tool—not an autonomous problem-solver.

For the Netherlands, that means using AI strategically: supportive where it helps, but always with humans in charge for complex tasks.