Chart Of The Week: How AI Is Learning To Stay On The Job

Source


Yesterday, we talked about how Claude is starting to behave less like software and more like a coworker.

People aren’t just prompting Claude Code and waiting for an answer anymore. They’re leaving it running and coming back to significant progress. It doesn’t need to be constantly monitored. If something breaks, Claude Code fixes itself and keeps going.

That experience feels new.

And it turns out that researchers have been tracking exactly how new it is.


THIS CURVE CHANGES EVERYTHINGEMPTY HEADING

This week’s chart comes from METR, a research group that measures how long different AI models can reliably work on real software engineering tasks without human intervention.

To be clear, these aren’t benchmarks. They’re actual tasks measured in human time:

Image: metr.org


This chart shows the time horizon that different models can sustain before they fail about half the time.

In plain English, it shows how long you can reasonably expect an AI system to keep working on a problem before it gets lost, stuck or needs help.

As you can see, for years that number barely moved.

Chat GPT-2 and GPT-3 could handle seconds, while GPT-3.5 and GPT-4 pushed into minutes. That was useful — and often impressive — but it still meant babysitting every step.

Over the past year, the curve started bending sharply upward. That’s because models released in 2024 and 2025 don’t just answer questions.

They persist.

Claude Opus 4.5 is now measured in hours, and OpenAI’s latest coding-focused models aren’t far behind.

Here’s how I explained this evolution to my team.

In 2023, the question was: Can my AI write a Bob Dylan inspired song?

In 2024, the bar moved higher: Can my AI outthink my lawyer on a narrow problem?

By 2026, the question has changed again: Can my AI work on a complex task all afternoon and coordinate with other agents while it does?

This difference between minutes and hours of persistence is about to change how people relate to AI. So far, we’ve had to babysit it. But now that AI can persist much longer, we can start supervising it instead.

Once that happens, usage will go from a few times a day to all day. And one assistant will become multiple agents running in parallel.

This is what I mean when I say that AI will soon start acting like coworkers you can delegate work to.

It means people will move from being individual contributors to managers of intelligent systems.

And as execution keeps getting cheaper, it means human oversight and judgement will become even more valuable.


HERE’S MY TAKE

People using tools like Claude Code have been genuinely surprised by how different the experience feels.

That change comes from the way the capabilities we talked about yesterday are finally stacking on top of each other. It started with broad knowledge and stronger reasoning. Now we’ve added iteration, the ability of AI to test, notice what broke, revise and keep working without someone standing over its shoulder.

That’s what today’s chart is measuring.

This chart also explains why memory is suddenly such a big deal.

You don’t need to run these systems locally. But you do need enough memory and context for multiple agents to coordinate, hand work off and stay aligned over time.

Which raises a new question.

What happens when persistent AI systems are connected to the real world and allowed to run while we’re not watching?

We’re starting to get a glimpse of that too.

And tomorrow, I’ll show you where it leads.


More By This Author:

Chart Of The Week: Two Superpowers, Two Energy Futures
A Fed Shakeup Could Change How Wall Street Sees Bitcoin
Chart Of The Week: How AI Becomes Universal
How did you like this article? Let us know so we can better customize your reading experience.

Comments

Leave a comment to automatically be entered into our contest to win a free Echo Show.