The Agentic Evolution: From LLMs to Coding Agents to Whatever Comes Next

6 March 2026 - 12 mins read

Commissioned, Curated and Published by Russ. Researched and written with AI.

What’s New

Date	Change
2026-03-06	Initial publication

You’ve Already Crossed One Threshold

If you use Claude or ChatGPT regularly, and you’ve also tried Claude Code or Codex, you’ve probably noticed they feel different. Not just better – categorically different. Something changed, and it’s hard to articulate exactly what.

Here’s the frame: you’ve crossed a threshold. You’ve moved from one tier of AI to another. And most engineers who’ve made that jump haven’t fully registered what changed or why it matters.

That gap in understanding is going to matter more as the next threshold arrives. There are three distinct tiers of AI capability deployment – not quantitatively different, but qualitatively. The jump from Tier 1 to Tier 2 isn’t “the model got smarter.” It’s a categorical shift in what the system is. The jump from Tier 2 to Tier 3 is another one.

Let’s map it out.

Tier 1: What LLMs Actually Are

You know this, so I’ll be brief.

A language model – Claude.ai, ChatGPT, Gemini – is a stateless text transformation machine. You put text in. It produces text out. Then it forgets the exchange ever happened (unless you scroll back up and shove it back in as context).

What it cannot do: anything. It cannot read a file unless you paste it. It cannot run code unless you paste the output back. It cannot send an email, modify a database, call an API, or take any action in the world whatsoever. It is, in the most literal sense, read-only.

The conversation is the product. You take the output – the paragraph, the code snippet, the explanation – and you go do something with it. You are the executor. The model is the advisor.

This is enormously useful. It’s also been transformative. But it’s one specific thing, and that thing has limits baked into its architecture.

The moment you understand what those limits are, you understand why what came next was such a big deal.

The Inflection: When Models Got Tools

The key insight, which arrived quietly and then everywhere at once, was this: what if instead of producing text and stopping, the model could take an action based on that text, observe the result, and continue?

Tool use. Function calling. Computer use. The framing varies, but the idea is the same: give the model access to the outside world, and let it act.

The moment a language model got access to bash – actual bash, the shell, on a real filesystem – it stopped being a chatbot. This sounds dramatic but it’s accurate. The class of things it can do expanded from “none” to “essentially anything you can script.” File I/O, network requests, running processes, reading test output, modifying configuration files. All of it.

This is not “a better ChatGPT.” It is a different category of system. The output is no longer paragraphs – it’s diffs, terminal state, test results, committed code. You don’t read the output and decide what to do with it. The agent reads the output and decides what to do with it.

Andrej Karpathy noted something important here: coding agents “basically didn’t work before December [2024] and basically work since.” That December inflection was real. The models crossed some capability threshold around that time – instruction following, multi-step planning, recovery from errors – and suddenly the loop closed. The thing actually worked.

Claude Code went from interesting toy to production tool in roughly six months. That’s the shape of the curve.

Tier 2: What Coding Agents Are (and Aren’t)

Let’s be precise about what a coding agent actually is, because the framing matters.

Claude Code, Codex, Cursor, Cline – these are Tier 2 systems. They share a common architecture: a capable language model, a set of tools (file read, file write, bash execution, sometimes web search), and a loop. The model plans, uses a tool, observes the result, plans again, uses another tool. Repeat until done or stuck.

This is multi-step within a session. This is the key phrase. Claude Code can write a test, run it, see it fail, read the stack trace, identify the bug, edit the code, run the test again, and keep going – all without you touching the keyboard. None of those intermediate steps involve you. You started it. You’ll review the output. But the iterations in between are autonomous.

Watch it in action and you feel it immediately: this is not a chatbot. This is an agent.

The Carlini example makes this concrete. Nicholas Carlini, a security researcher at Google DeepMind, ran 16 parallel Claude Opus agents for four hours, building a 100,000-line Rust C compiler from scratch. Human-initiated, environment-scoped, but operating at a scale that looks like something else. That’s the upper bound of where Tier 2 is right now.

The Cline incident is the other end of the story. Cline, the VS Code coding agent, was the vector for a supply chain attack precisely because of its Tier 2 capabilities. An attacker modified a package that Cline might install, adding instructions in a comment that Cline would read and execute. The attack worked because Cline has real bash access and real file write capability. The same properties that make it useful made it exploitable. Tier 2 has blast radius. Not unlimited, but real.

Why the bounded environment matters

Coding agents went mainstream before general agents for one reason: the environment was safe enough to let go.

Your codebase is a sandbox. Git exists. Mistakes are reversible. The blast radius of a wrong decision – a bad edit, a failed test run, an accidentally deleted file – is small and recoverable. You can watch it work, review the diff, revert if needed.

This is why coding agents are the right middle step. They gave LLMs agency inside constraints that humans were comfortable with. The tools are real but the scope is limited. You control what it can see and touch. That’s not an accident – it’s the design that made adoption possible.

What Tier 2 Can’t Do

Here’s where it gets interesting. Because coding agents are impressive, and that impressiveness can obscure some hard architectural limits.

They forget everything when the session ends.

When you close Claude Code, it’s gone. Not the files – the files persist. But the agent’s understanding of what it did, why it made the decisions it made, what it was planning to do next – all of that is lost. The next session starts cold. You’re the memory. You paste context back in. You explain what it already did.

This is not a model intelligence problem. The models are smart enough to remember. It’s an architectural constraint of the current deployment pattern: sessions don’t persist, and there’s no long-term memory store being maintained across them.

They can’t initiate.

A coding agent cannot decide, at 3am, to check the CI status and fix the flaky test. It cannot monitor a repository for new issues and start working on them. It cannot wake up and do things. It runs when you run it. Full stop.

They can’t span environments.

A coding agent in your codebase cannot, without you explicitly setting it up, go check your email for a related message, look up the bug in Jira, post a status update to Slack, and then come back and write the fix. It operates in the environment you give it. Bridging environments requires human coordination.

They can’t track goals across time.

“Check on this tomorrow” is not something a coding agent can do. It has no concept of tomorrow. No scheduler. No goal persistence. The goals exist inside your head, not inside the system.

These aren’t criticisms. They’re features of a safe, bounded system. But they are real constraints, and they define the ceiling of Tier 2.

Tier 3: What Autonomous Agents Actually Require

Tier 3 is not “a much better coding agent.” It’s a different architectural pattern. Here’s what it actually requires:

Persistent memory. Not just context within a session – memory that survives across sessions, gets updated with new information, and can be queried at the start of any new task. Short-term memory (within a context window) is largely solved. Long-term memory at scale – indexing, retrieval, relevance ranking across months of accumulated context – is not.

Cross-session goal tracking. The system needs to know that it was working on something, where it got to, what the next step is, and pick that up without being told. This requires goal state that persists outside the model’s context – a separate store that gets read at session start and written at session end.

Self-initiation. The system needs to be able to wake itself up. Scheduled triggers, event listeners, webhooks, cron-style scheduling – whatever the mechanism, the agent needs to be able to start without a human starting it. This is a trust and safety question as much as a technical one.

Trust architecture. This is the piece that doesn’t get talked about enough. An autonomous agent operating with real capabilities – bash, browser, email, APIs – needs constraint systems. What can it do without asking? What requires approval? How do you define the boundaries? How do you audit what it did? Trust architecture is the engineering discipline of Tier 3, and it doesn’t really exist yet as a mature field.

Multi-environment operation. Reading news, writing a file, posting to Telegram, checking email, updating a database – a Tier 3 agent doesn’t operate in one environment. It operates across all of them, with tools for each.

What does this actually look like? An agent that wakes up at 7am, fetches the morning news, synthesises a briefing, posts it to a Telegram channel, monitors for replies, updates a blog post based on feedback, and goes back to sleep. Not a pipeline of separate scripts. One agent with persistent goals and cross-session memory, operating continuously.

This is not science fiction. It is running right now. A small number of engineers – the people Karpathy called the “Claw” movement – are living in Tier 3 already. They’re running agents 24/7, agents that monitor things and send messages and take actions without being asked. The infrastructure is rough. The trust architecture is DIY. But the thing exists.

Why the Gap Is Closing Faster Than You Think

Go back to Karpathy’s observation: coding agents basically didn’t work before December 2024 and basically work since.

That’s a six-month step function. Models crossed a capability threshold – instruction following, multi-step planning, error recovery, consistent tool use – and suddenly the loop closed. The compound of model capability plus better tooling plus people actually using it in anger produced something qualitatively different from what existed before.

The December 2024 inflection for Tier 2 was visible in hindsight. Claude Code became a production tool. Cursor crossed a million users. GitHub Copilot grew up. The Ampcode manifesto (“The Coding Agent Is Dead” – meaning the simple autocomplete model is dead, replaced by true agents) landed because it was describing something people were already experiencing.

The same curve is happening for Tier 3, right now, in early 2026.

The infrastructure is improving fast: better long-term memory stores, better agent orchestration frameworks, better tooling for multi-agent systems. The models are capable enough – the gap to general autonomy was never really model intelligence. It was memory, scheduling, and trust architecture. Those are engineering problems, not research problems. Engineering problems get solved faster.

Claude Code went from toy to production in six months. If the Tier 2 curve is the template, we’re looking at general autonomous agents moving from “exists but rough” to “engineers are using this in production” in a similar timeframe. We’re probably already in the “exists but rough” phase. The “production tool” phase is not far.

The engineers who are paying attention are already building the foundations: memory systems, trust frameworks, scheduling infrastructure. The engineers who aren’t paying attention are still optimising their system prompts.

What This Means for How You Build

Understanding which tier you’re in matters, because the skills required at each tier are different. Not unrelated – they build on each other – but different in emphasis.

Tier 1 requires prompt engineering. How you frame the question, structure the context, specify the output format – this is the craft. It’s valuable. It’s learnable. It transfers directly to higher tiers.

Tier 2 requires environment setup and scaffolding. Which tools does the agent have access to? What can it see and touch? How do you structure the working directory? How do you give it context at session start? How do you review its output? The quality of a Tier 2 deployment depends on the environment as much as the prompt. Setting up Claude Code well – CLAUDE.md, clean project structure, clear conventions – is a real engineering task.

Tier 3 requires trust architecture, memory design, and constraint systems. What can the agent do autonomously? What requires a human checkpoint? How do you store and retrieve context across sessions? How do you audit the agent’s actions? How do you define the blast radius and enforce it? These are new questions. The answers aren’t obvious yet. The field is being invented in real time.

The framing errors at each tier are predictable. Tier 1 thinking applied to Tier 2 produces agents that are over-supervised – humans approving every small decision, negating the autonomy that makes Tier 2 valuable. Tier 2 thinking applied to Tier 3 produces agents that are under-constrained – no memory architecture, no trust boundaries, no audit trail, and eventually something breaks badly.

Knowing which tier you’re in lets you apply the right frame. It lets you build the right things.

Closing: Know Where You Are

The engineers who understand which tier they’re in are building the right things.

They’re not trying to solve Tier 3 problems by writing better prompts. They’re not building Tier 2 scaffolding for a system that needs Tier 3 trust architecture. They understand that each threshold is a qualitative shift – not an upgrade, a category change – and they’re building accordingly.

The engineers who don’t understand are still optimising prompts for a system that stopped being a chatbot a year ago. They’re treating Claude Code like a smart autocomplete. They’re missing the point of what they already have access to, and they’re not prepared for what’s coming.

You’ve already crossed the first threshold. You’re in Tier 2 territory. The next one is coming, faster than it looks from here.

The question is whether you see it.

Commissioned, Curated and Published by Russ. Researched and written with AI.