Breaking Tasks into Milestones: DeepMind's Fix for Long-Horizon Agent Failure

23 March 2026 - 7 mins read

Commissioned, Curated and Published by Russ. Researched and written with AI.

What’s New This Week

Published 23 March 2026. Google DeepMind released “A Subgoal-driven Framework for Improving Long-Horizon LLM Agents” (arXiv:2603.19685) on HuggingFace daily papers today. This post covers the paper’s core findings and what they mean for engineers building agentic systems.

Changelog

Date	Summary
23 Mar 2026	Initial publication covering the DeepMind subgoal paper.

There’s a failure mode engineers building multi-step agents hit quickly: the agent starts a long task, takes some sensible-looking steps, and then gets stuck. Not crashed. Just… cycling. Clicking the same button. Re-running the same search. Making progress that isn’t.

This isn’t a prompt engineering problem. It’s a structural one. And a paper from Google DeepMind published this week puts numbers on exactly how bad it is, then proposes a two-part fix.

The failure modes are predictable

“Long-horizon” is researcher shorthand for tasks that require many sequential steps to complete – “find all open GitHub issues tagged critical, summarise each one, and draft a Slack message to the engineering lead” is a long-horizon task. So is “book a flight from London to New York for next Thursday under $800.”

These tasks fail in consistent ways:

Mid-task drift. The agent loses track of the original goal as new information accumulates. Early constraints get buried in context. The agent starts optimising for what it sees now, not what it was asked to do.

Non-productive loops. The agent identifies an action it can take, takes it, finds it doesn’t resolve the problem, and takes it again. It has no mechanism for recognising it’s stuck.

Locally reasonable, globally wrong. Individual steps look fine. The trajectory as a whole is wrong. The agent makes decisions that make sense given its immediate context but move it away from the actual goal.

Sparse reward blindness. For agents trained with RL, success or failure only becomes clear at the end of a long sequence. The agent can’t identify which of its 40 steps was the one that mattered.

The DeepMind team measured how often agents actually hit these failure modes on WebArena-Lite, a web navigation benchmark. Agents using Gemini-2.5-Pro out-of-the-box showed “mid-task stuck” behaviour in nearly 50% of evaluation trajectories. After supervised fine-tuning on human demonstrations, smaller open models like Gemma-12B-SFT still failed to make progress in over 30% of cases.

These aren’t edge cases. They’re the median behaviour of current production-grade agents on realistic tasks.

What subgoal decomposition actually does

The intuition is straightforward: instead of holding one large goal in mind over a long action sequence, break the task into explicit intermediate checkpoints, verify each one before proceeding, and maintain awareness of where you are in the hierarchy.

That description sounds like chain-of-thought. It isn’t. The distinction matters.

Chain-of-thought is reasoning about what to do next. Subgoal decomposition is maintaining a structured plan that the agent actively checks its progress against. One produces thinking. The other produces an accountable execution state.

The difference shows up when things go wrong. A chain-of-thought agent drifting in a loop has nothing to snap it back. An agent with explicit subgoals has a checkpoint: is this subgoal complete? If not, what’s blocking it? If yes, advance to the next one.

It’s also different from simple task decomposition, where you break a job into sub-tasks at the start and hand them out. Subgoal decomposition is dynamic – the plan can be revised at execution time when the environment changes. A subgoal that assumed X was true can be updated when X turns out to be false.

What the DeepMind paper actually proposes

The paper (Taiyi Wang, Sian Gooding, Florian Hartmann, Oriana Riva, Edward Grefenstette, all Google DeepMind) makes two concrete contributions.

First: a real-time subgoal planning framework. During execution, a proprietary model generates and maintains a subgoal decomposition alongside the action sequence. The agent doesn’t just act – it continuously asks what the current intermediate goal is and whether it’s been reached. When new information arrives that invalidates the current subgoal, the plan is updated. This runs at inference time, no retraining required.

Applied to Gemini on WebArena-Lite, this alone produces roughly a 10 percentage point absolute improvement in task success rate. That’s a large lift for something that’s purely a prompting and scaffolding change.

Second: MiRA (Milestoning your Reinforcement Learning Enhanced Agent). This addresses the RL training problem. Sparse end-of-task rewards mean the agent can’t learn which intermediate steps contributed to success or failure. MiRA introduces dense, milestone-based reward signals: the agent gets feedback at each subgoal completion, not just at the end.

The results on MiRA are the headline number. Applied to the open-source Gemma3-12B model, success rate on WebArena-Lite goes from 6.4% to 43.0%. For reference, GPT-4-Turbo sits at 17.6% on the same benchmark. GPT-4o at 13.9%. The previous best open-model result, WebRL, was 38.4%.

An open 12B model outperforming GPT-4o by 29 percentage points on web navigation tasks, because of better training signal structure, is a significant result.

What this means for pipeline design

If you’re building agentic pipelines today, the practical implications are:

Add explicit subgoal tracking to any task over 10 steps. This doesn’t require a research implementation. A system prompt that instructs the agent to maintain a numbered subgoal list, mark each one complete before advancing, and update the list when context changes will capture most of the benefit. The overhead is minimal – a few hundred tokens per step.

Structure success criteria at the subgoal level, not just the task level. “Did we complete the whole task?” is a terrible evaluation signal for debugging. “Which subgoal did we fail at, and why?” is actionable. This applies to both your monitoring and your RL training if you’re fine-tuning.

Treat context window management as a subgoal concern. Long-horizon tasks accumulate context. Subgoal completion is the natural moment to summarise and compress – once a subgoal is done, the detailed action-observation trace for that subgoal can be replaced with a compact summary. The HiAgent paper (a related line of work) shows a 3.8x reduction in average steps required when agents compress context at subgoal boundaries.

Plan for subgoal revision, not just subgoal completion. The real world changes. A subgoal that assumed a page would load a certain way will sometimes be wrong. Build in explicit replanning logic rather than assuming the initial decomposition survives contact with the environment.

Current tooling support

LangGraph has the most natural support for this pattern. Its graph-based execution model lets you model subgoals as nodes with conditional edges – you can route execution based on subgoal completion checks, and the state object persists the current subgoal explicitly. It’s not automatic, but the primitives are there and the implementation is straightforward.

CrewAI models tasks at the agent level, which maps loosely to subgoal decomposition when you assign agents to specific sub-tasks. The downside is it’s more static – the task decomposition happens at setup time, not dynamically at execution time. Mid-task replanning requires custom work.

AutoGen (now at 0.4) is more conversational and flexible but offers less structure by default. You can implement subgoal decomposition through conversation patterns between a planner agent and executor agents, but you’re building more from scratch.

None of these frameworks implement milestone-based RL training. MiRA’s contribution is to the training loop, not the inference scaffold. If you’re fine-tuning agents, you’d need to implement the milestone reward structure yourself – it’s not packaged anywhere as an off-the-shelf training framework yet.

When it’s overkill

Subgoal decomposition adds overhead. A few hundred extra tokens per step adds up over a long trajectory, and the planning logic adds latency. For short tasks – under five or six steps, clear success criteria, controlled environment – it’s unnecessary complexity.

It’s also less useful when the task is inherently sequential and linear with no meaningful decision points. If every step is obvious given the previous one, a subgoal hierarchy doesn’t help. The value is at branch points: moments where the agent has to choose between multiple plausible next actions based on whether a prior subgoal is actually complete.

The other limitation the paper doesn’t fully address: subgoal decomposition assumes the task is decomposable into sequential milestones. Some long-horizon tasks aren’t – they require reasoning about the whole trajectory simultaneously, and breaking them into chunks loses information needed to make the right decisions. Open-ended creative tasks, adversarial environments, tasks with deep interdependencies between steps: these don’t always fit a hierarchical model cleanly.

The 43.0% success rate on WebArena-Lite is impressive. It also means the best current approach still fails 57% of the time on realistic web tasks. Subgoal decomposition is a meaningful step forward. It’s not a solved problem.

The right frame is: subgoal structure makes failure modes visible and recoverable. That’s worth a lot. It’s not the same as making long-horizon tasks reliably work.

Paper: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents – Wang et al., Google DeepMind, arXiv:2603.19685 (March 2026).