Measuring AI Ability to Complete Long Tasks

Measuring AI Ability to Complete Long Tasks: Insights 2025

“Measuring AI ability to complete long tasks” opens the discussion about how far AI systems can go — not just in solving one problem, but in executing multi-step, time-intensive workflows. As an AI and tech leader, I see this metric as one of the most meaningful indicators of real-world impact, beyond benchmarks. A recent study by METR (March 2025) lays the foundation for how we should think about the horizon of AI capability — and what it implies for productivity, risk, governance and the future of work.

What the Study Found

The METR blog post explains how the researchers measured the “task-completion time horizon”: the length of tasks (measured by how long they take humans) that AI agents can complete with a given success probability (e.g., 50%).

Key findings:

Over the past six years, the time horizon for tasks that frontier AI agents can complete (with ~50% reliability) has been doubling approximately every 7 months.
Currently, models show near-100% success on tasks that take humans under ~4 minutes, but less than ~10% success on tasks that take humans more than ~4 hours.
Extrapolating the trend suggests that within this decade, generalist AI agents may reliably complete week-long tasks (i.e., tasks that currently take humans days) with meaningful probability.
The authors emphasise that although benchmark scores are improving rapidly, the translation to everyday workflow automation is still constrained by the difficulty of stringing together longer sequences of actions.

Why This Metric Matters

From a leadership and organisational perspective, this “task-length horizon” metric is important for several reasons:

Real-world relevance: It captures not just whether an AI can answer a single question, but whether it can carry out a meaningful chunk of work (multiple steps, dependencies, time).
Forecasting power: Because the trend appears exponential and relatively consistent, it gives us a way to anticipate when AI will cross key thresholds (e.g., handling day-long tasks) rather than only relying on model-size milestones.
Strategic planning: If you know that AI may soon be able to do what humans currently take hours or days to do, you can begin shifting how you organise work, invest in training, and adjust risk/quality-controls accordingly.
Risk and governance: Longer tasks often involve greater complexity, coordination, decision-points and hidden risks. Knowing where AI still struggles helps design safe oversight regimes.

Implications for Software Development & Knowledge Work

As someone in the AI/tech leadership space, the study suggests several actionable implications:

Shift from micro-tasks to workflow design: Many current AI tools handle single steps well (code snippet generation, summarisation, simple tests). The frontier is chaining these into end-to-end workflows (e.g., feature development, code review, deployment).
Quality & trust become critical: When tasks span hours or days, even small errors compound. You’ll need stronger review frameworks, and AI-generated outputs will require human validation, especially for critical work.
Invest in tooling and pipelines: To exploit AI’s growing ability for longer tasks, your infrastructure (DevOps, data pipelines, team workflows) must support orchestration of multi-step automation—not just isolated tools.
Re-imagine roles and skills: As AI takes on more of the “task-span”, human roles may shift toward supervision, exception-handling, strategy, and meta-workflow design rather than step-execution.
Manage adoption timing: The exponential trend means capabilities will increase rapidly. Late adopters risk being surprised; early adopters should experiment now with complex workflows to prepare for scale.

Leadership Considerations & Risks

Over-hype vs. readiness: Even though the trend is strong, the study emphasises that many models still cannot reliably complete even several-hour tasks. Adopting AI for mission-critical multi-day workflows too early may backfire.
Complexity of chaining steps: The boundary between “single-task” and “multi-step workflow” is where many failures occur. Organisations must map dependencies, error-recoveries, and human-in-the-loop checkpoints.
Governance and oversight: As AI handles longer tasks, the scope for error grows. Risk controls must expand from “did it produce the right output?” to “did it follow the correct process, handle edge-cases, maintain security/compliance?”
Forecasting for disruption: If the doubling trend continues, you may have a shorter-than-expected runway to rethink how teams work. This may affect organisation design, talent investment, and tech stack decisions sooner than previously assumed.

The Big Picture

The study by METR provides a compelling lens to view AI progress: time-span of tasks rather than just benchmark scores or model size. For tech leaders, this matters because it aligns more closely with workflow impact. If AI can reliably complete longer tasks, it shifts from being an assistive tool to being a collaborator or even an autonomous actor in some settings.

As we move into 2026 and beyond, measuring and tracking this capacity will be key—not just in labs, but inside your organisation. The question is no longer if AI will handle multi-day workstreams, but when and how safely. And your job as a leader will be to ensure your team, infrastructure and governance are ready for that shift.

TL;DR

A recent METR study shows AI agents’ ability to complete longer tasks is doubling every ~7 months. While current models still struggle with tasks humans take hours, the trend suggests week-long workflows may soon be feasible. For leaders, this means focus less on single-step automation and more on designing robust workflows, investment in oversight, tooling for chaining tasks—and preparing for accelerated change.

Reference

“Measuring AI Ability to Complete Long Tasks”, METR, 19 March 2025.

What does the METR study suggest about AI's ability to complete long tasks?

The study indicates that AI agents’ ability to complete longer tasks is doubling approximately every 7 months, with the potential for generalist AI to reliably handle week-long tasks within this decade.

Why is the 'task-length horizon' metric important for organizations?

This metric captures the real-world relevance of AI by assessing its capability to perform meaningful work involving multiple steps and dependencies, aiding in strategic planning and risk management.

What implications does the study have for software development and knowledge work?

Organizations should shift focus from micro-tasks to designing end-to-end workflows, invest in tooling for multi-step automation, and prepare for changes in human roles towards supervision and strategy as AI capabilities grow.