DRAFT PROPOSAL · This is a proposal, not a finished product. Read, challenge, and improve it. · Join the discussion →
DORA for pipelines. AURA for agents.

See what your agents
actually deliver

DORA for pipelines. AURA for agents.
Four metrics. Built on OpenTelemetry.

The problem

Your agents are black boxes

DORA gave software delivery four clear metrics. AI agents have nothing. Every team invents their own, making it impossible to benchmark or improve.

01

No standard metrics

Is the agent delivering what was asked? How would you know? Every team tracks something different.

02

Task ≠ Value

Measuring tool calls is like measuring commits instead of deployments. Activity isn't delivery.

03

No quality gradient

Binary pass/fail doesn't capture reality. Two agents can "complete" a feature — one nails it, the other barely passes.

04

Recovery is invisible

When an agent fails mid-task, does it self-correct or spiral in a loop burning tokens? You can't tell until you check the bill.

Key insight

Measure deliverables, not tasks

DORA measures deployments, not commits. AURA applies the same principle: measure the unit of value delivered against a specification.

A task is a commit.
A deliverable is a deployment.

An agent might complete 50 tasks in pursuit of one thing a user cares about. That one thing, verified against a spec, is what AURA measures.

Deliverable ← unit of measurement for AURA
├── Spec ← what was requested (acceptance criteria)
├── Plan ← how the agent approaches it
├── Tasks ← individual steps
│   ├── LLM calls
│   ├── Tool calls
│   └── Reasoning steps
└── Outcome
Philosophy

What AURA believes

01

Measure deliverables, not tasks

Tool calls are commits. Deliverables are deployments. Measure value.

02

Speed, stability, and quality correlate

The best agents are fast, reliable, and accurate. Optimizing one at the expense of others hides deeper problems.

03

The spec is the source of truth

Without a spec, you can't measure quality. AURA makes specs explicit.

04

Metrics drive conversations

Not leaderboards. Not punishments. Conversations about where to improve.

05

Build on existing standards

OTel GenAI semantic conventions. Your infrastructure already works.

06

Start simple, go deep

Four headline metrics tell the story. Supporting metrics tell you why.

A natural question

Why not just use DORA?

DORA is excellent — if you're shipping software, you should be tracking it. AURA doesn't replace DORA. It fills a gap DORA was never designed to cover.

DORA measures how well your pipeline delivers code to production. It assumes a human wrote the code, a CI system tested it, and a deployment shipped it. AI agents don't have pipelines. They receive a request, reason, call tools, and produce output. The failure modes aren't broken builds — they're hallucinations, spec mismatches, and infinite loops. The quality question isn't "did it deploy safely?" but "did it do what was asked?"

If your agents write code that flows through a DORA-measured pipeline, the two complement each other perfectly: DORA measures the pipeline. AURA measures the agent.

The four metrics

Throughput · Stability

Like DORA, AURA captures the tension between speed and stability.

01 · Throughput
Feature Frequency
Deliverables accepted per time period. Measures productivity at the level humans care about.
Elite>20 /day
High5–20 /day
Medium1–5 /day
Low<1 /day
02 · Throughput
Feature Lead Time
Spec received to verified deliverable. Includes planning, execution, validation, and rework.
Elitep50 < 1m
Highp50 < 5m
Mediump50 < 30m
Lowp50 > 30m
03 · Stability
Human Intervention Rate
Percentage of deliverables that required human takeover or were abandoned without completion.
Elite<5%
High5–15%
Medium15–30%
Low>30%
04 · Stability
Recovery Efficiency
When things fail mid-delivery, how efficiently does the agent self-correct? Ratio of recovery overhead to total time.
Elite<10% overhead
High10–25%
Medium25–50%
Low>50%
Side by side

DORA → AURA

Same structure. Different domain.

AspectDORAAURA
DomainSoftware deliveryAI agent performance
UnitDeploymentDeliverable (against spec)
Throughput 1Deployment FrequencyFeature Frequency
Throughput 2Lead TimeFeature Lead Time
Stability 1Change Failure RateHuman Intervention Rate
Stability 2MTTRRecovery Efficiency
Data sourceCI/CD pipelinesOpenTelemetry traces
Specification

AURA Specification v0.1.0

DRAFT MIT GitHub →

AURA is a metrics framework that measures the reliability and performance of AI agent output. Like DORA measures software delivery performance through four key metrics, AURA measures agent performance through four: Feature Frequency, Feature Lead Time, Human Intervention Rate, and Recovery Efficiency.

Terminology

Deliverable
A unit of agent work verified against a specification. One deliverable = one measurable outcome. Examples: a feature implementation, a bug fix, a refactoring task.
Spec
The acceptance criteria for a deliverable. Can come from any source — an OpenSpec change folder, a Jira ticket, a markdown file, a plain text prompt. The spec defines what "done" means.
Phase
A stage in the deliverable lifecycle. The canonical phases are: propose, specs, design, tasks, apply, verify, archive. Not all phases are required for every deliverable.
Recovery
A rework cycle triggered by a failure during execution. When verification fails and the agent retries, that retry is a recovery attempt.

The Four Metrics

2.1 Feature Frequency

The number of deliverables accepted per unit time.

count(deliverables where status = "accepted") / time_period
TierThreshold
Elite≥3/day
High≥1/day
Medium≥1/week
Low<1/week

2.2 Feature Lead Time

Wall-clock time from spec received to deliverable accepted. Includes planning, execution, and verification.

accepted_at - started_at
TierThreshold
Elite<1 hour (3,600s)
High<4 hours (14,400s)
Medium<1 day (86,400s)
Low≥1 day

Includes human review time. To isolate agent time, subtract human wait phases from the total.

2.3 Human Intervention Rate

Percentage of deliverables that required human takeover or were abandoned without completion.

count(human_takeover or abandoned) / count(deliverables) × 100
TierThreshold
Elite<5%
High<10%
Medium<15%
Low≥15%

A deliverable that required agent self-correction (recovery) but completed without human involvement is NOT counted. Only deliverables where a human stepped in or the work was abandoned are counted.

2.4 Recovery Efficiency

The proportion of total effort spent on rework and retries.

recovery_time / total_time × 100
TierThreshold
Elite<5%
High<10%
Medium<20%
Low≥20%

Deliverable Lifecycle

A deliverable progresses through phases. Not all phases are required — a minimal deliverable has: start → apply → archive. Phases can repeat (apply → verify → apply → verify → archive).

propose specs design tasks apply verify archive

Phase Definitions

PhaseDescriptionData Captured
proposeDeliverable is identified and scopedStart timestamp, deliverable ID, description
specsRequirements and acceptance criteria definedStart/end timestamps, requirements count
designSolution approach is plannedStart/end timestamps, design artifacts
tasksWork is broken into discrete tasksStart/end timestamps, task count
applyAgent executes the workStart/end timestamps, tool calls, files changed
verifyOutput is validated against the specStart/end timestamps, test results, conformance
archiveDeliverable is finalized and recordedEnd timestamp, final metrics

Deliverable States

StateDescription
planningIn propose, specs, design, or tasks phase
executingIn apply phase
verifyingIn verify phase
completedArchived successfully
failedAbandoned or below conformance threshold

Failure Taxonomy

When a deliverable fails, it should be classified by failure type:

Failure TypeDescription
spec_misunderstandingAgent misinterpreted requirements
hallucinationAgent produced fabricated output
infinite_loopAgent got stuck in a retry cycle
tool_failureExternal tool returned an error the agent couldn't recover from
constraint_violationOutput exceeded specified boundaries
incompleteAgent stopped before finishing all requirements
regressionAgent's fix broke something that previously worked

Data Model

AURA defines three JSON Schema documents (JSON Schema draft 2020-12).

The final record emitted when a deliverable is completed or failed. This is the primary AURA output format.

{ "schema_version": "0.1.0", "change_id": "add-dark-mode", "started_at": "2026-02-26T10:00:00Z", "completed_at": "2026-02-26T10:45:00Z", "status": "completed", "description": "Add dark mode toggle to settings page", "metrics": { "resolution_latency_seconds": 2700, "phase_durations": { "propose": 60, "specs": 120, "design": 180, "tasks": 120, "apply": 1800, "verify": 300 }, "tool_calls": { "file_edit": 24, "bash": 15, "total": 87 }, "apply_iterations": 2, "recovery_attempts": 1, "tasks_completed": 12, "tasks_total": 12, "conformance": { "functional": 1.0, "correctness": 0.95, "constraints": 1.0, "iteration_penalty": 0.85, "overall": 0.97 }, "deliverable_failed": false, "failure_type": null }, "spec_source": { "framework": "openspec", "spec_id": "changes/add-dark-mode", "requirements_count": 8 }, "agent": { "name": "claude-code", "model": "claude-sonnet-4-20250514", "framework": "claude-code" }, "sessions": ["session-d4e5f6"] }

Required fields: schema_version, change_id, started_at, completed_at, status, metrics

Tracks the current state of an in-progress deliverable. Updated as the agent progresses through phases.

{ "change_id": "add-search-feature", "status": "executing", "started_at": "2026-02-26T09:00:00Z", "updated_at": "2026-02-26T09:35:00Z", "complexity": "moderate", "description": "Add full-text search to the product catalog", "spec_source": { "framework": "openspec", "spec_id": "changes/add-search-feature", "requirements_count": 6 }, "phases": { "propose": { "started_at": "...", "completed_at": "..." }, "apply": { "started_at": "...", "completed_at": null } }, "current_phase": "apply", "tasks_total": 8, "tasks_completed": 4, "tool_calls": { "file_edit": 8, "file_read": 12, "bash": 3 }, "tool_calls_total": 28, "recovery_attempts": 0, "apply_iterations": 1 }

Required fields: change_id, status, started_at

Lightweight event records for streaming and real-time collection. Emitted at phase transitions and significant moments during execution.

{ "event_type": "phase_start", "timestamp": "2026-02-26T10:05:00Z", "change_id": "add-dark-mode", "phase": "apply", "data": { "task_index": 3 } }

Event types: phase_start, phase_end, tool_call, recovery, deliverable_start, deliverable_end

Required fields: event_type, timestamp, change_id

Schema Validation

# Node npx ajv validate -s schemas/latest/metrics-output.schema.json -d my-output.json # Python python -m jsonschema -i my-output.json schemas/latest/metrics-output.schema.json

OpenTelemetry Integration

AURA extends OTel GenAI semantic conventions with the aura.* namespace. All attributes use this prefix to avoid collision with existing OTel conventions.

Span Hierarchy

aura.deliverable (root span) ├── aura.deliverable.plan ├── aura.deliverable.execute │ ├── aura.task │ │ ├── gen_ai.chat (OTel GenAI standard) │ │ └── aura.tool.call │ └── aura.recovery.attempt ├── aura.deliverable.validate └── aura.deliverable.accept

Semantic Attributes

AttributeTypeDescription
aura.deliverable.idstringUnique deliverable identifier
aura.deliverable.typestringfeature, bugfix, refactor, chore
aura.deliverable.statusstringplanning, executing, verifying, completed, failed
aura.deliverable.complexitystringtrivial, simple, moderate, complex
aura.spec.frameworkstringopenspec, jira, github-issue, markdown, prompt
aura.spec.idstringSpec identifier
aura.spec.requirements_countintNumber of requirements extracted
aura.phase.namestringCurrent phase name
aura.phase.iterationintPhase iteration count
aura.failure.typestringFailure classification
aura.recovery.attemptintCurrent recovery attempt number
aura.recovery.totalintTotal recovery attempts
aura.agent.namestringAgent identifier
aura.agent.modelstringModel used
aura.agent.frameworkstringAgent framework

Metric Instruments

InstrumentTypeUnitDescription
aura.deliverables.countCounter{deliverable}Total deliverables processed
aura.deliverables.acceptedCounter{deliverable}Deliverables accepted
aura.deliverables.failedCounter{deliverable}Deliverables failed
aura.resolution_latencyHistogramsResolution latency distribution
aura.recovery.attemptsHistogram{attempt}Recovery attempts per deliverable
aura.tool_calls.countCounter{call}Total tool calls

Spec Framework Compatibility

AURA is spec-framework agnostic. Any system that defines acceptance criteria for agent work can be an AURA spec source.

SourceSpec =Requirements =Deliverable Boundary
OpenSpecChange folderDelta spec requirementspropose → archive
JiraTicketAcceptance criteriaCreated → Done
GitHub IssueIssue bodyChecklist itemsOpened → closed
MarkdownThe fileBullet pointsCreated → verified
User promptThe prompt textImplicit (single req)Prompt → accepted

Supporting Metrics

The four headline metrics tell you what's happening. The supporting metrics tell you why. Every supporting metric feeds at least one headline metric — when a headline goes red, you drill into the supporting metrics to find the cause.

Token Usage

Feeds → Recovery Efficiency, Feature Frequency

Total input tokens, output tokens, cost per deliverable. If an agent uses 50k tokens on a deliverable that should take 10k, the extra 40k is recovery overhead. It also contextualises Feature Frequency — are you shipping more because the agent is efficient, or because it's brute-forcing with expensive models? Token cost per deliverable is the unit economics metric teams will care about most once they're past the "does it work" phase.

Tool Call Count

Feeds → Recovery Efficiency, Feature Lead Time

Counts by type: Write, Edit, Bash, Read, etc. A healthy deliverable has a predictable ratio of reads to writes. If you see 40 Read calls and 2 Writes, the agent spent most of its time searching. If you see 15 Write calls and 12 Edit calls on the same files, it's rewriting its own work. The pattern tells you where the agent is struggling.

Phase Duration

Feeds → Feature Lead Time

Time spent in each phase: propose, specs, design, tasks, apply, verify. This decomposes Feature Lead Time into its parts. A 4-hour deliverable where 3.5 hours was in apply is very different from one where 2 hours was human review during verify. This is where you find the bottleneck — is the agent slow, or is the human slow to review?

Apply Iterations

Feeds → Recovery Efficiency

How many times the apply phase ran before acceptance. First-time-right deliverables have zero recovery overhead. Tracking the count over time tells you if your agent is getting better or worse at completing work without rework.

Recovery Attempts

Feeds → Recovery Efficiency

Count of rework cycles within an apply iteration — different from apply iterations. An apply iteration is "the agent stopped, human said try again." A recovery attempt is "the agent hit an error and self-corrected within a single run." High recovery attempts with eventual success means the agent is resilient but inefficient. High recovery attempts with failure means it's thrashing.

Failure Type Distribution

Feeds → Human Intervention Rate

Breakdown across: spec_misunderstanding, hallucination, infinite_loop, tool_failure, constraint_violation, incomplete, regression. Categorises why humans had to intervene or deliverables were abandoned. A 12% intervention rate means very different things if it's all tool failures (infrastructure problem) versus all hallucinations (model problem). This is the metric that tells you where to invest.

Human Intervention Count

Feeds → Human Intervention Rate, Recovery Efficiency

Times a human had to step in during execution. An agent that completes every deliverable but needs 5 human nudges per run isn't really autonomous. This metric tracks the journey toward full autonomy.

Complexity Distribution

Feeds → Feature Frequency, Feature Lead Time

Trivial/simple/moderate/complex breakdown of deliverables. Contextualises Feature Frequency and Feature Lead Time. Shipping 10 trivial deliverables/day is not the same as shipping 2 complex ones. Without this, you can game throughput by splitting work into tiny pieces.

The Relationship Map

Every supporting metric feeds at least one headline metric, and most feed two. Look at the four headlines on a dashboard — when one goes red, drill into the supporting metrics to find the cause.

Token Usage ──────────┐ Tool Call Count ──────┼──→ Recovery Efficiency Recovery Attempts ────┘ Phase Duration ───────────→ Feature Lead Time Failure Type ──────────────┐ Human Interventions ────┴──→ Human Intervention Rate Complexity Distribution ──→ Feature Frequency

Performance Tiers

Summary of all performance tiers across the four metrics. Tier classification uses the most recent rolling window — recommended default is 7 days or 20 deliverables, whichever comes first.

MetricEliteHighMediumLow
Feature Frequency≥3/day≥1/day≥1/week<1/week
Feature Lead Time<1 hour<4 hours<1 day≥1 day
Human Intervention Rate<5%<10%<15%≥15%
Recovery Efficiency<5%<10%<20%≥20%
Get involved

This is a proposal — help shape it

AURA is an early draft, not a finished product. The specification, metrics, and data model are open for discussion. Read it, challenge it, and help make it useful.

Join the discussion → View on GitHub
Architecture

Built on OpenTelemetry

AURA extends OTel's GenAI semantic conventions. Your existing infrastructure, exporters, and backends all work.

YOUR CODELangChain · CrewAI · AutoGen · PydanticAI · custom
AURA SDKDeliverable lifecycle · Failure classification · Recovery tracking
OTEL SDKTraces · Metrics · Events
OTEL COLLECTORAURA processor · Spanmetrics · Export to any backend
BACKENDGrafana · Datadog · Honeycomb · Jaeger · any OTLP-compatible
Bring your own spec framework

Works with any spec framework

AURA measures how well agents deliver against specs. It doesn't care how those specs are written. Use whatever fits your workflow.

RECOMMENDED

OpenSpec

Spec-driven development for AI coding assistants. Each change gets a structured folder — proposal, requirements, design, tasks. AURA reads the requirements and measures conformance. The example in this repo uses OpenSpec.

ALTERNATIVE

spec-kit by GitHub

A thorough toolkit for spec-driven development with phase gates and structured templates. Heavier than OpenSpec, well-suited for enterprise teams that need more process.

ALTERNATIVE

Kiro, Linear, Jira, Notion…

A spec doesn't need a framework. A Jira ticket with acceptance criteria, a Notion doc with requirements, or even a plain Markdown file — all work.

THE PRINCIPLE

More explicit spec → more useful score

A vague prompt gives you a binary signal. A structured spec with 5 explicit requirements gives you a precise conformance gradient. AURA rewards specificity.

Bring your own agent

Plugins

Integrate AURA into your agent toolchain using one of the available plugins.

Plugin

Claude Code

Instruments Claude Code agent sessions with AURA metrics. Tracks deliverables, phases, and tool calls automatically via Claude Code hooks, emitting OpenTelemetry spans to your collector.

aura-metrics-claude-code-plugin →
Origin

Where AURA came from

AURA grew out of work on AgentEx — the idea that the developer experience problems we spent a decade solving for humans (slow feedback loops, environment drift, deployment-gated testing) hit AI agents even harder. A human waiting three minutes for a deploy can context-switch. An agent just burns tokens doing nothing. The same fixes accelerate both, but for agents the shift is qualitative, not just quantitative.

I'm Eamonn Faherty. I've been working on improving the developer experience for agents to improve the developer experience for the humans using them. Part of that portfolio is Local Web Services — a tool that reads your AWS CDK cloud assembly and recreates your entire application locally, giving agents (and humans) a tight inner loop measured in seconds instead of minutes.

While building it, I kept asking the same question: how do I know the agents are actually getting better? I needed a way to measure throughput, stability, and quality — something like DORA, but for agent deliverables. That's how AURA started. You can read more about the AgentEx concept in the original article.

Get involved

Help define the standard

AURA is a draft proposal. The goal is to establish a shared standard for measuring AI agent performance. Your input shapes what it becomes.