AURA — DORA for pipelines. AURA for agents.

Aspect	DORA	AURA
Domain	Software delivery	AI agent performance
Unit	Deployment	Deliverable (against spec)
Throughput 1	Deployment Frequency	Feature Frequency
Throughput 2	Lead Time	Feature Lead Time
Stability 1	Change Failure Rate	Human Intervention Rate
Stability 2	MTTR	Recovery Efficiency
Data source	CI/CD pipelines	OpenTelemetry traces

Terminology

Deliverable: A unit of agent work verified against a specification. One deliverable = one measurable outcome. Examples: a feature implementation, a bug fix, a refactoring task.
Spec: The acceptance criteria for a deliverable. Can come from any source — an OpenSpec change folder, a Jira ticket, a markdown file, a plain text prompt. The spec defines what "done" means.
Phase: A stage in the deliverable lifecycle. The canonical phases are: propose, specs, design, tasks, apply, verify, archive. Not all phases are required for every deliverable.
Recovery: A rework cycle triggered by a failure during execution. When verification fails and the agent retries, that retry is a recovery attempt.

The Four Metrics

2.1 Feature Frequency

The number of deliverables accepted per unit time.

count(deliverables where status = "accepted") / time_period

Tier	Threshold
Elite	≥3/day
High	≥1/day
Medium	≥1/week
Low	<1/week

2.2 Feature Lead Time

Wall-clock time from spec received to deliverable accepted. Includes planning, execution, and verification.

accepted_at - started_at

Tier	Threshold
Elite	<1 hour (3,600s)
High	<4 hours (14,400s)
Medium	<1 day (86,400s)
Low	≥1 day

Includes human review time. To isolate agent time, subtract human wait phases from the total.

2.3 Human Intervention Rate

Percentage of deliverables that required human takeover or were abandoned without completion.

count(human_takeover or abandoned) / count(deliverables) × 100

Tier	Threshold
Elite	<5%
High	<10%
Medium	<15%
Low	≥15%

A deliverable that required agent self-correction (recovery) but completed without human involvement is NOT counted. Only deliverables where a human stepped in or the work was abandoned are counted.

2.4 Recovery Efficiency

The proportion of total effort spent on rework and retries.

recovery_time / total_time × 100

Tier	Threshold
Elite	<5%
High	<10%
Medium	<20%
Low	≥20%

Deliverable Lifecycle

A deliverable progresses through phases. Not all phases are required — a minimal deliverable has: start → apply → archive. Phases can repeat (apply → verify → apply → verify → archive).

propose→ specs→ design→ tasks→ apply→ verify→ archive

Phase Definitions

Phase	Description	Data Captured
propose	Deliverable is identified and scoped	Start timestamp, deliverable ID, description
specs	Requirements and acceptance criteria defined	Start/end timestamps, requirements count
design	Solution approach is planned	Start/end timestamps, design artifacts
tasks	Work is broken into discrete tasks	Start/end timestamps, task count
apply	Agent executes the work	Start/end timestamps, tool calls, files changed
verify	Output is validated against the spec	Start/end timestamps, test results, conformance
archive	Deliverable is finalized and recorded	End timestamp, final metrics

Deliverable States

State	Description
planning	In propose, specs, design, or tasks phase
executing	In apply phase
verifying	In verify phase
completed	Archived successfully
failed	Abandoned or below conformance threshold

Failure Taxonomy

When a deliverable fails, it should be classified by failure type:

Failure Type	Description
spec_misunderstanding	Agent misinterpreted requirements
hallucination	Agent produced fabricated output
infinite_loop	Agent got stuck in a retry cycle
tool_failure	External tool returned an error the agent couldn't recover from
constraint_violation	Output exceeded specified boundaries
incomplete	Agent stopped before finishing all requirements
regression	Agent's fix broke something that previously worked

Data Model

AURA defines three JSON Schema documents (JSON Schema draft 2020-12).

The final record emitted when a deliverable is completed or failed. This is the primary AURA output format.

{ "schema_version": "0.1.0", "change_id": "add-dark-mode", "started_at": "2026-02-26T10:00:00Z", "completed_at": "2026-02-26T10:45:00Z", "status": "completed", "description": "Add dark mode toggle to settings page", "metrics": { "resolution_latency_seconds": 2700, "phase_durations": { "propose": 60, "specs": 120, "design": 180, "tasks": 120, "apply": 1800, "verify": 300 }, "tool_calls": { "file_edit": 24, "bash": 15, "total": 87 }, "apply_iterations": 2, "recovery_attempts": 1, "tasks_completed": 12, "tasks_total": 12, "conformance": { "functional": 1.0, "correctness": 0.95, "constraints": 1.0, "iteration_penalty": 0.85, "overall": 0.97 }, "deliverable_failed": false, "failure_type": null }, "spec_source": { "framework": "openspec", "spec_id": "changes/add-dark-mode", "requirements_count": 8 }, "agent": { "name": "claude-code", "model": "claude-sonnet-4-20250514", "framework": "claude-code" }, "sessions": ["session-d4e5f6"] }

Required fields: schema_version, change_id, started_at, completed_at, status, metrics

Tracks the current state of an in-progress deliverable. Updated as the agent progresses through phases.

{ "change_id": "add-search-feature", "status": "executing", "started_at": "2026-02-26T09:00:00Z", "updated_at": "2026-02-26T09:35:00Z", "complexity": "moderate", "description": "Add full-text search to the product catalog", "spec_source": { "framework": "openspec", "spec_id": "changes/add-search-feature", "requirements_count": 6 }, "phases": { "propose": { "started_at": "...", "completed_at": "..." }, "apply": { "started_at": "...", "completed_at": null } }, "current_phase": "apply", "tasks_total": 8, "tasks_completed": 4, "tool_calls": { "file_edit": 8, "file_read": 12, "bash": 3 }, "tool_calls_total": 28, "recovery_attempts": 0, "apply_iterations": 1 }

Required fields: change_id, status, started_at

Lightweight event records for streaming and real-time collection. Emitted at phase transitions and significant moments during execution.

{ "event_type": "phase_start", "timestamp": "2026-02-26T10:05:00Z", "change_id": "add-dark-mode", "phase": "apply", "data": { "task_index": 3 } }

Event types: phase_start, phase_end, tool_call, recovery, deliverable_start, deliverable_end

Required fields: event_type, timestamp, change_id

Schema Validation

# Node npx ajv validate -s schemas/latest/metrics-output.schema.json -d my-output.json # Python python -m jsonschema -i my-output.json schemas/latest/metrics-output.schema.json

OpenTelemetry Integration

AURA extends OTel GenAI semantic conventions with the aura.* namespace. All attributes use this prefix to avoid collision with existing OTel conventions.

Span Hierarchy

aura.deliverable (root span) ├── aura.deliverable.plan ├── aura.deliverable.execute │ ├── aura.task │ │ ├── gen_ai.chat (OTel GenAI standard) │ │ └── aura.tool.call │ └── aura.recovery.attempt ├── aura.deliverable.validate └── aura.deliverable.accept

Semantic Attributes

Attribute	Type	Description
aura.deliverable.id	string	Unique deliverable identifier
aura.deliverable.type	string	feature, bugfix, refactor, chore
aura.deliverable.status	string	planning, executing, verifying, completed, failed
aura.deliverable.complexity	string	trivial, simple, moderate, complex
aura.spec.framework	string	openspec, jira, github-issue, markdown, prompt
aura.spec.id	string	Spec identifier
aura.spec.requirements_count	int	Number of requirements extracted
aura.phase.name	string	Current phase name
aura.phase.iteration	int	Phase iteration count
aura.failure.type	string	Failure classification
aura.recovery.attempt	int	Current recovery attempt number
aura.recovery.total	int	Total recovery attempts
aura.agent.name	string	Agent identifier
aura.agent.model	string	Model used
aura.agent.framework	string	Agent framework

Metric Instruments

Instrument	Type	Unit	Description
aura.deliverables.count	Counter	{deliverable}	Total deliverables processed
aura.deliverables.accepted	Counter	{deliverable}	Deliverables accepted
aura.deliverables.failed	Counter	{deliverable}	Deliverables failed
aura.resolution_latency	Histogram	s	Resolution latency distribution
aura.recovery.attempts	Histogram	{attempt}	Recovery attempts per deliverable
aura.tool_calls.count	Counter	{call}	Total tool calls

Spec Framework Compatibility

AURA is spec-framework agnostic. Any system that defines acceptance criteria for agent work can be an AURA spec source.

Source	Spec =	Requirements =	Deliverable Boundary
OpenSpec	Change folder	Delta spec requirements	propose → archive
Jira	Ticket	Acceptance criteria	Created → Done
GitHub Issue	Issue body	Checklist items	Opened → closed
Markdown	The file	Bullet points	Created → verified
User prompt	The prompt text	Implicit (single req)	Prompt → accepted

Supporting Metrics

The four headline metrics tell you what's happening. The supporting metrics tell you why. Every supporting metric feeds at least one headline metric — when a headline goes red, you drill into the supporting metrics to find the cause.

Token Usage

Feeds → Recovery Efficiency, Feature Frequency

Total input tokens, output tokens, cost per deliverable. If an agent uses 50k tokens on a deliverable that should take 10k, the extra 40k is recovery overhead. It also contextualises Feature Frequency — are you shipping more because the agent is efficient, or because it's brute-forcing with expensive models? Token cost per deliverable is the unit economics metric teams will care about most once they're past the "does it work" phase.

Tool Call Count

Feeds → Recovery Efficiency, Feature Lead Time

Counts by type: Write, Edit, Bash, Read, etc. A healthy deliverable has a predictable ratio of reads to writes. If you see 40 Read calls and 2 Writes, the agent spent most of its time searching. If you see 15 Write calls and 12 Edit calls on the same files, it's rewriting its own work. The pattern tells you where the agent is struggling.

Phase Duration

Feeds → Feature Lead Time

Time spent in each phase: propose, specs, design, tasks, apply, verify. This decomposes Feature Lead Time into its parts. A 4-hour deliverable where 3.5 hours was in apply is very different from one where 2 hours was human review during verify. This is where you find the bottleneck — is the agent slow, or is the human slow to review?

Apply Iterations

Feeds → Recovery Efficiency

How many times the apply phase ran before acceptance. First-time-right deliverables have zero recovery overhead. Tracking the count over time tells you if your agent is getting better or worse at completing work without rework.

Recovery Attempts

Feeds → Recovery Efficiency

Count of rework cycles within an apply iteration — different from apply iterations. An apply iteration is "the agent stopped, human said try again." A recovery attempt is "the agent hit an error and self-corrected within a single run." High recovery attempts with eventual success means the agent is resilient but inefficient. High recovery attempts with failure means it's thrashing.

Failure Type Distribution

Feeds → Human Intervention Rate

Breakdown across: spec_misunderstanding, hallucination, infinite_loop, tool_failure, constraint_violation, incomplete, regression. Categorises why humans had to intervene or deliverables were abandoned. A 12% intervention rate means very different things if it's all tool failures (infrastructure problem) versus all hallucinations (model problem). This is the metric that tells you where to invest.

Human Intervention Count

Feeds → Human Intervention Rate, Recovery Efficiency

Times a human had to step in during execution. An agent that completes every deliverable but needs 5 human nudges per run isn't really autonomous. This metric tracks the journey toward full autonomy.

Complexity Distribution

Feeds → Feature Frequency, Feature Lead Time

Trivial/simple/moderate/complex breakdown of deliverables. Contextualises Feature Frequency and Feature Lead Time. Shipping 10 trivial deliverables/day is not the same as shipping 2 complex ones. Without this, you can game throughput by splitting work into tiny pieces.

The Relationship Map

Every supporting metric feeds at least one headline metric, and most feed two. Look at the four headlines on a dashboard — when one goes red, drill into the supporting metrics to find the cause.

Token Usage ──────────┐ Tool Call Count ──────┼──→ Recovery Efficiency Recovery Attempts ────┘ ↕ Phase Duration ───────────→ Feature Lead Time ↕ Failure Type ──────────────┐ Human Interventions ────┴──→ Human Intervention Rate ↕ Complexity Distribution ──→ Feature Frequency

Performance Tiers

Summary of all performance tiers across the four metrics. Tier classification uses the most recent rolling window — recommended default is 7 days or 20 deliverables, whichever comes first.

Metric	Elite	High	Medium	Low
Feature Frequency	≥3/day	≥1/day	≥1/week	<1/week
Feature Lead Time	<1 hour	<4 hours	<1 day	≥1 day
Human Intervention Rate	<5%	<10%	<15%	≥15%
Recovery Efficiency	<5%	<10%	<20%	≥20%

See what your agents actually deliver

Your agents are black boxes

No standard metrics

Task ≠ Value

No quality gradient

Recovery is invisible

Measure deliverables, not tasks

A task is a commit.A deliverable is a deployment.

What AURA believes

Measure deliverables, not tasks

Speed, stability, and quality correlate

The spec is the source of truth

Metrics drive conversations

Build on existing standards

Start simple, go deep

Why not just use DORA?

Throughput · Stability

DORA → AURA

AURA Specification v0.1.0

Terminology

The Four Metrics

2.1 Feature Frequency

2.2 Feature Lead Time

2.3 Human Intervention Rate

2.4 Recovery Efficiency

Deliverable Lifecycle

Phase Definitions

Deliverable States

Failure Taxonomy

Data Model

Schema Validation

OpenTelemetry Integration

Span Hierarchy

Semantic Attributes

Metric Instruments

Spec Framework Compatibility

Supporting Metrics

Token Usage

Tool Call Count

Phase Duration

Apply Iterations

Recovery Attempts

Failure Type Distribution

Human Intervention Count

Complexity Distribution

The Relationship Map

Performance Tiers

This is a proposal — help shape it

Built on OpenTelemetry

Works with any spec framework

OpenSpec

spec-kit by GitHub

Kiro, Linear, Jira, Notion…

More explicit spec → more useful score

Plugins

Claude Code

Where AURA came from

Help define the standard

See what your agents
actually deliver

A task is a commit.
A deliverable is a deployment.