DORA for pipelines. AURA for agents.
Four metrics. Built on OpenTelemetry.
DORA gave software delivery four clear metrics. AI agents have nothing. Every team invents their own, making it impossible to benchmark or improve.
Is the agent delivering what was asked? How would you know? Every team tracks something different.
Measuring tool calls is like measuring commits instead of deployments. Activity isn't delivery.
Binary pass/fail doesn't capture reality. Two agents can "complete" a feature — one nails it, the other barely passes.
When an agent fails mid-task, does it self-correct or spiral in a loop burning tokens? You can't tell until you check the bill.
DORA measures deployments, not commits. AURA applies the same principle: measure the unit of value delivered against a specification.
An agent might complete 50 tasks in pursuit of one thing a user cares about. That one thing, verified against a spec, is what AURA measures.
Tool calls are commits. Deliverables are deployments. Measure value.
The best agents are fast, reliable, and accurate. Optimizing one at the expense of others hides deeper problems.
Without a spec, you can't measure quality. AURA makes specs explicit.
Not leaderboards. Not punishments. Conversations about where to improve.
OTel GenAI semantic conventions. Your infrastructure already works.
Four headline metrics tell the story. Supporting metrics tell you why.
DORA is excellent — if you're shipping software, you should be tracking it. AURA doesn't replace DORA. It fills a gap DORA was never designed to cover.
DORA measures how well your pipeline delivers code to production. It assumes a human wrote the code, a CI system tested it, and a deployment shipped it. AI agents don't have pipelines. They receive a request, reason, call tools, and produce output. The failure modes aren't broken builds — they're hallucinations, spec mismatches, and infinite loops. The quality question isn't "did it deploy safely?" but "did it do what was asked?"
If your agents write code that flows through a DORA-measured pipeline, the two complement each other perfectly: DORA measures the pipeline. AURA measures the agent.
Like DORA, AURA captures the tension between speed and stability.
Same structure. Different domain.
| Aspect | DORA | AURA |
|---|---|---|
| Domain | Software delivery | AI agent performance |
| Unit | Deployment | Deliverable (against spec) |
| Throughput 1 | Deployment Frequency | Feature Frequency |
| Throughput 2 | Lead Time | Feature Lead Time |
| Stability 1 | Change Failure Rate | Human Intervention Rate |
| Stability 2 | MTTR | Recovery Efficiency |
| Data source | CI/CD pipelines | OpenTelemetry traces |
AURA is a metrics framework that measures the reliability and performance of AI agent output. Like DORA measures software delivery performance through four key metrics, AURA measures agent performance through four: Feature Frequency, Feature Lead Time, Human Intervention Rate, and Recovery Efficiency.
propose, specs, design, tasks, apply, verify, archive. Not all phases are required for every deliverable.The number of deliverables accepted per unit time.
| Tier | Threshold |
|---|---|
| Elite | ≥3/day |
| High | ≥1/day |
| Medium | ≥1/week |
| Low | <1/week |
Wall-clock time from spec received to deliverable accepted. Includes planning, execution, and verification.
| Tier | Threshold |
|---|---|
| Elite | <1 hour (3,600s) |
| High | <4 hours (14,400s) |
| Medium | <1 day (86,400s) |
| Low | ≥1 day |
Includes human review time. To isolate agent time, subtract human wait phases from the total.
Percentage of deliverables that required human takeover or were abandoned without completion.
| Tier | Threshold |
|---|---|
| Elite | <5% |
| High | <10% |
| Medium | <15% |
| Low | ≥15% |
A deliverable that required agent self-correction (recovery) but completed without human involvement is NOT counted. Only deliverables where a human stepped in or the work was abandoned are counted.
The proportion of total effort spent on rework and retries.
| Tier | Threshold |
|---|---|
| Elite | <5% |
| High | <10% |
| Medium | <20% |
| Low | ≥20% |
A deliverable progresses through phases. Not all phases are required — a minimal deliverable has: start → apply → archive. Phases can repeat (apply → verify → apply → verify → archive).
| Phase | Description | Data Captured |
|---|---|---|
| propose | Deliverable is identified and scoped | Start timestamp, deliverable ID, description |
| specs | Requirements and acceptance criteria defined | Start/end timestamps, requirements count |
| design | Solution approach is planned | Start/end timestamps, design artifacts |
| tasks | Work is broken into discrete tasks | Start/end timestamps, task count |
| apply | Agent executes the work | Start/end timestamps, tool calls, files changed |
| verify | Output is validated against the spec | Start/end timestamps, test results, conformance |
| archive | Deliverable is finalized and recorded | End timestamp, final metrics |
| State | Description |
|---|---|
| planning | In propose, specs, design, or tasks phase |
| executing | In apply phase |
| verifying | In verify phase |
| completed | Archived successfully |
| failed | Abandoned or below conformance threshold |
When a deliverable fails, it should be classified by failure type:
| Failure Type | Description |
|---|---|
| spec_misunderstanding | Agent misinterpreted requirements |
| hallucination | Agent produced fabricated output |
| infinite_loop | Agent got stuck in a retry cycle |
| tool_failure | External tool returned an error the agent couldn't recover from |
| constraint_violation | Output exceeded specified boundaries |
| incomplete | Agent stopped before finishing all requirements |
| regression | Agent's fix broke something that previously worked |
AURA defines three JSON Schema documents (JSON Schema draft 2020-12).
The final record emitted when a deliverable is completed or failed. This is the primary AURA output format.
Required fields: schema_version, change_id, started_at, completed_at, status, metrics
Tracks the current state of an in-progress deliverable. Updated as the agent progresses through phases.
Required fields: change_id, status, started_at
Lightweight event records for streaming and real-time collection. Emitted at phase transitions and significant moments during execution.
Event types: phase_start, phase_end, tool_call, recovery, deliverable_start, deliverable_end
Required fields: event_type, timestamp, change_id
AURA extends OTel GenAI semantic conventions with the aura.* namespace. All attributes use this prefix to avoid collision with existing OTel conventions.
| Attribute | Type | Description |
|---|---|---|
| aura.deliverable.id | string | Unique deliverable identifier |
| aura.deliverable.type | string | feature, bugfix, refactor, chore |
| aura.deliverable.status | string | planning, executing, verifying, completed, failed |
| aura.deliverable.complexity | string | trivial, simple, moderate, complex |
| aura.spec.framework | string | openspec, jira, github-issue, markdown, prompt |
| aura.spec.id | string | Spec identifier |
| aura.spec.requirements_count | int | Number of requirements extracted |
| aura.phase.name | string | Current phase name |
| aura.phase.iteration | int | Phase iteration count |
| aura.failure.type | string | Failure classification |
| aura.recovery.attempt | int | Current recovery attempt number |
| aura.recovery.total | int | Total recovery attempts |
| aura.agent.name | string | Agent identifier |
| aura.agent.model | string | Model used |
| aura.agent.framework | string | Agent framework |
| Instrument | Type | Unit | Description |
|---|---|---|---|
| aura.deliverables.count | Counter | {deliverable} | Total deliverables processed |
| aura.deliverables.accepted | Counter | {deliverable} | Deliverables accepted |
| aura.deliverables.failed | Counter | {deliverable} | Deliverables failed |
| aura.resolution_latency | Histogram | s | Resolution latency distribution |
| aura.recovery.attempts | Histogram | {attempt} | Recovery attempts per deliverable |
| aura.tool_calls.count | Counter | {call} | Total tool calls |
AURA is spec-framework agnostic. Any system that defines acceptance criteria for agent work can be an AURA spec source.
| Source | Spec = | Requirements = | Deliverable Boundary |
|---|---|---|---|
| OpenSpec | Change folder | Delta spec requirements | propose → archive |
| Jira | Ticket | Acceptance criteria | Created → Done |
| GitHub Issue | Issue body | Checklist items | Opened → closed |
| Markdown | The file | Bullet points | Created → verified |
| User prompt | The prompt text | Implicit (single req) | Prompt → accepted |
The four headline metrics tell you what's happening. The supporting metrics tell you why. Every supporting metric feeds at least one headline metric — when a headline goes red, you drill into the supporting metrics to find the cause.
Total input tokens, output tokens, cost per deliverable. If an agent uses 50k tokens on a deliverable that should take 10k, the extra 40k is recovery overhead. It also contextualises Feature Frequency — are you shipping more because the agent is efficient, or because it's brute-forcing with expensive models? Token cost per deliverable is the unit economics metric teams will care about most once they're past the "does it work" phase.
Counts by type: Write, Edit, Bash, Read, etc. A healthy deliverable has a predictable ratio of reads to writes. If you see 40 Read calls and 2 Writes, the agent spent most of its time searching. If you see 15 Write calls and 12 Edit calls on the same files, it's rewriting its own work. The pattern tells you where the agent is struggling.
Time spent in each phase: propose, specs, design, tasks, apply, verify. This decomposes Feature Lead Time into its parts. A 4-hour deliverable where 3.5 hours was in apply is very different from one where 2 hours was human review during verify. This is where you find the bottleneck — is the agent slow, or is the human slow to review?
How many times the apply phase ran before acceptance. First-time-right deliverables have zero recovery overhead. Tracking the count over time tells you if your agent is getting better or worse at completing work without rework.
Count of rework cycles within an apply iteration — different from apply iterations. An apply iteration is "the agent stopped, human said try again." A recovery attempt is "the agent hit an error and self-corrected within a single run." High recovery attempts with eventual success means the agent is resilient but inefficient. High recovery attempts with failure means it's thrashing.
Breakdown across: spec_misunderstanding, hallucination, infinite_loop, tool_failure, constraint_violation, incomplete, regression. Categorises why humans had to intervene or deliverables were abandoned. A 12% intervention rate means very different things if it's all tool failures (infrastructure problem) versus all hallucinations (model problem). This is the metric that tells you where to invest.
Times a human had to step in during execution. An agent that completes every deliverable but needs 5 human nudges per run isn't really autonomous. This metric tracks the journey toward full autonomy.
Trivial/simple/moderate/complex breakdown of deliverables. Contextualises Feature Frequency and Feature Lead Time. Shipping 10 trivial deliverables/day is not the same as shipping 2 complex ones. Without this, you can game throughput by splitting work into tiny pieces.
Every supporting metric feeds at least one headline metric, and most feed two. Look at the four headlines on a dashboard — when one goes red, drill into the supporting metrics to find the cause.
Summary of all performance tiers across the four metrics. Tier classification uses the most recent rolling window — recommended default is 7 days or 20 deliverables, whichever comes first.
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Feature Frequency | ≥3/day | ≥1/day | ≥1/week | <1/week |
| Feature Lead Time | <1 hour | <4 hours | <1 day | ≥1 day |
| Human Intervention Rate | <5% | <10% | <15% | ≥15% |
| Recovery Efficiency | <5% | <10% | <20% | ≥20% |
AURA is an early draft, not a finished product. The specification, metrics, and data model are open for discussion. Read it, challenge it, and help make it useful.
AURA extends OTel's GenAI semantic conventions. Your existing infrastructure, exporters, and backends all work.
AURA measures how well agents deliver against specs. It doesn't care how those specs are written. Use whatever fits your workflow.
Spec-driven development for AI coding assistants. Each change gets a structured folder — proposal, requirements, design, tasks. AURA reads the requirements and measures conformance. The example in this repo uses OpenSpec.
A thorough toolkit for spec-driven development with phase gates and structured templates. Heavier than OpenSpec, well-suited for enterprise teams that need more process.
A spec doesn't need a framework. A Jira ticket with acceptance criteria, a Notion doc with requirements, or even a plain Markdown file — all work.
A vague prompt gives you a binary signal. A structured spec with 5 explicit requirements gives you a precise conformance gradient. AURA rewards specificity.
Integrate AURA into your agent toolchain using one of the available plugins.
Instruments Claude Code agent sessions with AURA metrics. Tracks deliverables, phases, and tool calls automatically via Claude Code hooks, emitting OpenTelemetry spans to your collector.
aura-metrics-claude-code-plugin →AURA grew out of work on AgentEx — the idea that the developer experience problems we spent a decade solving for humans (slow feedback loops, environment drift, deployment-gated testing) hit AI agents even harder. A human waiting three minutes for a deploy can context-switch. An agent just burns tokens doing nothing. The same fixes accelerate both, but for agents the shift is qualitative, not just quantitative.
I'm Eamonn Faherty. I've been working on improving the developer experience for agents to improve the developer experience for the humans using them. Part of that portfolio is Local Web Services — a tool that reads your AWS CDK cloud assembly and recreates your entire application locally, giving agents (and humans) a tight inner loop measured in seconds instead of minutes.
While building it, I kept asking the same question: how do I know the agents are actually getting better? I needed a way to measure throughput, stability, and quality — something like DORA, but for agent deliverables. That's how AURA started. You can read more about the AgentEx concept in the original article.
AURA is a draft proposal. The goal is to establish a shared standard for measuring AI agent performance. Your input shapes what it becomes.