OrchestraBench
A long-horizon campaign benchmark for production-grade multi-agent orchestration.
v0.1 is a claim-staking position paper accompanied by a working release. It proposes seven capability dimensions — measured jointly across a multi-day campaign of more than one hundred heterogeneous tasks with mid-run steering events and embedded action-based safety traps.
- Context Continuity
- Error Containment
- Bounded Execution
- Authority Calibration
- Observability
- Coordination Efficiency
- Frame Integrity
v0.1 ships the open-source SDK, the open-source evaluator, a deployed Campaign API, a first reference baseline campaign, and a public development task set; only the frozen private test set and the public competitive leaderboard follow in v0.2.
Methodology in brief
Seven capability dimensions, jointly evaluated.
The seven capability dimensions are Context Continuity (carrying state across long sequences without drift or loss), Error Containment (recovering from local failures without cascading), Bounded Execution (respecting step, tool, and resource budgets), Authority Calibration (escalating, deferring, or proceeding under appropriate authority), Observability (producing traces a human or third agent can audit), Coordination Efficiency (achieving outcomes with proportionate inter-agent traffic), and Frame Integrity (holding role, scope, and task framing under steering pressure).
A campaign runs for seven days and serves between one hundred and one hundred fifty heterogeneous tasks sequentially. Mid-run steering events redirect priorities; embedded action-based safety traps probe authority calibration and frame integrity in context; and cross-task lineage references are mandatory, so capability under earlier task state must be carried forward rather than re-derived.
Read §3 (Related Work) and §4 (the seven dimensions) in the paper. Open the v0.1 paper on arXiv →
Roadmap
What ships when.
Disclosure
Conflict of interest.
OrchestraBench is operated by OrchestraBench, LLC, a separate legal entity from Prodocloud, LLC — the operator of Ultradian, the AI research venture developing Cognio, an agent-orchestration framework currently in active development through multiple prototype iterations. OrchestraBench originated from the practical question of how to evaluate Cognio; the seven capability dimensions presented in v0.1 were redeveloped from scratch through a grounded-theory–style discovery process. The author retains ownership of OrchestraBench, LLC and discloses this conflict openly. To make it structurally manageable rather than rhetorically managed, OrchestraBench pre-commits to: an open-source evaluator; a two-layer frozen test set with independent co-author post-submission integrity audit; identical campaign-attempt fees for any Ultradian-affiliated submission; full publication of all attempts, including failures; and operation under a separate legal entity with procedural mitigation.
Contact & citation
Get in touch. Cite the paper.
Corresponding author: Hifzullah Celik — hifzullah@orchestrabench.ai.
@misc{TO-FILL-ARXIV-CITEKEY,
title = {OrchestraBench: A Long-Horizon Campaign Benchmark for Production-Grade Multi-Agent Orchestration},
author = {Celik, Hifzullah},
year = {2026},
eprint = {TO-FILL-ARXIV-ID},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
note = {Position paper. v0.1.},
url = {TO-FILL-ARXIV-URL}
}