v0.1 — Paper + working release

OrchestraBench

A long-horizon campaign benchmark for production-grade multi-agent orchestration.

v0.1 is a claim-staking position paper accompanied by a working release. It proposes seven capability dimensions — measured jointly across a multi-day campaign of more than one hundred heterogeneous tasks with mid-run steering events and embedded action-based safety traps.

  • Context Continuity
  • Error Containment
  • Bounded Execution
  • Authority Calibration
  • Observability
  • Coordination Efficiency
  • Frame Integrity

v0.1 ships the open-source SDK, the open-source evaluator, a deployed Campaign API, a first reference baseline campaign, and a public development task set; only the frozen private test set and the public competitive leaderboard follow in v0.2.

v0.1 · paper + working release · operator OrchestraBench, LLC · corresponding author Hifzullah Celik, Ultradian

Methodology in brief

Seven capability dimensions, jointly evaluated.

The seven capability dimensions are Context Continuity (carrying state across long sequences without drift or loss), Error Containment (recovering from local failures without cascading), Bounded Execution (respecting step, tool, and resource budgets), Authority Calibration (escalating, deferring, or proceeding under appropriate authority), Observability (producing traces a human or third agent can audit), Coordination Efficiency (achieving outcomes with proportionate inter-agent traffic), and Frame Integrity (holding role, scope, and task framing under steering pressure).

A campaign runs for seven days and serves between one hundred and one hundred fifty heterogeneous tasks sequentially. Mid-run steering events redirect priorities; embedded action-based safety traps probe authority calibration and frame integrity in context; and cross-task lineage references are mandatory, so capability under earlier task state must be carried forward rather than re-derived.

Read §3 (Related Work) and §4 (the seven dimensions) in the paper. Open the v0.1 paper on arXiv →

Roadmap

What ships when.

v0.1
Available now
Methodology + seven capability dimensions + campaign architecture + COI disclosure with procedural pre-commitments; open-source SDK (Apache-2.0); open-source evaluator (Apache-2.0, production task generators held privately and released post-window for audit replay); deployed Campaign API; first reference baseline campaign (Cognio v0); public development task set of ~650 example tasks. (This paper + accompanying release.)
v0.2
Forthcoming
Frozen private test set with independent co-author post-submission integrity audit; public competitive leaderboard frontend; additional L3 baselines beyond the v0.1 reference baseline.
v0.3
Planned
Contested tier — the same seven dimensions under harsher conditions: adversarial environments, simulated human-in-the-loop, multi-modal coordination, and dynamic agent capability.
v0.4+
Planned
Inter-framework orchestration; continuous operation across thirty-day and ninety-day campaigns.

Disclosure

Conflict of interest.

OrchestraBench is operated by OrchestraBench, LLC, a separate legal entity from Prodocloud, LLC — the operator of Ultradian, the AI research venture developing Cognio, an agent-orchestration framework currently in active development through multiple prototype iterations. OrchestraBench originated from the practical question of how to evaluate Cognio; the seven capability dimensions presented in v0.1 were redeveloped from scratch through a grounded-theory–style discovery process. The author retains ownership of OrchestraBench, LLC and discloses this conflict openly. To make it structurally manageable rather than rhetorically managed, OrchestraBench pre-commits to: an open-source evaluator; a two-layer frozen test set with independent co-author post-submission integrity audit; identical campaign-attempt fees for any Ultradian-affiliated submission; full publication of all attempts, including failures; and operation under a separate legal entity with procedural mitigation.

Read full §8 in the paper →

Contact & citation

Get in touch. Cite the paper.

Corresponding author: Hifzullah Celikhifzullah@orchestrabench.ai.

BibTeX
@misc{TO-FILL-ARXIV-CITEKEY,
  title         = {OrchestraBench: A Long-Horizon Campaign Benchmark for Production-Grade Multi-Agent Orchestration},
  author        = {Celik, Hifzullah},
  year          = {2026},
  eprint        = {TO-FILL-ARXIV-ID},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  note          = {Position paper. v0.1.},
  url           = {TO-FILL-ARXIV-URL}
}