// METHODOLOGY · PILLAR 02

Evaluator calibration.

Two observers, given the same performance, will score it differently. The discipline of calibration is what closes that gap — not as a one-time exercise, but as an operational practice.

Request demo Read the methodology

Evaluator calibration — Evaluator calibration is the continuous practice of measuring and correcting variance between evaluators so a given score means the same thing across observers, units, and cycles. Without calibration, the scoring system measures the observer. With calibration, the scoring system measures performance.

// 01 — THE PROBLEM

Why uncalibrated evaluators corrupt readiness data.

Give the same performance to two observers and you will get two scores. Sometimes the gap is small. Sometimes it is the difference between meeting the standard and failing it.

Without calibration, the variance is invisible. The aggregate score looks like a measurement of unit performance. It is actually a measurement of which observers were assigned to which units.

This is the failure mode that audit-defensible readiness systems are built to close. It is not exotic. It is universal across high-consequence domains.

// 02 — THE METRIC

Inter-rater reliability as an operational signal.

Inter-rater reliability (IRR) is the formal measure of agreement between independent evaluators scoring the same performance against the same rubric. High IRR means the rubric and the calibration practice are producing the same score for the same performance.

OCTAAR treats IRR as an operational metric, not a research artifact. It is monitored across the cycle, surfaced as a leading indicator, and used to trigger evaluator recalibration before drift becomes a finding.

// 03 — THE PRACTICE

Calibration as a continuous loop.

Periodic calibration — once a year, before the exercise season, in a classroom — is necessary and insufficient. Drift accumulates between sessions. The score that was calibrated in March means something different by November.

Continuous calibration runs inside the operational cycle. Per-evaluator variance is tracked against the calibrated baseline. Out-of-tolerance scoring is flagged and routed to a calibrating evaluator. Drift events become findings against the evaluator pool, with assigned recalibration and closure verification — the same chain of custody applied to evaluators as to units.

The observer is part of the instrument. The instrument is calibrated continuously, or it is not calibrated.

// 04 — THE OUTCOME

What a calibrated evaluator pool changes.

A calibrated evaluator pool means a finding from one cycle is comparable to a finding from the next, even with different observers. It means readiness drift detected in one battalion is comparable to readiness drift detected in another. It means the longitudinal benchmark is honest.

Without calibration, none of those things are true. With calibration, the system of record produces decision-grade data.

// READ NEXT

Calibrated rubric design

The instrument the calibrated evaluator uses.

Readiness drift detection

The unit-level signal calibration enables.

Methodology overview

All six pillars.

Glossary: inter-rater reliability

Canonical definition.

// Last updated · May 19, 2026 · OCTAAR Methodology Team

// FAQ

Direct answers.

What is acceptable inter-rater reliability?

Domain-specific. OCTAAR surfaces the metric and lets the operator define the threshold against their published task standards — the same way the rubric scale is defined by the operator, not the platform.

Does OCTAAR train evaluators?

OCTAAR surfaces who needs recalibration, on which rubric items, against which baseline. The training itself is delivered by the operator's training program — often through the LMS — with closure recorded back into OCTAAR as evidence.

How often should evaluator calibration happen?

Continuously, in the sense that per-evaluator variance is tracked every cycle. Formal recalibration sessions happen at the cadence the operator's program requires — typically pre-cycle and on drift-event trigger.

What if our evaluators are senior officers who don't want to be 'calibrated'?

Calibration is not a comment on judgment. It is a property of an instrument. Senior evaluators carry the most weight in the scoring system; calibrating them is how their weight stays defensible.

// READY

See the discipline as an operating cycle.

Request Operational Readiness Demo All resources

// OR · CONTACT THE TEAM