// METHODOLOGY · PILLAR 02
Evaluator calibration.
Two observers, given the same performance, will score it differently. The discipline of calibration is what closes that gap — not as a one-time exercise, but as an operational practice.
Evaluator calibration — Evaluator calibration is the continuous practice of measuring and correcting variance between evaluators so a given score means the same thing across observers, units, and cycles. Without calibration, the scoring system measures the observer. With calibration, the scoring system measures performance.
// 01 — THE PROBLEM
Why uncalibrated evaluators corrupt readiness data.
Give the same performance to two observers and you will get two scores. Sometimes the gap is small. Sometimes it is the difference between meeting the standard and failing it.
Without calibration, the variance is invisible. The aggregate score looks like a measurement of unit performance. It is actually a measurement of which observers were assigned to which units.
This is the failure mode that audit-defensible readiness systems are built to close. It is not exotic. It is universal across high-consequence domains.
// 02 — THE METRIC
Inter-rater reliability as an operational signal.
Inter-rater reliability (IRR) is the formal measure of agreement between independent evaluators scoring the same performance against the same rubric. High IRR means the rubric and the calibration practice are producing the same score for the same performance.
OCTAAR treats IRR as an operational metric, not a research artifact. It is monitored across the cycle, surfaced as a leading indicator, and used to trigger evaluator recalibration before drift becomes a finding.
// 03 — THE PRACTICE
Calibration as a continuous loop.
Periodic calibration — once a year, before the exercise season, in a classroom — is necessary and insufficient. Drift accumulates between sessions. The score that was calibrated in March means something different by November.
Continuous calibration runs inside the operational cycle. Per-evaluator variance is tracked against the calibrated baseline. Out-of-tolerance scoring is flagged and routed to a calibrating evaluator. Drift events become findings against the evaluator pool, with assigned recalibration and closure verification — the same chain of custody applied to evaluators as to units.
The observer is part of the instrument. The instrument is calibrated continuously, or it is not calibrated.
// 04 — THE OUTCOME
What a calibrated evaluator pool changes.
A calibrated evaluator pool means a finding from one cycle is comparable to a finding from the next, even with different observers. It means readiness drift detected in one battalion is comparable to readiness drift detected in another. It means the longitudinal benchmark is honest.
Without calibration, none of those things are true. With calibration, the system of record produces decision-grade data.
// READ NEXT
// Last updated · · OCTAAR Methodology Team
// FAQ
Direct answers.
What is acceptable inter-rater reliability?
Does OCTAAR train evaluators?
How often should evaluator calibration happen?
What if our evaluators are senior officers who don't want to be 'calibrated'?
// READY