
Calibrate AI Grading with Cohen’s Kappa (Practical Guide)
Primary keyword: cohen’s kappa grading
When you point an AI grader at real student work, the first QA question is reliability: “How closely do AI scores agree with trained humans on the same essays?” The classic tool for this is Cohen’s κ (kappa)—a chance-corrected agreement statistic for two raters. In this guide you’ll learn exactly what κ measures, how to plan a defensible double-rating sample, how to compute κ (worked example + CSV), what thresholds to use in practice, and what to report for accreditation.
TL;DR: Kappa measures agreement, not accuracy. Use weighted κ for ordered rubric levels, watch for prevalence effects, and pair κ with targeted human spot checks. See our HITL workflow for operational guards.
What Kappa measures (and doesn’t)
Definition. Cohen’s κ compares the observed agreement \(p_o\) to the agreement expected by chance \(p_e\) from the raters’ own label distributions:
\[\kappa = \frac{p_o - p_e}{1 - p_e}.\]It’s 1 for perfect agreement, 0 when agreement equals chance, and can be negative if raters systematically disagree. Educational Testing Standards
Weighted κ for rubrics. For ordered categories (e.g., 1–4 rubric levels), use weighted κ (linear or quadratic) so “off by 1” is penalized less than “off by 3.” Cohen introduced weighted κ for “partial credit” scenarios—exactly our rubric case. personality-project.org
Agreement ≠ accuracy. κ says how consistently AI matches a reference rater; it does not prove either party is correct. Treat κ as a calibration check, not ground truth. Educational Testing Standards
Prevalence & bias effects. With very common/rare labels, you can see “high % agreement but low κ,” or vice-versa—the famous kappa paradoxes. Inspect marginals and report both % agreement and κ. PubMed
Alternatives (know when to switch). For multi-rater panels and missing data, Krippendorff’s α is often preferred; it generalizes across scales and handles gaps. Use κ for two raters (AI vs instructor) and α when your design is more complex. Oxford Academic
A sampling plan for double ratings (pragmatic + defensible)
-
Scope: Calibrate per assignment and rubric criterion (e.g., Thesis, Evidence, Organization), not just an overall score. Educational Testing Standards
-
How many to double-rate? Formal sample-size tables depend on the number of categories, target κ, and CI width; typical guidance suggests at least 50–100 artifacts for stable estimates, more if classes are imbalanced. Tools/papers by Sim & Wright (2005) and Bujang & Baharum (2017) provide worked planning rules; Rotondi & Donner give CI-based formulas and an R package (
kappaSize
). Oxford Academic -
Operational rule of thumb: Double-rate 10–20% or ≥50, whichever is larger, stratified by section and oversampling borderline/low-confidence AI cases. This balances precision with instructor time and lines up with auditing expectations in the AERA–APA–NCME Standards (document intended use and ongoing checks). Educational Testing Standards
-
Blinding: Ensure the human rater does not see AI’s score or rationale when producing the reference rating. Educational Testing Standards
Computing κ (worked example you can copy)
Below is a toy dataset (80 essays, 4-level rubric). You can download the CSV and replicate the calculation.
Download: cohens-kappa-example.csv
Confusion matrix (Human rows × AI columns)
1 | 2 | 3 | 4 | |
---|---|---|---|---|
1 | 11 | 0 | 0 | 0 |
2 | 3 | 23 | 3 | 0 |
3 | 0 | 5 | 26 | 2 |
4 | 0 | 0 | 4 | 3 |
From this table:
- Observed agreement \(p_o = 0.7875\)
- Expected agreement \(p_e = 0.3266\) (from the marginals)
- Unweighted κ = 0.6845
- Linear-weighted κ = 0.7648
- Quadratic-weighted κ = 0.8494
(Weights reduce the penalty for small disagreements—appropriate for ordered rubric levels.) Scikit-learn
Mini calculator (paste data):
Step 1 — Build confusion matrix (counts C_ij)
Step 2 — Observed: p_o = (Σ_i C_ii) / N
Step 3 — Expected: p_e = Σ_i (row_i_total/N) * (col_i_total/N)
Step 4 — Kappa: κ = (p_o - p_e) / (1 - p_e)
Weighted: use w_ij (linear or quadratic). Replace C_ij with w_ij * C_ij in Steps 2–3.
Tip: For production pipelines, compute per-criterion weighted κ (e.g., Thesis κ_w, Evidence κ_w) and an overall κ_w. Log model name/version + prompt/rubric hash with each batch for auditing. (Educational Testing Standards )
Thresholds and decisions (what’s “good enough”?)
There is no universal cut-point, but two common references appear in the literature:
- Landis & Koch (1977): 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect. Useful as a quick lexicon but widely criticized as arbitrary. NCBI
- McHugh (2012): tougher labels (e.g., 0.80–0.90 strong; >0.90 almost perfect). Emphasizes stakes-sensitive targets. PubMed Central
Our pragmatic policy (course-level, low/medium stakes):
- Aim for quadratic-weighted κ ≥ 0.75 before auto-releasing grades.
- If κ_w is 0.60–0.74, enable human sampling (e.g., 20%) + borderline review; show AI rationale but require instructor acceptance.
- If κ_w < 0.60, revise rubric prompts and/or model settings, retrain human raters (anchor papers), and recalibrate before deployment.
- Always publish confidence intervals for κ/κ_w in faculty QA notes. Oxford Academic
Remember the paradoxes. When categories are imbalanced (many “3”s), κ can look worse than % agreement. Report both and include the confusion matrix so reviewers can judge prevalence effects. PubMed
Acting on results (revise rubric? add sampling?)
Use κ to trigger actions, not as an end in itself:
- Low κ on one criterion (e.g., Evidence): tighten that criterion’s descriptors and exemplars; add targeted few-shot examples to your AI prompt; re-anchor with human raters. Educational Testing Standards
- Low κ overall + high prevalence skew: consider weighted κ, adjust category cut-points, and ensure the AI isn’t collapsing mid-levels; add sampling until κ recovers. personality-project.org , PubMed
- κ drops after a model update: version drift—lock model versions for a term or re-calibrate when vendors ship new releases. Document per Standards. Educational Testing Standards
Reporting for accreditation (what to include)
Your reliability note for a program review should include:
- Intended use & stakes (course feedback vs. certification).
- Sampling design (size, stratification, blinding).
- Metric choices (unweighted & weighted κ; rationale).
- Results (confusion matrix, % agreement, κ/κ_w with 95% CIs).
- Remediation rules (what you do when κ falls).
- Audit artifacts: rubric version, model name/params, prompts, data protection terms, and a link to the HITL workflow. Educational Testing Standards
Appendix A — Code snippet (optional)
// Pseudocode / Python equivalent available
// Use per-criterion arrays of human vs AI labels.
import { cohen_kappa_score } from 'scikit-learn'; // Python package
// Unweighted:
const kappa = cohen_kappa_score(human_labels, ai_labels);
// Weighted (quadratic):
const kappa_w = cohen_kappa_score(human_labels, ai_labels, { weights: 'quadratic' });
The standard API defines κ as \((p_o - p_e)/(1 - p_e)\) and supports weights="linear"
or "quadratic"
for ordered rubrics. Scikit-learn
Appendix B — CSV columns (template)
student_id,human_score,ai_score
1,3,3
2,2,2
3,2,3
...
Use integers for rubric levels (e.g., 1–4). Compute per-criterion κ by repeating this process for each criterion’s labels. Scikit-learn
Ready to Transform Your Grading Process?
Experience the power of AI-driven exam grading with human oversight. Get consistent, fast, and reliable assessment results.
Try AI GraderRelated Reading
© 2025 AI Grader. All rights reserved.