CTRL K

How to Calibrate AI Grading with Cohen’s Kappa (and When It Matters)

Calibrate AI Grading with Cohen’s Kappa (Practical Guide)

Primary keyword: cohen’s kappa grading

When you point an AI grader at real student work, the first QA question is reliability: “How closely do AI scores agree with trained humans on the same essays?” The classic tool for this is Cohen’s κ (kappa)—a chance-corrected agreement statistic for two raters. In this guide you’ll learn exactly what κ measures, how to plan a defensible double-rating sample, how to compute κ (worked example + CSV), what thresholds to use in practice, and what to report for accreditation.

TL;DR: Kappa measures agreement, not accuracy. Use weighted κ for ordered rubric levels, watch for prevalence effects, and pair κ with targeted human spot checks. See our HITL workflow for operational guards.

What Kappa measures (and doesn’t)

Definition. Cohen’s κ compares the observed agreement \(p_o\) to the agreement expected by chance \(p_e\) from the raters’ own label distributions:

\[\kappa = \frac{p_o - p_e}{1 - p_e}.\]

It’s 1 for perfect agreement, 0 when agreement equals chance, and can be negative if raters systematically disagree. Educational Testing Standards

Weighted κ for rubrics. For ordered categories (e.g., 1–4 rubric levels), use weighted κ (linear or quadratic) so “off by 1” is penalized less than “off by 3.” Cohen introduced weighted κ for “partial credit” scenarios—exactly our rubric case. personality-project.org

Agreement ≠ accuracy. κ says how consistently AI matches a reference rater; it does not prove either party is correct. Treat κ as a calibration check, not ground truth. Educational Testing Standards

Prevalence & bias effects. With very common/rare labels, you can see “high % agreement but low κ,” or vice-versa—the famous kappa paradoxes. Inspect marginals and report both % agreement and κ. PubMed

Alternatives (know when to switch). For multi-rater panels and missing data, Krippendorff’s α is often preferred; it generalizes across scales and handles gaps. Use κ for two raters (AI vs instructor) and α when your design is more complex. Oxford Academic

A sampling plan for double ratings (pragmatic + defensible)

Scope: Calibrate per assignment and rubric criterion (e.g., Thesis, Evidence, Organization), not just an overall score. Educational Testing Standards
How many to double-rate? Formal sample-size tables depend on the number of categories, target κ, and CI width; typical guidance suggests at least 50–100 artifacts for stable estimates, more if classes are imbalanced. Tools/papers by Sim & Wright (2005) and Bujang & Baharum (2017) provide worked planning rules; Rotondi & Donner give CI-based formulas and an R package (kappaSize). Oxford Academic
Operational rule of thumb: Double-rate 10–20% or ≥50, whichever is larger, stratified by section and oversampling borderline/low-confidence AI cases. This balances precision with instructor time and lines up with auditing expectations in the AERA–APA–NCME Standards (document intended use and ongoing checks). Educational Testing Standards
Blinding: Ensure the human rater does not see AI’s score or rationale when producing the reference rating. Educational Testing Standards

Computing κ (worked example you can copy)

Below is a toy dataset (80 essays, 4-level rubric). You can download the CSV and replicate the calculation.

Download: cohens-kappa-example.csv

Confusion matrix (Human rows × AI columns)

	1	2	3	4
1	11	0	0	0
2	3	23	3	0
3	0	5	26	2
4	0	0	4	3

From this table:

Observed agreement \(p_o = 0.7875\)
Expected agreement \(p_e = 0.3266\) (from the marginals)
Unweighted κ = 0.6845
Linear-weighted κ = 0.7648
Quadratic-weighted κ = 0.8494

(Weights reduce the penalty for small disagreements—appropriate for ordered rubric levels.) Scikit-learn

Mini calculator (paste data):


Step 1 — Build confusion matrix (counts C_ij)

Step 2 — Observed: p_o = (Σ_i C_ii) / N

Step 3 — Expected: p_e = Σ_i (row_i_total/N) * (col_i_total/N)

Step 4 — Kappa: κ = (p_o - p_e) / (1 - p_e)

Weighted: use w_ij (linear or quadratic). Replace C_ij with w_ij * C_ij in Steps 2–3.

Tip: For production pipelines, compute per-criterion weighted κ (e.g., Thesis κ_w, Evidence κ_w) and an overall κ_w. Log model name/version + prompt/rubric hash with each batch for auditing. (Educational Testing Standards )

Thresholds and decisions (what’s “good enough”?)

There is no universal cut-point, but two common references appear in the literature:

Landis & Koch (1977): 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect. Useful as a quick lexicon but widely criticized as arbitrary. NCBI
McHugh (2012): tougher labels (e.g., 0.80–0.90 strong; >0.90 almost perfect). Emphasizes stakes-sensitive targets. PubMed Central

Our pragmatic policy (course-level, low/medium stakes):

Aim for quadratic-weighted κ ≥ 0.75 before auto-releasing grades.
If κ_w is 0.60–0.74, enable human sampling (e.g., 20%) + borderline review; show AI rationale but require instructor acceptance.
If κ_w < 0.60, revise rubric prompts and/or model settings, retrain human raters (anchor papers), and recalibrate before deployment.
Always publish confidence intervals for κ/κ_w in faculty QA notes. Oxford Academic

Remember the paradoxes. When categories are imbalanced (many “3”s), κ can look worse than % agreement. Report both and include the confusion matrix so reviewers can judge prevalence effects. PubMed

Acting on results (revise rubric? add sampling?)

Use κ to trigger actions, not as an end in itself:

Low κ on one criterion (e.g., Evidence): tighten that criterion’s descriptors and exemplars; add targeted few-shot examples to your AI prompt; re-anchor with human raters. Educational Testing Standards
Low κ overall + high prevalence skew: consider weighted κ, adjust category cut-points, and ensure the AI isn’t collapsing mid-levels; add sampling until κ recovers. personality-project.org , PubMed
κ drops after a model update: version drift—lock model versions for a term or re-calibrate when vendors ship new releases. Document per Standards. Educational Testing Standards

Reporting for accreditation (what to include)

Your reliability note for a program review should include:

Intended use & stakes (course feedback vs. certification).
Sampling design (size, stratification, blinding).
Metric choices (unweighted & weighted κ; rationale).
Results (confusion matrix, % agreement, κ/κ_w with 95% CIs).
Remediation rules (what you do when κ falls).
Audit artifacts: rubric version, model name/params, prompts, data protection terms, and a link to the HITL workflow. Educational Testing Standards

Appendix A — Code snippet (optional)


// Pseudocode / Python equivalent available
// Use per-criterion arrays of human vs AI labels.
import { cohen_kappa_score } from 'scikit-learn'; // Python package
 
// Unweighted:
const kappa = cohen_kappa_score(human_labels, ai_labels);
 
// Weighted (quadratic):
const kappa_w = cohen_kappa_score(human_labels, ai_labels, { weights: 'quadratic' });

The standard API defines κ as \((p_o - p_e)/(1 - p_e)\) and supports weights="linear" or "quadratic" for ordered rubrics. Scikit-learn

Appendix B — CSV columns (template)


student_id,human_score,ai_score
1,3,3
2,2,2
3,2,3
...

Use integers for rubric levels (e.g., 1–4). Compute per-criterion κ by repeating this process for each criterion’s labels. Scikit-learn

Ready to Transform Your Grading Process?

Experience the power of AI-driven exam grading with human oversight. Get consistent, fast, and reliable assessment results.

Try AI Grader

Calibrate AI Grading with Cohen’s Kappa (Practical Guide)

Primary keyword: cohen’s kappa grading

What Kappa measures (and doesn’t)

Definition. Cohen’s κ compares the observed agreement \(p_o\) to the agreement expected by chance \(p_e\) from the raters’ own label distributions:

\[\kappa = \frac{p_o - p_e}{1 - p_e}.\]

It’s 1 for perfect agreement, 0 when agreement equals chance, and can be negative if raters systematically disagree. Educational Testing Standards

A sampling plan for double ratings (pragmatic + defensible)

Scope: Calibrate per assignment and rubric criterion (e.g., Thesis, Evidence, Organization), not just an overall score. Educational Testing Standards
How many to double-rate? Formal sample-size tables depend on the number of categories, target κ, and CI width; typical guidance suggests at least 50–100 artifacts for stable estimates, more if classes are imbalanced. Tools/papers by Sim & Wright (2005) and Bujang & Baharum (2017) provide worked planning rules; Rotondi & Donner give CI-based formulas and an R package (kappaSize). Oxford Academic
Operational rule of thumb: Double-rate 10–20% or ≥50, whichever is larger, stratified by section and oversampling borderline/low-confidence AI cases. This balances precision with instructor time and lines up with auditing expectations in the AERA–APA–NCME Standards (document intended use and ongoing checks). Educational Testing Standards
Blinding: Ensure the human rater does not see AI’s score or rationale when producing the reference rating. Educational Testing Standards

Computing κ (worked example you can copy)

Below is a toy dataset (80 essays, 4-level rubric). You can download the CSV and replicate the calculation.

Download: cohens-kappa-example.csv

Confusion matrix (Human rows × AI columns)

	1	2	3	4
1	11	0	0	0
2	3	23	3	0
3	0	5	26	2
4	0	0	4	3

From this table:

Observed agreement \(p_o = 0.7875\)
Expected agreement \(p_e = 0.3266\) (from the marginals)
Unweighted κ = 0.6845
Linear-weighted κ = 0.7648
Quadratic-weighted κ = 0.8494

(Weights reduce the penalty for small disagreements—appropriate for ordered rubric levels.) Scikit-learn

Mini calculator (paste data):


Step 1 — Build confusion matrix (counts C_ij)

Step 2 — Observed: p_o = (Σ_i C_ii) / N

Step 3 — Expected: p_e = Σ_i (row_i_total/N) * (col_i_total/N)

Step 4 — Kappa: κ = (p_o - p_e) / (1 - p_e)

Weighted: use w_ij (linear or quadratic). Replace C_ij with w_ij * C_ij in Steps 2–3.

Tip: For production pipelines, compute per-criterion weighted κ (e.g., Thesis κ_w, Evidence κ_w) and an overall κ_w. Log model name/version + prompt/rubric hash with each batch for auditing. (Educational Testing Standards )

Thresholds and decisions (what’s “good enough”?)

There is no universal cut-point, but two common references appear in the literature:

Landis & Koch (1977): 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect. Useful as a quick lexicon but widely criticized as arbitrary. NCBI
McHugh (2012): tougher labels (e.g., 0.80–0.90 strong; >0.90 almost perfect). Emphasizes stakes-sensitive targets. PubMed Central

Our pragmatic policy (course-level, low/medium stakes):

Aim for quadratic-weighted κ ≥ 0.75 before auto-releasing grades.
If κ_w is 0.60–0.74, enable human sampling (e.g., 20%) + borderline review; show AI rationale but require instructor acceptance.
If κ_w < 0.60, revise rubric prompts and/or model settings, retrain human raters (anchor papers), and recalibrate before deployment.
Always publish confidence intervals for κ/κ_w in faculty QA notes. Oxford Academic

Remember the paradoxes. When categories are imbalanced (many “3”s), κ can look worse than % agreement. Report both and include the confusion matrix so reviewers can judge prevalence effects. PubMed

Acting on results (revise rubric? add sampling?)

Use κ to trigger actions, not as an end in itself:

Low κ on one criterion (e.g., Evidence): tighten that criterion’s descriptors and exemplars; add targeted few-shot examples to your AI prompt; re-anchor with human raters. Educational Testing Standards
Low κ overall + high prevalence skew: consider weighted κ, adjust category cut-points, and ensure the AI isn’t collapsing mid-levels; add sampling until κ recovers. personality-project.org , PubMed
κ drops after a model update: version drift—lock model versions for a term or re-calibrate when vendors ship new releases. Document per Standards. Educational Testing Standards

Reporting for accreditation (what to include)

Your reliability note for a program review should include:

Intended use & stakes (course feedback vs. certification).
Sampling design (size, stratification, blinding).
Metric choices (unweighted & weighted κ; rationale).
Results (confusion matrix, % agreement, κ/κ_w with 95% CIs).
Remediation rules (what you do when κ falls).
Audit artifacts: rubric version, model name/params, prompts, data protection terms, and a link to the HITL workflow. Educational Testing Standards

Appendix A — Code snippet (optional)


// Pseudocode / Python equivalent available
// Use per-criterion arrays of human vs AI labels.
import { cohen_kappa_score } from 'scikit-learn'; // Python package
 
// Unweighted:
const kappa = cohen_kappa_score(human_labels, ai_labels);
 
// Weighted (quadratic):
const kappa_w = cohen_kappa_score(human_labels, ai_labels, { weights: 'quadratic' });

The standard API defines κ as \((p_o - p_e)/(1 - p_e)\) and supports weights="linear" or "quadratic" for ordered rubrics. Scikit-learn

Appendix B — CSV columns (template)


student_id,human_score,ai_score
1,3,3
2,2,2
3,2,3
...

Use integers for rubric levels (e.g., 1–4). Compute per-criterion κ by repeating this process for each criterion’s labels. Scikit-learn

Ready to Transform Your Grading Process?

Experience the power of AI-driven exam grading with human oversight. Get consistent, fast, and reliable assessment results.

Try AI Grader

Calibrate AI Grading with Cohen’s Kappa (Practical Guide)

What Kappa measures (and doesn’t)

A sampling plan for double ratings (pragmatic + defensible)

Computing κ (worked example you can copy)

Confusion matrix (Human rows × AI columns)

Thresholds and decisions (what’s “good enough”?)

Acting on results (revise rubric? add sampling?)

Reporting for accreditation (what to include)

Appendix A — Code snippet (optional)

Appendix B — CSV columns (template)

Ready to Transform Your Grading Process?

Related Reading

Calibrate AI Grading with Cohen’s Kappa (Practical Guide)

What Kappa measures (and doesn’t)

A sampling plan for double ratings (pragmatic + defensible)

Computing κ (worked example you can copy)

Confusion matrix (Human rows × AI columns)

Thresholds and decisions (what’s “good enough”?)

Acting on results (revise rubric? add sampling?)

Reporting for accreditation (what to include)

Appendix A — Code snippet (optional)

Appendix B — CSV columns (template)

Ready to Transform Your Grading Process?

Related Reading