CTRL K

Automated Essay Scoring (AES) vs. AWE vs. LLMs: What Actually Works in 2025

AES vs AWE vs LLMs: What Works for Essay Grading in 2025

Primary keyword: automated essay scoring

If you’ve heard conflicting claims about robot graders, grammar apps, and “AI teachers,” you’re not alone. This guide explains the three big families you’ll run into:

AES (Automated Essay Scoring): train-once scoring engines tuned to a prompt or construct.
AWE (Automated Writing Evaluation): tools that give formative feedback to improve drafts.
LLM-based scoring/feedback: large language models (GPT-5, Claude 3.5, Gemini 2.5, etc.) configured with rubrics and controls.

You’ll see where each shines, where each breaks, and a decision matrix to choose pragmatically under real institutional constraints.

TL;DR: AES is narrow and stable (best for fixed prompts at scale); AWE is great for formative feedback; LLMs are flexible but require human-in-the-loop, audit logs, and policy guardrails. If you want a turnkey way to run rubric-driven, auditable grading with sampling, you can try Exam AI Grader in minutes.

Definitions & 50-year history (in one page)

AES originated with Project Essay Grade (PEG) in the 1960s and later with ETS e-rater, Intelligent Essay Assessor (IEA), and others. Classic AES learns correlations between surface/linguistic features and human scores on a specific task, then scores new responses on that same task. (Digital Commons , ETS , Wake Forest University Image Database )
AES can achieve high agreement with trained raters on appropriate tasks, but construct coverage and vulnerability to gaming are longstanding concerns (e.g., inflated scores for length or fancy vocabulary; “stumping” counter-examples). (Digital Commons , ETS )
AWE (Automated Writing Evaluation) is about feedback (style, mechanics, organization, revision suggestions). Meta-analyses report meaningful improvements in writing quality when AWE is used well, especially in post-secondary contexts. (SAGE Journals , Frontiers , ERIC )
LLMs (2024–2025) changed the landscape: without task-specific training, LLMs can apply rubrics to open-domain essays with human-like scoring—good enough for low-stakes uses—but fairness, drift, and consistency remain active research areas. (ScienceDirect , ACM Digital Library )

Standards backdrop: Whether you use AES, AWE, or LLMs, your implementation should satisfy AERA–APA–NCME Testing Standards (validity, reliability, fairness, transparency) and similar guidance (e.g., ITC). (testingstandards.net , AERA , intestcom.org )

LLM era: what actually changed

Generalization without prompt-specific training. Unlike classic AES (train on hundreds/thousands of scored samples per prompt), LLMs can score with a well-written rubric in zero- or few-shot fashion. That collapses setup costs for many courses. (ACM Digital Library , ScienceDirect )
Richer rationale + structured outputs. LLMs can return criterion-level rationales and evidence pointers in JSON, enabling audits and downstream analytics—not just a single score. (See our HITL workflow: /blog/hitl-grading-workflow.)
But: reliability and fairness vary by model, prompt, and population; exact agreement with trained human raters is still imperfect; models can drift across versions; and bias/fairness reviews are essential. (ScienceDirect , Nature , ACL Anthology )

Reliability & validity (trade-offs you can’t ignore)

AES (classic): Strong reliability within its trained domain; transparent feature sets (e.g., linguistic indicators) but narrow construct; known adversarial cases (nonsense yet high scores) demand human oversight in high-stakes use. (ETS , Digital Commons )
AWE: The goal is learning gains, not summative reliability. Meta-analyses show large positive effects on writing quality when AWE feedback complements instruction. Use rubrics and exemplars to prevent “grammar-only” tunnel vision. (SAGE Journals , Frontiers , ERIC )
LLMs: Recent studies find fair to moderate agreement with expert raters in K-12 datasets and competitive reliability in higher-ed samples—when rubric instructions are explicit. Treat as decision support unless you’ve validated on your population. (ScienceDirect , ACM Digital Library , The Hechinger Report )

Policy lens: The Testing Standards expect you to document intended use, construct coverage, validation evidence, error analyses, and fairness checks. If you can’t show this, don’t deploy the system for high-stakes grading. (testingstandards.net , AERA )

Cost, latency, privacy, and policy—practical constraints

Cost (as of Aug 20, 2025)

Public API prices make large-scale, low-stakes LLM scoring surprisingly inexpensive:

OpenAI (current pricing page): GPT-4o/4.1 and newer families list token-based rates; see provider page for latest. (OpenAI Platform )
Anthropic (Claude 3.5 Sonnet): ~\(3/M input tokens, \)15/M output tokens. (Anthropic )
Google (Gemini 2.5): e.g., Flash-Lite at ~\(0.10/M input and \)0.40/M output; Flash around \(0.30/\)2.50. (Google AI for Developers )

Example: ~800-word essay (~1.1k input tokens) with ~150-token output • Gemini 2.5 Flash-Lite: ≈ **\(0.00017** per essay (≈\)0.85 per 5,000). • Claude 3.5 Sonnet: ≈ **\(0.00555** per essay (≈\)27.75 per 5,000). (Your mileage varies—model, caching, and output length matter; always check the provider’s current pricing page.) (OpenAI Platform , Anthropic , Google AI for Developers )

Classic AES has near-zero marginal cost once a model is trained, but setup costs (curating hundreds/thousands of scored essays per prompt) and maintenance (re-training when prompts change) can be substantial. (Digital Commons )

Privacy & policy

In the U.S., FERPA governs disclosure and handling of student education records; the Department of Education’s PTAC provides model terms, requirements, and best practices for online services—key if you use third-party AI. (Protecting Student Privacy , Protecting Student Privacy )
Schools must disclose how generative AI tools collect/process data and update privacy notices accordingly (see UK guidance as a useful reference). (GOV.UK )
Beyond privacy, ensure fairness reviews and bias monitoring; recent work highlights equity risks in zero-shot LLM scoring. (Nature , ScienceDirect )

Comparison matrix (quick scan)

Approach	What it is	Strengths	Weaknesses	Best for	Setup time	Ongoing cost	Reliability (task-fit)	Feedback quality	Policy fit
AES	Prompt-trained scorer (e.g., e-rater/IEA)	Fast, consistent within domain; low marginal cost	Narrow construct; re-training when prompts change; adversarial cases documented	Fixed prompts at very large scale (placement, standardized programs)	High (data curation/training)	Low	High if validated for that prompt	Limited (score + brief features)	Strong if you meet Standards & monitor validity
AWE	Feedback engine for revision	Improves writing quality over time; scalable feedback	Not a summative grader; quality varies by tool/config	Formative cycles, writing labs, drafts	Low-Medium	Low-Medium	N/A (aim is learning)	Rich, actionable feedback	Strong for formative use with teacher oversight
LLMs	Rubric-driven, general models	Flexible across prompts; rich rationales; structure (JSON)	Version drift; fairness; exact agreement not guaranteed; requires HITL	Course-level grading with sampling, or rapid feedback drafts	Very low (rubric + exemplars)	Low per essay	Moderate out-of-the-box; improve with calibration	High (criterion rationales)	Strong if audited + privacy terms in place

Sources: AES/AWE history & effects; LLM reliability/fairness; Testing Standards. (Digital Commons , SAGE Journals , Frontiers , ScienceDirect , testingstandards.net )

Decision tree (choose with your constraints)


Start
 ├─ Is this HIGH-STAKES (placement/certification) or program accountability?
 │    ├─ Yes → Do you have prompt-specific scored data (≥1k essays) and capacity to validate bias?
 │    │    ├─ Yes → AES or hybrid (AES primary + human double-score sample) → Annual validity study.
 │    │    └─ No  → LLM with strict HITL (double-score borderlines) or postpone automation.
 │    └─ No (low/medium stakes) →
 │         ├─ Is your goal learning gains on drafts? → AWE or LLM-as-feedback.
 │         └─ Is your goal consistent grading with transparency at course scale?
 │              → LLM + rubric JSON + 10–20% human sampling + appeals workflow.
 └─ Any path → Log model/version/prompts; run fairness checks; align with Testing Standards.

See our HITL template: /blog/hitl-grading-workflow.

When to pick which (by use case)

First-year writing, multiple sections, shared rubric (low/medium-stakes): LLM with criterion-by-criterion prompts + 10–20% human sampling (oversample low-confidence) + appeals. Faster than pure manual, more flexible than AES. (The Hechinger Report )
Standardized placement writing, same prompt every term (high-stakes): AES (if you can assemble/maintain a prompt-specific training set and run bias audits). Maintain a human read-behind sample and periodic validity studies per Standards. (Digital Commons , testingstandards.net )
Writing center / iterative drafts: AWE or LLM as a feedback coach. Evidence for learning gains is strongest here; pair with explicit revision plans. (SAGE Journals , Frontiers )
Mixed modalities (short answers + essays): Combine LLM scoring for open responses with rule-based autoscoring for short answers; keep manual review for edge cases. (Taylor & Francis Online )

Why hybrids win in practice

A practical 2025 stack:

AWE for drafts (students get immediate, criterion-mapped suggestions). (SAGE Journals )
LLM pass for rubric-aligned JSON per criterion (low temperature; model/version logged). (Related prompts: /blog/ai-essay-feedback-prompts.)
QA layer: auto-flags for plagiarism suspicions, cap rules (e.g., no thesis ⇒ cap), and unusual length/outlier detection.
Human sampling: 10–20% review with oversampling of edge cases and low-confidence predictions; appeals workflow.
Auditability: store rubric version, prompt template, model name, parameters, and raw JSON; recompute inter-rater reliability (e.g., κ) on the sampled set. Follow Testing Standards documentation. (testingstandards.net )

Known pitfalls (and how to avoid them)

Assuming “AI = unbiased.” Run group-level fairness checks and error analyses; LLMs can show subgroup differences without careful design and monitoring. (Nature )
Using AES outside its training domain. Classic engines degrade when prompts drift; budget for re-training and fresh human scoring whenever tasks change. (Digital Commons )
Letting any machine be the only grader in high-stakes decisions. Maintain human review pathways; this aligns with professional testing guidance. (testingstandards.net )
Over-editing student voice. Keep feedback diagnostic; require students to revise rather than replacing their prose.
Ignoring privacy terms. Use PTAC model terms to evaluate vendors; update privacy notices when introducing generative AI services. (Protecting Student Privacy , GOV.UK )
Forgetting the “gaming” lessons of early AES. Keep content-based checks and human sampling to deter formulaic hacks. (ETS )

Cost planning worksheet (quick math)

Token budgeting: cost ≈ (input_tokens/1M × input_rate) + (output_tokens/1M × output_rate).
Keep outputs concise (150–250 tokens per essay for criterion rationales is usually enough).
Consider prompt caching (provider-specific) for repeated rubric templates.
Track latency vs. throughput: batch by course/section; store responses for audits.
Check current prices on provider pages (OpenAI, Anthropic, Google). (OpenAI Platform , Anthropic , Google AI for Developers )

Mini case: “Good enough” vs. “High-stakes ready”

Recent studies suggest LLMs can match or approach typical teacher agreement levels on many school essays, especially with clear rubrics—promising for draft feedback and course-scale grading with sampling. But papers and practitioner reports also stress that high-stakes use still demands validation, bias analysis, and human oversight. (The Hechinger Report , ScienceDirect )

AES, AWE, LLMs — side-by-side (detailed)

Dimension	AES (e-rater/IEA class)	AWE (feedback tools)	LLM graders (GPT-5/Claude/Gemini)
Primary outcome	Score on fixed task	Draft improvement	Score and rationale
Data needs	High (prompt-specific)	Low-Medium	Very low (rubric few-shot)
Change tolerance	Low (retrain on prompt change)	High	High (but monitor version drift)
Gaming risk	Documented historically	Low (not a grader)	Lower than classic AES but still requires QA
Fairness	Requires dedicated audits	N/A (formative)	Active research; must audit locally
Where it shines	Large, fixed-prompt programs	Writing centers, iterative courses	Department-scale grading with HITL
Representative evidence	AES overviews; ETS reports	AWE meta-analyses	2024–2025 LLM studies & reports
Compliance work	Validity docs, monitoring	Data protection & transparency	All of the left + strong audit logs

Sources: histories & overviews; AWE effects; LLM reliability. (Digital Commons , SAGE Journals , Frontiers , The Hechinger Report , ScienceDirect )

References you can cite to reviewers

AES history & engines: Dikli (2006) overview; ETS e-rater materials; IEA/LSA papers; ETS analysis of “stumping e-rater.” (Digital Commons , ETS , Wake Forest University Image Database )
AWE effectiveness: Zhai et al. (2023) meta-analysis; Fleckenstein et al. (2023) review; Fan & Ma (2022) systematic review. (SAGE Journals , Frontiers , ERIC )
LLM reliability/fairness: Pack et al. (2024); ACM 2025 LLM grading study; practitioner coverage of Tate (AERA 2024) study; GEM workshop 2025 comparative work. (ScienceDirect , ACM Digital Library , The Hechinger Report , ACL Anthology )
Testing & policy: AERA–APA–NCME Standards (2014); ITC guidelines for technology-based assessment. (testingstandards.net , AERA , intestcom.org )
Privacy: U.S. DOE PTAC requirements/best practices and model terms; UK DfE guidance on AI and data protection (useful template language). (Protecting Student Privacy , GOV.UK )

Want to try the hybrid approach without building the plumbing?

Try Exam AI Grader — import your rubric, run criterion-level prompts, enable human sampling and appeals, and export an audit log. Start in minutes.

Ready to Transform Your Grading Process?

Experience the power of AI-driven exam grading with human oversight. Get consistent, fast, and reliable assessment results.

Try AI Grader

AES vs AWE vs LLMs: What Works for Essay Grading in 2025

Primary keyword: automated essay scoring

If you’ve heard conflicting claims about robot graders, grammar apps, and “AI teachers,” you’re not alone. This guide explains the three big families you’ll run into:

AES (Automated Essay Scoring): train-once scoring engines tuned to a prompt or construct.
AWE (Automated Writing Evaluation): tools that give formative feedback to improve drafts.
LLM-based scoring/feedback: large language models (GPT-5, Claude 3.5, Gemini 2.5, etc.) configured with rubrics and controls.

You’ll see where each shines, where each breaks, and a decision matrix to choose pragmatically under real institutional constraints.

Definitions & 50-year history (in one page)

AES originated with Project Essay Grade (PEG) in the 1960s and later with ETS e-rater, Intelligent Essay Assessor (IEA), and others. Classic AES learns correlations between surface/linguistic features and human scores on a specific task, then scores new responses on that same task. (Digital Commons , ETS , Wake Forest University Image Database )
AES can achieve high agreement with trained raters on appropriate tasks, but construct coverage and vulnerability to gaming are longstanding concerns (e.g., inflated scores for length or fancy vocabulary; “stumping” counter-examples). (Digital Commons , ETS )
AWE (Automated Writing Evaluation) is about feedback (style, mechanics, organization, revision suggestions). Meta-analyses report meaningful improvements in writing quality when AWE is used well, especially in post-secondary contexts. (SAGE Journals , Frontiers , ERIC )
LLMs (2024–2025) changed the landscape: without task-specific training, LLMs can apply rubrics to open-domain essays with human-like scoring—good enough for low-stakes uses—but fairness, drift, and consistency remain active research areas. (ScienceDirect , ACM Digital Library )

Standards backdrop: Whether you use AES, AWE, or LLMs, your implementation should satisfy AERA–APA–NCME Testing Standards (validity, reliability, fairness, transparency) and similar guidance (e.g., ITC). (testingstandards.net , AERA , intestcom.org )

LLM era: what actually changed

Generalization without prompt-specific training. Unlike classic AES (train on hundreds/thousands of scored samples per prompt), LLMs can score with a well-written rubric in zero- or few-shot fashion. That collapses setup costs for many courses. (ACM Digital Library , ScienceDirect )
Richer rationale + structured outputs. LLMs can return criterion-level rationales and evidence pointers in JSON, enabling audits and downstream analytics—not just a single score. (See our HITL workflow: /blog/hitl-grading-workflow.)
But: reliability and fairness vary by model, prompt, and population; exact agreement with trained human raters is still imperfect; models can drift across versions; and bias/fairness reviews are essential. (ScienceDirect , Nature , ACL Anthology )

Reliability & validity (trade-offs you can’t ignore)

AES (classic): Strong reliability within its trained domain; transparent feature sets (e.g., linguistic indicators) but narrow construct; known adversarial cases (nonsense yet high scores) demand human oversight in high-stakes use. (ETS , Digital Commons )
AWE: The goal is learning gains, not summative reliability. Meta-analyses show large positive effects on writing quality when AWE feedback complements instruction. Use rubrics and exemplars to prevent “grammar-only” tunnel vision. (SAGE Journals , Frontiers , ERIC )
LLMs: Recent studies find fair to moderate agreement with expert raters in K-12 datasets and competitive reliability in higher-ed samples—when rubric instructions are explicit. Treat as decision support unless you’ve validated on your population. (ScienceDirect , ACM Digital Library , The Hechinger Report )

Policy lens: The Testing Standards expect you to document intended use, construct coverage, validation evidence, error analyses, and fairness checks. If you can’t show this, don’t deploy the system for high-stakes grading. (testingstandards.net , AERA )

Cost, latency, privacy, and policy—practical constraints

Cost (as of Aug 20, 2025)

Public API prices make large-scale, low-stakes LLM scoring surprisingly inexpensive:

OpenAI (current pricing page): GPT-4o/4.1 and newer families list token-based rates; see provider page for latest. (OpenAI Platform )
Anthropic (Claude 3.5 Sonnet): ~\(3/M input tokens, \)15/M output tokens. (Anthropic )
Google (Gemini 2.5): e.g., Flash-Lite at ~\(0.10/M input and \)0.40/M output; Flash around \(0.30/\)2.50. (Google AI for Developers )

Example: ~800-word essay (~1.1k input tokens) with ~150-token output • Gemini 2.5 Flash-Lite: ≈ **\(0.00017** per essay (≈\)0.85 per 5,000). • Claude 3.5 Sonnet: ≈ **\(0.00555** per essay (≈\)27.75 per 5,000). (Your mileage varies—model, caching, and output length matter; always check the provider’s current pricing page.) (OpenAI Platform , Anthropic , Google AI for Developers )

Privacy & policy

In the U.S., FERPA governs disclosure and handling of student education records; the Department of Education’s PTAC provides model terms, requirements, and best practices for online services—key if you use third-party AI. (Protecting Student Privacy , Protecting Student Privacy )
Schools must disclose how generative AI tools collect/process data and update privacy notices accordingly (see UK guidance as a useful reference). (GOV.UK )
Beyond privacy, ensure fairness reviews and bias monitoring; recent work highlights equity risks in zero-shot LLM scoring. (Nature , ScienceDirect )

Comparison matrix (quick scan)

Approach	What it is	Strengths	Weaknesses	Best for	Setup time	Ongoing cost	Reliability (task-fit)	Feedback quality	Policy fit
AES	Prompt-trained scorer (e.g., e-rater/IEA)	Fast, consistent within domain; low marginal cost	Narrow construct; re-training when prompts change; adversarial cases documented	Fixed prompts at very large scale (placement, standardized programs)	High (data curation/training)	Low	High if validated for that prompt	Limited (score + brief features)	Strong if you meet Standards & monitor validity
AWE	Feedback engine for revision	Improves writing quality over time; scalable feedback	Not a summative grader; quality varies by tool/config	Formative cycles, writing labs, drafts	Low-Medium	Low-Medium	N/A (aim is learning)	Rich, actionable feedback	Strong for formative use with teacher oversight
LLMs	Rubric-driven, general models	Flexible across prompts; rich rationales; structure (JSON)	Version drift; fairness; exact agreement not guaranteed; requires HITL	Course-level grading with sampling, or rapid feedback drafts	Very low (rubric + exemplars)	Low per essay	Moderate out-of-the-box; improve with calibration	High (criterion rationales)	Strong if audited + privacy terms in place

Sources: AES/AWE history & effects; LLM reliability/fairness; Testing Standards. (Digital Commons , SAGE Journals , Frontiers , ScienceDirect , testingstandards.net )

Decision tree (choose with your constraints)


Start
 ├─ Is this HIGH-STAKES (placement/certification) or program accountability?
 │    ├─ Yes → Do you have prompt-specific scored data (≥1k essays) and capacity to validate bias?
 │    │    ├─ Yes → AES or hybrid (AES primary + human double-score sample) → Annual validity study.
 │    │    └─ No  → LLM with strict HITL (double-score borderlines) or postpone automation.
 │    └─ No (low/medium stakes) →
 │         ├─ Is your goal learning gains on drafts? → AWE or LLM-as-feedback.
 │         └─ Is your goal consistent grading with transparency at course scale?
 │              → LLM + rubric JSON + 10–20% human sampling + appeals workflow.
 └─ Any path → Log model/version/prompts; run fairness checks; align with Testing Standards.

See our HITL template: /blog/hitl-grading-workflow.

When to pick which (by use case)

First-year writing, multiple sections, shared rubric (low/medium-stakes): LLM with criterion-by-criterion prompts + 10–20% human sampling (oversample low-confidence) + appeals. Faster than pure manual, more flexible than AES. (The Hechinger Report )
Standardized placement writing, same prompt every term (high-stakes): AES (if you can assemble/maintain a prompt-specific training set and run bias audits). Maintain a human read-behind sample and periodic validity studies per Standards. (Digital Commons , testingstandards.net )
Writing center / iterative drafts: AWE or LLM as a feedback coach. Evidence for learning gains is strongest here; pair with explicit revision plans. (SAGE Journals , Frontiers )
Mixed modalities (short answers + essays): Combine LLM scoring for open responses with rule-based autoscoring for short answers; keep manual review for edge cases. (Taylor & Francis Online )

Why hybrids win in practice

A practical 2025 stack:

AWE for drafts (students get immediate, criterion-mapped suggestions). (SAGE Journals )
LLM pass for rubric-aligned JSON per criterion (low temperature; model/version logged). (Related prompts: /blog/ai-essay-feedback-prompts.)
QA layer: auto-flags for plagiarism suspicions, cap rules (e.g., no thesis ⇒ cap), and unusual length/outlier detection.
Human sampling: 10–20% review with oversampling of edge cases and low-confidence predictions; appeals workflow.
Auditability: store rubric version, prompt template, model name, parameters, and raw JSON; recompute inter-rater reliability (e.g., κ) on the sampled set. Follow Testing Standards documentation. (testingstandards.net )

Known pitfalls (and how to avoid them)

Assuming “AI = unbiased.” Run group-level fairness checks and error analyses; LLMs can show subgroup differences without careful design and monitoring. (Nature )
Using AES outside its training domain. Classic engines degrade when prompts drift; budget for re-training and fresh human scoring whenever tasks change. (Digital Commons )
Letting any machine be the only grader in high-stakes decisions. Maintain human review pathways; this aligns with professional testing guidance. (testingstandards.net )
Over-editing student voice. Keep feedback diagnostic; require students to revise rather than replacing their prose.
Ignoring privacy terms. Use PTAC model terms to evaluate vendors; update privacy notices when introducing generative AI services. (Protecting Student Privacy , GOV.UK )
Forgetting the “gaming” lessons of early AES. Keep content-based checks and human sampling to deter formulaic hacks. (ETS )

Cost planning worksheet (quick math)

Token budgeting: cost ≈ (input_tokens/1M × input_rate) + (output_tokens/1M × output_rate).
Keep outputs concise (150–250 tokens per essay for criterion rationales is usually enough).
Consider prompt caching (provider-specific) for repeated rubric templates.
Track latency vs. throughput: batch by course/section; store responses for audits.
Check current prices on provider pages (OpenAI, Anthropic, Google). (OpenAI Platform , Anthropic , Google AI for Developers )

Mini case: “Good enough” vs. “High-stakes ready”

AES, AWE, LLMs — side-by-side (detailed)

Dimension	AES (e-rater/IEA class)	AWE (feedback tools)	LLM graders (GPT-5/Claude/Gemini)
Primary outcome	Score on fixed task	Draft improvement	Score and rationale
Data needs	High (prompt-specific)	Low-Medium	Very low (rubric few-shot)
Change tolerance	Low (retrain on prompt change)	High	High (but monitor version drift)
Gaming risk	Documented historically	Low (not a grader)	Lower than classic AES but still requires QA
Fairness	Requires dedicated audits	N/A (formative)	Active research; must audit locally
Where it shines	Large, fixed-prompt programs	Writing centers, iterative courses	Department-scale grading with HITL
Representative evidence	AES overviews; ETS reports	AWE meta-analyses	2024–2025 LLM studies & reports
Compliance work	Validity docs, monitoring	Data protection & transparency	All of the left + strong audit logs

Sources: histories & overviews; AWE effects; LLM reliability. (Digital Commons , SAGE Journals , Frontiers , The Hechinger Report , ScienceDirect )

References you can cite to reviewers

AES history & engines: Dikli (2006) overview; ETS e-rater materials; IEA/LSA papers; ETS analysis of “stumping e-rater.” (Digital Commons , ETS , Wake Forest University Image Database )
AWE effectiveness: Zhai et al. (2023) meta-analysis; Fleckenstein et al. (2023) review; Fan & Ma (2022) systematic review. (SAGE Journals , Frontiers , ERIC )
LLM reliability/fairness: Pack et al. (2024); ACM 2025 LLM grading study; practitioner coverage of Tate (AERA 2024) study; GEM workshop 2025 comparative work. (ScienceDirect , ACM Digital Library , The Hechinger Report , ACL Anthology )
Testing & policy: AERA–APA–NCME Standards (2014); ITC guidelines for technology-based assessment. (testingstandards.net , AERA , intestcom.org )
Privacy: U.S. DOE PTAC requirements/best practices and model terms; UK DfE guidance on AI and data protection (useful template language). (Protecting Student Privacy , GOV.UK )

Want to try the hybrid approach without building the plumbing?

Try Exam AI Grader — import your rubric, run criterion-level prompts, enable human sampling and appeals, and export an audit log. Start in minutes.

Ready to Transform Your Grading Process?

Experience the power of AI-driven exam grading with human oversight. Get consistent, fast, and reliable assessment results.

Try AI Grader

AES vs AWE vs LLMs: What Works for Essay Grading in 2025

Definitions & 50-year history (in one page)

LLM era: what actually changed

Reliability & validity (trade-offs you can’t ignore)

Cost, latency, privacy, and policy—practical constraints

Cost (as of Aug 20, 2025)

Privacy & policy

Comparison matrix (quick scan)

Decision tree (choose with your constraints)

When to pick which (by use case)

Why hybrids win in practice

Known pitfalls (and how to avoid them)

Cost planning worksheet (quick math)

Mini case: “Good enough” vs. “High-stakes ready”

AES, AWE, LLMs — side-by-side (detailed)

References you can cite to reviewers

Ready to Transform Your Grading Process?

Related Reading

AES vs AWE vs LLMs: What Works for Essay Grading in 2025

Definitions & 50-year history (in one page)

LLM era: what actually changed

Reliability & validity (trade-offs you can’t ignore)

Cost, latency, privacy, and policy—practical constraints

Cost (as of Aug 20, 2025)

Privacy & policy

Comparison matrix (quick scan)

Decision tree (choose with your constraints)

When to pick which (by use case)

Why hybrids win in practice

Known pitfalls (and how to avoid them)

Cost planning worksheet (quick math)

Mini case: “Good enough” vs. “High-stakes ready”

AES, AWE, LLMs — side-by-side (detailed)

References you can cite to reviewers

Ready to Transform Your Grading Process?

Related Reading