
AES vs AWE vs LLMs: What Works for Essay Grading in 2025
Primary keyword: automated essay scoring
If you’ve heard conflicting claims about robot graders, grammar apps, and “AI teachers,” you’re not alone. This guide explains the three big families you’ll run into:
- AES (Automated Essay Scoring): train-once scoring engines tuned to a prompt or construct.
- AWE (Automated Writing Evaluation): tools that give formative feedback to improve drafts.
- LLM-based scoring/feedback: large language models (GPT-5, Claude 3.5, Gemini 2.5, etc.) configured with rubrics and controls.
You’ll see where each shines, where each breaks, and a decision matrix to choose pragmatically under real institutional constraints.
TL;DR: AES is narrow and stable (best for fixed prompts at scale); AWE is great for formative feedback; LLMs are flexible but require human-in-the-loop, audit logs, and policy guardrails. If you want a turnkey way to run rubric-driven, auditable grading with sampling, you can try Exam AI Grader in minutes.
Definitions & 50-year history (in one page)
-
AES originated with Project Essay Grade (PEG) in the 1960s and later with ETS e-rater, Intelligent Essay Assessor (IEA), and others. Classic AES learns correlations between surface/linguistic features and human scores on a specific task, then scores new responses on that same task. (Digital Commons , ETS , Wake Forest University Image Database )
-
AES can achieve high agreement with trained raters on appropriate tasks, but construct coverage and vulnerability to gaming are longstanding concerns (e.g., inflated scores for length or fancy vocabulary; “stumping” counter-examples). (Digital Commons , ETS )
-
AWE (Automated Writing Evaluation) is about feedback (style, mechanics, organization, revision suggestions). Meta-analyses report meaningful improvements in writing quality when AWE is used well, especially in post-secondary contexts. (SAGE Journals , Frontiers , ERIC )
-
LLMs (2024–2025) changed the landscape: without task-specific training, LLMs can apply rubrics to open-domain essays with human-like scoring—good enough for low-stakes uses—but fairness, drift, and consistency remain active research areas. (ScienceDirect , ACM Digital Library )
Standards backdrop: Whether you use AES, AWE, or LLMs, your implementation should satisfy AERA–APA–NCME Testing Standards (validity, reliability, fairness, transparency) and similar guidance (e.g., ITC). (testingstandards.net , AERA , intestcom.org )
LLM era: what actually changed
-
Generalization without prompt-specific training. Unlike classic AES (train on hundreds/thousands of scored samples per prompt), LLMs can score with a well-written rubric in zero- or few-shot fashion. That collapses setup costs for many courses. (ACM Digital Library , ScienceDirect )
-
Richer rationale + structured outputs. LLMs can return criterion-level rationales and evidence pointers in JSON, enabling audits and downstream analytics—not just a single score. (See our HITL workflow: /blog/hitl-grading-workflow.)
-
But: reliability and fairness vary by model, prompt, and population; exact agreement with trained human raters is still imperfect; models can drift across versions; and bias/fairness reviews are essential. (ScienceDirect , Nature , ACL Anthology )
Reliability & validity (trade-offs you can’t ignore)
-
AES (classic): Strong reliability within its trained domain; transparent feature sets (e.g., linguistic indicators) but narrow construct; known adversarial cases (nonsense yet high scores) demand human oversight in high-stakes use. (ETS , Digital Commons )
-
AWE: The goal is learning gains, not summative reliability. Meta-analyses show large positive effects on writing quality when AWE feedback complements instruction. Use rubrics and exemplars to prevent “grammar-only” tunnel vision. (SAGE Journals , Frontiers , ERIC )
-
LLMs: Recent studies find fair to moderate agreement with expert raters in K-12 datasets and competitive reliability in higher-ed samples—when rubric instructions are explicit. Treat as decision support unless you’ve validated on your population. (ScienceDirect , ACM Digital Library , The Hechinger Report )
Policy lens: The Testing Standards expect you to document intended use, construct coverage, validation evidence, error analyses, and fairness checks. If you can’t show this, don’t deploy the system for high-stakes grading. (testingstandards.net , AERA )
Cost, latency, privacy, and policy—practical constraints
Cost (as of Aug 20, 2025)
Public API prices make large-scale, low-stakes LLM scoring surprisingly inexpensive:
- OpenAI (current pricing page): GPT-4o/4.1 and newer families list token-based rates; see provider page for latest. (OpenAI Platform )
- Anthropic (Claude 3.5 Sonnet): ~\(3/M input tokens, \)15/M output tokens. (Anthropic )
- Google (Gemini 2.5): e.g., Flash-Lite at ~\(0.10/M input and \)0.40/M output; Flash around \(0.30/\)2.50. (Google AI for Developers )
Example: ~800-word essay (~1.1k input tokens) with ~150-token output • Gemini 2.5 Flash-Lite: ≈ **\(0.00017** per essay (≈\)0.85 per 5,000). • Claude 3.5 Sonnet: ≈ **\(0.00555** per essay (≈\)27.75 per 5,000). (Your mileage varies—model, caching, and output length matter; always check the provider’s current pricing page.) (OpenAI Platform , Anthropic , Google AI for Developers )
Classic AES has near-zero marginal cost once a model is trained, but setup costs (curating hundreds/thousands of scored essays per prompt) and maintenance (re-training when prompts change) can be substantial. (Digital Commons )
Privacy & policy
- In the U.S., FERPA governs disclosure and handling of student education records; the Department of Education’s PTAC provides model terms, requirements, and best practices for online services—key if you use third-party AI. (Protecting Student Privacy , Protecting Student Privacy )
- Schools must disclose how generative AI tools collect/process data and update privacy notices accordingly (see UK guidance as a useful reference). (GOV.UK )
- Beyond privacy, ensure fairness reviews and bias monitoring; recent work highlights equity risks in zero-shot LLM scoring. (Nature , ScienceDirect )
Comparison matrix (quick scan)
Approach | What it is | Strengths | Weaknesses | Best for | Setup time | Ongoing cost | Reliability (task-fit) | Feedback quality | Policy fit |
---|---|---|---|---|---|---|---|---|---|
AES | Prompt-trained scorer (e.g., e-rater/IEA) | Fast, consistent within domain; low marginal cost | Narrow construct; re-training when prompts change; adversarial cases documented | Fixed prompts at very large scale (placement, standardized programs) | High (data curation/training) | Low | High if validated for that prompt | Limited (score + brief features) | Strong if you meet Standards & monitor validity |
AWE | Feedback engine for revision | Improves writing quality over time; scalable feedback | Not a summative grader; quality varies by tool/config | Formative cycles, writing labs, drafts | Low-Medium | Low-Medium | N/A (aim is learning) | Rich, actionable feedback | Strong for formative use with teacher oversight |
LLMs | Rubric-driven, general models | Flexible across prompts; rich rationales; structure (JSON) | Version drift; fairness; exact agreement not guaranteed; requires HITL | Course-level grading with sampling, or rapid feedback drafts | Very low (rubric + exemplars) | Low per essay | Moderate out-of-the-box; improve with calibration | High (criterion rationales) | Strong if audited + privacy terms in place |
Sources: AES/AWE history & effects; LLM reliability/fairness; Testing Standards. (Digital Commons , SAGE Journals , Frontiers , ScienceDirect , testingstandards.net )
Decision tree (choose with your constraints)
Start
├─ Is this HIGH-STAKES (placement/certification) or program accountability?
│ ├─ Yes → Do you have prompt-specific scored data (≥1k essays) and capacity to validate bias?
│ │ ├─ Yes → AES or hybrid (AES primary + human double-score sample) → Annual validity study.
│ │ └─ No → LLM with strict HITL (double-score borderlines) or postpone automation.
│ └─ No (low/medium stakes) →
│ ├─ Is your goal learning gains on drafts? → AWE or LLM-as-feedback.
│ └─ Is your goal consistent grading with transparency at course scale?
│ → LLM + rubric JSON + 10–20% human sampling + appeals workflow.
└─ Any path → Log model/version/prompts; run fairness checks; align with Testing Standards.
See our HITL template: /blog/hitl-grading-workflow.
When to pick which (by use case)
-
First-year writing, multiple sections, shared rubric (low/medium-stakes): LLM with criterion-by-criterion prompts + 10–20% human sampling (oversample low-confidence) + appeals. Faster than pure manual, more flexible than AES. (The Hechinger Report )
-
Standardized placement writing, same prompt every term (high-stakes): AES (if you can assemble/maintain a prompt-specific training set and run bias audits). Maintain a human read-behind sample and periodic validity studies per Standards. (Digital Commons , testingstandards.net )
-
Writing center / iterative drafts: AWE or LLM as a feedback coach. Evidence for learning gains is strongest here; pair with explicit revision plans. (SAGE Journals , Frontiers )
-
Mixed modalities (short answers + essays): Combine LLM scoring for open responses with rule-based autoscoring for short answers; keep manual review for edge cases. (Taylor & Francis Online )
Why hybrids win in practice
A practical 2025 stack:
- AWE for drafts (students get immediate, criterion-mapped suggestions). (SAGE Journals )
- LLM pass for rubric-aligned JSON per criterion (low temperature; model/version logged). (Related prompts: /blog/ai-essay-feedback-prompts.)
- QA layer: auto-flags for plagiarism suspicions, cap rules (e.g., no thesis ⇒ cap), and unusual length/outlier detection.
- Human sampling: 10–20% review with oversampling of edge cases and low-confidence predictions; appeals workflow.
- Auditability: store rubric version, prompt template, model name, parameters, and raw JSON; recompute inter-rater reliability (e.g., κ) on the sampled set. Follow Testing Standards documentation. (testingstandards.net )
Known pitfalls (and how to avoid them)
-
Assuming “AI = unbiased.” Run group-level fairness checks and error analyses; LLMs can show subgroup differences without careful design and monitoring. (Nature )
-
Using AES outside its training domain. Classic engines degrade when prompts drift; budget for re-training and fresh human scoring whenever tasks change. (Digital Commons )
-
Letting any machine be the only grader in high-stakes decisions. Maintain human review pathways; this aligns with professional testing guidance. (testingstandards.net )
-
Over-editing student voice. Keep feedback diagnostic; require students to revise rather than replacing their prose.
-
Ignoring privacy terms. Use PTAC model terms to evaluate vendors; update privacy notices when introducing generative AI services. (Protecting Student Privacy , GOV.UK )
-
Forgetting the “gaming” lessons of early AES. Keep content-based checks and human sampling to deter formulaic hacks. (ETS )
Cost planning worksheet (quick math)
- Token budgeting: cost ≈ (input_tokens/1M × input_rate) + (output_tokens/1M × output_rate).
- Keep outputs concise (150–250 tokens per essay for criterion rationales is usually enough).
- Consider prompt caching (provider-specific) for repeated rubric templates.
- Track latency vs. throughput: batch by course/section; store responses for audits.
- Check current prices on provider pages (OpenAI, Anthropic, Google). (OpenAI Platform , Anthropic , Google AI for Developers )
Mini case: “Good enough” vs. “High-stakes ready”
Recent studies suggest LLMs can match or approach typical teacher agreement levels on many school essays, especially with clear rubrics—promising for draft feedback and course-scale grading with sampling. But papers and practitioner reports also stress that high-stakes use still demands validation, bias analysis, and human oversight. (The Hechinger Report , ScienceDirect )
AES, AWE, LLMs — side-by-side (detailed)
Dimension | AES (e-rater/IEA class) | AWE (feedback tools) | LLM graders (GPT-5/Claude/Gemini) |
---|---|---|---|
Primary outcome | Score on fixed task | Draft improvement | Score and rationale |
Data needs | High (prompt-specific) | Low-Medium | Very low (rubric few-shot) |
Change tolerance | Low (retrain on prompt change) | High | High (but monitor version drift) |
Gaming risk | Documented historically | Low (not a grader) | Lower than classic AES but still requires QA |
Fairness | Requires dedicated audits | N/A (formative) | Active research; must audit locally |
Where it shines | Large, fixed-prompt programs | Writing centers, iterative courses | Department-scale grading with HITL |
Representative evidence | AES overviews; ETS reports | AWE meta-analyses | 2024–2025 LLM studies & reports |
Compliance work | Validity docs, monitoring | Data protection & transparency | All of the left + strong audit logs |
Sources: histories & overviews; AWE effects; LLM reliability. (Digital Commons , SAGE Journals , Frontiers , The Hechinger Report , ScienceDirect )
References you can cite to reviewers
- AES history & engines: Dikli (2006) overview; ETS e-rater materials; IEA/LSA papers; ETS analysis of “stumping e-rater.” (Digital Commons , ETS , Wake Forest University Image Database )
- AWE effectiveness: Zhai et al. (2023) meta-analysis; Fleckenstein et al. (2023) review; Fan & Ma (2022) systematic review. (SAGE Journals , Frontiers , ERIC )
- LLM reliability/fairness: Pack et al. (2024); ACM 2025 LLM grading study; practitioner coverage of Tate (AERA 2024) study; GEM workshop 2025 comparative work. (ScienceDirect , ACM Digital Library , The Hechinger Report , ACL Anthology )
- Testing & policy: AERA–APA–NCME Standards (2014); ITC guidelines for technology-based assessment. (testingstandards.net , AERA , intestcom.org )
- Privacy: U.S. DOE PTAC requirements/best practices and model terms; UK DfE guidance on AI and data protection (useful template language). (Protecting Student Privacy , GOV.UK )
Want to try the hybrid approach without building the plumbing?
Try Exam AI Grader — import your rubric, run criterion-level prompts, enable human sampling and appeals, and export an audit log. Start in minutes.
Ready to Transform Your Grading Process?
Experience the power of AI-driven exam grading with human oversight. Get consistent, fast, and reliable assessment results.
Try AI GraderRelated Reading
© 2025 AI Grader. All rights reserved.