✅ Grade essays and short-answers with AI Learn more →
Exam AI GraderPosts
CTRL K
Automated Essay Scoring (AES) vs. AWE vs. LLMs: What Actually Works in 2025

AES vs AWE vs LLMs: What Works for Essay Grading in 2025

Primary keyword: automated essay scoring

If you’ve heard conflicting claims about robot graders, grammar apps, and “AI teachers,” you’re not alone. This guide explains the three big families you’ll run into:

  • AES (Automated Essay Scoring): train-once scoring engines tuned to a prompt or construct.
  • AWE (Automated Writing Evaluation): tools that give formative feedback to improve drafts.
  • LLM-based scoring/feedback: large language models (GPT-5, Claude 3.5, Gemini 2.5, etc.) configured with rubrics and controls.

You’ll see where each shines, where each breaks, and a decision matrix to choose pragmatically under real institutional constraints.

TL;DR: AES is narrow and stable (best for fixed prompts at scale); AWE is great for formative feedback; LLMs are flexible but require human-in-the-loop, audit logs, and policy guardrails. If you want a turnkey way to run rubric-driven, auditable grading with sampling, you can try Exam AI Grader in minutes.


Definitions & 50-year history (in one page)

  • AES originated with Project Essay Grade (PEG) in the 1960s and later with ETS e-rater, Intelligent Essay Assessor (IEA), and others. Classic AES learns correlations between surface/linguistic features and human scores on a specific task, then scores new responses on that same task. (Digital Commons , ETS , Wake Forest University Image Database )

  • AES can achieve high agreement with trained raters on appropriate tasks, but construct coverage and vulnerability to gaming are longstanding concerns (e.g., inflated scores for length or fancy vocabulary; “stumping” counter-examples). (Digital Commons , ETS )

  • AWE (Automated Writing Evaluation) is about feedback (style, mechanics, organization, revision suggestions). Meta-analyses report meaningful improvements in writing quality when AWE is used well, especially in post-secondary contexts. (SAGE Journals , Frontiers , ERIC )

  • LLMs (2024–2025) changed the landscape: without task-specific training, LLMs can apply rubrics to open-domain essays with human-like scoring—good enough for low-stakes uses—but fairness, drift, and consistency remain active research areas. (ScienceDirect , ACM Digital Library )

Standards backdrop: Whether you use AES, AWE, or LLMs, your implementation should satisfy AERA–APA–NCME Testing Standards (validity, reliability, fairness, transparency) and similar guidance (e.g., ITC). (testingstandards.net , AERA , intestcom.org )


LLM era: what actually changed

  1. Generalization without prompt-specific training. Unlike classic AES (train on hundreds/thousands of scored samples per prompt), LLMs can score with a well-written rubric in zero- or few-shot fashion. That collapses setup costs for many courses. (ACM Digital Library , ScienceDirect )

  2. Richer rationale + structured outputs. LLMs can return criterion-level rationales and evidence pointers in JSON, enabling audits and downstream analytics—not just a single score. (See our HITL workflow: /blog/hitl-grading-workflow.)

  3. But: reliability and fairness vary by model, prompt, and population; exact agreement with trained human raters is still imperfect; models can drift across versions; and bias/fairness reviews are essential. (ScienceDirect , Nature , ACL Anthology )


Reliability & validity (trade-offs you can’t ignore)

  • AES (classic): Strong reliability within its trained domain; transparent feature sets (e.g., linguistic indicators) but narrow construct; known adversarial cases (nonsense yet high scores) demand human oversight in high-stakes use. (ETS , Digital Commons )

  • AWE: The goal is learning gains, not summative reliability. Meta-analyses show large positive effects on writing quality when AWE feedback complements instruction. Use rubrics and exemplars to prevent “grammar-only” tunnel vision. (SAGE Journals , Frontiers , ERIC )

  • LLMs: Recent studies find fair to moderate agreement with expert raters in K-12 datasets and competitive reliability in higher-ed samples—when rubric instructions are explicit. Treat as decision support unless you’ve validated on your population. (ScienceDirect , ACM Digital Library , The Hechinger Report )

Policy lens: The Testing Standards expect you to document intended use, construct coverage, validation evidence, error analyses, and fairness checks. If you can’t show this, don’t deploy the system for high-stakes grading. (testingstandards.net , AERA )


Cost, latency, privacy, and policy—practical constraints

Cost (as of Aug 20, 2025)

Public API prices make large-scale, low-stakes LLM scoring surprisingly inexpensive:

  • OpenAI (current pricing page): GPT-4o/4.1 and newer families list token-based rates; see provider page for latest. (OpenAI Platform )
  • Anthropic (Claude 3.5 Sonnet): ~\(3/M input tokens, \)15/M output tokens. (Anthropic )
  • Google (Gemini 2.5): e.g., Flash-Lite at ~\(0.10/M input and \)0.40/M output; Flash around \(0.30/\)2.50. (Google AI for Developers )

Example: ~800-word essay (~1.1k input tokens) with ~150-token output • Gemini 2.5 Flash-Lite: ≈ **\(0.00017** per essay (≈\)0.85 per 5,000). • Claude 3.5 Sonnet: ≈ **\(0.00555** per essay (≈\)27.75 per 5,000). (Your mileage varies—model, caching, and output length matter; always check the provider’s current pricing page.) (OpenAI Platform , Anthropic , Google AI for Developers )

Classic AES has near-zero marginal cost once a model is trained, but setup costs (curating hundreds/thousands of scored essays per prompt) and maintenance (re-training when prompts change) can be substantial. (Digital Commons )

Privacy & policy

  • In the U.S., FERPA governs disclosure and handling of student education records; the Department of Education’s PTAC provides model terms, requirements, and best practices for online services—key if you use third-party AI. (Protecting Student Privacy , Protecting Student Privacy )
  • Schools must disclose how generative AI tools collect/process data and update privacy notices accordingly (see UK guidance as a useful reference). (GOV.UK )
  • Beyond privacy, ensure fairness reviews and bias monitoring; recent work highlights equity risks in zero-shot LLM scoring. (Nature , ScienceDirect )

Comparison matrix (quick scan)

ApproachWhat it isStrengthsWeaknessesBest forSetup timeOngoing costReliability (task-fit)Feedback qualityPolicy fit
AESPrompt-trained scorer (e.g., e-rater/IEA)Fast, consistent within domain; low marginal costNarrow construct; re-training when prompts change; adversarial cases documentedFixed prompts at very large scale (placement, standardized programs)High (data curation/training)LowHigh if validated for that promptLimited (score + brief features)Strong if you meet Standards & monitor validity
AWEFeedback engine for revisionImproves writing quality over time; scalable feedbackNot a summative grader; quality varies by tool/configFormative cycles, writing labs, draftsLow-MediumLow-MediumN/A (aim is learning)Rich, actionable feedbackStrong for formative use with teacher oversight
LLMsRubric-driven, general modelsFlexible across prompts; rich rationales; structure (JSON)Version drift; fairness; exact agreement not guaranteed; requires HITLCourse-level grading with sampling, or rapid feedback draftsVery low (rubric + exemplars)Low per essayModerate out-of-the-box; improve with calibrationHigh (criterion rationales)Strong if audited + privacy terms in place

Sources: AES/AWE history & effects; LLM reliability/fairness; Testing Standards. (Digital Commons , SAGE Journals , Frontiers , ScienceDirect , testingstandards.net )


Decision tree (choose with your constraints)

Start ├─ Is this HIGH-STAKES (placement/certification) or program accountability? │ ├─ Yes → Do you have prompt-specific scored data (≥1k essays) and capacity to validate bias? │ │ ├─ Yes → AES or hybrid (AES primary + human double-score sample) → Annual validity study. │ │ └─ No → LLM with strict HITL (double-score borderlines) or postpone automation. │ └─ No (low/medium stakes) → │ ├─ Is your goal learning gains on drafts? → AWE or LLM-as-feedback. │ └─ Is your goal consistent grading with transparency at course scale? │ → LLM + rubric JSON + 10–20% human sampling + appeals workflow. └─ Any path → Log model/version/prompts; run fairness checks; align with Testing Standards.

See our HITL template: /blog/hitl-grading-workflow.


When to pick which (by use case)

  • First-year writing, multiple sections, shared rubric (low/medium-stakes): LLM with criterion-by-criterion prompts + 10–20% human sampling (oversample low-confidence) + appeals. Faster than pure manual, more flexible than AES. (The Hechinger Report )

  • Standardized placement writing, same prompt every term (high-stakes): AES (if you can assemble/maintain a prompt-specific training set and run bias audits). Maintain a human read-behind sample and periodic validity studies per Standards. (Digital Commons , testingstandards.net )

  • Writing center / iterative drafts: AWE or LLM as a feedback coach. Evidence for learning gains is strongest here; pair with explicit revision plans. (SAGE Journals , Frontiers )

  • Mixed modalities (short answers + essays): Combine LLM scoring for open responses with rule-based autoscoring for short answers; keep manual review for edge cases. (Taylor & Francis Online )


Why hybrids win in practice

A practical 2025 stack:

  1. AWE for drafts (students get immediate, criterion-mapped suggestions). (SAGE Journals )
  2. LLM pass for rubric-aligned JSON per criterion (low temperature; model/version logged). (Related prompts: /blog/ai-essay-feedback-prompts.)
  3. QA layer: auto-flags for plagiarism suspicions, cap rules (e.g., no thesis ⇒ cap), and unusual length/outlier detection.
  4. Human sampling: 10–20% review with oversampling of edge cases and low-confidence predictions; appeals workflow.
  5. Auditability: store rubric version, prompt template, model name, parameters, and raw JSON; recompute inter-rater reliability (e.g., κ) on the sampled set. Follow Testing Standards documentation. (testingstandards.net )

Known pitfalls (and how to avoid them)

  • Assuming “AI = unbiased.” Run group-level fairness checks and error analyses; LLMs can show subgroup differences without careful design and monitoring. (Nature )

  • Using AES outside its training domain. Classic engines degrade when prompts drift; budget for re-training and fresh human scoring whenever tasks change. (Digital Commons )

  • Letting any machine be the only grader in high-stakes decisions. Maintain human review pathways; this aligns with professional testing guidance. (testingstandards.net )

  • Over-editing student voice. Keep feedback diagnostic; require students to revise rather than replacing their prose.

  • Ignoring privacy terms. Use PTAC model terms to evaluate vendors; update privacy notices when introducing generative AI services. (Protecting Student Privacy , GOV.UK )

  • Forgetting the “gaming” lessons of early AES. Keep content-based checks and human sampling to deter formulaic hacks. (ETS )


Cost planning worksheet (quick math)

  • Token budgeting: cost ≈ (input_tokens/1M × input_rate) + (output_tokens/1M × output_rate).
  • Keep outputs concise (150–250 tokens per essay for criterion rationales is usually enough).
  • Consider prompt caching (provider-specific) for repeated rubric templates.
  • Track latency vs. throughput: batch by course/section; store responses for audits.
  • Check current prices on provider pages (OpenAI, Anthropic, Google). (OpenAI Platform , Anthropic , Google AI for Developers )

Mini case: “Good enough” vs. “High-stakes ready”

Recent studies suggest LLMs can match or approach typical teacher agreement levels on many school essays, especially with clear rubrics—promising for draft feedback and course-scale grading with sampling. But papers and practitioner reports also stress that high-stakes use still demands validation, bias analysis, and human oversight. (The Hechinger Report , ScienceDirect )


AES, AWE, LLMs — side-by-side (detailed)

DimensionAES (e-rater/IEA class)AWE (feedback tools)LLM graders (GPT-5/Claude/Gemini)
Primary outcomeScore on fixed taskDraft improvementScore and rationale
Data needsHigh (prompt-specific)Low-MediumVery low (rubric few-shot)
Change toleranceLow (retrain on prompt change)HighHigh (but monitor version drift)
Gaming riskDocumented historicallyLow (not a grader)Lower than classic AES but still requires QA
FairnessRequires dedicated auditsN/A (formative)Active research; must audit locally
Where it shinesLarge, fixed-prompt programsWriting centers, iterative coursesDepartment-scale grading with HITL
Representative evidenceAES overviews; ETS reportsAWE meta-analyses2024–2025 LLM studies & reports
Compliance workValidity docs, monitoringData protection & transparencyAll of the left + strong audit logs

Sources: histories & overviews; AWE effects; LLM reliability. (Digital Commons , SAGE Journals , Frontiers , The Hechinger Report , ScienceDirect )


References you can cite to reviewers

  • AES history & engines: Dikli (2006) overview; ETS e-rater materials; IEA/LSA papers; ETS analysis of “stumping e-rater.” (Digital Commons , ETS , Wake Forest University Image Database )
  • AWE effectiveness: Zhai et al. (2023) meta-analysis; Fleckenstein et al. (2023) review; Fan & Ma (2022) systematic review. (SAGE Journals , Frontiers , ERIC )
  • LLM reliability/fairness: Pack et al. (2024); ACM 2025 LLM grading study; practitioner coverage of Tate (AERA 2024) study; GEM workshop 2025 comparative work. (ScienceDirect , ACM Digital Library , The Hechinger Report , ACL Anthology )
  • Testing & policy: AERA–APA–NCME Standards (2014); ITC guidelines for technology-based assessment. (testingstandards.net , AERA , intestcom.org )
  • Privacy: U.S. DOE PTAC requirements/best practices and model terms; UK DfE guidance on AI and data protection (useful template language). (Protecting Student Privacy , GOV.UK )

Want to try the hybrid approach without building the plumbing?

Try Exam AI Grader — import your rubric, run criterion-level prompts, enable human sampling and appeals, and export an audit log. Start in minutes.

Ready to Transform Your Grading Process?

Experience the power of AI-driven exam grading with human oversight. Get consistent, fast, and reliable assessment results.

Try AI Grader

Related Reading

Prompt Templates for Fair, Specific, Actionable Essay Feedback

Copy-paste prompts to generate clear, rubric-aligned feedback on structure, evidence, clarity, and citations.

September 7, 2025
Detecting & Mitigating Bias in AI Grading: A Practical Playbook (2025)

Make bias checks routine, not ad hoc. A step-by-step, classroom-ready workflow to test and reduce AI grading bias—counterfactuals, blinded samples, prompts, monitoring, and audit docs.

September 7, 2025
Migration Guide: Move from Manual to AI Grading (4-Week Plan)

Pilot, calibrate, and roll out AI grading in four weeks—templates, comms, and QA checkpoints included.

September 7, 2025
  • AI Grader
  • Posts
  • RSS
  • Contact
  • Privacy
  • Terms

© 2025 AI Grader. All rights reserved.