CTRL K

Open-Ended STEM: Grading Explanations, Proof Sketches, and Diagrams (with AI)

Grading STEM Explanations, Proof Sketches & Diagrams with AI

Primary keyword: ai grading stem short answer

TL;DR (Key takeaways)

Grade the reasoning, not just results. Use analytic criteria (Correctness, Method, Communication, Units/Notation) with explicit descriptors and exemplars.
Partial credit should be principled. Treat each criterion as ordered levels and sum them—this aligns with well-established partial credit models in psychometrics.
Handle symbols & diagrams via OCR + multimodal LLMs. Convert math to LaTeX when possible; let vision-capable LLMs read figures, graphs, and handwritten work. docs.mathpix.com
Human-in-the-loop is non-negotiable for high-stakes. Follow the AERA–APA–NCME Standards: validate locally, log versions, and sample-review edge cases. testingstandards.net
Start quickly: upload scanned responses to Exam AI Grader, attach your rubric+exemplars, and enable 10–20% human sampling with appeals.

What “good” STEM explanations look like

Across physics, math, engineering, and chemistry, high-quality open-ended responses share three things:

Mathematical/Scientific Correctness — claims and computations are valid; assumptions are stated.
Method & Reasoning — appropriate approach is selected, steps are justified, and alternatives considered.
Communication — legible layout, symbolic conventions, diagrams/axes/units labeled, conclusions stated.

These align with widely cited math-teaching guidance emphasizing evidence of student thinking and explicit communication, not just answers. Taylor & Francis Online

A minimal, reusable analytic rubric (4 criteria × 4 levels)

Criterion	3 — Exemplary	2 — Proficient	1 — Emerging	0 — Incorrect/Insufficient
Correctness	All claims/values correct; assumptions explicit	Minor arithmetic/slip; core claims correct	Partially correct; major gap	Incorrect or unjustified
Method	Appropriate method; steps fully justified	Appropriate method; minor omissions	Partially appropriate; unclear justifications	Inappropriate or missing method
Communication	Clear structure; variables defined; diagram/graph labeled; conclusion stated	Mostly clear; minor labeling/structure issues	Hard to follow; missing labels or conclusion	Illegible/disorganized
Units/Notation	SI units consistent; symbols standard	One minor unit/notation issue	Several issues	Units absent/wrong; nonstandard symbols

Tip: Keep criteria independent (e.g., unit mistakes don’t double-penalize correctness). When you later analyze reliability, independence makes interpretation saner. See Standards’ emphasis on construct clarity and documentation. testingstandards.net

Criteria that reward reasoning, method, and communication

Analytic (criterion-by-criterion) scoring outperforms one holistic score when you care about how students solved problems, not only whether they did. This mirrors best practices in math education (“elicit and use evidence of student thinking”). Taylor & Francis Online

For STEM short answers, strong evidence indicates automated scoring can work well when criteria are explicit and examples are provided. Recent overviews of automated short-answer scoring (SAS) and LLM-assisted SAS report promising agreement with human raters and improved feedback granularity. PMC , arXiv

Handling symbolic notation & diagrams (OCR + attachments)

1) Convert symbols to LaTeX whenever possible

Use a STEM-aware OCR tool to turn photos/PDFs into LaTeX/MathML, preserving tables and equations; developer docs confirm support for printed and handwritten STEM content. This reduces hallucinations and makes evidence strings easy to show in rationales. docs.mathpix.com
If your scans are messy, pre-process: crop margins, ensure high contrast, avoid shadowed photos, and scan pages flat (phone scanners with “document” mode work surprisingly well).

2) Let multimodal LLMs read diagrams/handwriting

Vision-capable models can interpret graphs, lab plots, free-body diagrams, and scribbled derivations from images. Use them to extract captions/axes/units and to cross-check textual reasoning. Anthropic

Reality check: diagram and handwritten math understanding remain non-trivial research areas. The CROHME competitions summarize progress and gaps in handwritten math recognition; newer multimodal science QA datasets (e.g., ScienceQA) also highlight diagram reasoning challenges. Don’t skip human sampling. OpenAI Platform

Partial credit strategies—with concrete examples

“Partial credit” should be more than gut feel. Treat each criterion as ordered categories (e.g., 0→3). This aligns with Partial Credit Model (PCM) and related polytomous IRT models, which assume ordered score steps and let you analyze item functioning and reliability over time. SpringerLink

Example A (Physics): Constant-acceleration kinematics

Prompt: “A ball is thrown upward at \(v_0\). Derive \(t_\text{peak}\) and \(h_\text{max}\) ignoring drag.”

Partial-credit matrix (excerpt):

Evidence	Score
Correct equations of motion; derives \(t_\text{peak}=v_0/g\) and \(h_\text{max}=v_0^2/(2g)\); defines variables; units correct	C3 M3 U3
Correct method but algebra slip (e.g., \(v_0^2/g\)) with clear setup	C2 M3 U2
Chooses energy or kinematics appropriately but omits a justification step; unlabeled axes in sketch	C2 M2 U1
Uses wrong equation (e.g., no constant-a assumption) or no units	C0–1 M0–1 U0

Sum across criteria (Communication included) to produce the final score. This is transparent for students and auditable for you (and aligns with Standards’ documentation expectations). testingstandards.net

Example B (Math): Proof sketch (limit of a polynomial is continuous)

Look-fors: identifies \(\epsilon\)–\(\delta\) or continuity composition facts; structures argument; uses standard notation.

Partial credit notes: Give credit for valid structure even if a bound is loose; deduct for missing quantifiers or undefined symbols; separate “method” from “correctness.”

Common STEM pitfalls (and how your rubric catches them)

Unit errors and missing labels. Use an explicit Units/Notation criterion; align to SI guidance for symbol use/spacing to avoid ambiguous work. docs.mathpix.com
Skipped justification steps. Reserve points for stating the theorem/assumption used (e.g., constant acceleration, conservation of energy).
Diagram ambiguity. Require labeled axes, scales, and directed vectors; multi-image questions should reference each figure.
Symbol reuse / undefined variables. Deduct under Communication—not under Correctness—to avoid double penalties.
Answer-only responses. Cap Correctness if reasoning is absent (make that cap explicit in the rubric).

Workflow: from paper to scored JSON (HITL included)

Digitize submissions. Batch-scan or photograph pages cleanly; for math-heavy work, convert to LaTeX where feasible to reduce transcription ambiguity. docs.mathpix.com
Attach rubric + exemplars. Provide at least one full-credit and one borderline example per criterion.
Run multimodal scoring. Use rubric-guided prompts that ask for criterion-by-criterion ratings and short rationales. Vision models can read embedded plots/tables directly. Anthropic
Quality-assure. Auto-flag low confidence, outliers, and rubric caps; oversample these for human review.
Log for audits. Save model name/version, prompt, rubric hash, and raw JSON per response; compute inter-rater reliability (e.g., weighted kappa for ordinal criteria) on your human-review sample. UWO Personal Websites

Example JSON (per response, per criterion)


{
  "item_id": "PHYS-101-Q3",
  "student_id": "S12345",
  "model": "vision-LLM-x.y",
  "rubric_version": "v2.1",
  "scores": [
    {"criterion":"Correctness","level":2,"rationale":"Setup correct; algebra slip in h_max."},
    {"criterion":"Method","level":3,"rationale":"Selected energy method; steps justified."},
    {"criterion":"Communication","level":2,"rationale":"Axes labeled; conclusion terse."},
    {"criterion":"Units/Notation","level":2,"rationale":"One unit mismatch in intermediate step."}
  ],
  "final": 9
}

Why this is valid & defensible (Standards + evidence)

Automated short-answer scoring is mature enough to support decision-support workflows in STEM when rubrics and exemplars are explicit; recent work with LLMs shows high correlation with human scoring in authentic settings. Taylor & Francis Online , PMC , arXiv
Partial-credit modeling (PCM/GPCM) provides a theoretical backbone for criterion levels and for monitoring item functioning over time. SpringerLink , University Digital Conservancy
Standards (AERA–APA–NCME) require documenting intended uses, validation, fairness checks, and transparency—match your deployment to these expectations, especially for high-stakes. testingstandards.net , AERA

Worked assets you can reuse

Annotated solution pack (PDF): For each problem type (algebraic derivation, free-body diagram, proof sketch), include annotated Exemplary/Proficient/Borderline/Insufficient responses.
Partial-credit matrix (CSV): One row per evidence pattern; columns = criteria levels + notes.
Calibration set: 30–50 responses spanning the score range; double-scored by two instructors.

Example: partial-credit matrix (CSV snippet)


item_id,criterion,level,descriptor,evidence
PHYS-101-Q3,Correctness,3,"All expressions and values correct, assumptions explicit","t_peak=v0/g and h_max=v0^2/(2g); states g>0 downward"
PHYS-101-Q3,Correctness,2,"Core expressions correct; minor slip","Transcription sign error but final fix noted"
PHYS-101-Q3,Method,3,"Appropriate method; steps justified","Energy or kinematics applied with all steps shown"
PHYS-101-Q3,Communication,2,"Mostly clear; minor labeling issues","Diagram arrows not oriented but axes labeled"
PHYS-101-Q3,Units/Notation,1,"Several issues","Mixed m/s and km/h in same derivation"

Practical tips for LaTeX, figures, and scans

Prefer one page per image (no collages).
Ensure diagrams have titles, labeled axes, units, and readable scales.
For heavy math, a pass through a STEM-oriented OCR (e.g., to LaTeX/MathML) improves downstream scoring and makes rationales referenceable. docs.mathpix.com

Policy & fairness

Disclosure & auditability: Log prompts, model versions, rubric versions, and rationales.
Local validation: Compare model scores with trained raters on your population; compute weighted kappa (linear/quadratic) for ordinal criteria. UWO Personal Websites
Equity: Check subgroup error patterns; revisit criteria language if you see systematic differences.
Governance: Align deployment to Standards (intended use, evidence, error analyses, fairness). testingstandards.net

FAQ

1) Can AI grade proofs or only numeric answers? Yes—if your rubric rewards method and communication, AI can score structured proof sketches and derivations with criterion-level rationales. Keep exemplars handy and sample borderline cases for human review. Evidence from short-answer scoring studies supports rubric-guided LLM scoring as decision support. Taylor & Francis Online , arXiv

2) How do I deal with messy handwriting and symbols? Run images through a STEM-aware OCR to produce LaTeX/MathML where possible; otherwise use a vision-capable LLM and require references to labeled diagram elements/variables in the rationale. docs.mathpix.com , OpenAI Platform

3) What’s a defensible way to assign partial credit? Use ordered levels per criterion (0–3). This matches the Partial Credit Model family used in measurement theory and keeps scoring consistent across items and semesters. SpringerLink

4) How do I ensure reliability and fairness? Double-score a sample (10–20%), compute weighted kappa, and run subgroup error checks. Keep an appeals path. These steps align with testing Standards. UWO Personal Websites , testingstandards.net

5) Can AI read graphs and lab plots? Modern multimodal models accept images and can interpret charts/graphs; still, require clear axes/units and keep human sampling for ambiguous plots. Anthropic

6) Should I cap scores for “answer-only” work? Yes—state explicit caps (e.g., Correctness ≤1 if no method shown). This deters “answer dumping” and reinforces reasoning & communication.

7) Where should I start if I’m new to this? Begin with our short-answer guide for rubric prompts and JSON schemas, then try a small pilot with 1–2 assignments: /blog/ai-short-answer-grading-guide.

Ready to Transform Your Grading Process?

Experience the power of AI-driven exam grading with human oversight. Get consistent, fast, and reliable assessment results.

Try AI Grader

Grading STEM Explanations, Proof Sketches & Diagrams with AI

Primary keyword: ai grading stem short answer

TL;DR (Key takeaways)

Grade the reasoning, not just results. Use analytic criteria (Correctness, Method, Communication, Units/Notation) with explicit descriptors and exemplars.
Partial credit should be principled. Treat each criterion as ordered levels and sum them—this aligns with well-established partial credit models in psychometrics.
Handle symbols & diagrams via OCR + multimodal LLMs. Convert math to LaTeX when possible; let vision-capable LLMs read figures, graphs, and handwritten work. docs.mathpix.com
Human-in-the-loop is non-negotiable for high-stakes. Follow the AERA–APA–NCME Standards: validate locally, log versions, and sample-review edge cases. testingstandards.net
Start quickly: upload scanned responses to Exam AI Grader, attach your rubric+exemplars, and enable 10–20% human sampling with appeals.

What “good” STEM explanations look like

Across physics, math, engineering, and chemistry, high-quality open-ended responses share three things:

Mathematical/Scientific Correctness — claims and computations are valid; assumptions are stated.
Method & Reasoning — appropriate approach is selected, steps are justified, and alternatives considered.
Communication — legible layout, symbolic conventions, diagrams/axes/units labeled, conclusions stated.

These align with widely cited math-teaching guidance emphasizing evidence of student thinking and explicit communication, not just answers. Taylor & Francis Online

A minimal, reusable analytic rubric (4 criteria × 4 levels)

Criterion	3 — Exemplary	2 — Proficient	1 — Emerging	0 — Incorrect/Insufficient
Correctness	All claims/values correct; assumptions explicit	Minor arithmetic/slip; core claims correct	Partially correct; major gap	Incorrect or unjustified
Method	Appropriate method; steps fully justified	Appropriate method; minor omissions	Partially appropriate; unclear justifications	Inappropriate or missing method
Communication	Clear structure; variables defined; diagram/graph labeled; conclusion stated	Mostly clear; minor labeling/structure issues	Hard to follow; missing labels or conclusion	Illegible/disorganized
Units/Notation	SI units consistent; symbols standard	One minor unit/notation issue	Several issues	Units absent/wrong; nonstandard symbols

Tip: Keep criteria independent (e.g., unit mistakes don’t double-penalize correctness). When you later analyze reliability, independence makes interpretation saner. See Standards’ emphasis on construct clarity and documentation. testingstandards.net

Criteria that reward reasoning, method, and communication

Handling symbolic notation & diagrams (OCR + attachments)

1) Convert symbols to LaTeX whenever possible

Use a STEM-aware OCR tool to turn photos/PDFs into LaTeX/MathML, preserving tables and equations; developer docs confirm support for printed and handwritten STEM content. This reduces hallucinations and makes evidence strings easy to show in rationales. docs.mathpix.com
If your scans are messy, pre-process: crop margins, ensure high contrast, avoid shadowed photos, and scan pages flat (phone scanners with “document” mode work surprisingly well).

2) Let multimodal LLMs read diagrams/handwriting

Reality check: diagram and handwritten math understanding remain non-trivial research areas. The CROHME competitions summarize progress and gaps in handwritten math recognition; newer multimodal science QA datasets (e.g., ScienceQA) also highlight diagram reasoning challenges. Don’t skip human sampling. OpenAI Platform

Partial credit strategies—with concrete examples

Example A (Physics): Constant-acceleration kinematics

Prompt: “A ball is thrown upward at \(v_0\). Derive \(t_\text{peak}\) and \(h_\text{max}\) ignoring drag.”

Partial-credit matrix (excerpt):

Evidence	Score
Correct equations of motion; derives \(t_\text{peak}=v_0/g\) and \(h_\text{max}=v_0^2/(2g)\); defines variables; units correct	C3 M3 U3
Correct method but algebra slip (e.g., \(v_0^2/g\)) with clear setup	C2 M3 U2
Chooses energy or kinematics appropriately but omits a justification step; unlabeled axes in sketch	C2 M2 U1
Uses wrong equation (e.g., no constant-a assumption) or no units	C0–1 M0–1 U0

Example B (Math): Proof sketch (limit of a polynomial is continuous)

Look-fors: identifies \(\epsilon\)–\(\delta\) or continuity composition facts; structures argument; uses standard notation.

Partial credit notes: Give credit for valid structure even if a bound is loose; deduct for missing quantifiers or undefined symbols; separate “method” from “correctness.”

Common STEM pitfalls (and how your rubric catches them)

Unit errors and missing labels. Use an explicit Units/Notation criterion; align to SI guidance for symbol use/spacing to avoid ambiguous work. docs.mathpix.com
Skipped justification steps. Reserve points for stating the theorem/assumption used (e.g., constant acceleration, conservation of energy).
Diagram ambiguity. Require labeled axes, scales, and directed vectors; multi-image questions should reference each figure.
Symbol reuse / undefined variables. Deduct under Communication—not under Correctness—to avoid double penalties.
Answer-only responses. Cap Correctness if reasoning is absent (make that cap explicit in the rubric).

Workflow: from paper to scored JSON (HITL included)

Digitize submissions. Batch-scan or photograph pages cleanly; for math-heavy work, convert to LaTeX where feasible to reduce transcription ambiguity. docs.mathpix.com
Attach rubric + exemplars. Provide at least one full-credit and one borderline example per criterion.
Run multimodal scoring. Use rubric-guided prompts that ask for criterion-by-criterion ratings and short rationales. Vision models can read embedded plots/tables directly. Anthropic
Quality-assure. Auto-flag low confidence, outliers, and rubric caps; oversample these for human review.
Log for audits. Save model name/version, prompt, rubric hash, and raw JSON per response; compute inter-rater reliability (e.g., weighted kappa for ordinal criteria) on your human-review sample. UWO Personal Websites

Example JSON (per response, per criterion)


{
  "item_id": "PHYS-101-Q3",
  "student_id": "S12345",
  "model": "vision-LLM-x.y",
  "rubric_version": "v2.1",
  "scores": [
    {"criterion":"Correctness","level":2,"rationale":"Setup correct; algebra slip in h_max."},
    {"criterion":"Method","level":3,"rationale":"Selected energy method; steps justified."},
    {"criterion":"Communication","level":2,"rationale":"Axes labeled; conclusion terse."},
    {"criterion":"Units/Notation","level":2,"rationale":"One unit mismatch in intermediate step."}
  ],
  "final": 9
}

Why this is valid & defensible (Standards + evidence)

Automated short-answer scoring is mature enough to support decision-support workflows in STEM when rubrics and exemplars are explicit; recent work with LLMs shows high correlation with human scoring in authentic settings. Taylor & Francis Online , PMC , arXiv
Partial-credit modeling (PCM/GPCM) provides a theoretical backbone for criterion levels and for monitoring item functioning over time. SpringerLink , University Digital Conservancy
Standards (AERA–APA–NCME) require documenting intended uses, validation, fairness checks, and transparency—match your deployment to these expectations, especially for high-stakes. testingstandards.net , AERA

Worked assets you can reuse

Annotated solution pack (PDF): For each problem type (algebraic derivation, free-body diagram, proof sketch), include annotated Exemplary/Proficient/Borderline/Insufficient responses.
Partial-credit matrix (CSV): One row per evidence pattern; columns = criteria levels + notes.
Calibration set: 30–50 responses spanning the score range; double-scored by two instructors.

Example: partial-credit matrix (CSV snippet)


item_id,criterion,level,descriptor,evidence
PHYS-101-Q3,Correctness,3,"All expressions and values correct, assumptions explicit","t_peak=v0/g and h_max=v0^2/(2g); states g>0 downward"
PHYS-101-Q3,Correctness,2,"Core expressions correct; minor slip","Transcription sign error but final fix noted"
PHYS-101-Q3,Method,3,"Appropriate method; steps justified","Energy or kinematics applied with all steps shown"
PHYS-101-Q3,Communication,2,"Mostly clear; minor labeling issues","Diagram arrows not oriented but axes labeled"
PHYS-101-Q3,Units/Notation,1,"Several issues","Mixed m/s and km/h in same derivation"

Practical tips for LaTeX, figures, and scans

Prefer one page per image (no collages).
Ensure diagrams have titles, labeled axes, units, and readable scales.
For heavy math, a pass through a STEM-oriented OCR (e.g., to LaTeX/MathML) improves downstream scoring and makes rationales referenceable. docs.mathpix.com

Policy & fairness

Disclosure & auditability: Log prompts, model versions, rubric versions, and rationales.
Local validation: Compare model scores with trained raters on your population; compute weighted kappa (linear/quadratic) for ordinal criteria. UWO Personal Websites
Equity: Check subgroup error patterns; revisit criteria language if you see systematic differences.
Governance: Align deployment to Standards (intended use, evidence, error analyses, fairness). testingstandards.net

FAQ

5) Can AI read graphs and lab plots? Modern multimodal models accept images and can interpret charts/graphs; still, require clear axes/units and keep human sampling for ambiguous plots. Anthropic

Ready to Transform Your Grading Process?

Experience the power of AI-driven exam grading with human oversight. Get consistent, fast, and reliable assessment results.

Try AI Grader

Grading STEM Explanations, Proof Sketches & Diagrams with AI

What “good” STEM explanations look like

A minimal, reusable analytic rubric (4 criteria × 4 levels)

Criteria that reward reasoning, method, and communication

Handling symbolic notation & diagrams (OCR + attachments)

1) Convert symbols to LaTeX whenever possible

2) Let multimodal LLMs read diagrams/handwriting

Partial credit strategies—with concrete examples

Example A (Physics): Constant-acceleration kinematics

Example B (Math): Proof sketch (limit of a polynomial is continuous)

Common STEM pitfalls (and how your rubric catches them)

Workflow: from paper to scored JSON (HITL included)

Why this is valid & defensible (Standards + evidence)

Worked assets you can reuse

Example: partial-credit matrix (CSV snippet)

Practical tips for LaTeX, figures, and scans

Policy & fairness

FAQ

Ready to Transform Your Grading Process?

Related Reading

Grading STEM Explanations, Proof Sketches & Diagrams with AI

What “good” STEM explanations look like

A minimal, reusable analytic rubric (4 criteria × 4 levels)

Criteria that reward reasoning, method, and communication

Handling symbolic notation & diagrams (OCR + attachments)

1) Convert symbols to LaTeX whenever possible

2) Let multimodal LLMs read diagrams/handwriting

Partial credit strategies—with concrete examples

Example A (Physics): Constant-acceleration kinematics

Example B (Math): Proof sketch (limit of a polynomial is continuous)

Common STEM pitfalls (and how your rubric catches them)

Workflow: from paper to scored JSON (HITL included)

Why this is valid & defensible (Standards + evidence)

Worked assets you can reuse

Example: partial-credit matrix (CSV snippet)

Practical tips for LaTeX, figures, and scans

Policy & fairness

FAQ

Ready to Transform Your Grading Process?

Related Reading