Back to case studies

AI in K-12 Schools: Learning in the Age of Free Outputs

Stress Test | 2026-04-09

Core pattern: AI made producing school outputs nearly free. Schools responded to the symptom - the output - not the disease: what happens to learning when producing the output no longer requires it.

Claim: When AI makes school outputs cheap without changing how students are assessed, schools drift toward compliance theater while the students who most need real skill-building pay the highest price.

AI made it easy to produce schoolwork without doing the thinking the work was supposed to require. Bans and detectors did not solve that problem, and most schools still lack the time, training, and assessment redesign needed to measure learning directly.

Evidence level: Medium | Event window: 2023-01-01 to 2026-04-09

Receipts: tracked in Methods and Sources by type: Primary documents | Official data | Independent analysis

What they did

Starting in January 2023, major US school districts banned AI tools on school networks. NYC banned it in January 2023, declared it a success, and reversed course four months later. Other large districts moved through similar cycles. By spring 2025, 54% of students and 53% of core subject teachers were using AI for schoolwork regardless of what their district’s policy said. The modal response across US K-12 is now: no coherent policy at all.


Why it worked (or didn’t)

Schools are built around outputs - essays, problem sets, reports. For most of education’s history, that was fine. Producing a plausible essay required the student to construct an argument. Constructing an argument built the skill. Output and learning were the same activity.

AI broke that chain. Producing the output no longer requires the cognitive work. A student who prompts AI to write the essay gets a B on the assignment and skips the part that builds the skill.

Schools responded to the output problem - bans, detection tools, honor code updates. They did not respond to the learning problem, because that requires something harder: redesigning how assessment works so that demonstrated knowledge cannot be outsourced.

The ban-only approach also failed on a structural fact: AI is on every personal phone, accessible at home and off school networks. Substack reporting on USC Understanding America Study panel data (Psychoftech, November 2025) found that students in schools with complete bans reported AI use at roughly the same rate as students in schools with no ban. This finding is directionally consistent with other usage data but is a secondary source - not independently peer-reviewed.


Mechanism evidence

AI use is near-universal and policy-independent:

  • 86% of students globally use AI in their studies; 54% weekly (Digital Education Council survey, cited in Campbell ATS meta-summary, March 2025)
  • 66% of US teens use AI for schoolwork; rate does not vary meaningfully based on school ban status - directionally consistent across sources, though the primary citation here is a secondary analysis, not a peer-reviewed study (Psychoftech / USC Understanding America Study, November 2025)
  • College Board survey (10,000+ respondents, June 2024-June 2025): high school usage grew from 79% to 84% between January and May 2025 alone

The supply-side disruption is causal, not theoretical:

The randomized controlled trial (RCT) evidence establishes this directly.

Bastani et al. (PNAS, 2025) ran a controlled study of roughly 1,000 K-12 students. Students who used AI to work math problems during the learning period performed measurably worse when tested on the same material after AI was removed. The AI improved their performance while it was present. When it was gone, the learning was gone too. (DOI: 10.1073/pnas.2422633122)

The Lodge and Loble “performance paradox” names the same pattern: AI boosts immediate performance while diminishing durable learning. The output goes up. The skill does not.

Barcaui (Social Sciences & Humanities Open, 2025) ran an RCT with 120 undergraduate business students. AI-using students scored 57.5% on a surprise retention test 45 days later. Traditional-methods students scored 68.5%. Six weeks after the work was done, the AI group had forgotten enough to fall below passing. The gap was not noise - it was 11 points on a surprise test neither group knew was coming.

The scaffold vs. crutch distinction:

Not all AI use produces the same harm. The mechanism is cognitive offloading - what work the student’s mind is doing versus what the AI is doing.

Asking AI to summarize a dense research paper so the student can quickly assess whether it is worth reading in full: the student still decides what matters, evaluates the source, and does the reading on anything that makes the cut. AI handled retrieval; the student handled judgment. Beneficial offloading.

Asking AI to write the essay: the student has a polished argument they did not construct, cannot fully explain, and cannot tell you is correct. They submitted something they do not own. When the teacher asks them to defend it, the gap shows immediately. Harmful offloading - the construction work that was supposed to happen did not happen.

The distinction is not about AI being good or bad. It is about whether the student is doing the construction work alongside the tool, or bypassing it entirely.

AI does improve surface writing quality on some metrics - grammar, sentence formulation, vocabulary enrichment. The RCT evidence establishes harm when AI replaces the cognitive work, not when it supports grammar or formatting. The scaffold-vs-crutch split is important: surface help is not the same thing as thinking help.

The good loop and the bad loop:

When a student has built genuine domain expertise, AI removes the mechanical friction - and what’s left is the thinking.

A student researching a complex topic used to spend hours on tasks that required time but not judgment: finding sources, skimming for relevance, pulling together summaries. AI handles that now. What the student is left with is the actual hard part: reading what came back, deciding what matters, asking what’s missing, noticing what conflicts, forming a view.

That is not easier. It is more demanding. And it builds more, faster - because the student is spending their cognitive effort on judgment and synthesis instead of search and retrieval.

The prerequisite is not domain expertise. It is engagement. AI can build expertise from scratch - if the student actually follows through.

Ask a question. AI scans and synthesizes with citations. Student reads the synthesis - and follows the citations. Forms a view. Asks the next question. That process builds understanding. The AI is the on-ramp, not the destination.

The “follow the citations” step is the hinge. That is what separates scaffold from crutch in this scenario. The AI provides a map; the student does the reading.

question asked -> AI synthesizes with citations -> student reads synthesis and follows citations -> understanding grows -> student asks better questions -> deeper investigation -> domain expertise builds -> student knows enough to ask harder questions -> question asked ->

What schools are for - in the AI era - is building the habit of engagement before students are handed the tool. Not the facts. The habit: follow through, question, verify, form your own view. Students who have that habit enter the good loop whether they start as novices or experts. Students who don’t accept the synthesis and stop.

The tool makes an engaged person more capable. That is the good loop.

When a student skips the building phase and uses AI to generate outputs from the start, the loop runs the other way. This is the bad loop - and it is self-reinforcing:

seeks efficiency -> AI produces polished output -> output quality mistaken for understanding (performance paradox) -> student stops monitoring own knowledge gaps (metacognitive laziness) -> more cognitive work outsourced to AI -> skills atrophy from disuse -> can’t evaluate AI output -> increased dependency -> diminished capacity for judgment and oversight -> economically replaceable -> seeks more efficiency ->

Each pass through the loop erodes the foundation needed to break out of it.

The Matthew Effect - which loop you enter depends on what you bring:

“Matthew Effect” is a term from sociology: advantages compound for those who already have them, and disadvantages compound for those who don’t. From the Gospel of Matthew: “to him who has, more will be given.”

In this context: students with strong metacognitive foundations enter the good loop. Students without them enter the bad loop. The tool amplifies the gap that already exists - it does not create it.

This is why the Matthew Effect and the bad loop are the same problem at two levels. Within a school, the student who starts with more knowledge uses AI to go further. The student who starts with less uses AI to skip what they needed to build. Across schools, the well-resourced school has the teacher capacity and policy infrastructure to teach the distinction. The under-resourced school does not. The compounding runs at both levels simultaneously.

Most schools have no policy at all:

  • Only 31% of public schools have any written AI policy as of December 2024 (NCES School Pulse Panel, via Child Trends)
  • Only 13% of high schools actively encourage AI use in all classes; roughly 40% ban it outright; 1 in 5 allow it with no formal policy (College Board 2025)
  • Only 34% of teachers report their school or district has policies on AI use related to academic integrity; over 80% of students report teachers have not explicitly taught them how to use AI (RAND Doss et al., spring 2025)

Enforcement-only generates compliance theater:

  • Only 9% of teens’ parents believe their teen regularly uses AI for schoolwork (Psychoftech / USC, November 2025)
  • 47% of students say cheating is easier than last year; 35% name ChatGPT specifically (Wiley survey, 2024)
  • K-12 discipline rates for AI-related plagiarism rose from 48% (2022-23) to 64% (2023-24)

Teacher planning time is structurally incompatible with the redesign required:

  • US teachers average 4.4-4.5 hours of planning time per week (NCES analysis, NCTQ / EdSurge, March 2024)
  • Singapore - a top TALIS 2024 performer - allocates 8.2 hours per week for planning; US teachers are at roughly half that
  • A standard oral examination (one AI-resistant format) runs up to 30 minutes per student; for a class of 30, that is 15 hours of assessment time per teacher per class before any prep or calibration
  • No named district has allocated a documented budget specifically for released time for assessment redesign (confirmed absence across RAND, CRPE, Digital Promise, DOE sources reviewed)

Evidence notes

  • Compliance theater (Psychoftech/USC): plausible - Substack reporting on USC Understanding America Study panel data; not independently peer-reviewed. Directionally consistent with College Board usage data.
  • Barcaui K-12 generalization: plausible - RCT used adult undergraduate business students; cognitive offloading mechanism is consistent with Bastani et al. but direct K-12 transfer is not confirmed.
  • Good loop learning pathway: plausible - consistent with beneficial cognitive offloading research; no RCT directly measuring this pathway in K-12 was located.
  • “Metacognitive laziness”: plausible - verify Gerlich 2025 (MDPI Societies, Source 6); term not confirmed in current research file.
  • “Performance paradox”: confirmed - Lodge & Loble 2026, UTS/UQ report, locally verified PDF.
  • RAND “34% of teachers”: teacher perception, not a district audit; the 31% NCES figure and the 34% RAND figure use different measurement units.
  • K-12 discipline rates (48% to 64%): plausible - artsmart.ai statistics compilation; primary source chain not independently verified.
  • Time horizon: early - AI adoption confirmed; outcome data from redesigned assessment approaches does not yet exist at K-12 scale.
  • Counterfactual: not available - no peer-reviewed RCT of a named K-12 district’s AI integration policy measuring downstream outcomes published as of April 2026.
  • Load-bearing variable: the redesign hypothesis requires teacher time and PD quality that do not currently exist at scale. If structural conditions do not change, redesign does not happen regardless of how clearly the need is documented.

Measuring Knowledge, Not Output

AI made producing outputs nearly free. That changes what assessment is for. A grade on an essay was always a proxy for something else: whether the student learned to construct an argument, evaluate evidence, form a view. When the output can be generated in seconds, the proxy breaks.

The question becomes: does the student actually know the material? Detection tools try to answer that by policing the output. They cannot. The only answer is assessment that requires the student to demonstrate understanding directly.

The policy layer that exists is narrow and mostly procedural.

Ohio requires every public district to publish a comprehensive AI policy by July 1, 2026. Tennessee’s Public Chapter 550 (spring 2024) mandated K-12 and university AI policies statewide. These are policy publication mandates, not quality standards. Whether a district’s policy addresses learning outcomes, teacher capacity, or what students are supposed to develop - the statute does not say.

Detection does not work and cannot be the enforcement mechanism:

OpenAI shut down its own detector after it correctly identified only 26% of AI-written text while falsely flagging 9% of human writing. Turnitin’s roughly 1% claimed false positive rate rises to roughly 4% in practice at research institutions. Detection tools are also biased against non-native English writers. Multiple UC campuses declined to adopt Turnitin’s AI detection feature citing accuracy concerns.

Academic integrity bodies (IACIS, university guidance documents) advise treating AI detector scores as one signal, not proof of misconduct. That is a sound standard. But it depends on institutional capacity to conduct actual investigations, which most schools lack.

The real protection is assessment redesign:

Assessment formats that require demonstrated process - revision trails, oral explanation, iterative work shown over time - cannot be outsourced to AI in the same way a one-shot take-home essay can. This is not a silver bullet. Oral assessments create documented performance disadvantages for students with speech or language differences, English learners, and students with anxiety. Any single alternative format relocates inequity rather than eliminating it. The solution is a portfolio of process-oriented formats, not a single swap.

Some vendors are trying to solve the right problem - most are not:

Two vendors dominate the conversation: Khanmigo (Khan Academy) and MagicSchool AI. Both are at scale. They take fundamentally different approaches.

Khanmigo is student-facing and takes a coaching approach. It asks guiding questions instead of delivering answers, with the intent of keeping the student doing the construction work. That is the right instinct - an attempt to build the good loop into the tool itself.

MagicSchool AI is primarily teacher-facing: rubric generation, lesson planning, differentiation. It reduces the administrative load on teachers, which is important. It does not address the measurement question. It does not ask whether the student can demonstrate what they know without the tool present.

Whether Khanmigo’s design intent translates to actual learning outcomes is a different question - covered in Where it broke.

A promising piece: the AI-generated personalized verification exam:

One approach worth watching uses AI to scale the verification problem. A student submits a paper. AI generates a personalized set of questions drawn from that specific paper - the argument, the evidence, the choices made. The student then takes a short test on their own work under controlled in-class conditions.

The key is personalization. Generic AI answers do not help if the questions are built from the student’s specific argument. The student has to know their own thinking. That is the actual test.

The teacher’s role shifts: instead of writing questions from scratch, they review the AI-generated set, flag anything off-base, and read the paper. The cognitive load does not disappear - but it moves from question generation to quality review, which is faster and still keeps the teacher engaged with the student’s work. At scale, that is a meaningful difference.

This is promising not because AI writes the questions. It is promising because it separates production from verification. The student may have had help producing the work; the follow-up check tests whether they actually understand what they turned in.

This is a plausible way to protect the learning itself; no K-12 RCT exists yet. The technology components exist. The workflow is not yet standardized.

Remote proctoring: a partial model, not a K-12 solution:

Remote proctoring services (Pearson, Proctorio, ProctorU) handle controlled-environment testing at scale for certification exams and higher education. That model is worth understanding - but it does not transfer cleanly to mandatory K-12 use. Equity constraints are a challenge: home environments vary widely in privacy, quiet, and reliable internet. The Pearson model works because a physical testing center remains the primary offering; remote access is a convenience layer, not the foundation. Building mandatory remote proctoring into public K-12 would risk becoming another inequity amplifier, creating a two-tier system where well-resourced students test from quiet home offices and under-resourced students are disadvantaged before the test begins. Worth learning from the model; not worth copying it wholesale.


Where it broke

The ban cycle was compliance theater, not a structural response.

NYC banned AI, declared success, and reversed within months. Other large districts moved through the same cycle. The lesson most districts drew was “integrate instead of ban.” The follow-through on what integration actually requires - teacher capacity, assessment redesign, funding - has not materialized at scale.

The Matthew Effect is running at the school level, not just the student level.

The equity gap does not appear only within schools. It appears between them.

Low-poverty districts provide AI training to 67% of teachers. High-poverty districts provide it to 39% (RAND RRA956-31, fall 2024). Private schools are more than twice as likely as public schools to both permit AI use AND have governance policies for it (College Board 2025).

This is equity inversion: the students who most need the scaffold/crutch distinction explained to them are in the schools least prepared to explain it. The students most at risk of entering the bad loop are in the schools with the least teacher training, the least policy infrastructure, and the least capacity for assessment redesign.

The Matthew Effect runs in both directions here. Students with strong foundations enter the good loop. Schools with strong infrastructure enter the good loop. The compounding works the same way.

AI literacy programs are too new to measure.

Washington County, MD launched an AI literacy program in 2024-25. Georgia’s CTE AI pathway is documented. Connecticut ran a statewide pilot in January-June 2025. Evaluation data from none of these programs has been published. The adoption curve is ahead of the evidence base.

The vendor evidence gap is the highest-profile example.

Khanmigo scaled from roughly 68,000 users in 2023-24 to 700,000+ in 2024-25. School-level outcome claims (Enid High School geometry, Mesa Public Schools grade improvements) are self-reported by school officials, have no control groups, and have not been independently verified. Khan Academy itself acknowledges no RCT has been completed. MagicSchool AI claims 40% of US public schools - with no published outcome data. Both tools have adoption curves that are far ahead of their evidence base. This is a material gap: the Socratic design intent may be right; we do not yet know if it works.

Scale check: The redesign response that the evidence points toward - sustained, content-specific, coached professional development; assessment instruments rebuilt for demonstrated performance; AI literacy integrated into core curriculum - is implemented in a small number of named schools and districts. National coverage data does not exist. The problem is universal. The response is scattered.

Concepts worth testing:

  • Personalized verification checks: generate follow-up questions from the student’s own work, then test under controlled conditions
  • Timed verification checks: after a student submits work, give a short in-class check on that specific work
  • Oral defense lite: ask 3-5 targeted follow-up questions about why the student made specific choices
  • Process portfolios: assess question formation, source selection, draft evolution, revision, and reflection, not just the final artifact
  • Process plus defense: require draft trails and revision history, then ask the student to explain key choices briefly in person
  • Revision memos: require a short explanation of what changed between drafts and why
  • Tool-removal checks: allow AI during preparation, then remove it for a short explanation or retention test
  • Error-detection tasks: ask students to diagnose weak reasoning, bad evidence, or false claims, not just produce clean outputs
  • Transfer tasks: ask students to apply what they learned to a new case they have not rehearsed
  • Scheduled no-AI drills: periodically require core work to be done without AI assistance so raw skills do not atrophy

None of these is a silver bullet. The common pattern is the point: preparation can be assisted; demonstration cannot. Different subjects will need different formats. The requirement is the same in all of them - measure retained understanding, not polished output.


Adoption Without Evidence

There is no settled market verdict. Usage is universal and growing regardless of policy. The vendor market has responded with scale - but scale is not evidence. Outcome data does not yet exist to match the adoption numbers.

The policy market is moving toward mandated policy publication (Ohio, Tennessee) and away from outright bans. “Having a policy” and “having a functional response” are different things.

Districts are not being rewarded or punished based on the quality of their AI integration. Accountability mechanisms for this are absent.

Evidence notes

  • MagicSchool AI “40% of US public schools”: vendor-reported, unverified.
  • Khanmigo school-level outcome claims (Enid, Mesa): self-reported by school officials, no control group, not independently verified. Khan Academy acknowledges no RCT completed.

Policy environment

The solution is not a mystery. Schools need to redesign assessment around process and judgment. Teachers need the time and training to do it. Students need to learn the distinction between scaffold and crutch before the bad habit sets in.

Public schools in particular are hard-pressed to deliver that. A teacher running on 4.4 hours of planning time a week - already at 60% of what their own union recommends - does not have bandwidth to redesign assessment instruments, learn new rubric formats, and calibrate oral examination procedures. The system runs on what is already there.

The policy layer is showing up. Ohio now requires every public district to publish a comprehensive AI policy by July 2026. Tennessee passed a similar mandate in 2024. That matters - it creates an accountability surface, something that can be audited. But read the laws carefully: they require a policy. They do not fund one. They do not require the policy to address learning outcomes, teacher capacity, or what students are supposed to develop. The mandate is publication, not substance.

The federal clarification from the DOE (July 2025) said existing formula grants - Title I, Title IV-A - can be used for AI integration, including teacher training. That is permission, not money. Title I dollars are already stretched across everything a district does. Permitting their use for AI training is not the same as allocating new funds for it. The check has not been written.

The result is a policy layer arriving before the support layer. States are requiring compliance with something most districts do not yet have the capacity to deliver. The gap fills with the path of least resistance: vendor adoption, detection tools, one-off workshops that do not change practice. Compliance theater at the policy level, not just the student level.

What would actually move this: funded teacher release time as a line item in district budgets; professional development designed to the Learning Policy Institute standard - sustained, coached, content-specific - not a half-day workshop; and accountability infrastructure that catches up. AP exams and state assessments that still reward AI-completable take-home work undermine everything a classroom teacher tries to redesign. The school cannot fix what the testing system rewards.

Evidence notes

  • NSF grant program: existence confirmed; scale and scope not confirmed in research session.
  • LPI 7-feature PD standard: confirmed, 2017 synthesis of 35 studies. The modal district format (one-off workshops) does not meet it.
  • Title II federal PD funding: $2.2 billion annually. Large districts spend ~$8 billion/year on PD total; only 30% of teachers improve substantially (Hechinger Report). Money alone is not the constraint - design quality is.

North Star verdict

Education is supposed to be the squeeze-reduction lever. The pathway for a kid without family wealth is: build skills, demonstrate competence, earn options. Options create security. Security creates the healthy loop.

AI is now sorting students into two tracks before they finish high school.

The good loop: students who build genuine domain expertise can use AI to go faster, catch more, and produce better work. Their skills are durable because they are not outsourceable - they are the ones capable of evaluating what the AI produces. Those students become more valuable as AI scales.

The bad loop: students who use AI to skip the construction work produce better-looking outputs with less retained learning. Their dependence on the tool grows as their underlying skills atrophy. They cannot catch the AI’s errors because they never built the knowledge needed to recognize them. Those students become more replaceable as AI scales.

Which loop a student enters depends on what they bring to the tool - metacognitive foundations, base knowledge, and the instructional context that teaches them the difference between using AI as a scaffold and using it as a crutch.

The Matthew Effect is running in both directions: within schools, students with stronger foundations enter the good loop. Across schools, well-resourced institutions are better positioned to close that gap. High-poverty districts are providing AI training to 39% of their teachers. Low-poverty districts are at 67%. The students who most need the distinction explained are in the schools least equipped to explain it.

If this plays out without intervention, the result is not that AI disrupted education equally. The result is that AI accelerated the sorting that was already happening. Education does not reduce the squeeze; it amplifies it.

The case for intervention rests on a load-bearing assumption: that judgment, domain expertise, and the ability to evaluate AI output remain AI-resistant skills long enough to be worth building deliberately. Given current AI capabilities that is a plausible bet, not a settled finding. The redesign argument does not require the assumption to hold permanently - only long enough for today’s students to build something real. Name it as a bet before building a policy case on it.

The System Lesson: When doing the assignment stopped requiring learning it, schools needed a new answer. Most didn’t have one. The kids with the least going in paid for that gap.


Research gaps

[RESEARCH GAP: No RCT or quasi-experimental study measuring downstream academic outcomes from a named K-12 district’s AI integration policy has been published. All school-level outcome claims (Khanmigo/Enid, Khanmigo/Mesa) are self-reported, uncontrolled, and unverified.]

[RESEARCH GAP: No study has measured how much released time assessment redesign actually requires per teacher per course in K-12. The structural incompatibility is clear; the magnitude is unmeasured.]

[RESEARCH GAP: Whether AI-generated efficiency savings (rubric generation, lesson planning) are redirected toward assessment redesign or absorbed into baseline workload compression has not been studied. CRPE data suggests the latter, but no causal study confirms it.]

[RESEARCH GAP: Connecticut CCERC evaluation of the 2025 statewide pilot has not published outcome data. If and when it does, it will be the first systematic K-12 evaluation of a state-level AI integration program in the US.]

[RESEARCH GAP: No study has directly measured whether authentic assessment redesigns (process-based, context-specific, oral) close or widen achievement gaps between students from high- and low-resource backgrounds.]

Back to case studies