image of students collaborating on a digital project in a technology lab

Why we are running a second RCT on AI tutoring (and what we are testing this time)

This week we begin our second randomised controlled trial of a constrained AI tutor on Eedi, a four-arm study running with 1,525 students across 10 UK secondary schools in years 8, 9, and 10. The diagnostic engine at the heart of this trial was used in our first AI Tutor study in 2025, undertaken in partnership with Google DeepMind's research team. This new trial, also with DeepMind, extends that work and approach - a slow and measured build of evidence rather than a rush to deliver and scale. We'll run it in classrooms over the course of 12 weeks, with learning outcomes measured using Renaissance's STAR Maths assessment.

This trial involves a "constrained" AI tutor, but what does that mean? Our AI tutor is not a general-purpose chatbot that students can engage with to get help on their homework. It is designed to work in a distinctly different way: it activates only when a student answers a diagnostic check-in question incorrectly, and the conversation is bounded to the specific construct the student is learning about. This design is a deliberate response to the growing evidence base on what unconstrained AI does to learning. Bastani and colleagues (2025) found that students using an unconstrained AI tutor improved while learning with it. On post-tests without the tool, they performed significantly worse than students without AI access at all. They concluded that generative AI without guardrails can harm learning. The risk is cognitive offloading: the tool does the work, the student does not.

At Eedi, we instruct the AI to support the student through one specific moment of difficulty, then return them to the lesson.

The reason we can bound the conversation this tightly is the diagnostic engine beneath our testbed tool: Eedi School. Every question in our diagnostic questions library has one correct answer and three incorrect ones, and each incorrect answer (a ‘distractor’) is mapped to a specific, named misconception. When a student picks a wrong answer, we know something precise about their thinking; not just that they got the question wrong, but why. That diagnostic signal is what we pass to the AI tutor, and it is what turns a generic model into something that can speak to a specific student about a specific misunderstanding. Eedi’s diagnostic engine is the intelligence layer for maths teaching and learning, built and refined over nearly a decade of classroom use.

AI tutoring is a crowded marketplace, with a growing number of benchmarks and usability studies. But rigorous causal evidence on efficacy remains scarce. Our first study with Google DeepMind in 2025 took an initial step toward building that evidence base. In fact, the Stanford SCALE Initiative recently included the study in their 2026 review of AI in K-12 as one of only 20 high-quality causal studies identified from over 800 papers reviewed. We kept that initial study small by design, allowing us to focus on rigor: 165 students, five schools, seven weeks. Supervising tutors approved 74.4% of AI-drafted messages without any edits, the safety audit found zero instances of harmful content, and our Bayesian analysis attributed a 93.6% posterior probability to supervised AI tutoring producing greater knowledge transfer than human tutoring alone. We hold those findings lightly; they are signposts, not conclusions. See the study highlights here.

This second RCT is larger, longer, and asks a more specific question: how much does student-level context matter to the quality of AI tutoring?

The trial compares four conditions:

Static content (business as usual). Students work through math fluency practice questions and explainer videos pre-recorded by our learning design team. This is the baseline.
AI tutor with pedagogy prompt. The AI is constrained by a detailed pedagogy prompt grounded in Eedi's approach, with access to the diagnostic question and the specific misconception the student is likely holding. A human tutor reviews, edits, or rejects every message before it reaches the student.
AI tutor with pedagogy prompt plus student context. The prompt is enriched with personalisation signals to keep the student engaged and help the tutor (specific if both AI and human?) locate the zone of proximal development faster. A human tutor reviews, edits, or rejects every message before it reaches the student.
Human tutor only. A trained human tutor working without AI suggestions, as a comparison against the gold standard of one-to-one support.

The questions we’re investigating are practical:

Does pedagogically constrained AI tutoring hold up at scale, and translate into measurable gains on a standardised assessment?
Does layering rich student context meaningfully change the experience and the learning?
And what value does AI tutoring contribute on top of human tutoring alone?

Now, onto actually running the trial. We will share the results when they arrive in Summer 2026.

‍

Launched: Our Second AI Tutor RCT

Why we are running a second RCT on AI tutoring (and what we are testing this time)

Want to help us help a billion kids by 2030?