Why AI Learning Tools Fail, and What Would Actually Work

Apr. 14, 2026

In this article, I look at why most AI learning products fail the students they claim to help, what decades of cognitive science research say about how people actually learn, and what a genuinely effective AI-powered learning tool would look like.

What current AI learning tools actually do

Spend any time on social media and you will see the ads. An AI that explains any concept. A chatbot that tutors you like a personal teacher. Upload your textbook and ask it anything. Learn faster, study smarter, ace your exams. The technology is impressive. The research says most of it does not work, and some of it makes learning worse.

Look at what most AI learning tools do. They summarize content. They generate flashcards. They let you chat with your PDF. They create practice questions from your notes. They reorganize material into study guides. Every one of these features assumes the student already understands the material at a basic level and wants to organize, review, or deepen that understanding. These are productivity tools for people who already know how to learn.

The student who is struggling, who does not understand the concept at a fundamental level, who cannot formulate the right question to ask, who does not know what they do not know, gets almost nothing from these tools. You cannot ask good questions about material you do not understand. The chatbot waits for you to engage, and if you do not know where to start, it has nothing to offer. Khan Academy’s own analysis of Khanmigo found many students typing “idk” rather than engaging with the AI’s Socratic questioning. The design assumed a level of existing knowledge that struggling students did not have.

The marketing emphasizes ease. Learn faster. Study less. Let AI do the hard work. This framing is the opposite of what research says produces durable learning. The conditions that make learning feel easy and productive often produce worse long-term outcomes. Struggle and effort frequently produce the most durable knowledge.

The working memory bottleneck

Your long-term memory is essentially unlimited. You can store vast amounts of knowledge, skills, and experiences over a lifetime. Your working memory, the mental workspace where you actively think about things, can hold roughly four pieces of information at a time for about twenty seconds.

Everything you consciously process passes through this bottleneck. Reading a sentence, following an argument, solving a problem, making sense of a diagram. When you successfully organize new information and connect it to things you already know, it gets stored in long-term memory as organized knowledge structures you can retrieve and use later. That is learning.

When too much hits your working memory at once, nothing gets through. The information does not get organized, does not get connected to existing knowledge, and does not get stored. It bounces off. This is what is happening when a student says “this makes no sense”. The material is not inherently incomprehensible. Their working memory is overwhelmed.

Three things compete for that limited working memory space. First, the inherent complexity of the material. Some concepts require you to hold multiple pieces in mind at the same time. Second, poor instructional design. Badly organized explanations, split attention between a diagram and text on different pages, redundant information that adds clutter without adding understanding. Third, the actual mental work of learning. Making connections, building schemas, integrating new knowledge with old.

If a student’s working memory is consumed by the complexity of the material and the confusion of the presentation, there is nothing left for actual learning. The information goes in one ear and out the other. It is not that the student is not trying. There is no cognitive room for the learning to happen.

Prior knowledge is the strongest predictor of future learning

Prior knowledge is the single strongest predictor of future learning. Not intelligence, not motivation, not the quality of the teacher, not the medium of instruction. What you already know.

When you already know things about a subject, you have schemas in long-term memory that help you organize new information. Those schemas expand your working memory by letting you treat complex ideas as single units. An experienced chess player does not see 32 individual pieces. They see familiar patterns. A doctor does not process a list of isolated symptoms. They see a syndrome. Each schema condenses what would otherwise be multiple separate items into one, freeing up working memory for new learning.

A student without those schemas faces a harder task. The same material that fits comfortably in an expert’s working memory overwhelms a novice’s. This is not about intelligence. The novice’s working memory is not smaller. It is carrying heavier loads because every piece of information is separate and unconnected.

A tool that presents the same material to everyone, regardless of what they already know, will help students who have strong foundations and fail students who do not. The foundation determines whether the new information has somewhere to land.

The YouTube aha moment

Most students have experienced this. You sit through weeks of lectures on a topic, struggling to make sense of it. Then you watch a YouTube video, maybe a beautifully animated explanation, and suddenly it clicks. You get it. The natural conclusion is that the YouTube creator is a better teacher than the professor, or that visual explanations are inherently superior to lectures.

The research tells a different story.

At least six things go on during those moments of sudden understanding, and the video is probably the least important. The failed classroom lectures were not wasted. They created partial knowledge structures in your brain. Incomplete, disorganized, but real. Your brain has been working on this problem, even when it did not feel like it.

The time gap between the lectures and the video introduced a spacing effect. Distributing your exposure to material over time improves retention and understanding compared to cramming it together. The gap was not wasted time. It was a learning mechanism.

You slept between the lectures and the video. Sleep is not just rest. During slow-wave sleep, your brain consolidates and reorganizes recent learning, integrating new traces with existing knowledge. Several nights of sleep after those confusing lectures were doing organizational work.

By the time you press play on the YouTube video, your brain has partially processed, consolidated, and organized the lecture material. The video encounters a mind with partially-formed schemas ready for the final integration step. It provides the last piece that completes a puzzle your brain has been assembling for weeks. The feeling is dramatic, a genuine burst of insight, but the video was the trigger, not the cause. Any adequate explanation at that moment would likely have produced the same result.

Research found that students who watched smooth, polished, passive explanations felt like they learned more than students who engaged in active, effortful learning, but actually performed worse on tests. The video’s smooth flow and elegant presentation creates a subjective feeling of understanding that the student misattributes to the video’s quality. The frustrating classroom lectures that built the actual prerequisite knowledge get no credit.

This has direct relevance for AI education tools that prioritize smooth, satisfying explanations. The experience that feels best may teach the least.

What actually works

A comprehensive review of ten common study strategies rated each based on the strength of the evidence behind them. The two strategies students use most, rereading and highlighting, were rated lowest. The two strategies with the strongest evidence were ones most students rarely use: practice testing (actively recalling information from memory) and distributed practice (spacing study sessions over time rather than cramming them together).

The testing effect, the finding that actively retrieving information from memory strengthens long-term retention more than passively reviewing it, has been confirmed over and over again. In one well-known study, students who repeatedly tested themselves recalled 70% of material after one week, compared to 44% for students who repeatedly reread the same material.

The spacing effect is similarly powerful. Distributing study sessions over time rather than massing them together produces one of the largest effects among all study strategies. The optimal spacing interval depends on how long you need to remember, roughly 10-20% of the desired retention period. If you need to remember something for a year, space your reviews about one to two months apart.

A third strategy, interleaving, or mixing different problem types within a study session rather than practicing the same type repeatedly, shows strong effects for certain kinds of learning, particularly when you need to learn to tell similar things apart. Practicing mixed math problem types forces you to figure out which approach to use, not just execute a procedure you have already identified.

All three of these strategies share a common characteristic. They make learning feel harder. Students who test themselves feel less confident than students who reread. Students who space their practice feel like they are forgetting between sessions. Students who interleave feel confused by the mixing. Surveys show that 93% of undergraduates believe massed study is more effective than spaced study. They are wrong.

Any tool that optimizes for user satisfaction, for making students feel productive and confident, will steer them toward inferior learning strategies. The strategies that produce the best learning feel the worst. The strategies that feel the best produce the worst learning.

Making learning easy makes learning worse

Researchers coined the term “desirable difficulties” to describe a broad pattern. Conditions that create struggle and apparent failure during learning frequently produce the most durable knowledge.

Students who generate answers before being told the answer remember better than those who are simply given the answer, even when the generated answer is wrong. Students who practice with varied conditions perform worse during practice but better on later tests. Students who receive feedback after a delay learn more than those who receive immediate feedback. In study after study, the conditions that slow down apparent progress enhance long-term retention and the ability to apply knowledge to new situations.

A dramatic demonstration comes from research on productive failure, an approach where students attempt to solve problems they do not yet have the tools to solve, before receiving any instruction. They struggle, fail, and produce incorrect solutions. Then the teacher provides instruction that builds on their attempts. This approach consistently outperforms direct instruction for conceptual understanding. When a major university applied this method to a linear algebra course, pass rates jumped from roughly 55% to 75%.

Not all difficulty is good. Confusing instructions, irrelevant complexity, and poor scaffolding are undesirable difficulties that waste cognitive resources without triggering useful learning processes. The difficulty must force the brain to do the kind of work that builds knowledge. Retrieving, connecting, organizing, discriminating. If the difficulty adds confusion or frustration without engaging these processes, it is harmful.

A well-designed tool does not make everything easy. It makes the right things hard in the right ways, at the right times, for the right students.

Expertise reversal

One of the most consequential findings in learning science is the expertise reversal effect. Instructional methods that help beginners can actively harm advanced learners, and the reverse. Providing guidance to novices helps substantially. Removing guidance from experts helps substantially. Providing guidance to experts hurts. Removing guidance from novices hurts.

When you are a beginner, you do not have organized knowledge in your head to guide your learning. You need external structure, step-by-step examples, clear explanations, guided practice, to substitute for the internal knowledge structures you have not built yet. Without that external structure, you are forced into cognitively expensive search strategies that consume your working memory and leave nothing for actual learning.

As you develop expertise, your growing internal knowledge does the organizing work that external guidance previously handled. External guidance becomes redundant, and processing that redundant information consumes working memory resources that could be used for deeper learning. The worked example that accelerated the novice’s learning becomes unnecessary clutter for the expert.

One-size-fits-all instruction is always wrong for someone in the room. An educational tool must adapt to the learner’s current knowledge level, not present the same content to everyone. What a beginner needs and what an advanced student needs are different in kind.

Interactive simulations

Interactive simulations, the kind where you can drag sliders, adjust variables, and watch systems respond, are powerful learning tools. Over twenty years of research shows large positive effects compared to traditional instruction. In one study, students who learned about electrical circuits using a simulation outperformed students who used real laboratory equipment on conceptual tests. They were faster at building real circuits afterward, despite never having touched physical components.

The benefits depend heavily on the learner’s prior knowledge. When students have enough foundation to engage with a simulation, to form hypotheses, interpret feedback, and explore systematically, simulations are excellent tools for building deep understanding. When students lack that foundation, the simulation becomes cognitive noise. Too many things to attend to, too many variables to track, and no internal framework to organize the experience. The result is confusion.

When a simulation about energy had default settings that differed between Earth and Moon scenarios in a way that was not obvious, roughly 90% of students held the correct belief about a physics concept before using the simulation. Fewer than 20% held it afterward. The simulation taught them the wrong thing because they lacked the knowledge to interpret what they were seeing.

Simulations belong at a specific point in the learning journey. After foundational knowledge has been established, when the learner has enough internal structure to engage with the exploration. They are not a starting point for struggling students. They are a deepening tool for students who already have somewhere to put the new understanding.

Analogies

Analogies are powerful. Connect unfamiliar concepts to familiar ones, and understanding follows. Compare electricity to water flowing through pipes, and the abstract becomes concrete.

Analogies do not just illuminate the parts of a concept that map correctly onto the familiar domain. They also generate inferences about the parts that do not map. Those inferences are often wrong.

In one experiment, researchers gave students two different analogies for electricity. Students given a water-flow analogy (current as water, batteries as pumps) performed better on questions about batteries but worse on questions about resistors. Students given a moving-crowd analogy (current as people, resistors as gates) showed the opposite pattern. Each analogy improved understanding in its strong domain and degraded it elsewhere.

Even when teachers explicitly tell students that an analogy is imperfect and point out where it breaks down, the incomplete model often remains as the student’s only model of the concept. Students taught the atom-as-solar-system analogy actively resist abandoning it when they encounter quantum mechanics. The analogy was meant as a stepping stone. It became a cage.

The mitigation strategies from the research are specific. Use multiple analogies for complex concepts, each capturing different aspects. Always explicitly address where each analogy breaks down. Use bridging analogies, chains of intermediate comparisons, each one a small conceptual leap. Monitor for misconception formation, because analogies do not just risk leaving gaps in understanding. They risk filling those gaps with wrong answers that feel right.

An LLM generating unchecked analogies, choosing whatever comparison seems most intuitive, is a misconception factory. Every analogy it produces needs to come with explicit boundary conditions, and the system needs to check whether the student has over-extended the comparison.

Motivation

There is a common assumption that struggling students are simply unmotivated, that they lack discipline, grit, or work ethic. The research paints a different picture.

The most extensively validated framework for understanding motivation identifies three basic psychological needs: autonomy (a sense of choice and agency), competence (feeling capable and effective), and relatedness (feeling connected to others). When these needs are met, people are naturally motivated. When they are frustrated, motivation collapses.

Across hundreds of studies spanning over 200,000 participants, the strongest predictor of persistence is not enthusiasm or enjoyment. It is whether the person has internalized why the activity matters to them personally. External rewards and punishments showed no association with performance or persistence and correlated with decreased well-being.

For struggling students, the competence need is critical. A student who has repeatedly failed, who has never experienced success in a subject, has a genuine competence deficit that undermines motivation at its root. Giving this student more autonomy without first building competence often backfires, reinforcing the sense of helplessness. The research suggests a specific sequence. Build competence first through structured, achievable success experiences, then gradually introduce autonomy as the student develops confidence.

Gamification (points, badges, leaderboards, streaks) gets a complicated verdict. When done well, gamification elements can jumpstart engagement for students who have not yet found intrinsic motivation in a subject. When reduced to superficial mechanics layered on top of the same experience, it frequently backfires. In one study, students in a gamified course with leaderboards and badges showed less intrinsic motivation, lower satisfaction, and lower exam scores than students in a non-gamified version of the same course. Every student except the leaderboard leader voted to eliminate the gamification system.

Gamification that affirms competence (informational feedback about your progress) supports motivation. Gamification that controls behavior (do this to earn points, compete to stay on the leaderboard) tends to undermine it.

Engagement is not the same thing as learning. Optimizing for the dopamine hit of streaks and points may interfere with the deeper motivational work of building genuine competence.

Five failure modes of LLMs as educational tools

Years of research have identified five specific failure modes that undermine LLMs as educational tools.

They make things up

When AI systems generated hints for math problems in one study, 32% failed quality checks. When GPT-4 was embedded in a tutoring system and generated hints automatically, 35% were too general, incorrect, or simply gave away the answer. In systems where a human reviewed every AI message before a student saw it, error rates dropped to 0.1%. Without that review, errors were frequent enough to undermine trust and teaching quality.

They tell students what they want to hear

Research testing major AI models found sycophantic behavior, agreeing with the student rather than correcting them, in 58% of educational interactions. In nearly 15% of cases, the AI abandoned its own correct answer after the student pushed back. Once the AI started agreeing with a student’s wrong answer, it continued doing so in nearly 80% of subsequent exchanges. In one test, an AI wrongly admitted to making mistakes on 98% of questions when challenged, even when it had been correct. This is the opposite of what a good teacher does, and it reinforces the misconceptions that struggling students already hold.

They cannot track what you know

The core function of a good tutor is maintaining a mental model of what the student understands and what they do not, then adjusting instruction accordingly. Current LLMs fundamentally cannot do this. Testing revealed that LLMs generated essentially the same responses regardless of whether information about a student’s errors was included. The AI literally could not tell the difference between a student who had just made a critical error and one who had not. Without the ability to track your knowledge state over time, the AI treats every question as if it is the first one you have ever asked.

They create dependency

A large-scale study gave approximately 1,000 students access to ChatGPT for math practice. Students with AI access improved their practice scores by 48%. It looked like the tool was working brilliantly. When the AI was removed and students took an exam on their own, they scored 17% worse than students who had never used the AI at all. They had outsourced their thinking to the AI without internalizing the material, and they did not realize it was happening. Their self-reported confidence was high even as their actual learning declined. A follow-up study found the same pattern even among skilled learners. Motivation and good intentions were not enough to prevent dependency. Only system-level design constraints could.

They explain when they should question

LLMs are trained to be helpful, which in practice means they default to explaining things clearly and directly. Effective tutoring is usually the opposite. Asking questions that guide the student to figure things out themselves, withholding answers to create productive struggle, and calibrating the level of support to what the student needs at that moment. When researchers built a model specifically trained on real tutoring conversations, it significantly outperformed GPT-4 on teaching quality, because effective teaching requires a fundamentally different behavior than the “be maximally helpful” objective that LLMs are optimized for.

The difference between a tool that helps and a tool that hurts

When researchers gave students access to GPT-4 without any constraints, just the raw AI, practice performance improved by 48% and exam performance dropped by 17%. When they wrapped the same GPT-4 in pedagogical guardrails (hint sequences that escalated from vague to specific, rules preventing the AI from giving away answers, structured problem-solving scaffolds), practice performance improved by 127% and the learning harm was largely eliminated.

The difference between a tool that helps and a tool that hurts was not the AI model. It was the pedagogical architecture surrounding it. The same technology that harmed learning when unconstrained improved learning when properly constrained.

The most effective AI education systems are not the newest or most AI-forward. They are the ones with the most carefully designed pedagogical structures.

The best AI education systems are decades old

Carnegie Learning’s MATHia system, developed starting in the 1980s, does not use large language models at all. It uses symbolic AI based on a formal model of how the human mind solves math problems, tracking each step a student takes and comparing it against a cognitive model. When tested across over 18,000 students and 147 schools, students using this system nearly doubled their growth on standardized tests compared to traditional instruction.

ALEKS, based on a mathematical framework developed in 1994, maps subject domains into hundreds of atomic concepts and maintains a precise model of which concepts each student has mastered. It determines your knowledge state in 25-30 questions, identifies exactly what you are ready to learn next, and periodically reassesses to ensure genuine mastery. Students reaching 75% mastery on ALEKS were 2.7 times more likely to achieve proficiency on state assessments.

Mindspark, an Indian adaptive learning platform, uses statistical models to identify misconceptions and target instruction. When tested with students in Delhi’s urban slums, it produced significant improvements in both math and language over just four and a half months. The largest relative gains went to the lowest-achieving students. It works offline and costs approximately two dollars per student per month.

None of these systems use the kind of AI that generates marketing buzz. They use older, less glamorous technologies (symbolic reasoning, statistical models, knowledge graphs) that have been rigorously tested and validated over decades. What they share is a deep commitment to pedagogical structure. They know what the student knows, they know what the student needs to learn next, and they sequence instruction accordingly. The AI serves the pedagogy, not the other way around.

The most promising new approach is not replacing these systems with LLMs. It is combining them. One system uses LLMs to fill in natural-language dialogue within a pre-defined pedagogical structure that controls the learning flow. The structure specifies: start with an open-ended prompt, then a Socratic hint, then a more direct hint, then the answer if the student is still stuck. The LLM generates the actual words at each step, but it never controls the sequence. This constrained system achieved better tutoring outcomes than unconstrained GPT-4, where the AI rambled through 29 exchanges without the student ever reaching the answer.

What AI is actually good at

LLMs are extraordinarily good at understanding and generating natural language. They can rephrase explanations in different ways. They can take a concept and describe it using vocabulary appropriate for different knowledge levels. They can generate varied practice questions on a topic. They can provide natural-language feedback on student responses. They can recognize when a student’s free-text answer reveals a specific misconception.

These are capabilities of communication, not capabilities of pedagogy. The AI can be an excellent voice for a well-designed system. It should not be the brain. The brain needs to be a pedagogical architecture grounded in research: diagnostic assessment, knowledge tracking, mastery-based progression, spaced retrieval practice, adaptive difficulty. The AI generates the words. The system determines what needs to be said, when, and to whom.

What a research-backed learning tool would actually look like

If you took the research seriously, all of it, including the parts that are inconvenient and counterintuitive, what would an AI-powered learning tool actually look like? It would look nothing like a chatbot.

Figure out what the student actually knows

Before any learning happens, the tool needs to know where the student stands. Not what grade they are in, not what chapter they are on, not what they say they need help with, but an accurate map of what they actually know and what they do not.

Some existing tools already do this well, determining your knowledge state in 25-30 questions and mapping exactly what you know and what you are ready to learn next. Good tutors do the same thing in their first session. The specific technique matters less than the principle: you cannot teach effectively if you do not know what the student’s starting point is.

For a tool that uses AI, this diagnostic can be richer than a traditional test. The AI can present a problem and, based on the student’s response including the specific errors they make, identify which underlying concepts are solid and which are missing. A student who answers “12” to the question “what is 3 × 5” has a different problem than a student who answers “8” (the latter likely confused multiplication with addition). The AI’s language understanding lets it parse free-text responses and identify specific misconceptions, not just score answers as correct or incorrect.

The result is a knowledge map, a picture of the student’s understanding across all the concepts in the domain, showing exactly where the gaps are. This map is the foundation everything else builds on.

Start where the student is, not where the curriculum says they should be

The research on Teaching at the Right Level, one of the most rigorously evaluated approaches in education, demonstrates extraordinary results when instruction matches the student’s actual level rather than their grade level.

For a technology-based tool, this means the system should be willing to go backward before going forward. If a student is supposed to be learning algebra but the diagnostic reveals they do not have solid arithmetic, the system starts with arithmetic. This feels counterintuitive. It means telling a ninth-grader they need to work on fifth-grade material. The research is unambiguous. Building on a foundation that does not exist produces nothing. Building on a solid foundation, even if it means starting lower, produces rapid progress.

The AI’s role here is crucial for preserving the student’s dignity. The system does not need to announce “you are working on fifth-grade math”. It simply presents problems at the right level, with appropriate context and language for the student’s age. The AI’s natural language capability means it can frame the same mathematical concept differently for a nine-year-old and a fifteen-year-old, even if the underlying skill being practiced is identical.

Worked examples first, not simulations

When a student encounters a concept they do not know, the research is clear about what comes first: worked examples. Step-by-step demonstrations of how to solve a problem, with each step explained.

For novice learners, worked examples dramatically outperform problem-solving. The reason connects directly to the working memory bottleneck. When a novice tries to solve a problem they do not know how to solve, their working memory gets consumed by search strategies, trying different approaches, checking whether they are getting closer, managing the frustration of being stuck, leaving nothing for the actual learning. A worked example offloads that cognitive burden, freeing up working memory for understanding the structure and logic of the solution.

The tool should present a worked example, then ask the student to solve a similar problem. Then another example with a step removed (a “completion problem”), then another similar problem. Gradually, the scaffolding fades as the student develops competence. This adaptive fading, systematically removing solution steps as mastery develops, significantly outperforms the traditional approach of showing an example and then throwing students into problem sets.

The AI’s contribution here is generating varied worked examples and practice problems on demand, adapting the language and context to the student’s level, and providing specific feedback when the student makes an error. Not just “wrong, try again” but “you applied the right rule in step 2, but in step 3 you need to consider…”.

Retrieval practice as the core activity

Once the student has seen how a concept works through worked examples and can solve guided problems, the tool shifts to its most important function: structured retrieval practice.

The student is regularly asked to recall and apply what they have learned, not immediately after learning it, but after a delay. The questions are spaced over time, using algorithms that schedule reviews at optimal intervals. Different topics are interleaved, so the student has to first determine which approach applies before executing it. The questions are recall-based (requiring the student to generate an answer) rather than recognition-based (multiple choice), because recall produces larger learning effects.

This will feel harder than alternatives. The student will feel less confident. They will get more answers wrong during practice than they would with a less effective approach. The tool needs to be transparent about why. “This feels hard because you are actually building durable memory. Research shows this is the most effective way to learn.” It should not water down the difficulty to make the experience more comfortable.

The AI shines here in generating diverse practice questions that test the same underlying concept in different contexts, preventing the student from memorizing specific question patterns instead of learning the underlying principle. It can also provide tailored feedback on wrong answers, identifying the specific misconception behind the error and explaining exactly where the reasoning went wrong.

Interactive simulations, earned

After the diagnostic assessment confirms that a student has developed sufficient prerequisite knowledge in a topic area, the tool can introduce interactive simulations for conceptual deepening.

The simulations should use implicit scaffolding, guiding the student through environmental design rather than explicit instructions. The research shows that students given only open-ended questions (“What happens when you increase this?”) with well-designed simulations explored more, discovered more relationships, and revised their mental models more effectively than students given step-by-step instructions who explored only enough to answer each specific question.

The AI’s role is composing the right simulation experience for the right student at the right time. Based on the student’s knowledge state, the AI selects which variables to expose, what starting conditions to use, and what guiding questions to pose. It generates the narrative that walks the student through the exploration. The simulation itself, the interactive components, should be reliable, tested primitives, not arbitrary code generated on the fly.

This is a constrained-composition approach. The AI chooses and configures from a library of well-tested interactive components, similar to how a music producer works with a library of instruments and samples rather than building new instruments for every song. The AI handles the creative composition. The components guarantee reliability.

Analogies with guardrails

When the AI uses analogies to explain concepts, it should follow a specific protocol informed by the research.

Use multiple analogies for complex concepts, not just one. Each analogy captures different aspects, and the combination helps the student build a richer mental model than any single comparison could provide.

Explicitly address where each analogy breaks down. The research shows that teachers consistently skip this step, and the result is students who internalize the analogy’s limitations as if they were features of the real concept. The AI should say “this is like X in these ways, but unlike X in these ways, and here is why the difference matters”.

Monitor for over-extension. If a student’s subsequent responses suggest they have taken an analogy too far, applying features of the familiar domain that do not map onto the new concept, the AI should catch this and correct it before the misconception solidifies.

Use bridging analogies when the conceptual distance is large. Instead of one big leap from familiar to unfamiliar, chain smaller steps through intermediate cases that gradually shift from the student’s existing understanding to the target concept.

No raw AI access

Based on the research showing that unconstrained AI access harms learning, the tool should never give students a freeform chat interface with the AI. Every interaction should be mediated through a structured pedagogical sequence.

This does not mean the AI is invisible. The student should feel like they are having a conversation. The conversation follows a designed structure: the AI asks a question, the student responds, the AI evaluates the response and determines the next move based on the pedagogical framework. If the student is struggling, the system provides a hint, vague at first, more specific if needed. If the student is succeeding, the system increases difficulty or moves to a new topic. If the student’s response reveals a misconception, the system addresses it directly.

The AI handles the language, generating natural, responsive, encouraging dialogue. The system handles the pedagogy, deciding what to say when, what to ask next, when to advance and when to review, when to offer support and when to let the student struggle productively.

Constrained AI consistently outperforms unconstrained AI in education research. The constraint is the feature that makes the tool effective.

Measure what matters

The tool should build in assessment that separates the feeling of learning from actual learning.

Delayed tests. Do not just check whether the student can solve a problem immediately after learning it. Check whether they can solve it three days later, a week later, a month later. Immediate performance is a poor predictor of learning. Delayed performance is much better.

Transfer tests. Do not just check whether the student can solve the same type of problem they practiced. Check whether they can apply the concept to a new situation they have not seen before. This is the true test of understanding versus memorization.

Calibration checks. Ask students how confident they are in their answers before showing whether they are correct. Over time, a well-functioning tool should help students become better calibrated, more accurately assessing what they know and what they do not. Students who consistently feel confident on questions they get wrong are experiencing the illusion of competence that the tool should be helping them overcome.

Honest progress reporting. The tool should never tell students they are doing better than they are. If a student is scoring 45% on delayed retrieval tests for a topic, the tool should say so, clearly, and explain what it means. False encouragement is sycophancy, and sycophancy reinforces the very illusions that prevent real learning.

Support for teachers and facilitators

The strongest evidence in AI education does not come from AI replacing teachers. It comes from AI augmenting them. One system that gives real-time pedagogical suggestions to human tutors found that students were significantly more likely to master topics, with the effect nearly doubling when the human tutor was less experienced. The cost was $20 per tutor per year.

A comprehensive learning tool should include a dashboard for teachers, tutors, or facilitators that surfaces the information they need to intervene effectively. Which students are struggling and on what specific concepts. Which students have disengaged (stopped using the tool or showing declining performance). Which concepts the entire group is weak on, suggesting a gap in the teaching rather than individual student problems. How each student’s knowledge map has changed over time.

This component may be the highest-leverage piece of the entire system. It does not require complex AI. It requires thoughtful data presentation. It transforms the teacher from someone guessing about 30 students to someone making informed decisions based on precise knowledge-state data.

The integration gap

The individual components described here are not hypothetical. Diagnostic assessment engines exist and work. Mastery-based progression has extensive evidence behind it. Spaced retrieval practice is the most validated learning strategy known. Constrained AI outperforms unconstrained AI in controlled studies. Interactive simulations have two decades of evidence. Teacher dashboards are straightforward engineering.

What does not exist is a system that integrates all of these components, powered by modern AI capabilities, designed from the ground up based on learning science rather than engagement metrics, and available to the students who need it most.

Building this tool requires going against every instinct that currently drives educational technology product development. It requires building something that feels harder when every competitor promises easier. It requires measuring success by what students remember weeks later, not by how many minutes they spend on the app today. It requires telling students they need to go back to fundamentals when every other tool lets them skip ahead. It requires constraining AI when the marketing advantage lies in showcasing its freedom.


References