One provocative litany I’ve used to frame my work is: what comes after the book? Is it pictures of pages on screens? Is it videos of lectures? Why are all the answers to this question so boring? Where are the powerful ideas about how people learn, feel, and act?
I’ve been exploring memory systems as one avenue in response. But in my mind, doing something interesting with memory systems means leaving behind spaced repetition systems as we understand them today. The goal isn’t to “scale Anki.” It’s to use prototype systems to understand something new about how people learn, feel, and act—then to use those new understandings to create new kinds of systems.
On the ground, day-to-day, the biggest challenge I face is in asking good enough questions—questions whose answers can significantly change the way we understand a phenomenon. I’ve written this somewhat grumpy piece to distill some of my struggles here. We’ll sort through some boring, bad questions and try to lever our way to some more interesting ones.
Quantum Country has accumulated several million data points. Seems great! People are often quite excited when they hear that, as if accumulating a mountain of data will necessarily produce new understanding. But to extract meaning from that data, you need good questions. More perniciously: unless you have good questions in mind, you’re probably not even collecting the right data.
I’ve run dozens of analyses across many controlled trials on Quantum Country’s data. If I were a junior academic, I could probably have turned these into several papers’ worth of experimental results. Gotta juice that publication/citation count! But I haven’t published any of these studies, because I don’t think the questions they’re asking are good enough. The results I have are too parochial, too conditional on local details.
Asking good questions is hard. Part of the problem is that most papers aren’t asking good questions. If you read an unbiased sample of papers, your taste will mostly be shaped by boring questions which do little to advance the field.
The obvious questions are usually incremental. They assume the parameters of existing frameworks, then attempt to clarify some extension or variation. “Does the spacing effect manifest… for first-graders… when learning science concepts?” Such experiments can accrete understanding, but they’re quite distinct from, say, the initial experiments which uncovered the effect.
Another set of obvious questions stem from asking “what can we do with the data we already have?”, rather than “what would we really like to know, and how might we collect the data to know it?” This kind of data-centric fixation is wearyingly common around Silicon Valley types: wow, you have all this data! Let’s optimize things! Surely we can use this data to produce a more efficient review schedule? Yes, sure, but inefficiencies are not what hold back memory systems: they’re wildly efficient, even using dumb schedules! Why are schedule optimization questions the ones you want to ask? In most cases, I think the answer is “because it’s easy.”
Analyzing Quantum Country’s memory data
Here’s a bad question you can ask about Quantum Country: does it work? There are several key problems with this question. The first is: what does “work” mean? The second is: a yes/no question like this tells you little. The third is: despite the phasing, spaced repetition is sufficiently well-supported that “yes” should be considered the null hypothesis. But understanding these failures can help us write a better question.
Here’s a better—but still bad—question you can ask about Quantum Country: what fraction of participating readers eventually end up reliably remembering all the material? The boring complaints about this question are methodological: compared to what? What does “participating readers” mean? What does “eventually” mean? What about survivorship effects? But a lack of rigor isn’t the real problem with this question. The real problem is: what would an answer even mean? If the answer were 80%, how would your understanding differ from a world in which the answer was 70%? What does an answer to this question teach us about Quantum Country, much less about how people learn/feel/act in general?
Let’s try again: across varying time periods, how much of Quantum Country’s material would a reader remember if they didn’t do the review sessions, compared to those who did? This question seems worse because it’s less precise. Yes, you’d need to nail down several elements to get a real answer. But rigor aside, this is a better question because it starts to access the dynamics of learning. It’s the first of our example questions which might teach us something generalizable.
It’s important to remember, after all (and I’m reminding myself right here!): this is the point—to learn something generalizable. We’re trying to learn something which might help us build the next system, the next category of systems. The point is not (as it usually is in tech) to produce experimental data which can show that “our product works!” on some marketing page. Fuck that. The point is insight and its downstream consequences.
So let’s double down on generalizability. Here’s an even less precise question which I feel is nevertheless much more interesting: what is the effect of a particular review event on a person’s memory? A moment before that question appears, their mind is in one state. Then they answer the question, and their mind is in another state, with durable changes which persist for weeks or months. What happened? Can we characterize that change? What parameters does the change depend on? What’s its stationarity? Essentially, can we establish a function which describes the dynamics of retrieval on memory?
I became interested in this question last summer, only to realize that my millions of data points couldn’t actually help me answer the question, since they lack variation along the necessary axes. I’ve had to artificially introduce controlled variation (e.g. random variations in scheduling) and wait for new data to accumulate. This was a painful but valuable lesson.
The other problem with this sort of question is that it probably means constructing a model. The literature around spaced repetition is full of models, predicting e.g. probabilities of recall after various intervals. I’m extremely skeptical of these models. They might be somewhat predictive, but I don’t think they’re very explanatory. What should we understand a “probability of recall” to mean, physically? When I have a 60% chance of remembering an answer in a given moment, what’s actually happening to make my mind differ from another answer I have a 70% chance of remembering? It’s not a matter of dice in my brain. There have been various attempts to align empirical probabilistic models to theoretical frameworks of memory, but such models are fraught with “let’s estimate a probability by assuming an exponential fit and doing a logistic regression…” More predictive than explanatory. I don’t trust it.
Quantum Country has an unusual opportunity to explore the dynamics of memory with less modeling chicanery. For example, the challenge that a system like SuperMemo faces is that each user writes their own questions. So there is only one sample—ever—of a given person answering a given question for, say, the third time. At that moment, it was either "remembered” or “not remembered”. Or, OK, a "grade" of 1-5. You don’t get a nice continuous value, and there’s no way to talk about the “probability of recall” for answering that question at that time without doing some kind of curve-fitting estimation. Was it 80%? 85%? How good was your estimation? Well, you have to use another model to evaluate that, according to how well the estimate explains the subsequent data points. This is what we call the “rub some linear algebra on it” approach to understanding. Don't get me wrong: you can produce useful systems without explanatory understanding! But it's helpful to identify such places as potential opportunities.
On Quantum Country, everyone answers the same questions, so we have many samples for every situation. We don’t need to estimate “retrieval probabilities”: we can look at how fractions of populations shift between various buckets. For example, of the 50k people who reviewed this question five days after initially remembering it while reading the essay, how many of them remembered the answer? How does that compare to the fraction of the population which was instead asked to review the same question several weeks after their initial read? No model necessary here. Or you can think of it as a frequentist probability estimation of some hidden “retrieval strength” variable, I guess. Whatever. I think it’s a stronger foundation for understanding.
When someone has trouble remembering a prompt, what should we do? Yes, we can change the schedule, but what else? I’ve run a controlled trial on the retry mechanism, and it seems to help, particularly early in the learning process. But that’s quite a blunt instrument. Re-reading? Breaking the forgotten topic down into more detailed constituents? Providing alternative examples? Supplementary explanations? Or, maybe you can do nothing, and if there are enough other adjacent prompts, those will eventually support memory of the troublesome prompt. To me, these intervention questions are much more interesting than issues of schedule optimization.
Bored of memorization studies
For the sake of discussion, let’s try a different approach. Let’s get further away from well-studied SRS paradigms. Imagine that you’re forbidden to ask: “did they remember that answer?” What questions can we ask about the mnemonic medium? About reading informational texts, in general?
This is a nice lens to hold, because it’s a reminder that Michael and I didn’t conceive of the medium simply as a more easily adoptable Anki. It’s just that it’s so easy to ask questions about whether or not people can remember the answers to the questions the medium asks. So it’s easy to accidentally fixate there, even though other elements may be much more important. But that’s lazy, and it’s unlikely to produce transformative insight.
We could say: look, the spacing effect and testing effect have been studied enough. They reliably produce stable memory encodings. If you understood them better, you could probably make them work more efficiently. But they’re really quite efficient already. As far as rote memory issues are concerned, maybe the problems are sufficiently solved.
But rote memory isn’t all that interesting. Memory is a proxy for learning, which is a proxy for meaningful enablement. So what can we say about learning? In what circumstances and to what extent does reliable memory transfer to open-ended tasks in the subject? That is: if you’ve studied Quantum Country, can you explain topics in quantum computing to someone else? Can you solve (simple) problems you haven’t seen before? Can you create circuits for a purpose? Can you spot unmentioned connections to your understanding of classical computers? Get more specific: what are the authorial implications? What characteristics of prompts seem to promote this type of transfer learning, and through what mechanisms?
One of our core hypotheses for Quantum Country (still untested!) is that the mnemonic medium may have significant effects on downstream topics. That is, if you study chapter 1 via a mnemonic text, can you learn chapter 2 more rapidly? Accurately? Deeply? Can you learn topics you wouldn’t practically have been able to learn before? What are the key interactions here? Presumably some prompts matter more than others—what characterizes that? Presumably there’s a non-linear relationship between the amount of practice and the impact on downstream topics—what is it, and in what ways is it malleable? What are the upper bounds on this effect? To paint a vivid concrete picture: can we reliably enable a typical teenager to engage with graduate-level material?
What about creativity? Where do ideas come from? In Seeing What Others Don’t, Gary Klein suggests that key patterns of insight generation include noticing connections and contradictions (along with a few other factors, less relevant here). Propensity to notice connections and contradictions seems awfully dependent on what’s in one’s memory! So: can memory systems make us more insightful? Presumably some kinds of prompts help here more than others—what characterizes that? Are special synthesis-oriented prompts helpful, or is the impact more a function of solidly understanding the basics? If we designed a new “memory system” with the sole aim of downstream impact on creative work, what would it look like? Would it involve retrieval practice at all?
Screw "learning": what about action? What sort of learning leads to downstream action in the world, as opposed to just learning for learning's sake? How might we design environments which support the factors which produce such action? Which promote great conversation with friends?
What about behavior change? Are “salience prompts” a thing? How do we write good ones, and what’s the scope of their effect? Is there value in author-provided prompts of this kind, or must they be created by readers? Perhaps there's some happy medium? I've suggested that for topics like meta-rationality, extended contact with the material may turn out to be the primary value of the medium. How would we know if that were true? If “extended contact" really is the primary goal, what fundamental "nouns" and "verbs” should we build a communications system around?
One surprising theme in Quantum Country user interviews was that the sessions had an impact on readers’ identities. Engaging with questions about quantum computing every few days over the span of months helped people start to think of themselves as “a person who studies quantum computing”, in a much more visceral way than if they’d simply read an explanatory essay on an afternoon a few months back. I don’t understand this at all! I don’t understand how to know whether it’s happening, or what’s happening, or what the implications of it are—much less how to characterize the interactions with details of the text or the medium in any more generalizable way. But despite my total inability to generate any good questions around this theme, it strikes me as fertile ground for good questions.
Most of the questions in this section aren’t stated crisply enough to actually explore in detail. Refining them to the point of actionability will require a great deal of insight—insight which may not be available without poking around at poorly-shaped versions of the questions. But asking these increasingly outlandish questions is an exercise, for me, in actively rejecting the stupendously boring questions which pervade the literature around memory systems and adjacent “learning” technologies.
All this blather about questions isn’t just idle rumination. I have a few projects about to take flight for which my questions are quite inadequate!
I’m collaborating now with an economics professor on a class in which we’re running a randomized controlled trial around mnemonic-medium-like interactions. We’ve got the core mechanics up and running, so now the question is: what, exactly, should we be measuring in the course of the class? I mean, yes, sure, we’ll record their class test scores and a lot of their review attempts. But it would be quite uninteresting to simply find that “people who use SRS get better grades in the class.” That’s the null hypothesis at this point. The goal is to generate insight. So what should we be looking for in interviews? In open-ended projects? I’m not worried about pre-registering my hypotheses or anything like that. Everything we’re doing is exploratory, meant to improve the questions we’re asking. But I do want to make sure we’re recording what we need to record to answer a wide range of questions.
Likewise, I’m excited about David Chapman’s new meta-rationality essay, which incorporates Orbit prompts to reinforce its ideas. It’s quite unlike both Quantum Country and How to write good prompts: it’s in part a persuasive essay, though it’s also an explanatory essay, introducing much more abstract tools than those in the prompt-writing guide. The feedback so far has been interesting. Something about it isn’t working. But it isn’t not working either—I think. My questions here are still quite weak. I haven’t dug into the data we have at all, but I’ll do that in the coming week.