Deep learning works extraordinarily well. And we still largely don’t know why.
Neural networks can write code, diagnose disease, translate language, and produce images indistinguishable from photographs. The machinery underneath is, in principle, completely legible: you can write down the architecture, the data, the objective, the learning rule.
And yet we have no unified scientific framework explaining why training works, what the resulting networks will do, or how to predict their behaviour from first principles. The field trains largely by intuition, folklore, trial and error, and increasing scale.
Today, a group of researchers from Berkeley, Harvard, NYU, Stanford, the Flatiron Institute, Penn, and the Astera Institute are publishing a paper that argues this, slowly but surely, is about to change.
There Will Be a Scientific Theory of Deep Learning—by Jamie Simon, Daniel Kunin, and twelve co-authors—articulates an emerging discipline they term learning mechanics, and consolidates five converging lines of evidence that a rigorous theory of deep learning is not merely desirable but beginning to emerge.
Imbue has been proud to support this research. We believe deterministic engineering of deep learning systems will make them easier to build openly, and harder to monopolize. To accompany the paper, we recorded an episode of Generally Intelligent in which Jamie and Dan discuss the ideas in depth.
References
There Will Be a Scientific Theory of Deep Learning (Simon, Kunin et al.)
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (Yang et al. 2022)
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability (Cohen et al. 2021)
Scaling Laws for Neural Language Models (Kaplan et al. 2020)
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks (Saxe, McClelland, Ganguli 2014)
Neural Tangent Kernel: Convergence and Generalization in Neural Networks (Jacot et al. 2018)
The Platonic Representation Hypothesis (Huh et al. 2024)
Generalization in diffusion models arises from geometry-adaptive harmonic representations (Kadkhodaie, Guth et al. 2024)
On the Stepwise Nature of Self-Supervised Learning (Simon, Knutins, Fetterman, Albrecht et al. 2023)
Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks (Kunin et al. 2025)
Transcript
Daniel Kunin (Intro)
At the heart of trying to understand artificial intelligence or how deep learning works is about understanding learning, and learning is about movement. Learning is changing parameters, so it’s the model moving through some parameter space. And physics has spent centuries building up tools and ideas and thought processes on how to think about movement.
Kanjun Qiu
Hey, welcome to Generally Intelligent. I’m your host Kanjun, and I’m the CEO of Imbue. We’re an AI company whose mission is to make tech serve humans, and we do that by building open agent infrastructure tools: agents that help humans maintain control as AI capabilities grow. And today we are talking about a new perspective paper from Jamie Simon and Daniel Kunin. There are 12 other coauthors. This paper argues that there will be a scientific theory of deep learning, and it gives a vision of how this will look. This is a little bit controversial in the field, and we’ll get into that.
Jamie Simon is a deep learning theorist. He did his PhD in physics at Berkeley, advised by Mike DeWeese. He won the department’s best thesis award, and his research aims to build first-principles understanding of the learning behaviors of neural networks, often taking inspiration from ideas in physics. Jamie’s also a research fellow at Imbue. We fund this work because we believe a scientific theory of deep learning is a great tool for democratization of power.
Daniel Kunin is a postdoc at Berkeley, also hosted by Mike DeWeese and Peter Bartlett. Daniel completed his PhD at Stanford, advised by Surya Ganguli. His research integrates insights from statistics, physics, and neuroscience to study the mathematical principles of artificial and natural intelligence.
We also have here Josh Albrecht, Imbue’s co-founder and CTO. Josh is a former machine learning researcher who recently has been focused on building open source tools for agents. Josh runs about 100 agents in parallel right now, and has been shipping like 50,000 lines of code every day—sorry, week. Not day. We’re not there yet. Maybe in a year or two — which is kind of crazy, but really interesting. Bringing the applied lens to deep learning.
Let’s get started. You guys are publishing a paper, and you make the case in this paper that there will, in fact, be a theory of deep learning. You talk about the emerging evidence for such a field, and you’re calling the field learning mechanics. Can you tell me what is learning mechanics? How would you describe it?
Jamie Simon
Learning mechanics is the term we’re proposing for essentially a fundamental, mechanistic, mathematical science of the learning and other behaviors of neural networks.
If you’ve heard of mechanistic interpretability, it often frames itself as the biology of deep learning. It approaches things with a biologist’s or systems neuroscientist’s lens, trying to pick apart anatomy, identify circuits, and connect these with a semantic interpretation of what these models are doing and how they’re thinking. And it’s mostly a qualitative science.
Learning mechanics hopes to be essentially the physics of deep learning: a first-principles, highly mathematically-grounded counterpart, more distant from the semantics and closer to things like the training process, the selection of hyperparameters, and the dynamics of how all this learning happens. A theory like this gives a solid foundation to ask lots of other questions in the world of deep learning that you might want to know the answer to.
Kanjun Qiu
Something I’m curious about is why does deep learning need such a theoretical foundation? Like, what does the “physics” of deep learning give us that the biology doesn’t?
Daniel Kunin
Having a theory of deep learning would have many practical, scientific, and safety implications. From the practical point of view, deep learning has historically been driven mostly by trial and error: seeing what works and what doesn’t, and going pretty much gradient descent in terms of practice. Having an actual theory for the foundations of what’s driving success in these tools would allow us to be more theory-driven, more efficient, and potentially more safe. We could think about the risks that come with these technologies and design ways to mitigate those risks.
In terms of the scientific reasons, it’s just a fascinating idea to try to build an understanding of these very complex machines that are generating text and images, and actually try to understand what’s really driving that success. And then there’s the neuroscience side of things: if we really understand how artificial intelligence works, that might also provide a lens for understanding natural intelligence. So building a scientific theory of deep learning would have all these different practical, safety, and scientific implications.
Kanjun Qiu
So one thought is: if we have a physics of deep learning instead of just mechanistic interpretability, we have less grad student descent and more like engineering for deep learning systems. Less guess and check and more causal, predictive models for how we should expect a training run to pan out based on different dynamics.
Daniel Kunin
Yeah. Machine learning in general is a really interesting technology because our choices don’t directly go into the final product. When we design these systems, we’re designing a playground that iterates on itself, and the result is the model. Our design choices as engineers are implicitly affecting the final product, making it very difficult to know when that product might, you know, when we design a bridge, we’re designing the bridge itself. We can understand how our design choices affect whether the bridge is going to fall or not. Having a theory of civil engineering gives us an ability to understand the conditions under which bridges might fall. That’s very difficult when our design choices are not going directly into the final product.
So this idea of building a learning mechanics is in some sense trying to understand how those design choices in the beginning—in the setup, whether the data, the architecture, the learning, the hyperparameters of that whole process—affect the final product.
Kanjun Qiu
It’s kind of like: how do the initial conditions pan out?
Daniel Kunin
Yeah.
Jamie Simon
And to the question of why we want fundamental physics in addition to a kind of biology or systems neuroscience perspective: there are things that you can do with a quantitative science that are much harder to do with a qualitative science. And the opposite is also true. There are things you can do with a semantically aware science that you cannot do with a “dumb” science—like explaining cognition in the brain from quantum physics alone is very difficult, but explaining why neurons fire without an understanding of atomic physics is impossible.
If we’re serious about the challenge of understanding deep learning and putting together some kind of publicly available theory about this, then we want to be studying it at all levels of abstraction. And this level happens to resemble physics. There’s no dogmatic reason why there should be a physics of deep learning any more than a pharmacology of deep learning. It’s descriptive rather than prescriptive. In trying to understand these things and asking natural questions, we were finding—and people in our field have found—that asking certain types of natural questions leads you inexorably towards this first-principles, quantitative problem that’s at the root of everything. And we think that problem has an answer that’s going to look kind of like a physics in some ways and not in others. That’s what we’re trying to articulate here.
Daniel Kunin
To me, learning is really a process of movement. Learning is changing parameters. And so it’s the model moving through some parameter space. And physics has spent centuries building up tools and ideas and thought processes on how to think about movement in the physical space. So at the heart of trying to understand artificial intelligence or how deep learning works is about understanding learning, and learning is about movement.
Kanjun Qiu
That’s super interesting. I love that way of thinking about it. And one thing that’s nice about deep learning systems is that they’re completely measurable, unlike biological systems. You can learn so much more.
Jamie Simon
Oh my gosh. Yeah. Both of us—I did my PhD in the Redwood Center in Berkeley, Dan is a postdoc there, and I’m a visiting scholar there—we hang out with neuroscientists all the time and get to see firsthand how difficult the task of doing theoretical neuroscience is, because you can write down all the mathematical models you want, but you’re so limited in what you can measure about the brain.
Kanjun Qiu
I’m curious: why now? What were you noticing that led the 14 of you coauthors to get together and write this paper now? Not five years ago, not ten years from now. What are you observing?
Daniel Kunin
Well, there’s a simple answer to that, which is most of us just graduated from our PhDs. But we didn’t actually get together with the purpose of writing this paper. We got together to share ideas and talk about research. All of us think deeply about trying to understand deep learning, and we take different approaches. Many of us have either a physics background or a physics perspective, but that’s not true for everyone on the list.
Jamie kind of pulled us together as a group. We had all known each other—either we were in the same institutions or neighboring institutions. Jamie was at Berkeley, I was at Stanford. Or we were going to the same conferences. At some point, Jamie approached me and said, why don’t we create a community among these grad students who are studying the same ideas with similar perspectives but in different institutions and different labs, and try to come together to think more deeply about what we’re doing and share our ideas—not just through arXiv papers, but through conversation and dialog.
So this paper actually came out of a retreat that Jamie organized, where we all came together in the woods in the Berkshires, cooking food for a week and sharing our research ideas. At some point we realized we all had quite different perspectives on how we do research. We might not be coming together to produce a new technical contribution to the field. But we realized that as a field, we really needed to summarize the progress we’ve made and our internal intuitions about open directions and next steps. So this paper kind of came out of that.
Why now more broadly in the field, I’ll let Jamie answer that.
Jamie Simon
There are a few reasons why the present moment is particularly promising for the development of a scientific theory of deep learning. One is just that deep learning has never been more accessible, more commoditized, more highly studied. We now have a pretty good convergence on methods for training large-scale systems that are fairly reproducible and work fairly well. It was harder to do science of large models when things were still being hashed out, but now that landscape has solidified. Most of us are in this fortunate spot in our careers that offers us time to do it.
Kanjun Qiu
That’s very exciting. You’re free to work on the important problems.
Jamie Simon
Yeah.
Daniel Kunin
And I think “important” is a key word right there. We’re in a pretty unique time where the practice of these technologies is skyrocketing, and we can see all the effects they’re having in our world, both here in San Francisco and the Bay, and more broadly. The importance of really trying to understand these things is becoming more important.
Kanjun Qiu
Yeah, totally.
Jamie Simon
And in our field, we’ve been watching the field of deep learning theory and the academic efforts to understand these systems push forward and grow and change and hit walls for five, six, seven years. We’ve seen a lot of things that have worked, and our assessment, getting together, was: oh yeah, we think there are serious ways in which this effort—which for a long time has been embattled, really difficult to get purchase on the problem—is actually starting to work. Things are starting to come together in different ways.
So this paper is trying to convey a tone of optimism as we organize different lines of evidence that things are actually coming together. Everything we do in this paper is trying to be descriptive, not prescriptive. We didn’t set out to write an optimistic paper. We talked about what we thought was happening and where we should go next, and found that the natural thing to do was to articulate these emerging ideas in one place and try to drum up momentum and excitement to bring them forward.
Josh Albrecht
Maybe we can get into some of the details of why you both believe it’s now possible to make this type of theory. For a while there has been a sort of reputation among practitioners—in the past 5 or 10 years of people doing deep learning—that theory is kind of useless or pointless. Empirical results are really a lot further ahead than theoretical understanding of neural networks. This was not always true. In the past, we had a pretty good understanding of random decision forests, support vector machines, some of these other systems. But with deep learning there was this kind of bigger and almost growing gap where it felt like we were leaving theory behind. What has changed in the past year or two? What new tools are there? What gives you this optimism that we might actually be able to make really interesting progress on the theory of deep learning?
Jamie Simon
Let’s widen for a moment the idea of what theory is. Inside these large labs, there are massive science of scaling teams. They want to ask: how does every hyperparameter scale with every other hyperparameter? Getting these things right is really important to scaling up a large model. The company that does this better could have the better model after the next training run.
Identifying these empirical hyperparameter scaling relationships is sort of like a protean version of theory. It’s like: okay, there’s an exponent on this log-log plot, it appears to be two, to within some error bars. Great. That’s useful. When you get a clear signal like that, you can carve off the problem and toss it to your friendly neighborhood theorist. And there’s often a nice reason why it’s two, or whatever it is.
As large-scale model training has become more and more systematized, getting things right fairly early on—instead of just trying every possible permutation of architectures and hyperparameters—has become more and more important. So it’s sort of pushed this science-of-scaling perspective, which has made clear that there are principles here.
There’s a famous example, particularly celebrated in the theory community, called Maximal Update Parameterization (µP). µP leads to a technique called μTransfer for taking certain hyperparameters, like the learning rate, of a smaller model and scaling it up to a larger model. This had breakthrough success and was used a lot around the GPT-4, GPT-5 era, and in many different forms is now kind of just baked into the way people think about scaling up models. The paper actually came out right when I joined Imbue, and you guys had me present it at Journal Club.
Kanjun Qiu
I do remember that. Yeah. Greg Yang’s paper.
Josh Albrecht
Can you give a very brief description of that paper and the technical perspective?
Jamie Simon
Absolutely. Here’s something that every practitioner training large models has experienced. You’re going in, training a large model — it’s really expensive, it takes hours or days or weeks to train the whole thing. But there are some numbers you have to set. These numbers are called hyperparameters, and they’re things like learning rate, depth, width, number of heads in your transformer, all of these things. You go and train the large model and find: that did not work as well as I expected. You think: I got the hyperparameters wrong. So what do you do? It’s too expensive to just try different combinations on the big model. So you try it on a small model, find some good hyperparameters, go to the big model—and they still don’t work very well. mu transfer is essentially a technique that identifies certain non-dimensional quantities that it prescribes should remain the same upon scaling up, in particular the width of a model, the hidden dimension. By preserving these—so it ends up looking like you scale your learning rate with a certain exponent relative to your width, and ditto with the initialization scale of each layer—you preserve non-dimensional quantities related to training and feature learning. This happens to often preserve good performance from smaller models to bigger models. So it can really reduce your overhead and hassle when training the bigger model.
Daniel Kunin
Yeah. So practically it means you can spend your time optimizing your hyperparameters on small models, find those optimal conditions, and then transfer them to large models.
Jamie Simon
An analogy I think is particularly apt is building a small model of a bridge before building the big one. When you go and build the Golden Gate Bridge—
Kanjun Qiu
It’s much more expensive to build the big model. You want to do it on the small one first.
Jamie Simon
For any major bridge construction project, there are many competing designs to choose between. But you can’t just go build the Golden Gate Bridge ten times and see which one stands up, which one bears the most weight.
Josh Albrecht
Unless you raise enough money.
Jamie Simon
Unless you raise enough money. So one way you do this is build a small model of the bridge and see which works best. But hang on: it’s not that easy, because how do you know the small model is informative? As you scale something up, material properties change. Just look at how ants can support so much more than their weight, but we can’t. So how do you make a small model that’s informative for the big model? You can do this if you understand things about materials science and the scaling relationships of different stress and strain and fracture quantities as you make a model bigger. What this mu transfer idea is essentially doing is identifying the right nondimensionalized quantities to closely examine in your small model so that they’re informative about your large model.
Kanjun Qiu
So μTransfer is a good example of how theory applies to practice and helps us train larger models more effectively and more cheaply and be able to run small-scale experiments to do that.
I want to dive into the paper itself and talk through some more examples like this. You guys state five observations that serve as evidence that a theory is emerging. One, there are analytically solvable settings that exist. Two, insightful limits actually do reveal fundamental behavior. Three, simple equations can capture meaningful macroscopic statistics. Four, hyperparameters can be disentangled and actually understood so it’s not all a big mess. Five, even across settings and tasks where the training setting is different, you end up seeing universal phenomena.
I want to dive into each of these, maybe one by one. But before we do, I’m curious: how did you get to these five observations? What is it about these five that made you include them in this paper? Were there other observations you didn’t include? Are these five saying something in particular? Are they comprehensive?
Daniel Kunin
I can’t completely remember how we settled on those five, but I do remember the ordering had quite a few iterations.
In terms of: are there other lines of evidence that there will be a scientific theory of deep learning? Yes. This is not a comprehensive review of all papers in deep learning theory and all different approaches. There are researchers who take other approaches that we are not reviewing in this work.
These are the approaches we focused on as the most promising examples of evidence that this theory will exist, and they all have this flavor that is physics-inspired: simple models, macroscopic variables, taking limits, disentangling hyperparameters, and universality. They’re all in the spirit of mechanics. I think that’s partly why we settled on these five.
Jamie Simon
My memory of this is: we were at this cabin in the Berkshires, it was like day three, and most of these 14 authors were there, and we were trying to articulate what our shared vision was, what was in common between our perceptions of what really matters and where the field would and should go. We started filling up tons of big whiteboards with different clusters of ideas, and we changed many times on how we thought about these.
At first they were like guidelines for a young person getting started in the field to follow: look for toy models, take limits when you can, make sure you study your hyperparameters. Then we realized after a few iterations that there’s a stronger framing: hey, each of these isn’t just a recommendation. It actually carries with it some clear successes from the last ten years in deep learning theory, and is forward-looking in that it suggests important open directions and ways to think about research going forward. And they all together have this mechanics flavor — this flavor of classical or statistical mechanics where it’s like learning is about movement and you study the interaction of these components and how they all lead to learning.
Kanjun Qiu
Studying the mechanics of that movement.
Jamie Simon
Yeah.
Daniel Kunin
If we’re saying there will be a theory of deep learning, maybe a more appropriate question is: what is actually the barrier to a theory of deep learning? We have access to the learning rule, we have access to the data, we know exactly what the architecture is, we know the task. And as we said, we can measure everything and anything. So it’s actually kind of surprising that we don’t have a theory.
Josh Albrecht
Yeah. The question is: why haven’t we figured it out already?
Kanjun Qiu
So how would you answer that?
Daniel Kunin
It’s not the opacity of the problem. It’s the complexity. It’s an extremely complex, interacting, nonlinear, high-dimensional system and we’re trying to understand what’s going on. One way to think about all of these observations we put forward as evidence that there will be a theory: they’re all ways of handling that complexity. Taking that complexity and simplifying it. These five categories are success stories—clusters of research papers and research approaches—and they all, at the end of the day, take a different way of handling that complexity, which is the barrier to a scientific theory.
Kanjun Qiu
That makes a lot of sense.
Jamie Simon
And I think, to this question of why isn’t there a theory already: the questions that were asked in decades past hadn’t yet realized that the training processes of deep learning—this high-dimensional, complex, messy thing that depends on real-world data statistics—that there’s no way around that and it really needs to be grappled with. The classic theory that people still often think of as statistical learning theory, has this idea that a simple, parsimonious model with high regularization that doesn’t overfit will generalize well. That’s a really beautiful, mathematically correct, self-contained theory. But at the time it was developed, we didn’t have modern deep learning. There was no way that this understanding—that this complexity can’t be reduced, there’s no easy way out—could have been worked into the DNA of that whole way of thinking. The more modern flavor of deep learning theory is less like mathematics that tries to work out everything and offer a nice simple guarantee to practitioners, and is actually much more like a diverse, rich, very scientific approach that just dives into the complexity.
Kanjun Qiu
And tries to make sense of the pieces of complexity, as opposed to trying to simplify it all down.
Jamie Simon
Yeah. And a nice thing about that approach is that we don’t need one simple answer to be making progress. Those generalization bounds from the past, essentially formalizing Occam’s razor—a simple model with higher regularization won’t overfit—that’s kind of an end-to-end theory. I think we won’t have that for deep learning for a while. But that’s fine, because diving into this whole messy process and finding bits of it that we can organize—finding structure and regularity — there’s so much wonderful stuff to do. And even organizing pockets of it is useful for people who are working in those pockets.
Kanjun Qiu
So these are five pockets of organized complexity.
Jamie Simon
Yes. And our hope and belief is that these pockets can be widened and eventually linked together into something resembling a comprehensive theory. And this will happen in the next decade or so.
Kanjun Qiu
Let’s talk about some of the pockets.
Josh Albrecht
Should we start with the first one?
Jamie Simon
Yeah.
Kanjun Qiu
The first pocket is: analytically solvable settings exist. As a theory layperson, what does that mean and what does it let you understand or organize?
Daniel Kunin
My research probably most overlaps with this pocket. Deep learning — let’s first talk about what it is. It’s an architecture, a dataset, a task, and a learning rule. Sometimes we can find simple versions of these: a simple architecture, maybe a simple learning rule, a fixed dataset, and we can see how all those pieces interact. In this section of the paper—section 2.1—we talk about two different ideas for finding these simple, analytically solvable settings. They’re both, at the end of the day, about linearizing the problem: linearizing it in terms of the data or in terms of the parameters.
Linearized in terms of the data means: deep learning is these linear followed by nonlinear transformations in sequence. If we just got rid of those nonlinear transformations and thought about training a model that’s just a sequence of linear transformations from input to output, what would that look like? Would training a deep linear network be the same as training a shallow linear network?
This setting has had immense progress in the last 10 or 15 years, showing that they’re different — that the deep linear network’s dynamics, and the solution it learns, are not the same as a shallow linear network. So depth really does have an effect on the learning process, and we can understand that effect directly. That effect—generally, a deep linear network has a preference for how it breaks down the task, learning the principal directions—the singular vectors of the task. It has a preference for its ordering, how it learns those singular vectors. And that idea of a simplicity bias, or learning some things before others, is a hallmark of more modern, more realistic deep learning.
Josh Albrecht
That’s a really good example, because when people think, “well, you simplified the problem too much—if you remove the nonlinearity, a deep fully linear network and a shallow fully linear network are mathematically kind of the same, nothing interesting is happening”—the learning mechanics make them slightly different. You’re saying that by splitting it up, you can see actually where some of the parts of deep learning are coming from: this idea of learning the simple things first and then the more complicated things afterward. Even though you’ve simplified it by removing the non-linearities and made it almost trivial—just learning one matrix or something — you’re still actually learning something interesting about learning and about the setup by simplifying it. And that maybe helps us build toward our more full picture, even if we’ve made unrealistic assumptions here.
Kanjun Qiu
Just so I understand: linearization means you remove the nonlinearities, just weights?
Josh Albrecht
The nonlinearities.
Daniel Kunin
Yeah. To linearize a network in terms of the data, you take the nonlinear network and remove the nonlinearities at every layer. If you’re using an architecture with ReLUs, you just get rid of the ReLUs and use linear activations. The activation comes in, activation goes out, no change.
Kanjun Qiu
That makes sense. What do you learn from these linearized settings that applies to nonlinear settings? And what do you lose from the linear setting that’s no longer applicable?
Daniel Kunin
Practically, we’re definitely not suggesting you train linear networks. So you lose the fact that the network is no longer useful—
Kanjun Qiu
It’s only usable for studying the dynamics.
Daniel Kunin
Exactly, studying this idea of how the initial conditions affect the way in which the model changes in function space. What you gain is analytic tractability. It becomes a much simpler problem to study; we can think about how those four ingredients interact. And in particular, it points to this idea that the learning process is inherently biased towards simplicity. I believe that is going to be a critical part—in fact, that idea comes up in other parts of our paper, in the other sections.
Kanjun Qiu
Yeah, in almost every section.
Daniel Kunin
And it’s going to be an essential idea behind why deep learning works so well — that it is biased towards finding really meaningful aspects of the data, meaningful signals, before less meaningful ones. By doing this, it’s biased towards a simple or parsimonious solution, even though it might look more complex. It’s actually biased towards simplicity.
Josh Albrecht
Which helps explain some of the generalization, right? Because you’re starting with simpler things, so maybe it can generalize better. One of the things I’m not sure comes through fully in the paper is this idea of: they learn simpler things first. And I think both of you have ideas about what the shape of this grand theory could look like, and intuitions about how these things will link up. So this paper isn’t just saying “we hope there will be a thing” — it seems like you each have ideas about what the actual grand theory will concretely look like, and where these lessons come from: analytical tractability, learning simple things first, a bunch of other lessons like that.
One thing I want to highlight for people listening is: really skip to the end and check out the future directions, and reach out if you’re interested. There’s a lot going on. There’s a lot of different signals, a lot of different evidence. How it all fits together is interesting—it’s still emerging, still early.
Kanjun Qiu
I’m also really curious about the “simple equations capturing meaningful macroscopic statistics” section. Can one of you explain what that bubble looks like? You talk about neural scaling laws, etc.
Jamie Simon
Yeah. This is section 2.3 in our paper: simple macroscopic laws. If you look back at how other sciences developed in paradigm-setting eras, there were often a whole bunch of disconnected empirical observations that took the form of nice mathematical relationships, or laws, that were only later explained or linked together.
There are a number of examples like this, including neural scaling laws—an empirical law that is currently driving basically everything happening in Silicon Valley. And there’s something called the edge of stability effect, which I think is really beautiful. It was first found as an empirical law. Basically: as you train a neural network and move over the loss surface, sometimes you’re in a spot where the loss surface is very smooth, and sometimes you’re in a spot where the loss surface is very steep — you’re in a narrow valley. You can measure a quantity called the sharpness; technically, this is the maximum eigenvalue of the second derivative matrix of the loss surface. If you look at how this evolves over time, you find something quite reproducible on large models: as you train, take steps, sharpness grows and grows—but then it levels off at a particular value that appears empirically to be two over the learning rate. This only shows up this cleanly when you’re doing full-batch gradient descent, not stochastic.
Josh Albrecht
That’s probably not a coincidence.
Jamie Simon
Not a coincidence. The scientist most associated with progressive sharpening and edge of stability is Jeremy Cohen, one of the authors on this paper. The observation he makes in that first paper is: well, two over the learning rate is exactly the value for the sharpness.
The Hessian curvature—that classical learning theory, the traditional branch of optimization work from the last century—predicts the instability to start. If you have a really simple function, if you have a valley that’s too steep for your learning rate, you’ll bounce out of it. What’s the critical steepness? It’s two over the learning rate.
Kanjun Qiu
Interesting.
Jamie Simon
So it’s not a coincidence. But of course, that assumes some nice structure on the loss surface that allows you to actually solve the dynamics. This is extremely reproducible in models large and small, and has big implications for your choice of learning rate and the effect that choice has on your training run and the later generalization of your model.
Kanjun Qiu
As a practitioner, how would I use this to choose a learning rate?
Jamie Simon
It’s a difficult question to answer because so many other things go into the choice of learning rate.
Kanjun Qiu
Let me ask a different question. As a practitioner, how should I intuitively think about progressive sharpening, or sharpening in general? And also why does it increase, intuitively?
Jamie Simon
So this is actually something my collaborators and I are trying to build a good theory for right now. Our intuition is twofold. One: the weight matrices of a neural network tend to align with each other in a linear algebraic way and also grow in norm over the course of training, and both of these things tend to lead to higher sharpness. That’s a qualitative explanation that needs to be built into a quantitative theory. Another reason is that there’s a unifying geometric explanation: gradient descent moves in the direction of steepest descent by definition, and if there’s a direction that the network can really fall off and speed up very fast, it sort of gets sucked towards those.
Kanjun Qiu
Trying to find those steepest directions.
Jamie Simon
Yeah. There’s a sense in which neural networks kind of learn to learn faster, and that probably explains progressive sharpening — but I should add a disclaimer that these are my personal hunches.
Daniel Kunin
One of the open directions in the paper is really trying to understand the connection between sharpness, edge of stability, and generalization. We have an empirically observed and repeatable experiment. We have a theory for why gradient descent dynamics can be stably at this edge of stability. But connecting that directly to performance, generalization, and feature learning is an open direction—a really interesting one to pursue.
Kanjun Qiu
Interesting. This bucket of things — simple equations capturing macroscopic behaviors — they’re often these empirically observed laws. And now this is a place where you can start to develop theory around them precisely because they’re so empirically consistent.
Jamie Simon
Wonderfully, yes. If you look at the history of chemistry, you have all of these gas laws: pressure and temperature appear to be proportionally related holding volume constant, pressure and volume are inversely correlated holding temperature constant. And you can combine these into PV = nRT. Thinking about these kinds of things lets you guess: gases are made of discrete molecules that bounce around in a kinetic fashion, that leads to a statistical mechanics view of the system. You could imagine it would have been much harder to go the other way: rather than top-down empirical laws, starting by saying “let me first write down a fundamental theory of gases at the micro level and make predictions at the macro level. Ah, I posit that there should be this quantity called pressure that scales intrinsically in a certain way, and therefore...” I mean, this is what people are trying to do with string theory right now, and it’s really hard. It’s hard to just guess the answer and make predictions without some kind of empirical grounding.
Daniel Kunin
It’s interesting that we talked about the first bucket and the third bucket because they’re really opposites. The first bucket is all bottom-up: building up from foundational principles and simple settings to try to understand phenomena. The other one is empirically top-down. A deep understanding of these things from the top-down approach would have huge practical implications. Thinking about neural scaling laws and being able to a priori understand the exponent. What about the data, optimizer, architecture, etc. leads to that exponent and the scaling law? Being able to predict those exponents before you actually find them would be a huge win for developing more powerful systems.
Josh Albrecht
Maybe we can dig into the second one, the insights of limits into fundamental behavior.
Daniel Kunin
Let me let Jamie talk about this, but I’ll say first that I think this is actually maybe the most important of all the directions.
Josh Albrecht
That’s why I wanted to dig into it.
Daniel Kunin
Yeah. And maybe the most physics in spirit.
Jamie Simon
Why is it the most important?
Daniel Kunin
I think it’s where we’ve made the most precise statements about realistic systems, when taking limits and thinking about these objects in the most high-dimensional sense. In that sense it’s the most important: it has flavors of a lot of the other pieces—for example, 2.1, analytically solvable settings, because by taking limits we end up getting analytic tractability. But it’s also related to section three in the sense that we’re talking about realistic systems, not toy systems. This intersection of realistic and tractability, by taking limits, has been a major success in deep learning theory in the last decade—less than a decade, honestly.
This probably starts with Arthur Jacot’s, one of the authors on this paper’s, results on neural tangent kernels.
Jamie Simon
I’d say it starts a little before this.
Daniel Kunin
Before that, yeah.
Kanjun Qiu
Tell us what “taking limits on a system” means.
Jamie Simon
I mean limits in the ordinary calculus sense of the term. You have some number describing your system — it could be a size parameter, or a learning rate, or some other parameter — and you take it to either infinity or to zero. And taking a limit of a variable removes it from the expression it’s in. In calculus: consider the function 5 + 1/x as x grows to infinity. Clearly the 1/x vanishes. Taking the limit gives you the coarse picture, which is: if x is big enough, you’re left with the 5.
This is one of the foundational tools—maybe the foundational tool—of statistical physics. Going back to the gas analogy: imagine I hand you a little box with 100 particles of gas bouncing around, and I say, alright Kanjun, give me a theory of this. And you’re like, how? And I’m like, I’ll tell you anything you want to know—the quantum mechanics of these particles, their interactions, whatever you want to know about these molecules. And you’d go to the whiteboard and start thinking about how 100 position and velocity variables interact with each other. You can see it’s complicated. You’re tracking a lot of stuff. But then there’s this thing that’s paradoxical and amazing: as I add more and more particles, in some sense the problem becomes easier.
Kanjun Qiu
So you treat it like one body instead of 100 particles.
Jamie Simon
Yeah. In this glass of water, or in my lungs full of air, there’s more than 10^20 particles.
Josh Albrecht
That’s basically infinity.
Jamie Simon
So you can imagine that correction: 5 + 1/x — well, if x is 10^20, 5 is going to be a pretty good approximation. So it’s actually really easy to derive the ideal gas law, PV = nRT, once you realize: these are interesting quantities. When I have lots of particles, there’s a sort of relationship between pressure, volume, and temperature that becomes exact in certain limits.
Neural networks also admit descriptions like this. The simplest one is the gradient flow limit—so simple that it’s often not even discussed in the same breath as these other limits, but it is totally foundational. The gradient flow limit says: take my step size to zero. In gradient descent, your parameter update is equal to the gradient times some learning rate parameter, eta (η). Gradient flow says: take η to zero. You might say, well, I’m not going to go anywhere then. And I say: okay, take the number of time steps commensurately larger. Something that was previously discrete—100 particles or molecules of water—ends up being a continuous system that you can treat with differential equations.
This story has played out in a dizzying number of limits in deep learning theory. Some of them are practically useful. Some are merely insightful. The earlier μ transfer story about hyperparameters was derived in the theory of infinite width. There’s very little you can say about a neural network of width ten, just like there’s very little you can say about a glass of water with 100 molecules in it, because everything’s so complicated and messy and contingent on your initialization and steps. But the systems we’re actually using now are so big—that’s another answer to the “why now” question.
Kanjun Qiu
They’re basically continuous. They’re approaching continuous.
Jamie Simon
Yeah. At least in certain respects. Neural networks are doing so many things that they’re maybe not continuous in every way. But when you have depth 100 and width 10,000, which are pretty typical numbers for a large language model today, it’s not surprising that some of the math you might do about depth and width going to infinity — as long as you’re careful to scale things in the right way to preserve those non-dimensional quantities — might be insightful.
Kanjun Qiu
What are some things we’ve learned from taking these limits so far?
Josh Albrecht
What kinds of useful infinities are there? You said neural tangent kernel — the μ stuff is infinite width. We have infinite depth. Do we have infinite steps?
Kanjun Qiu
How should I think about some of these continuous systems? For example, intuitively: what’s the difference between infinite width and infinite depth?
Jamie Simon
The interesting limits split into two types: those that are realistically useful, and those that are theoretically insightful and may reveal some fundamental learning behaviors, whether or not we’re actually using that limit in practice.
Some examples of limits that are really inspired by practice: infinite data, infinite context length, infinite width, infinite number of attention heads, infinite depth.
The neural tangent kernel limit is one way of taking things to infinite width. It turns out there are two ways of scaling to infinite width. One gives a simpler system that does not do feature learning — hidden representations don’t evolve, but it’s mathematically very beautiful and can be used to answer certain questions in a compact way. Then there’s the more realistic limit, which is μP, or the feature-learning or “rich” limit — it goes by different names—and this actually does preserve feature learning, and it’s way harder to study.
This has now been pretty clearly converged on as the right infinite-width limit to study. It’s a guiding belief in the research community that pretty much whenever you can study something at infinite width, it’s a good idea, because finite width is just adding one-over-width corrections.
Josh Albrecht
It’s like the icing on the cake. You want the cake first, and the icing is: how do we discretize it? Like, if we could do it at 10 trillion layers—okay, maybe we don’t need 10 trillion layers, it’s just annoying to put on a GPU and it’s too big. Can we get away with 100? Can we get away with 1000? Usually, yes.
Jamie Simon
Yeah. It’s worth noting that the way computational physics solves any continuous system — how do you do finite element analysis of a metal beam that’s flexing? You don’t solve a continuous PDE because you can’t even represent a real number on your computer. You discretize it into a mesh and then the mesh flexes and you’re solving linear algebraic equations. How do you solve fluid flow? You discretize volume and track the fluid flow at all the points in the mesh. How do you solve any ODE? The simplest method is Euler’s method—it’s basically gradient descent, just discrete steps.
So there’s a sense, which we name the discretization hypothesis in this paper, where this belief—that has been bouncing around needing a name—is articulated. The idea is: maybe practical deep learning should be thought of this way, and maybe this is why scaling things up only makes things better. Of course you simulate your fluid with a finer mesh; of course you’ll get a more accurate description.
Kanjun Qiu
So deep learning is a discretization solution.
Jamie Simon
Basically. The discretization hypothesis states that essentially any practical deep learning system is a discretization of some ideal continuous system that would have performed better on multiple axes. You have finite data, finite step size, finite width and depth, and all these other things.
Kanjun Qiu
That’s a super interesting way of thinking about what a neural network is.
Josh Albrecht
It makes it so much less magical. Right now, when people think about AI, they’re like, it’s this black box, we don’t understand it. And we also don’t understand exactly how water moves around in a glass. But we know it’s not going to jump out, right? We just know that’s not happening because that’s how it works. No matter how fine we make our simulation, we’re never going to find a way to make it jump out — that’s just the nature of the system.
It’s really interesting to see deep learning in this continuous way. There are probably better ways we could do these continuous flows, better ways to set these networks up, for sure. But scaling up isn’t doing something qualitatively different in this sense — the scale is giving us better and better approximations.
Jamie Simon
Yeah. You could also imagine that new abilities become resolvable once you’ve discretized your mesh of all human text at a finer resolution.
Josh Albrecht
Because when you’re doing fluid dynamics with just two particles, you’re just not going to get a very convincing waterfall.
Jamie Simon
Yeah. You won’t see turbulence with a handful of particles. You won’t see waves.
Josh Albrecht
That’s really interesting.
Kanjun Qiu
In terms of open work in the limits area, how do you guys think about that?
Jamie Simon
We have two open questions about this.
Daniel Kunin
The discretization hypothesis.
Jamie Simon
Yeah. And I was also thinking about hyperparameters—zero hyperparameters. One open direction is: is this discretization story true? I published a paper a couple of years ago called More is Better that shows the discretization hypothesis is true for a class of models called random feature models—fairly general, but these models don’t learn features; it’s just a random feature projection that’s static, with a linear model trained on top. You take learning rate to zero, take width to infinity, take depth to infinity.
In transformers, there’s more than just a width—there’s also the number of heads. Width and depth are maybe just practical necessities that actually obscure our view of the real, deeper, simpler thing the model is doing. And every time you clear away one of these things, you leave a simpler picture behind. After the limit of infinite width, people generally understand: if you’re studying something with width as a possible confounder, here are the experiments you do to make sure you’re large enough that it’s not a confounder—and that lets you see the rest of the system more clearly because you’ve factored it out.
So there’s another open direction: is there a model of deep learning where you’ve taken every possible limit and you’re left with something that has no hyperparameters and is just some platonic ideal — some alien creature that has a minimal number of degrees of freedom.
Josh Albrecht
Like the ideal gas law — you’d still have some quantities about data and other things, but you wouldn’t have to do all this weird other machinery.
Jamie Simon
There’s this idea that really the complexity is in the data, not the model. The model can be kind of dumb — a comparatively small number of lines of Python to describe the architecture of a transformer. But the dataset is really where the gains come from. So as a theorist, this suggests you should make your model as simple as possible, subject to it showing a handful of behaviors you know are important, like feature learning. And then ask: how does this work on arbitrary data? Hitting hyperparameter count zero would let you ask: how does this ideal “ball of clay” accept inference from the data you put it into contact with?
Kanjun Qiu
That’s like a totally different system to study, which is really exciting.
Josh Albrecht
That’s a good description of much of section 2.2 — how we can use these limits. Maybe moving on to later sections: we already touched on 2.4, how hyperparameters can be disentangled. So talking about 2.5 and then the actual applications — what exactly did you mean by universal phenomena appearing across settings and tasks, and what kind of implications does that have?
Daniel Kunin
There’s one experiment I think is pretty interesting and insightful on this idea of universality. Specifically, a paper by one of our authors, Florentin: if you take two diffusion networks — diffusion models, where you give them random noise and an image comes out — two different networks trained on different datasets, and you give them the same random patch of noise, at some point as the models get bigger, you find they produce the exact same image. So as models scale in both data and size, there’s something universally shared among all these different models — they’re converging to similar solutions.
If this weren’t true, it would be very difficult to build a scientific theory, because every different model would require its own theory. The idea that there’s some level of universality between large architectures is very promising for the idea that we could build a scientific theory of deep learning. What we really want to understand is: what are these shared properties? What is universal among them?
Kanjun Qiu
That’s super interesting. What are some other examples of universality and how it shows up?
Daniel Kunin
A good example is the debate from a couple of years ago: are large language models stochastic parrots, or are they learning world models? Are they basically just learning to predict the next token on correlation patterns within their corpus of data? Or are they actually understanding something about the world — why the next token would be the next token? The general consensus now is that they are actually learning deep understandings of the world. And so that would mean different large models are learning similar world models.
The idea is that maybe there is a universal understanding of the world that all these different models are converging towards, which could explain why in the diffusion network example, two different networks given the same random patch of input generated the same images — because they’ve learned the same world model of realistic images.
Kanjun Qiu
So interesting — that there might be a universal world model that is predictive of data in the world. Despite data being shown in different orders and different types of data, and models having different architectures — in order to predict things, there’s some kind of convergence.
Josh Albrecht
It comes with some pretty big asterisks. This is on a given dataset, presumably with similar data. If you had two totally separate datasets, you couldn’t possibly get the same thing unless they were from roughly the same distribution. Like, if you only have red images and only blue images, I can’t give you the same data. There has to be overlap in some way. You have to be general enough. They’re drawing from a similar distribution, or whatever.
Jamie Simon
Yes and no. There’s a paper articulating this idea — they called it the Platonic Representation Hypothesis. The idea is that there’s one true universal world model, and any dataset you might take is a projection of that world model, like the shadows in Plato’s cave. Different datasets will capture different facets of it, but as long as they’re rich enough to capture the full range of things, they’ll learn similar representations. And there are some, you know, debatable but pretty striking experiments that show you get similar representations even between vision and text models, where the datasets have no overlap — just images and their captions.
Kanjun Qiu
How is similarity between representations measured?
Jamie Simon
That is a terrific question. And such a thorny one. You can start to think about it mathematically fairly easily. Take two big models — say, two language models — feed in the same document, the same prefix, into both, propagate forward to some layer, and ask: are these two models thinking the same way at these two layers?
Daniel Kunin
And the risk is that in high dimensions, things that are actually quite dissimilar might look very similar. The real question — the risk here — is that we might be fooling ourselves that these two things are similar.
Kanjun Qiu
But actually it’s a super high-dimensional and dissimilar.
Daniel Kunin
And this is a huge challenge in deepening our understanding of universality across models. It’s also something people have been asking in neuroscience for a long time: how do you compare neural recordings from two different organisms? How do you compare the neural recordings from an organism and an artificial neural network? How to compare high-dimensional objects and ask “how similar are they, really?” is a very difficult question. A lot of progress has been made in that direction, and we expect more.
Jamie Simon
I think the methods question here is actually more important than the answer. A yes or no to this question — I think there’s some version of asking it where the answer will be yes, and some version where the answer will be no. The open direction isn’t “do they?” — it’s “what exactly do you mean by similarity?” In what way? It’s the only one of our ten open directions in this perspective paper that’s really about methods, about the metric.
My sense from kernel theory is that probably any useful metric will need to capture what functions are easily learnable from a simple model on top of these representations — for example, a linear model. Getting everything to work with large models is really a challenge.
This is one of those deep and tantalizing empirical questions that people have been wondering about for a long time. It’s starting to seem like, in some sense, it’s got to be yes. But we don’t know how to precise.
Kanjun Qiu
What does the representation represent exactly? Go ahead.
Josh Albrecht
One of the things that’s actually exciting to me about this theory work more broadly is: as we take these limits and simplify things, we start to have tools — we start to clear things away — and it becomes easier to build up, create larger systems, ask better questions. And this representation question is a perfect example of the kinds of things that could have a really big impact. If you really know what a representation is, if you can really answer what similarity means, it might suggest immediately: well, we should do this in a totally other way, which could be a complete reframe — much simpler and easier. We don’t know if that will happen. It’s possible we’ll do all this theory work and find out we can make things 10% more efficient, and that’s the cap. It’s also possible we learn something where it’s like: no, we were just getting started and doing this totally wrong. And there could be a huge shift in how we think about these things.
So for me, one of the really exciting things about this work is that it potentially has these kinds of new things we can build with our understanding, and new questions we can build on top of that can really shift what we’re able to do and what kinds of questions we can even ask.
Jamie Simon
This leads into something I think about a lot: the idea that science is an edifice that builds on itself brick by brick. We’re not going to solve the puzzle with one brick. What each paper, each contribution, each project should try to do is lay down one humble but very solid thing that can support the weight of the building we’re constructing. And there’s this sense — which metaphor is it — a rising tide lifts boats and allows you to reach higher bricks, because rising tide lifts all bricks and you can reach higher fruit from standing on your boat. Dan makes fun of me for my metaphors.
Daniel Kunin
Jamie is a metaphor creator.
Kanjun Qiu
Do you have any preferred way of thinking about representation similarity?
Daniel Kunin
We think that neural networks take their data and build progressively richer representations of that data — sometimes we call those features, sometimes representations. And through that process, going layer by layer, they’re eventually able to take that final representation and maybe predict the next token. So they’re building up, layer by layer, a belief about what might be the most likely next token.
When I think about comparing representations, I’m actually more interested in: what precisely about the data, what features of the data, are they actually extracting? So the question of comparing models has a lot to do with understanding the data itself. One of our open directions is: how should we model data? What is a good model of data? And I think answering the question of how to compare models will inevitably require an understanding of how to think about the features in data.
Josh Albrecht
All of these answers — if you answer that question, it helps answer this other question. It helps you assemble these bricks. One thing I’m excited about is all the different implications of this work. Do you want to talk about some of the implications? Let’s say we make progress. We make some more of these bricks. How is this helpful to the field more generally, to other fields?
Kanjun Qiu
Or even today — are there practical things you recommend practitioners do? Either version of the question works.
Daniel Kunin
We opened this conversation talking about learning mechanics as a kind of physics underlying deep learning — in the same sense that mechanistic interpretability is a biology of deep learning. Talking about how these understandings can actually influence other fields: I think there’s going to be a symbiotic relationship between the mechanistic interpretability community and the deep learning theory or learning mechanics community — and it goes both ways.
Learning mechanics and the ideas in this paper, and the ideas I expect will come out of the open directions, will have a big impact on mechanistic interpretability. Formal definitions of what our features, representations, and circuits are — these are words used in both communities that can mean very different things. Coming to a consensus and defining these formally, from first principles, from both the bottom-up and top-down approach, will have a huge effect.
And vice versa: the mechanistic interpretability community has always brought data to the forefront of every problem they study, and that’s something the theory community of deep learning hasn’t always done. Sometimes we treat data as just X-Y pairs — input and output — and focus on the optimizer, the architecture, the nonlinearities. We get drawn toward problems where we want to use interesting hammers: stochastic differential equations, neural tangent kernels. But sometimes the data comes last. And data is probably the most important piece of this puzzle. Mechanistic interpretability has found all these interesting insights by looking at the interaction of data and architectures and neural networks. Theorists who take those insights seriously and really think about them as goalposts to understand are going to make big impact in our field. I think there’s going to be an amazing symbiotic relationship there.
Kanjun Qiu
Very generative for problems to work on. For each of you, what are your current obsessions or interests? What open question are you most interested in solving right now?
Jamie Simon
The two open directions I’m most interested in right now are Open Directions 1 and 2 from this paper: simple models of nonlinear feature learning, and theory that’s data-aware.
Kanjun Qiu
If you were to describe that as “I am so puzzled by X” —
Jamie Simon
Yeah. Okay. So here’s the deal.
Josh Albrecht
What is the open question? What are you actually puzzled by?
Jamie Simon
We have these beautiful, insightful, solvable models. There are two main workhorse models. Daniel touched on deep linear networks earlier — they have these stepwise learning dynamics, learning single directions one by one. There’s another class of models called kernel methods, specifically kernel regression, which is connected to deep learning not by killing the activation functions but by taking a certain infinite-width limit — not the upper limit, the simplifying one — and you get kernel regression out. You can also solve this, and it’s a beautiful theory of learning. It can actually learn functions that are fully nonlinear in the data, except the learning dynamics of the network are linear. So it’s like a linear function learned after a nonlinear projection. This also shows a simplicity bias, and we have a complete theory of generalization that works really well.
Kanjun Qiu
It just doesn’t apply that well to anything realistic.
Jamie Simon
Right. We have another solvable case that comes from a simplifying assumption on a deep neural network, but again we’ve thrown away an important form of nonlinearity — in this case, the nonlinearity in the parameters. You know, we want to know what’s at the intersection of these two things. I want to write down a model that is nearly as tractable and insightful as these two, but gets the best of both worlds. I want a model whose dynamics I can solve and study and understand better than even a shallow MLP — a multi-layer perceptron, the simplest neural network. Even that, with its nonlinearity, is too complicated, too annoying, in my view, to do complete end-to-end trajectory studies. So I want something simpler than that, but that can still learn fully nonlinear functions and still has fully nonlinear dynamics.
My team and I have identified a function class of this sort. We think we know how it’s related to MLPs. We can solve its dynamics in a large number of cases. We can predict things about how MLPs learn from this, and it seems to work pretty well in a large number of cases. And right now I’m pretty obsessed with getting that to work out, because having a good nonlinear model of the dynamics of feature learning would just immediately let you ask so many new questions where the complexity is pared down, but you know it’s still capturing something important about the system you’re studying. That feels like a huge unlock.
Kanjun Qiu
How about you, Daniel? Like, I’m so confused by X, or...
Daniel Kunin
Sure, yeah. This is maybe not something I’m so confused by, but maybe the arc of my research for the next year or so.
Something Jamie and I worked on together about a year or two ago: as we’ve talked about limits, there’s a certain limit I was really interested in — the limit as we take all our parameters to the origin. If we actually set all the weights and parameters to zero, the input-output map of that network would be nothing — zero. It would never train, because the gradient of one parameter is always dependent on another. This is essentially a critical point of the loss landscape.
If we perturbed slightly off of that critical point, eventually with enough time, the model would learn something. What has been shown before in other settings is that the learning dynamics of networks in this vanishing initialization limit start at a saddle point at the origin and progress through a sequence of saddle-to-saddle dynamics — jumping from one saddle point to another, eventually going down toward a global minimum. In some simple models, you see this in the loss curve as plateaus followed by drops, plateaus followed by drops. Each one of these jumps from one saddle point to another is learning some aspect of the task.
Jamie and I worked on a paper where we tried to unify a bunch of existing saddle-to-saddle dynamics under one picture. We were training this model through gradient descent, but maybe there’s actually an alternative, discrete optimization process that would describe that process, where each step of this discrete process is actually learning a feature. If we understand that process, we’re also understanding something about the features the network is going to pick up. We also looked at work from the mechanistic interpretability community around modular arithmetic and how neural networks learn Fourier features. We proved that using our framework, we could show why these Fourier features would emerge in that setting.
That put me on this trajectory for the next year: looking at all these different algebraic tasks. Modular addition can be thought of as a task of group composition under the cyclic group. I got interested in these tasks where we know a lot about the underlying structure and know what the right features are — and trying to understand how, through training, a neural network picks up or acquires those features.
The hope here is that by really deeply understanding that process, even in tasks where you already know the solution and the right features, you’ll understand that acquisition process — and that process will transfer to settings where we don’t know what features a network is learning.
For example, take a next token prediction task. I might think about generating data from a hidden Markov model or some sequence generating process where I know the underlying probability distributions and what would be optimal — what an optimal learner would do if tasked with next token prediction. I would know the right features to do that task.
Kanjun Qiu
So we should be able to, in theory, predict what features it’s acquiring.
Daniel Kunin
Yeah. Roughly. It might not be predicting a priori. It might just be an alternative algorithm that, if you run it on the data, would give you the same solution at the end — but this alternative algorithm would be more interpretable.
Kanjun Qiu
Interesting.
Daniel Kunin
So I think this idea of using synthetic datasets that have a lot of structure that still require nonlinear learning, and studying how neural networks learn the structure in those synthetic datasets, is a really promising direction.
Josh Albrecht
That’s a really good example. To zoom out a little bit and ask about the broader implications: you can imagine bootstrapping from modular arithmetic to, okay, what about long-form arithmetic where you carry the one — which is like a language, something you could see if someone’s explaining arithmetic — and you can imagine making it richer and richer to the point where it looks like reasonable text. You have a full model of what features the neural network is learning, how it’s actually making these decisions, how that changes at scale. We’re clearly not there today, but this is the kind of thing that seems really exciting, because once we do have that level of understanding, we might be able to talk about this stuff sensibly at a policy perspective, an engineering perspective, something that connects to real life. But until we have the tools and fundamentals for this, it’s hard to say anything at that level.
Jamie Simon
I’m never quite sure how much to make this a central motivation for what I do. I feel quite confident that some form of regulation and policy of AI will be necessary. And we haven’t really tried many ways of regulating AI yet, I would say.
Things are changing very fast. But one possible future, one possible way things play out, is that having a language to talk about these things in terms other than their raw statistics — number of tokens trained on, FLOPs for forward pass — gives us tools to better describe, regulate, characterize, and have a sober public conversation about these systems.
Daniel Kunin
Being able to attribute data to the learned model — to understand the influence of what may be one set of data points or a certain corpus of texts on the final trained model, or on the learning process — would be a pretty important tool in the conversation around regulation, copyright infringement, and other things like that. Really understanding the influence of data is critical not only for deep scientific understanding, but also for the practical and regulatory frameworks that will eventually be needed.
Josh Albrecht
There’s also a safety angle. We’ve used the bridge engineering example before — I think it’s hard to make many grounded claims about safety without really understanding the system you’re talking about. If you just have a black-box understanding of bridges, it’s like, “well, it didn’t fall down yet.” We can do so much better once we really understand bridges, engines, planes, nuclear reactors.
Jamie Simon
Right. We say in the paper that we believe there are three types of reasons to want a theory of learning mechanics, and it’s important for three reasons. One is fundamental scientific reasons: understand intelligence, understand deep learning — there are big scientific mysteries here. One is practical reasons: there are many things we’d like to do with deep learning that we could do much better with an explanatory theory. The last is safety reasons. Presumably if we want to have — that’s the one setting where we would most like to have guarantees and understanding. It’s not super critical to have very reliable chatbots most of the time. But certainly, if we’re in a scenario where very powerful AI systems are making high-stakes decisions in real time — for example, running your life — like, we need to be able to appeal to some kind of grounded way of thinking about these things. This is also a response to the common objection: “Oh, well, these things will understand themselves before you ever understand them. So why are you trying to do this?” This is a concern for human intellectual endeavors across the board right now — far from unique to us. But I think there’s a unique answer: setting aside the fact that we already understand some things and they’re being useful, isolated pockets of understanding are useful even without the whole thing.
Kanjun Qiu
And it would be nice to use these systems to understand these dynamics in order to design better versions of these systems. We don’t only want to evolve systems that we rely on.
Jamie Simon
Yeah. Even setting aside those answers to the objection “you won’t get there before they do” — the safety and oversight case is really the one setting where, unless you trust the AIs to police themselves, you don’t want to totally hand over control. Having some kind of fundamental theory gives us a foot in the door.
Josh Albrecht
It seems hard to bootstrap into a place where you have control, oversight, safety, and understanding without this, basically.
Kanjun Qiu
On that note, for everyone listening — for those who are grad students — is there anything you’d recommend they do if they want to participate in the field, work on one of these open questions, or get involved in the community?
Josh Albrecht
Reach out to you? Cool blog post? Send you some money? What should they do?
Daniel Kunin
You could see this paper from two different angles. One angle: we basically have three claims in this paper — there will be a scientific theory; there are pieces of this theory starting to emerge; and this theory will take the form of the mechanics of the learning process. So you could think of this paper as essentially our justification of those claims. Or you could also look at it from a pedagogical point of view. Reading this paper and looking at the citations — the papers we cite and the stories we highlight — is basically 14 authors in this space describing what we would call a great intro course to understanding deep learning theory.
Kanjun Qiu
This is like the textbook for understanding deep learning theory right now, where we haven’t written a textbook yet because it’s so new. We don’t have the answers.
Josh Albrecht
Yeah, it’s the syllabus to an intro course.
Daniel Kunin
So I would say: read these different things and think, as if you’re a young researcher, which of these different approaches or methods of handling complexity appeal to you and your thought process. Then go deeper into that, reach out to the authors aligned with those approaches, and think about what open questions and open directions we posited.
Jamie Simon
Yeah. The last section of the paper is called “How to Get Involved in the Development of Learning Mechanics.” There we give a list of tenets of advice for newcomers to the field. Things like: try a few problems before going deep into one. Value scientific insight and understanding over the difficulty of the theorem you proved. Do lots of experiments because they’re cheap and easy. I’d recommend that a young grad student look at both that section and the open directions section, and then read the rest with an eye for trying to hear and disentangle the stories we’re telling from this broader mass of literature, and see what reaches out to them and compels. We’re also putting together a website — at time of recording, the URL is learningmechanics.pub.
It’s meant to be linked to in this perspective paper, and we’re hoping it will serve as a place for high-quality perspectives from experts, pedagogical materials, open questions, and maybe even a forum for discussion.
Kanjun Qiu
Thank you both so much. This is really fun, and I hope that many more people who are wonderfully curious and qualified will join the field and will actually develop a theory of deep learning.










