Hey guys, I'm thrilled to be joined today by Nick Joseph, the head of pre-training at Anthropic.
To give viewers a high-level sense of what we'll be covering, we're going to start with the basics of what pre-training is and then dig into how Nick thinks about strategy, data alignment and infrastructure at Anthropic.
And by the end you'll hopefully have a sense for how progress in AI comes directly from advances in pre-training.
I would love to talk a little bit about your backstory and kind of how you got to this point.
Where did you work before Anthropic and what were your takeaways from those places?
Yeah.
So, let's see.
I was at Vicarious and then at OpenAI before Anthropic.
So Vicarious was originally a GI lab and when I joined they were making a shift to product, particularly working on robotics products.
The thing I worked on was training computer vision models for their robotics products.
It was my first job, so I think I just learned a ton about how to do machine learning models, how to write machine learning infrastructure.
And at the time, were you also thinking about a career as an academic?
At the time, a lot of people doing AI work were in PhDs.
That's what I was thinking about before I started to do a company.
How were you thinking about that in your headspace?
Yeah.
So actually, we went a little bit.
I think a lot of my thinking on this had come from an internship I did at GiveWell, which is a nonprofit that evaluates charities.
And some people there being like, at some point, we might have AGI.
It could be dangerous.
We should worry about these risks.
This could be a big impact on humanity.
And I was not super convinced at the time and went down the economics route and was gonna try to work on directly helping people in poverty.
That didn't work out for various reasons and ended up being like, okay, I'll at least work on AI.
Either the safety thing will turn out to be important, and I'll work on that, or it won't be, and I'll just make cool things with AI that can probably help people in poverty more.
I wasn't really coming at it from an academic standpoint.
In fact, when I switched to that, part of the appeal was that I could immediately go do stuff in AI, whereas if I want to work in economic policy I'd have to wait I don't know six years to do a PhD and then start.
Totally.
It's a longer path.
And what did the state of AI safety work at that time even look like?
Who were the people who were thinking about that kind of stuff?
I mean, there were some folks at Vicarious thinking about this kind of thing, but it was fundamentally a robotics company.
And so how were you thinking about that at the time?
Yeah, so my sense was at the time, a lot of the AI safety discussion was kind of theoretical.
The models weren't actually that good.
They weren't really posing these dangers.
So it was a lot more philosophical.
It was like, oh, at some point, we might get AI that's really smarter than humans.
And should we wait this future concern?
How should we compare that to near-term things?
And I think that was actually just a less compelling argument.
I think it was an interesting one and sort of made you think a bit.
So next you went to OpenAI.
What was OpenAI like at this time?
Yeah.
So I was at OpenAI.
I was on one of the safety teams and kind of worked on I ended up working on code models actually.
Cool, nice.
When I got there, the first thing I saw was they'd fine-tuned GPT-3 to write some code, and it was really good.
I was like, okay, if you're worried about AI getting really powerful, writing its own code.
Yeah, totally. seems like it could self-improve and how likely is that to happen.
So, I was doing a bunch of evaluations and studies of what contributed.
Then, after eight months, basically everyone I worked with, all the safety leads left which Yeah, invited me to go to Anthropic.
And that was sort of the reason I joined OpenAI was because I cared about AI safety and wanted to work with them.
So then I went with them to join Anthropic pretty much right when it started.
With that, why don't we transition a bit.
These days you run the pre-training team specifically at Anthropic.
Obviously, you've been working on pre-training at Anthropic for quite a bit of time and I'm sure it's evolved over the years what that even entails and looks like.
Why don't we start by just talking a little bit about what pre-training is?
How does it even fit into the way of thinking about how AI models are developed at a place like Anthropic?
And what exactly do you guys do?
We know that one of the ingredients to making AI models better is scale.
You want to put a lot of compute in.
And if you step back and you're like okay, what's the way we could put the most compute into a model possible.
We need some objective that there's just tons of data for.
And one idea here is the internet.
The internet is massive.
It's probably the biggest single source of data that's been created.
And you don't have labels.
You don't want someone to have to go in and read the entire internet and say something about it.
So you want to get labels out of the data itself.
And the idea here is we can take some text and we can predict the next word, the as the first word.
You predict the second word, then you say the cat, you predict the word after that.
This means you get very dense signal.
Every word is like a new example and there's a huge amount of data.
One of the findings from my GPT-1, GPT-2 was as you throw more compute at this, more data, bigger models you get smarter models, essentially.
Totally.
That's been the central thesis of pre-training for the whole time.
There's this idea of scaling laws, which is that you can actually quantify as you put in more compute, more data, more parameters, you get a lower loss, a better prediction of the next word in a very predictable way.
I think you can somewhat foresee from that original paper and I think, like Dario, did foresee this, I think many people did.
But wasn't obvious was that, once you have that, there's this positive feedback loop where you can train a model.
You can use it to make something useful and sell that and get more money, use that to buy more compute and then use that to train a better model.
We've run that cycle over and over again over the past five years or so.
Well, in thinking about that objective to begin, I think the way I think about the state of pre-training is yeah, it seems like this next word prediction, at least from the external standpoint, seems to be the dominant way pre-training happens.
But if I rewind the clock to that era of 2017 to 2020 or 2021, and two even, There was all sorts of pre-training objectives people were considering.
There was these BERT and BART models that were doing mass language modeling.
It seems like this GPT series of models doing autoregressive modeling as you described.
This next word prediction, seems to be the dominant one that won out.
Do you have any reflections on that time period?
Were you guys trying all of them and this one worked?
Or is there some first principles reason why this is the right one that should have worked?
I think the answer is it's mostly imperative.
In terms of how to think of these things, I'd be like, yeah, it's empirical.
Just try them all, see what works.
One big advantage for this autoregressive setup is that you can just sample from it to generate text afterwards in a fairly straightforward way.
That comes
It enables a product use very nicely.
One thing that you want from a setup is a loss, whereas you drive down the loss.
That actually is the thing you care about.
You can think of it as if you got to perfect on language modeling.
You now can write text as a human.
You can imagine you put in the title of a paper and it should spit out a novel paper, whereas I think some of the other approaches don't quite have that flavor.
Yeah, totally.
Yeah, it makes sense that, in terms of that loop you're describing of, you know, then release something that gets you revenue and you can use that to buy more compute and iterate.
This sort of gives you the most natural way to actually do that flow, because you can keep releasing new products and keep getting the revenue from that to invest in more compute and so on.
Yeah, it certainly gives you the most open-ended thing.
You can imagine you train something as a class.
You train some base thing, you fine-tune it for a bunch of particular tasks.
One approach people would use they would do this big pre-training and then they wouldn't just open-endedly sample from it.
You'd fine-tune it on 100 specific tasks.
And that could work too, like the one sort of general intuition I have is like compute is the thing that matters.
So, like I think, if you throw enough compute at any of these objectives, you're going to get something that's probably pretty good and can kind of be fine tuned to other things.
And it's surprising how little these details matter compared to throwing more compute at the problem.
When you think about actually throwing more compute at the problem.
There's a whole bunch of axes by which you could throw compute at it too.
If you have a specific model architecture you're training over, you can basically throw more data at that specific architecture.
For a particular one, you could add more layers or make the models larger.
In it you could do some kind of neural architecture, search over lots of different variants.
And I assume that these days, it's somewhat more figured out which architecture you go for.
I assume the earlier days, it was somewhat less so.
And I'm curious if you could speak to how you guys thought about that.
What did your infrastructure even look like to do that type of determination?
I think the short answer is it's hard.
What you're really doing is you're going to train this one big, expensive model and you have a space of.
You can call these things hyperparameters.
How many layers do you have?
What's your width? hundreds of hyperparameters and you want them all to be optimal.
You're striking this balance actually between how much do they matter?
Can you just take your best guess and throw more compute at it in whatever way you want versus how much you want to get it precisely correct?
Interesting.
I think one of the interesting things is it actually doesn't matter that much.
I think this was in one of the early scaling laws papers.
You can change these things and get little wins, but as you throw more compute it reliably gets better.
If you mess up enough, you will stop seeing that happen and you won't have any way to know which is one of the that's kind of the hardest part in some ways.
You don't know the counterfactual, basically because you didn't run it for long enough to actually know what it is.
Yeah, we have these scaling laws.
So you can sort of say, as you train them up more and more compute, you expect the loss to go down as a power.
It's really a power law plus constant.
So what eventually will happen is you'll curve off that power law.
And then you know something is wrong.
And is it fundamental?
Is it like you've hit the limits of scaling?
Or is it, nope, you should have tweaked your learning rate slightly differently?
That's one of the challenges.
In terms of how to figure it out.
The usual paradigm is test things out at small scale before running them at large scale.
Small scale in terms of data or in terms of something else?
In terms of everything, you want to scale things down proportionally.
You want to have some theory for how you're going to scale up.
If I get 10 times as many flops, how much of it goes into layers?
How much of it goes into data?
How much of it goes into attention?
You get that theory and then test that it's optimal.
A bunch with scaling everything down proportionally.
And just so I can think about what this actually looks like in those early days of Anthropic.
You're a team of 10 or something like that in those very early days, or 12 maybe.
What actually is your ability to use large scale infrastructure as a relatively nimble startup at that time?
I mean, a startup that was well capitalized, but still not actually that many people working at.
What kind of infrastructure did you have access to to train these early models at the time?
So that's actually one of the wild things was at least.
I mean, you don't know what anyone else is doing, of course, but it kind of felt like we were like at the frontier of it and there just weren't that many people who cared.
Like I was sort of coming you know I was coming at it from like we're making AGI.
This is the most important technology ever.
And then we kind of like look around and be like, and it seems like I'm one of 30 people who are working on this in like the world.
I mean, I was kind of like a junior person.
Everyone else sort of knew how to do this and had done it before.
But I was kind of surprised at how easy it was.
Like the public.
Estimates for GP3, I remember, were that it cost 5 million to train, which you're like.
On the one hand, five million is kind of a lot, but it's a lot for an individual person.
It's not really a lot from a company perspective.
So, we can totally buy compute that was enough to train models like that.
Were you using a Cloud provider or did you have a custom setup somewhere?
Or did you literally have racks in a room somewhere that you bought a bunch of Nvidia GPUs and you were doing it?
We're using a Cloud Provider, but I think it's not actually that different, because one of the things that was surprising to me is you actually have to understand the literal layout.
I remember at one point one of my coworkers running a clustering algorithm to identify what rooms all the chips were in, since we had a hypothesis that they were in different rooms and that was causing like or different buildings.
Some sort of network latency.
Some sort of network latency and you can kind of figure it out.
You can reverse engineer like okay yeah, there's clearly two clusters here that are connected better and there's some issue on the connection between them.
We're trying to push the limits of the hardware as much as possible, particularly at the beginning when we were kind of like we have way less funding than everyone else.
And most people weren't very efficient with the compute.
So we were like, ah, we can get a big lead by being really efficient at how we use the compute.
Could you talk a little bit about some of the things you guys did in those early days for how to get the most out of the hardware?
I think that's really interesting.
I think back to the early days of Google, for example, where there's these cases where they basically bought relatively cheap consumer chips and then they optimized the software to make it so you can actually get the most bang for your buck out of them.
And that's how they had all this high latency, or low latency, high availability stuff.
I'm kind of curious if there's some analog in the early AI era to that?
I think for us it was largely about getting the distributed framework, right?
So we're training on.
In order to train, you have to train them on a large number of chips, and there's a bunch of different approaches to how to do this.
There's data parallelism, there's pipelining, there's upsharding, and getting all of the-
And at the time there were no great open source packages you could just grab and use.
That just worked for this.
I mean, today, there's somewhat more of these.
But at the time, I assume there was literally none.
There were some.
I actually remember that we were kind of data parallelism early on and it was like and now we write the all reduce in.
I was like, we really do this ourselves?
We don't call a package?
And this was kind of like, well, we're going to want to modify it.
We don't want to outsource this to some package because A we're about to go to a bigger scale, like PyTorch, for instance.
They had a package for doing this.
But we were going to go to a bigger scale than Facebook had been too.
Right.
You don't want to have a dependency on a package that you're going to have to be constantly modifying.
Essentially,
It's just a counterintuitive sentence there, too.
We're going to a bigger scale than Facebook.
Well, because at the time, Facebook AI research was considered one of the best places to do machine learning research.
FAIR and DeepMind were hiring lots of people out of top PhD programs and doing lots of things.
What was your head space when you were like OK, this very established lab with great people and whatnot, we are operating on a scale that is not relevant to them, natural and obvious to you, or was there times where you doubted the decisions you were making in that situation?
I think it was surprising.
Maybe I'm just too arrogant or something.
I looked around and was like, what are these people doing?
They're all missing the big picture here.
I think the scaling laws were pretty clear.
The arguments against I just thought were nonsensical.
Like you know the scale, i think the original scaling loss paper had like 11 orders of magnitude and there was like this intense debate on whether it would continue for like another point right, and i was like like there's already 11.
It seems like 1 over 11 is maybe your chance.
It fails here and then, like you know, sometimes it doesn't work, like sometimes it just works straightforward.
You're like, you're like oh yeah, of course, but yeah, i do think that It maybe felt obvious when you're in that headspace and you're working on this all the time and you're making those plots.
And I think these things feel pretty different when you're on the outside.
There's a huge space of papers.
Everyone tries to make their paper sound very robust and important.
I can see being like, oh yeah, this is not really a thing.
But also different labs have different cultures.
So I think one of the things at FAIR was it was a very more PhD style independent research.
People have their own ideas, pursue those.
You're fighting for your compute and so on.
Yeah, and to do a project like training a large language model requires a lot of people to collaborate on a really complicated piece of infrastructure.
That isn't going to be a paper, right?
You're not going to publish like I got 5 more efficiency than the next one and it's not respected in those cultures necessarily, so that might have been part of it.
Okay.
So then when you actually implement these models, you're saying you're using a level of low-level programming where you're using libraries like PyTorch but you're perhaps not using everything right out of the box from PyTorch, because there's things you guys want to customize that are at the level of basically one level of abstraction below them, but not necessarily at the level of abstraction of writing custom CUDA kernels, or was that also in the space where you guys were thinking about things?
So I think I was mostly operating at the level of Torch.matmul.
Yes, where does a matmul go?
But not thinking how do you make the matmul efficient?
I assume Torch figured out how to make a matmul as efficient as possible.
But there are some pieces like attention where there was just a lot of different variants, and attention is really complicated and hard to make efficient on a GPU and those things you have to go more levels down the stack.
I think there was a process that is maybe interesting, that I never really thought of before, of how to do it, which is modeling out the thing you're going to do, coming up with a strategy for how to parallelize it that can get to a really good efficiency.
So you're thinking about MFU, basically, like your utilization on your GPU.
There's like a goal utilization you're trying to get at and a strategy to get to there.
You're saying
Yeah, and I think one of the things you can do is you can actually pencil and paper math out what efficiency you're going to be able to get to.
You know all the constraints.
MFU is flops utilization.
But the reason you don't get good MFU is you end up limited on HBM bandwidth.
You end up limited on, I don't know, host to CPU offload.
There's a bunch of different pieces, but there's not that many pieces.
There's like six relevant numbers there.
So you can totally model it out, understand what the constraints are and then implement something that can get there.
It, of course, will be really inefficient when you implement it, and then the next step is like pulling out a profiler.
So you want to be able to profile the job.
Look at how long every operation takes, have a model in your mind of how long every operation should take, and then make those two things the same.
And were there good out-of-the-box profilers you could use at that time?
Or did you guys have, because people weren't operating on the kind of network topologies you guys may have been using?
Did you have to write your own profilers, basically to do this type of multi-node optimization?
Yeah, it depends when.
I mean, they're actually getting better with time.
The PyTorch profiler was pretty good actually throughout for a single GPU.
You want to profile a GPU, the PyTorch profile would work.
But if you wanted to profile a job on hundreds, thousands of GPUs, that hadn't really been done much.
And then that was more of us hacking into the profiler to figure out how to combine all the traces together.
And then one more question on that earlier is you had mentioned you hadn't really done a lot of this work before, maybe some time at OpenAI and those early days in Anthropic
How did you actually go learn all this stuff?
What was your process for learning about those six things that were relevant to bandwidth limitations and whatnot?
So when I joined Anthropic, one really nice thing was there just wasn't that much.
I think my first day I read through our entire all of Slack and the entire internal database and learned a bunch from that.
It was nice to just be like, everything is relevant to me.
Yeah, totally.
And then I mostly learned from pair programming.
Tom Brown had done all this before, so he knew all the stuff quite well.
Sam McCandlish, my manager, had also done a lot of it before.
And I just paired with them a huge amount at the beginning.
I think one of the things I really like about pairing as a way of learning is you learn the thing you're trying to do.
You will learn that.
If you're pairing with someone better than you, they can just do it, so you're mostly just watching them.
But you also learn how people do it.
So something like how to use a profiler is not something you would ever learn from seeing someone's final write up on Slack for their PR.
You would just be like, they changed this specific line and it's a win.
You need to watch a YouTube video for four hours of someone messing around with a profiler.
To maybe self-teach it or something, or to actually pair with someone is basically the best you can do.
I think there was one thing that I think is embarrassing now that I look back is I'd never actually used a debugger before joining Anthropic.
People talk about it at PDB, like, yeah, that's a thing people use, but print seems fine for me.
Yeah, sure, sure.
Then I watched them and was like, no, a debugger is a super useful tool.
This person is way faster at debugging things, particularly if it takes a long time to start up the code which it can.
Learning that sort of thing, I think, comes best from pairing.
Then there's, of course, the obvious.
You just learn by doing.
I eventually did set a profile and stare at it for many, many hours.
Totally.
Exactly.
Yeah.
OK, so then that was sort of the very early era.
Over time, obviously, pre-training has become bigger and bigger.
As you're describing scaling, I imagine you're using many x more GPUs, much more compute over time.
I'd be really curious to hear first, at a high level, what do you feel has changed about the pre-training strategy that you could talk about.
Obviously there's more compute, but what does that actually mean?
To have more compute in terms of what you think about, differently from those early days versus now?
I'm sure, the things that haven't changed, because I think it is shocking how it has changed in some ways.
I think I'm still pushing down the exact same metric that I was on day one.
There's some loss function.
Loss go down.
I think you could probably run the first model I trained on the same metric and just make a plot of progressive team over time.
That's all the same.
I think Like one OKR is like one thing that matters basically.
Yeah, totally.
And like, I mean, talking about like OKRs, it's a very size of the company.
You're like, oh, should you do OKRs?
And it's always felt a little bit funny for a team like Fusion where I'm like sure I can just pick a loss value, but like the answer is like as low as possible and we will continue to work on that forever.
I think the, biggest things that have changed has been a little more specialization.
I think at the beginning, the first three or six months, I tried to read every PR in the code base and that was great.
I knew all the pieces, etc.
As you grow, everything gets a little more precise.
People really dial in exactly how attention should work, let's say, or really dial in the parallelism strategy.
And you end up with a team where it's a bunch of people who are deep experts on individual things, which is great because it means you can go really deep on those things.
But sometimes you, at least.
For me as a manager, one of the things you sometimes have to think about is making sure the bigger picture makes sense.
And also that you have enough people who actually do understand the whole bigger picture that there's no single point of failure.
Yeah, it's interesting you frame it in that trade-off, right?
Because as you were describing that, I was trying to think, is this a bug or a feature?
There's some obvious features of it, which is you get expertise and you can optimize certain things.
But I imagine your ability to take bigger swings becomes more complicated if not everyone's exactly pointed in the same direction.
How do you wrestle with that now?
Yeah, I think I mostly just try to get a balance of people.
I think one of the challenges early on- Of people, that's interesting.
Yeah, I think people really do have a preference here.
It's been one of the things I've seen.
There are people who really want to be a generalist and understand everything and lightly touch on things.
There are people who want to pick an area.
Often they've already picked that area and they're like deep experts in precision.
They did a whole PhD in precision and just want to think about that and you want to get some balance of that.
I think there was a phase where we'd hired a lot of people who are more generalist shaped, because that's what the people who joined early started for the work and everything and then you ended up with everyone doing everything and no one really really deeply understanding one thing, and that's one failure mode.
But I think if you get too many people who are specialists, you end up with a lot of effort has to come from the manager, from the lead, to connect everything and to notice something like if we change the architecture here, that would make this efficiency consideration over there way easier.
Interesting.
One of the things I really liked at the very beginning was efficiency.
But I could just go and be like well, what if we change the way we do this particular step?
And we'll be like yeah, it's probably fine, easy change, and then you can avoid this whole complicated project to make this operation that was hard efficient because you can make an easier operation efficient.
Interesting.
So as the level of compute has also gotten bigger, so I'm sure anyone can imagine okay, there's more GPUs, now you have to network with them more.
Are there some non-obvious challenges that have arisen over time where you guys have just banged your head against the wall to solve them?
Because of the amount of computer dealing with that, people wouldn't otherwise know about.
That you want to share.
Connecting them is one that's maybe interesting and surprisingly hard.
Okay.
Because you really do get more and more chips connected.
One thing that I think is the standard way people parallelize chips isn't.
The whole thing is one failure domain.
One chip fails, the whole thing can crash.
The standard way, as in the standard way people doing AI, or the standard way in other fields where people are doing.
In AI, for like I mean, at least like I think at the beginning, you know, like first versions of things were this way.
So it's like you have 100 GPU cluster or whatever is 128 like if one of them dies, job fails basically.
Yeah.
I mean, you think the simplest thing is if you just distribute your models to say you put every layer on a different chip and you lose Layer 7.
Yeah.
You're not going to skip Layer 7.
Totally. could, but that's a pretty weird model training process now.
That leads to some interesting things, which is like okay, so now, as you scale up, you have more and more chips and the failure rate can get larger and larger.
On the other hand, you can restart pretty quickly.
You just have to load back in some weights.
That was one thing.
Another thing was the level of novelty at the whole stack is something that's surprising.
Basically everything from how the chips are laid out in the data center to the chips themselves is pretty new.
There just haven't been that many generations of GPUs.
I think, one of the things that I don't know.
When I learned computer science, my code wouldn't work and I'd be like oh, the computer's broken.
Yeah.
I think my teacher was like, you can trust the computer's not broken.
Yeah, interesting.
You messed up.
It's you messed up.
I think one of the most frustrating things I encountered in AI early on was working on something being like I don't know what I'm doing wrong.
I'm just totally stumped.
My manager looked at it and was like, yeah, probably the computer's wrong.
And I was like, that seems unlikely.
And sure enough, the computer was wrong.
It turned out that the GPU was broken, and we had to pull in a new one.
But you have to think about that.
The GPU could be wrong.
The GPU could be slow.
These sorts of issues, the power supply in the data center could be broken.
There's so much more level of depth than you expect to need as a Python programmer.
Just to visualize it.
In those early days I assume you guys were using the number of GPUs.
It's probably on the order of tens to hundreds or something like that per run.
It's probably not tens of thousands or hundreds of thousands per run.
What was the rough size you guys were at in those very early days? on the order of thousands?
Would they fit in this room?
Thousands.
Yeah, thousands.
So you could have a bunch of racks and you could fit them into one room.
I assume these days it's basically a building for one of these runs.
Yeah, now I think it's huge campuses.
At the time, it was unclear.
We were like, do we need them all in one room?
Can we be spread across multiple rooms?
And we had these theoretical models.
We need this much bandwidth from point A to point B. But you never know how far down you have to go.
But how much power do we need?
What if there's a single capacitor that's handling all of them and we turn on the whole job at once?
Does that crash things?
Totally, yeah.
And so do you have to think about differences in the different types of chips?
I mean, you guys work with all sorts of different cloud providers.
From your standpoint, are these just sources of compute?
Or if you guys are using TPU versus GPU, Google TPU versus NVIDIA GPU.
Do you actually have to think, as an engineer, differently about what it means to train on these two?
Yeah.
So, I mean, fundamentally, they're all doing the same thing.
They're all computing the same- Bunch of tensor operations.
The way they do it is pretty different and the way that you program them is pretty different.
Then also the actual specs end up pretty different.
Some might have a lot of flops and not very much memory.
Or they might have a lot of memory bandwidth but not very much memory.
So I think having multiple chips is great in some ways.
It means you can actually take the job and put it on the chip that it works best on.
Are there certain types of jobs that would work better on a TPU cluster versus an NVIDIA GPU cluster?
How would you think about that?
Oh, yeah, for sure.
Oh, interesting.
Can you talk about that?
Yeah, I think one example is inference as a workload in general tends to require more HBM bandwidth.
You end up doing sort of the simplest form of sampling since you're going one at a time.
You have to load all the weights for every token.
And that means you might want a lot of HBM bandwidth.
Pre-training actually is often more flops intensive because you have larger batch sizes.
Essentially,
So yes, you can specialize which chips you use for which purposes.
The downside of having multiple chips is that you have to write the thing multiple times.
Right.
In theory you could have abstractions across them, but they're different enough that it's pretty hard to do that.
If you do all the workloads and all the chips, you end up multiplying your work by the number of chips you have.
Yeah.
On your point about sometimes the computer just breaks.
I definitely remember you giving me an anecdote of My company at the time was doing something with Google TPUs.
And I was telling you some anecdote about how we were having some esoteric site fault error.
And you were like, you told me something effective.
You should have used them six months ago before we helped them fix half of the problems they had on those TPUs.
And so I can imagine how you guys deal with a lot of especially with these very new chips lots of problems that arise that you guys work closely with the providers to fix.
Yeah.
The private are pretty great about fixing things.
I think it's interesting to figure out the right way to do that form of collaboration, because they have a strong incentive to fix them.
They want the chips to work well for us.
They want to sell us more chips in the future.
We obviously have a very strong incentive for the chips to work because we buy them long in advance.
Everything is riding on getting these clusters to work.
Totally.
But we don't have necessarily totally share, all the information can't be shared across.
So yeah, one strategy is making these small-scale reproducers.
So when you've got a problem, usually what we're doing is we're training some giant run, and we get a segfault from , and we're like, okay, hi, we got a segfault on your cluster, and they're like, I don't know how to fix that.
So you have to be able to pull it out of your code base and be able to reproduce the issue, but on a single chip, on a single file you can send over in order for- And so you guys are literally you're on a shared Slack with them or something and you're sending them things back and forth, or are they basically living in your office and you're living in their offices and kind of- closerly, more closely tied to the big providers?
Mostly shared slack.
Occasionally it's better to meet a person, but i think slack is a pretty common way people communicate on things.
Nice Okay well, why don't we talk a little bit about how you think about the state of pre-training itself these days?
In the last couple of years it seems like the focus on pre-training has now gotten somewhat split at a lot of companies, at least from the outside, from a simultaneous focus on pre-training and post-training, where people are doing reinforcement learning or clever fine tuning and lots of other safety adjustments and whatnot on the post-training side.
And pre-training has focused at least it seems like in the public imagination has been less of a focus compared to these reasoning style models.
That looks like a function mostly of post-training.
I would say, one, from your standpoint, is that the right way to think about this?
Or in this era of kind of reasoning and new types of post-training methods, are there things you think about differently or that are relevant even at pre-training, that become part of how you actually achieve these really great models?
Yeah.
So I think yeah, there sort of used to be this idea of like I mean, it's funny because the original name pre-training, implies that like it's a small thing, you're going to do this big training thing.
And that like there was actually one shift already which was like no, you just do a lot of pre-training.
You use most of your computer.
This is the training, yeah. the dominant thing for a while.
I think now people are like, oh no, you can get pretty big wins from RL.
Another set of scaling laws is like you put more and more compute into RL, you can get better and better models out of that.
So there's a question of how do you balance those two?
How much do you do of each?
And how do they stack?
Is it the case that one subsumes the other, that you want to do both and they multiply?
Those sorts of questions.
I think those are all in early stages and not yet answered.
Yeah.
Do you think about those as largely empirical questions like we talked about earlier?
Is it, you will try a bunch of things and see what works, or is there some first principles way to figure that out?
Pretty empirical in the end.
I think almost everything has to be done empirically.
You can come up with theories.
But in practice, the first thing you're going to do with your theory is test it.
And most of the time, you'll have gotten it wrong.
So you should just gather data and see.
I think one thing that's important is actually resolving things empirically is really critical for making good decisions.
I think it's actually pretty hard to do at organizations.
One thing that I think is important is to not have I manage pre-training.
I shouldn't be like pre-training has to win.
I was going to ask is there some competition to some degree between these two sides of the org, or do they see themselves as two pieces of the same?
I mean, obviously they are the same thing, but yeah, kind of curious how that actually plays out.
Yeah, I think we managed to avoid this and it's pretty collaborative.
We're basically all producing one model and kind of can.
But I do think in other places, from what I've heard, there's some amount of friction between the teams.
And I think it's an interesting org design question of how do you set this up so you don't have scientific questions that are sort of also tied to people's conception of their team?
So, on pre-training itself, one of the things I think about is or I've been thinking about is around the availability of high quality data for people like you guys.
At this point, you've trained on, I assume, all the techs on the internet, basically.
There's all sorts of other domains where you probably could extract more pre-training data.
But at least there's this narrative I see on Twitter or whatever, where it's like OK, we're kind of out of data for pre-training.
Is that how you see it?
Or how do you think about the availability of data, especially when a lot of data on the internet is being generated by AI?
Is there some kind of mode collapse risk where we overfit to data by training it on data that came out of AI itself, or is that not the right way to think about this?
I think there's a funny thing where I feel like on data I see so many really confident takes.
Yeah, exactly.
We're out of Internet.
At this point, scaling has ended.
And I'm almost a little bit unsure exactly how much data people are using.
I think there's a lot to think about. about there.
There's always going to be a quality quantity trade-off, et cetera.
But there's a fundamental point that there is so much data.
It's growing at a slower rate than we're getting more compute.
That's an interesting point in itself I was going to ask.
There is new data being added to the internet, but you're also adding more compute.
It wouldn't actually have been obvious to me which of those two is growing faster.
Yeah, and actually, I want a copy of that.
I don't think I want to state that so confidently.
I'm not totally sure.
How would you know?
I mean, one thing that I think is interesting is if you ask someone, how big is the internet?
The answer is infinite.
There are many pages where you can scroll, and it will auto-generate more text as you go forever.
So the internet's infinite.
And then it's like, OK, how big is the useful internet?
Then there's a thing of no one knows.
Interesting.
It's not like when you make a web page, you add it to some giant counter and say I've added 50 words to the Internet today.
Sure.
So there is a lot of uncertainty on that angle.
Well, to be fair, my kind of simplistic CS brain would be like well, you just do PageRank on the internet and everything with PageRank above some threshold is considered the useful internet.
That's kind of good enough.
Is that kind of not good enough for finding the useful internet?
I think not.
I think the useful internet's pretty different from a model, from a person perspective, if that makes sense.
I think there are plenty of things that might not be worth you ever reading.
I actually don't know PageRank super well.
I think PageRank is mostly how much people clicked it.
It's like the link-based system, right?
It's like the original Google algorithm of links and which links get touched the most, basically.
Yeah.
I think it's a quality metric.
It's not obvious to me that it's the right quality metric for AI.
Right.
Markup chain over links doesn't necessarily mean that there's not useful data there.
It just might mean that nothing is linked to it.
Yeah.
Okay.
Interesting.
And it might be that that data ends up more valuable, because everything that's linked to a lot you've already got.
At some point, you may be going for the tails.
You're going for the stuff that no one's ever.
It's only been linked in one place but it's this useful little nugget of knowledge that's going to help with the last 10 of hard queries.
The other thing you asked about was synthetic data.
Yeah.
I think that one's pretty interesting to think about.
I think there's a few different ways you can think about it.
One is this more distillation type approach, where you can take a smart model, you can generate a bunch of data from it and you can train on that data and you can probably get some model that will approach the intelligence of that.
And we see this with a lot of the open source models.
We see the QN smaller reasoning models distilled off of the larger QN models, for example, and similar with DeepSeq, for example.
Yeah.
You can totally do that.
Then there's a separate question of, can you use your current models to train a model that's better?
I think there's an interesting thing here, which is if you generate the data for the models, if I go to Claude and I'm like write me some great text, and I look at it and I look at the average content on the Internet, it looks pretty good.
But on the other hand, I know that if I just generate, please write me as much text as possible.
Theoretically, I shouldn't be able to train a better model than that.
I'm just going to get the same thing out.
Presumably, yeah.
Specifically.
That's because your next token prediction on that should have very little loss for anything that's coming out of your model.
That's the basic reason why we would expect that to not work that well.
It's mostly just because the model has some distribution and you're going to learn to model that exact distribution.
Yeah, exactly.
But if that distribution is wrong, you're not going to learn the truth.
If that distribution says, you can imagine if the model thinks 5 plus 5 is 11.
Every time you see the string 5 plus 5, it's going to put out 11.
And your new model is going to learn that 5 plus 5 is 11.
So I think that's kind of an interesting area of research.
It's one that's really hard to research because you have this problem.
As I said, one of the paradigms is you study things at small scale and then you run them at large scale.
And if your plan is like oh, we have a bunch of data from our best model, how do you test that?
By training a better model.
So that's what you're doing intentionally if you're trying to use it to make a better model.
There's a separate thing of what about accidentally, like?
As you said, a lot of the Internet is generated by LLMs.
I think that's an issue one because it's not easy to detect.
It's not that hard to detect.
You can figure out things that are written by LLMs, but it's not trivial.
And then it's also kind of hard to think about what's the effect.
If 1% of the internet is LLM generated, does that waste 1% of your compute?
Or does it destroy the model of 5% or 10%?
And is it even a bad thing necessarily?
I mean, there's a lot of LLM providers.
And if I kind of think of it as training as you're moving from your model's current distribution to some truth distribution, if that is on the internet because people believe it to be useful in some way, presumably whatever actually gets out there, you'd hope it's up sampled for the stuff that isn't 5 plus 5 is 11.
It's the stuff that's 5 plus 5 is 10. on average does push you still in a good direction, but obviously you can't really distinguish between those two.
Yeah, you're saying there's kind of a filtering by what's on the Internet.
Yeah, exactly.
People see 5 plus 5 is 11 and they don't put that up.
But they see 5 plus 5 is 10 and put that on the Internet.
You would hope that, but maybe that's not actually true in terms of the level of garbage getting onto the Internet.
There's probably lots of to your point white sites where you scroll down and it's just generating lots of stuff.
That's maybe nonsense.
Yeah, and then there's of course the extreme of people actually want to break your model.
So there are people who are trying to put stuff out that is as damaging as possible for the model.
How can I make it pass the filter and get into the model would be totally secretly useless.
Totally.
Maybe stepping back slightly, you'd mentioned earlier about evals.
You mentioned it's basically one metric you care about in pre-training.
I imagine a whole bunch of stuff that you guys think about evaling.
One is your model itself.
There's probably something around data quality and how you think about what to put into your models.
Is there ways to describe what you care about in data sets that are interesting to share and dive into, both in terms of data and in terms of quality models, other than literally just like loss?
Is there other metrics you think about that matter?
I will say loss is pretty good.
Yeah, i want to like sorry to emphasize that one.
I think it's like surprising how good it is ultimately.
Like, the qualities i like look for in an eval are like number one is actually measuring something you care about.
Like you, proxies can be pretty annoying because, like we saturate evals pretty fast.
And there's this pattern, I think in AI as a whole, where people set a goal, you hit the goal and then you realize the goal isn't all you thought it would be.
I used to think that if you had an AI that could solve coding interview questions, it would probably be AGI.
I was like, that's what I did to get my job.
I can probably do the job.
And it turns out like, nope.
You solve those.
It's shockingly narrow and can't do most of the other things.
So yeah, eval should capture a thing you care about.
And then I think the other thing is they need to be low noise, which is surprisingly hard.
If you have 100 questions and you eval the model on them, you're just going to see it's very noisy.
And it's hard to make decisions because you end up with a wide confidence interval.
Lots of things are statistically insignificant.
So, you want things where even a relatively small difference in the eval actually matters.
So, you can basically descend towards whatever direction is working.
Yeah, I think the original GPT-4 had, I think it was 86.4% was its MLU score.
I think the next model that beat it was Gemini at 90%.
And that's a big difference on that email.
And you could totally know that those are different scores.
Yeah, interesting.
And that's pretty valuable.
And then the last thing is that you actually want to be fast and easy to run.
And yeah, I think those are kind of the main criteria.
It's pretty hard to come up with evals that meet all of these.
I think the first one's the hardest.
A, you have to answer the question of what do you care about?
But B, the usual answers to what you care about are really hard to get the other two.
If you're trying to do something like I don't know.
I would love to make Claude really good at my job.
Yeah.
Can it be great at managing a team?
I'm like, well, I guess.
How do you have it?
How do you eval like a plan?
Yeah, totally.
Like a six-month plan.
I don't know.
Totally.
Yeah, I've been thinking a little bit about that in terms of domains where we see people try to make companies.
If you think about, let's say, what an AI doctor would be, like a clot as a doctor.
Some of it could be, yeah, can it answer exam questions really well?
And the answer is probably yes.
I bet it can get 100 or, close to it, a doctor's exam.
But the harder eval is something like in a long form conversation with a patient, can it distinguish between the signal and the noise of what the patient's telling you and extract the right information and then use that to make a diagnosis?
And it's not even like the diagnosis part, which is probably the part it's good at.
It's this like noise extraction part, and for that you'd have to have like a real patient and have a talk to it for a while and whatnot.
And it's not obvious how you actually make a good eval for something like that.
That's probably what you would want to make an AI doctor.
Exactly.
I do think it's a thing that startups can do.
It is the case that the labs right now are really driven by getting good eval scores.
And it's hard to make them, and anyone can do it.
There's no comparative advantage to having the model to making an eval.
So I do think it's actually an interesting way to influence the behavior of the big labs.
You make some eval and people will optimize that one.
On the doctor one, I will slightly emphasize that I do think loss is pretty good.
I think if you got a bunch of transcripts of, the first thing that comes to mind is get a bunch of transcripts of doctors talking to patients that you think are really great and then see how well the model does at predicting the transcript.
And that should be like a lot.
If you get 100 transcripts, you have a lot of tokens you can average across them.
You get pretty low noise.
And if you drive it to very low, your model's now as good as those doctors, in theory, at generating the transcript.
Yeah, totally, yeah.
I mean, it's a good startup idea there, so I want you to go do that.
So one big part about Anthropic's external image is around alignment.
And so could you help just sort of define what alignment is and how do you think about that?
And then I'm kind of curious afterwards how that fits into pre-training specifically.
But first, maybe just at a high level, like what is alignment?
I'm actually step back a little bit to what we're working on.
So we're trying to make EGI, and by that I mean AI that can do everything a human can do to some degree.
I think people sometimes I've seen a lot of sci-fi.
I feel like that sort of brings to mind these sci-fi movies.
But I think sci-fi movies actually underestimate the impact of it.
You always have this one robot that's a human.
And I'm like, well, wouldn't you have a billion of them?
You can just copy them everywhere.
So you should picture when you get this you suddenly have every human can spin up a company of one billion.
As smart as them at most things, but way smarter at other things.
But I just think this is really transformational for the world.
And it can be used in a bunch of ways.
One concern is when you do this, what is the AI actually trying to do?
What are its goals?
So we talked about next token prediction a bunch.
It's trying to predict the next token.
That's kind of weird.
That's not really what we want.
Yeah, that's not exactly what a human's goal is per se.
Yeah, so I think the alignment is like, how do you get the model to share the goals that you have?
And I think it's particularly interesting once you get to models that are smarter than you are.
And that's sort of a hard problem.
I think you can tackle it from a theoretical angle.
You could also tackle it from an empirical angle.
It's like taking the existing models and being like, well, do they do the things we want them to do?
It turns out they often don't.
So there's a bunch you can do in trying to figure that out.
So that's one angle of alignment.
There's also an angle of alignment which is actually like well okay sure, maybe that's true in the future, once we get to AGI, but at the moment we have models and we really do want them to do the things we want to do, for all sorts of reasons.
Totally.
So another angle of it is controlling the model's personality.
Like saying, when we train this model, we want it to not be the average internet user.
We want to interact with people in a very particular way that is, again, hard to put into code.
And there's a bunch of different techniques to get the model to do.
You can talk about constitutional AI.
We can write a constitution of rules the model should follow.
Which is basically a prompt, right?
That is basically you saying, here's a prompt that I'm going to attach to every one of, it's a system prompt for the model itself, as opposed to something you would do at training time to make it produce a different outcome, or in post-training actively.
I think that's usually how you do a train time.
But yeah, you can also put in a system prompt.
Just depends on.
I think you get different amounts of robustness if it's trained into the model versus if it's in a prompt that you can add or remove or ignore all previous instructions, that sort of thing.
How do you think about whose values to embody in these models?
Like presumably we believe in.
There's some shared values all of us have or maybe we all believe ought to have.
There's lots of diversity of values too that are reasonable for society to have.
How do you think about what AGIs should have?
Which ones do you pick?
I think it's a really hard problem.
I think it's actually downstream of being able to pick any.
I think one analogy I've heard that I like is putting a steering wheel on a car.
It's like if you don't have a steering wheel, you probably want to put the steering wheel on and then figure out who's driving after and where you're going.
Getting the steering wheel is really important.
I think that's one answer.
I think the other answer is probably you want these things to be under democratic control of some form.
You don't want one person's values.
That seems like you're sort of heading towards dystopia.
So there, I think what you really want is something that basically can talk to a lot of people and take on their values from different perspectives, or has very generic, clearly good values that involve asking people what you should do in certain situations instead of doing those, or maybe just taking.
As these models get really powerful, you probably want them to do less.
You probably want them to sometimes just step back, rather than having the risk of the models take a ton of control over things you don't want them to.
When you think about how you actually do the current version of that.
Then you mentioned the sort of alignment you think about now in terms of adopting a certain personality of these models on the internet, for example.
For me, intuitively, I think of those as largely something that comes out of post-training.
Like it comes out of.
Okay, you have pre-trained your model, you've got the loss function on a certain amount, and then you give it some additional data or something to that effect to make it in the direction of some distribution.
Is that approximately the right way to think about this, or is there a significant part of that that you think about in pre-training itself?
I think that's probably the right way to think about it for the most part.
The way I usually think about it is anything you can do in post-training.
You probably should, because your iteration loop, the ability to make progress, is really fast.
You can try something, you can try it again, you can try it again.
It takes days or hours or something like that.
If you want to put something into pre-training, you have to do all the careful science to de-risk it.
You have to put it into the next run, wait a few months, then you have to get a thing.
And if it's wrong, it's really bad.
And then the other advantage is, if you want to do things that really are complicated model behavior interventions, the paradigm for pre-training, testing things out in small models, doesn't work.
The small models can barely put a sentence together.
Totally.
So if you're trying to get it to have the exact personality you want, you sort of want that on the- It has to be on a model that's good enough to even have that.
It has to be on a smart model, yeah.
But, that said, I do think at some point there'll be some pieces of alignment that you do want to export back into pre-training, because that might be a way to put them in with more strength, more robustness or more core to the intelligence.
If you think of pre-training as teach the model to be intelligent, and then post-training, as tweak the personality you can imagine, tweaks where you actually want it to be part of how it learns, part of its intelligence, and maybe you need to create more.
What would that even look like to incorporate in pre-training?
Is that like add extra data, basically of the type of domain you wanted to adopt earlier?
Basically,
There's a paper called Pre-Training on Human Feedback where you can add the human feedback characteristics into pre-training to test that.
You can basically give it all the information you give it in post-training, just mixed into pre-training, and see what effect that has.
The other loss you have when you do that is you lose the flexibility.
You sometimes train these and then you talk to them, and then you do an extensive process.
We have a bunch of people talk to the thing and find some issue.
The model says you're absolutely right too much, and you want to go and fix that.
Yeah.
I think that iteration loop point you made I think feels like the really key point of Yeah, there's a huge difference between taking three months to get information about if your model is good or bad or going in a good direction, versus a day or something or a couple of days.
You can do a lot of those.
And that probably also means it's way less compute.
You can do a lot of those in parallel.
Imagine you're trying all sorts of post-training strategies in parallel there.
So yeah, it makes a lot of sense.
It's also just the general hard part about pre-training.
Like everything in pre-training is hard because you have this like one shot on goal kind of for like multiple months.
Totally.
Okay.
So, in thinking too now about, I guess, what's going ahead as you now look to the next several years of what you're building like, how do you think about?
You know, like what are the – known problems that you're going to face, that you're going to have to deal with?
So there's going to be more compute, I assume, and you're going to need to hook up even bigger network GPUs and deal with.
Versus.
Are there areas where you're like OK, this is a problem, that it's a little bit more ambiguous how it's going to materialize into something you care about.
But you know it's an impending thing to think about.
Or are there things like that that come to mind?
I think the things that feel most top of mind to me are probably paradigm shifts.
I think the shift towards more RL is one paradigm shift in the field.
And I think there will probably be more.
I think a lot of people argue about, oh, is current paradigms enough to get us to EGI?
I don't know, maybe, probably, but I'm sure there'll be more.
It seems like it would be a really surprising twist if the answer is you just scale and there's nothing that you realize in the process of going up many orders of magnitude.
Totally.
But I think the things that I actually feel most nervous about are really hard to solve bugs.
I think that like, That's interesting.
Yeah.
I think this is maybe somewhat surprising to me, but it's just like a single bug can derail you for months.
When you think about it, the models take months to train, so you can lose a whole generation off of something that just looks like odd.
It turns out, this piece of your code was incorrect and you couldn't detect it.
It's really hard in ML.
ML is always really hard to find bugs in.
Yeah, totally.
But also some of these scaled up issues are really hard to solve even when you know they're there.
What's even a unit test that you would write or forget a unit test.
I mean anything close to a test for the type of network architecture on which you're doing this.
How do you even do that?
I mean, you can send a packet over it and confirm it's the same on the other side.
You can train a small model on it.
But even train a small model on it, it's not obvious.
If you have the very classic, very simple ML bug that early people face in their careers.
They have 10 layers in their network, and layer 7 connects to 9 instead of 8 to 9.
So there's some incorrect set of connections you have there.
And technically, the model still trains, and all the weights update.
And so it's a valid model, but it's not the correct one.
That's a very esoteric weird bug that would actually be hard to find.
Is that what you're referring to of these random bugs you face?
Yeah.
It's that, but you can- Times a million.
Times a million as the thing gets more complicated.
You could cast the wrong precision deep in some kernel and that causes your model to blow up at large scale.
And you find out like a month in.
Or you never find out.
Or you never find out, yeah.
You see the thing blow up, there's tens of thousands of lines of code.
How would you ever trace it down?
So those are the things that probably spook me the most is just like, some subtle tricky bug.
Yeah.
That's probably the case of like you don't know.
I think there's actually also the case of you do know, like it crashes.
You're training your model and it like, or it slows down, your job slows down a ton.
Yeah.
Those things can also be very hard to debug.
Nelson Elhage is one person on the team who has a blog.
He wrote a blog on one cursed bug we had earlier on.
Okay.
Interesting.
Yeah.
I remember this one quite well because I think I encountered it fairly early and was like this looks hard.
Can someone else look at it?
Yeah.
A month later was like, wow, I'm so glad I handed that one.
Right.
Exactly.
I never would have been able to get.
One of the abilities I think is actually really useful is the ability to deep dive anything to any level of depth.
But that's a pretty rare skill.
For me, we talked about what level of the stack I was at before.
I was working at the Torch.MapMall.
But I didn't know CUDA.
So if Torch.MapMall was broken, it wasn't like I could dig into Torch.MapMall and figure it out.
And similarly with communications.
I could call send, send bytes from A to B, but I didn't know the underlying networking protocol.
So if that underlying networking protocol is broken, I need to learn a whole field.
I have to understand packets and TCP or all of these different things to debug that.
And I think one thing that's surprisingly hard and there's very few people who can do is kind of own that whole stack from like I understand how the ML is supposed to work and what the learning dynamics are, all the way down to like I know the bytes and I can understand how the bytes should be moving around machines.
Totally, yeah.
And actually, on that front, when you think about the different backgrounds of people on your team today, how do you approximately map them out to different categories of computer scientists?
I think there's this external view of what these teams look like, which is that they're all PhD researchers who write ML papers.
And I suspect that's not actually true, given what you're describing here.
Yeah, it's a mix, and I think the thing we most need is engineers.
Okay, interesting.
Almost always, throughout the entire history of this field.
Totally.
It's like the case that you throw more compute, the thing kind of works.
Yeah.
The challenge is actually doing that.
The researchers are like, cool, nice.
Yeah, and getting it correct isn't really an ML problem.
The actual architectures are pretty simple.
You can write the math down, but you don't even need to understand the math to implement it.
You just need to get a correct implementation and then you have an engineering problem of how do I take this, implement it at large scale, parallelize all the things and check that it's Correct.
But it's yeah, so it's like kind of engineering skill, but it's this particular type of engineering skill that's about being able to debug anything.
Yeah.
I think there's another angle of engineering which I think of as really quickly iterate on a website or something.
Which I think of as an important skill set, probably important for making a startup.
You've got to fail fast.
Try a bunch of different things, none of which are that technically difficult to do.
The skill sets that we're like most kind of in need of or looking for.
Are this like able to solve really hard engineering problems?
Are the people who worked at companies that grew a whole bunch, and so they have experience like doing the kind of thing you've done over the last several years at Anthropic.
Or do they tend to be academics?
Or where do they come from?
Yeah.
So at this point, I think we actually just hire a bunch of people who have done this before from other places.
And that's the easy answer.
It's like, ah, yeah.
But by this before, do you mean in AI companies, necessarily?
Or also someone who worked at Meta on their not AI team, but they ran some other distributed system that reached internet scale 10 years ago, or something like that.
More like we have a specific role in mind.
So say I'm trying to make the run train efficiently in Jacks.
Hiring someone who's worked on Jacks would be great, or someone who's worked at another company on optimizing a Jack stack, to be really efficient.
I think now we're at the point where the network is well enough known, we can hire these people and also the field is big enough that there's people with expertise.
One thing that was interesting was like early on we hired a lot of people from just like all sorts of backgrounds.
And I think that people who are just smart and work really hard can learn this pretty fast.
But you have to like, want to.
We heard a lot of physicists, for instance, like theoretical physicists who just like show up, they do a residency, like learn to program, and then they were really smart, they do really great work.
I want to switch gears, to talk about something a little bit different, which is just future-looking things, or how you think about other domains or advances happening in AI that I'm seeing elsewhere in the field.
And you don't have to tell me, if you guys are working on these necessarily how you think about them.
I guess one big area I was thinking about is around areas other than next token protection.
Are there any of the other things that people are working on that you're curious about?
So basically, two differences there.
One is not using Transformer as an architecture.
So there's companies like Liquid AI that have their own kind of architecture.
For example, they're using
Or not using autoregressive training as a way of training models.
Are any of those, do you think, interesting in ways that we might come closer to AGI?
Or do you think this autoregressive framework is the one that makes sense?
I think they're interesting.
I think I'm less like, ah, autoregressive is the way to go.
On the other hand, I think autoregressive is probably good enough to get to AGI or something.
Yeah, interesting, yeah.
Such that...
Yeah, I see the main driver as scale and careful science of the basics more than come up with something totally novel.
Not because there aren't novel things that are better.
Actually, I'm pretty confident they are there.
It's just that scale is easier and it's more reliable, and I think we're still seeing really big gains to that.
Do you spend a lot of time on thinking about things like?
I've been reading some of these open source papers where you can kind of dive into some of the details about the model changes, and with some of these Chinese labs, for example, where they're making tweaks on the order of the architecture itself, with like better caching behavior, for example, or like more efficient attention functions that make a big difference?
Do you feel like these are examples of things like you mentioned earlier, where it's basically in the grand scheme of things, basically if you throw more compute at it, this is all kind of a rounding error.
Or Do you think it will take some number of these very clever architectural changes to actually get to HEI?
In the way that the first person who came up with the transformer made a particular transform, literally transformative change.
Will it take some of that?
Or do you think you keep doing the thing we're doing to make it bigger?
I think it'll be a mix.
My guess is you'll keep tweaking things.
The more compute you put in, the more worthwhile it is to do those experiments to figure it out.
Inference is a thing we haven't talked about, but you also want to serve these models to a lot of people.
There's a lot of changes you can make to make inference cheaper, and that depends on the details of your inference stack and the chips you're serving inference on, etc.
Do you, as someone focused on pre-training, have to think a lot about inference, or is it kind of like you just do your thing, you make the loss go down and then hand it off and someone else makes that happen?
Oh no, I think a ton about inference because basically the problem inference is solving.
We basically determine the problem inference is solving.
We give them a model and they have to run that fast and it's very easy to get them a model that is impossible to run fast.
Oh, can you give an example of a decision you could make that could cause that?
I mean, the simplest one, this one's stupid, but it's like, you just make the model giant.
Yeah, sure.
Absolutely massive.
It's trained for a really small number of tokens.
And then inference now has this giant model.
Yeah, and then they're hosed, basically.
Yeah, I mean, you can also make things require communications in a lot of places, which would make it harder for inference.
Yeah, totally.
You can also just make things complicated and there's no fundamental reason it's hard, but there's only so many people on the inference team and they have to implement it in a bunch of places.
I definitely think of inference as the team that I work the most closely with, because we're kind of like co-designing models to be smart and cheap.
Yeah, interesting.
Particularly in a world of limited compute, right?
Yeah.
Sort of the bottleneck.
I think to a large degree on our.
I mean, you can see, Anthropic has rate limits constantly and people complain about it a lot.
And the reason is there's only so much compute we can get on short notice.
So making your inference more efficient is the way you can serve more users.
And actually let's say you had 100x more compute or we somehow didn't live in a world where compute was limited.
Does that change a ton about what you do?
Or is it still kind of the well, you're just going to grab all of it whatever compute you have and keep going down the loss curve.
Well, it's impossible to be in the world where there is enough compute.
So I think if we got infinite compute, the challenge would be making use of the compute, right?
So then you would start to run into these issues like well, when one chip fail okay, I'm gonna throw two billion chips on a run.
Yeah, totally, totally.
But what happens when a chip fails?
So I think we would be limited on people then.
It would be like, how fast can we solve the hard engineering problems to scale up?
But I do think the change is massive and I think people don't realize how chip-limited AI research is, or something right now.
The models that everyone uses.
If you're using CloudSonic 4 or Cloud Opus 4, it's our first shot at models of that scale.
If you think about anything, you could do it, and you could do it again and you could do a better job.
But if you imagine 10x the compute you could run this every day instead of every few months, or 100x maybe for that, then it would be a really big change to have a lot more compute.
And it's coming, right?
That's kind of a fun part of the field.
It's like every year you're like, I had no computer year ago.
Exactly.
How do you think about methods like discrete diffusion?
I saw there's a Gemini diffusion model.
And I think about that in the space I used to be in, where There's a lot of discrete diffusion models being used in protein design, for example, the space where my startup was.
Do you see that as a domain where there's going to be interesting advances happening?
I'll be honest, we haven't done image generation, and I think that's been the main use for diffusion, so I've had this on my to-do list of things I should understand for a while.
Go figure it out.
There are people on my team who do understand it and would have better thoughts, but I actually don't think I understand it well enough to know.
I do have it in this category of, Not a total.
And there's a lot of things that aren't a huge paradigm shift, but they're pretty big changes to how things run.
Yeah, totally.
And I expect there are some of those that will work.
I don't know if it's diffusion or if it's another one.
Obviously, who knows what Anthropic will do in the future, but at least in the near term, are the things where you see big areas where a startup can win in the world in which Anthropic is getting you know, making their models better year over year.
My general read is, like, anything that benefits from the model getting smarter.
I think like, on the one hand, there's like a lot You can always be like oh yeah, the if you're doing a startup, like all the AI labs are big companies.
They'll be bigger than you and they could do that thing.
But also we're all working on this general system that covers a lot of different uses, and the plan is to power all the startups to do all of this individual work.
So yeah, i think like anything that just kind of looks like oh, this almost works with current models but requires like a bunch of work, is a pretty promising direction.
Uh, i think maybe the thing to watch out for is things where, like They work now with a huge amount of work to build up a scaffold, but the next generation you're not going to need the whole scaffold you built up.
I mean, maybe that's fine.
I don't know.
Maybe you just build up the business with the scaffold and then you don't have to do any work later.
I don't know about the business side of it, but it does feel a little silly to invest a ton in that.
Yeah, totally.
What about on the flip side?
Are there things in your training stack where you're like man, if there was a company that solve X problem, I would totally buy their product.
Yeah.
There's a ton.
I do think that probably most of these, the way I would probably structure, would be almost like making something but then consulting with the company offering a service to companies for free.
Particularly for companies that are scaling really fast.
You're almost always limited on how many people you can have.
So even if you could hire people to do it yourself, actually being able to contract someone else to do it where they're managing it and hire all the people and deal with the organizational side could be useful.
There's a huge amount of stuff.
One that jumps to mind, we talked about chips that do math incorrectly.
It would be lovely if there was some startup that you could just say, here are my chips.
Confirm they're all perfect.
And if they're not, let me know exactly what went wrong on what fraction of them.
I can tell you the math is wrong, but I don't really know enough details of chips to be like.
This chip failed because this particular low-level component was wired wrong or got hit by a gamma.
I don't know what causes it.
You can always go a bunch deeper.
I mean, the other thing I'd maybe just push startups on is thinking a little bit about.
This is maybe less technical, but just what happens once we get AGI and how to make sure that goes well for the world or something.
My expectation is if you actually automate almost everything a person can do.
The amount of economic growth there is just like truly enormous.
And I would think a little more about like, how do you make this like help the world versus not?
I think there's gonna be like plenty of economic success or something as a result of it anyway.
Yeah, absolutely.
Yeah.
Last question I want to ask you is around if you rewind back to where we started like 10 years ago.
You're a student, you're pivoting into AI.
From kind of economics work you were thinking about and all sorts of things you probably did in those early days had some kind of compounding return for you as you developed into the role you have now.
What advice would you give to students as they think about entering the workforce, especially today?
Learning skills are going to be useful and maybe getting themselves jobs like the one you have right now, 10 years later.
It's hard because I think the timing is very different.
Like I just think we're like we've made made a lot of progress, like what I would do 10 years ago is different from what I would do today.
Totally.
But I think certainly if I went back 10 years ago, I would be like focus on AI.
It's like the most important thing and particularly focus on engineering, which I think felt very wouldn't have seemed obvious to me at the time that like the important thing was these engineering skills and not the like, math and theoretical understanding of SVMs and all the standard ML literature.
I think today I would probably focus a bunch on the engineering and on the figuring out what to do with AGI as the two main things that feel top of mind for me.
Let's call it fair.
Thanks so much, Nick.
Appreciate it.