Andrej's Blog

Training Data is Still an Open Problem

We spend most of our time working on training data. If someone told me this a year ago, it would have surprised me. The general thinking was that brute force scaling of AI models would begin to reach diminishing returns, and that data collection infrastructure like ours would be used as a tool during inference.

Teams are still scaling pretraining aggressively. Not in a brute force way, but in a much more targeted and deliberate manner. The way this is happening tends to vary between labs, but the consistent signal is a constant need for sources with high densities of “good” tokens.

There’s an over-simplified version of the AI stack that shows up in a lot of investor decks. It goes something like this:

chips -> infrastructure -> apps

Interestingly, data often ends up being a bit of a footnote in these conversations. Even some of the people closest to the largest labs grossly underestimate what “internet-scale” really means, and view data as a solved problem in the training process. A good way to think about this might be, if Google suddenly stopped scraping the whole internet every day, would they still be the best at pretraining in a year’s time? Unlikely.

Training data demand isn’t uniform, and it doesn’t fall into clean categories. The biggest AI labs have incredible amounts of compute, and although they still care about efficiency, in practice they tend to optimize more for coverage. The default behavior is to take as much data as they can within a certain domain and absorb the long tail.

Smaller labs that can’t afford to train on everything will focus on density instead of coverage. This means stricter filtering and getting as much signal as possible out of fewer tokens.

Teams looking to do the same thing might want completely different data, and at the same time, teams will do very different things with the same data.

Regardless of what teams are looking for, the workflow tends to look similar. They start broad, absorb a large dataset, and use it to figure out where the signal is coming from. From there, requests become more specific. More filtering is introduced, and sometimes this involves deploying machine annotations at scale to properly index datasets. This is especially true for multimodal data, where filtering is very expensive and is often deferred to later stages.

At the same time, teams don’t stop absorbing large amounts of data. Even as pipelines become more selective, coverage continues to expand in parallel. In a strange way, the data “system” never really converges. It becomes broader and narrower at the same time, all the time.

There are a few reasons why training data is still far from a solved problem (there’s probably a strong case to be made that because training data is the real alpha, it will never be “solved”). To get a great training dataset, it isn’t as simple as just crawling as many pages as possible. You have to decide what to crawl, where to crawl it, how often to revisit it, and how to deal with the fact that large parts of the web don’t want to be crawled at all. Even the largest labs have internal pipelines, but that doesn’t eliminate the problem. Coverage is always incomplete, and priorities shift quickly enough that the need for external data never really goes away.

None of this is “set and forget”. It requires constant iteration. Sources degrade, others emerge, and what might be considered high-quality data doesn’t necessarily stay that way.

At scale, even small inefficiencies compound quickly. Every day spent on data acquisition is a day not spent training, and this is time you can't get back in a race to ship better models. Most teams would rather focus on improving models than maintaining complex data pipelines.

A year ago, it was easy to think about training and inference data as separate domains. That distinction is beginning to break down. To build good web search, you need a comprehensive, high-signal crawl of the internet. The same is true for training data. The work that goes into identifying, collecting, and refining high-quality data ends up looking similar in both cases.

Good training data tends to look like good search data, and the infrastructure being built for pretraining data collection today is laying the foundation for what real-time systems will rely on in the future.

Most of this isn't visible from the outside. The scale and complexity of the training data market is easy to underestimate unless you're directly working with it.