Where Do LLMs Pull Their Information From?

Written By
Timothy Boluwatife
SEO Strategist
Table Of Content
Our Clients

Where Do LLMs Pull Their Information From?

Large Language Models (LLMs) like GPT-3, GPT-4, and others have amazed many of us by answering all sorts of questions. They can explain quantum physics, draft a recipe, or even write poetry in Shakespearean style. It often feels like they know everything. Which naturally raises the question: Where are these models getting their information? Are they searching the internet live? Do they have a giant database stored inside? How do they seem to recall facts and details about so many topics?

Let’s demystify this by exploring how LLMs are built and where their “knowledge” comes from.

LLMs 101: Trained on Massive Datasets

First, it’s important to know that LLMs do not actively browse the web or query some external knowledge base every time you ask a question (unless they are specifically designed with that feature, which I’ll touch on later). Instead, they generate answers based on patterns they learned during a training phase.

During training, an LLM is fed a huge amount of text data. This data comes from various sources and effectively teaches the model about language and facts in the world. The training process adjusts the model’s internal parameters so that it can predict the next word in a sentence, form coherent paragraphs, and capture relationships between concepts.

So, when you ask an LLM a question, it’s pulling from this learned representation of all that training data. It’s a bit like how a person might answer a question: based on everything they’ve read and learned in the past. They aren’t looking something up in the moment (unless they need to); they’re recalling from memory. Similarly, the LLM “recalls” (in a generalized way) from its training data.

What Kind of Data Do LLMs Train On?

The exact datasets vary by model and by organization, but here are common sources:

  • The Open Internet (Web Crawl): Many LLMs are trained on a snapshot of the internet. For example, OpenAI’s models use data from a “Common Crawl,” which is a publicly available dataset that has billions of words scraped from websites. This includes all sorts of content: news articles, blog posts, forums (like Reddit), Wikipedia pages, and more. If it’s text and publicly accessible, it might be in there.
  • Wikipedia: Wikipedia is often explicitly included because it’s a large, relatively clean and comprehensive resource covering many topics. It’s a quick way to inject a lot of factual knowledge.
  • Books: A lot of models train on digitized books (from various genres – literature, non-fiction, textbooks). Books provide well-edited, long-form content that helps the model learn about structure and also get information on subjects like history, science, philosophy, etc.
  • News Articles & Journals: Some datasets include archives of news articles and possibly scientific papers or journals. These help the model with more formal writing and up-to-date information (up to the cutoff of training).
  • User-Generated Content: Content from Q&A sites (StackExchange), forums, and social media discussions might also be included. For example, OpenAI’s GPT models are known to have trained on a lot of Reddit conversations (filtered for quality), which might contribute to their chatty and sometimes opinionated style.
  • Miscellaneous: Other sources can include lyrics, subtitles, legal documents, etc. Essentially, if it’s text and can give the model knowledge of language usage or world facts, it could be part of the training mix.

The total text can be astronomically large – think in the order of hundreds of billions of words. By absorbing this, the model builds a statistical and contextual map of language: which words and concepts tend to appear together, how sentences are formed, how ideas progress, etc. In doing so, it also picks up tons of facts (because those facts appeared in the text it read).

Compression, Not Database Lookup

It’s useful to clarify that LLMs don’t store data like a database of facts with easy retrieval. Instead, through training, they compress information into the weights (parameters) of the model. 

For instance, the model doesn’t have a little index card that says “The capital of France is Paris.” However, in seeing that sentence (and many like it and related context) in the training data, the pattern of words “capital of France” → “Paris” gets baked into the model’s neural connections.

 Later, if you prompt “The capital of France is [blank]”, the model’s internal representation will strongly predict “Paris” as the likely completion, because of how it adjusted its neurons during training to reflect that association.

Think of it like how you compress a large image into a smaller file – you lose some details, but keep the essence. LLMs sort of compress the vast text they see into a mathematical form that can still produce the essence of that text when asked.

This is why LLMs can sometimes get details wrong (like messing up a date or a name) – they didn’t memorize exact tables of facts; they absorbed patterns and sometimes those patterns are fuzzy, especially if there were conflicting or rare mentions in the data.

Examples of Known LLM Training Datasets

  • OpenAI’s GPT models: They haven’t published an exact list of all sources, but they’ve mentioned using CommonCrawl, WebText (a filtered web dataset), Wikipedia, a corpus of books, etc. The knowledge cutoff often talked about (e.g., “my knowledge is up to 2021”) is because they took a snapshot of all this data up to a certain date to train on. Anything after that, the model wouldn’t have seen (unless updated later).
  • Google’s LaMDA or PaLM: Google, with its access to the internet, also trains on a ton of web data, plus I’d imagine things like Google Books, scholarly articles, maybe dialogue datasets for conversation.
  • Meta’s LLaMA: In their paper, they list sources like Common Crawl, Wikipedia, books, GitHub, StackExchange, etc., and even specific percentages from each to create a balanced training set.
  • Anthropic’s Claude: Likely similar sources – web, Wikipedia, etc. They also emphasize some filtered data to reduce harmful content, so they might curate which parts of the web to include.

So across the board, these models have read a lot of what humans have written publicly. It’s why they often reflect not just facts but also the biases or perspectives present in that data.

Live Browsing vs. Training Data

It’s worth noting a distinction: Base LLMs vs. augmented LLMs.

  • A base LLM (like GPT-3.5 or GPT-4 out-of-the-box) only knows what it was trained on. If you ask it about something that happened after its training cutoff, it might try to guess or say it doesn’t know. For example, “Who won the World Cup in 2022?” – if the model’s training ended in 2021, it wouldn’t reliably know (and might either guess based on prior patterns or give a cutoff message).
  • Some systems integrate browsing or retrieval. For instance, ChatGPT with the Browsing feature or Bing’s AI chat actually perform a web search behind the scenes and then use the LLM to summarize or answer using what they found. In those cases, the answer is coming from both the LLM’s training and the fresh info it just looked up. But unless you’re explicitly using such a feature, the LLM is not pulling in new info.

So when using a vanilla LLM model, remember: it’s knowledge is like an encyclopedia that was written at the time of its training and then sealed off. It’s not dynamically pulling in new stuff (again, unless designed that way with plugins or browsing).

Why Do LLMs Sometimes Get Stuff Wrong Then?

Despite having all that data in training, LLMs are not infallible:

  • Memory Limits: They can’t remember every detail from the training – it’s a generalization, not exact storage. If a piece of info was rare or not emphasized in the training data, the model might not produce it correctly.
  • Conflicting Data: The web has errors and contradictory statements. The model might have seen two different “facts” about something and it may not have a way to know which one’s correct (unless one was overwhelmingly more present). It might produce a blend or get confused.
  • No Understanding of Truth: LLMs don’t know truth; they know what texts usually say. If lots of people wrote an incorrect fact online, the model might repeat it. They lack an internal fact-checking mechanism beyond what patterns they learned.
  • Context and Prompt Influence: How you ask can affect what it pulls up from its “memory.” The model might have info somewhere in its weights, but if the prompt is vague or misleads it, it might fetch the wrong pattern. Good prompting can help it access the relevant info (“Think step by step” often helps it recall stored facts more accurately by forcing it to reason).

Summing Up

LLMs pull their information from the vast swaths of text they were trained on – basically, they learned from everythingthey could read: the internet, books, articles, etc. They don’t have a direct line to a database or the internet at generation time (unless augmented); instead, they generate answers from the “impressions” all that reading left on their internal neural network.

So, when ChatGPT tells you about the French Revolution or how to boil an egg, it’s basically synthesizing an answer from all the books, articles, and posts about those topics it saw during training.

Think of an LLM like a very well-read parrot that has also learned how to write original sentences: it’s heard millions of conversations and texts, and now it can chatter away about them. It’s not citing sources live (unless explicitly designed to), but it’s regurgitating and remixing all that past content in a coherent way.

Understanding this helps in using LLMs wisely. You’ll know why they might not know super recent events, why they could occasionally output a mistake (because maybe they never saw the correct info clearly in training), and why sometimes they speak so authoritatively (because they’re echoing the confidence of the texts they saw).

In summary: LLMs pull from their training data – an enormous, rich, but sometimes imperfect tapestry of human knowledge and language. They’re a mirror of what they were fed, with the ability to recombine and present it in new ways.

Timothy Boluwatife

Tim's been deep in SEO and content for over seven years, helping SaaS and high-growth startups scale with smart strategies that actually rank. He’s all about revenue-first SEO.

Timothy Boluwatife

Tim's been deep in SEO and content for over seven years, helping SaaS and high-growth startups scale with smart strategies that actually rank. He’s all about revenue-first SEO.