Large Language Models — LLMs — are the technology behind ChatGPT, Claude, Gemini, and every AI assistant that has entered your life in the last three years. They can write code, summarise legal documents, pass medical exams, and hold conversations that feel disturbingly real. Yet most people who use them every day have no idea what they actually are, how they were built, or why they behave the way they do. This is the article that changes that.

Start here: a model that predicts the next word

At its most fundamental level, an LLM does exactly one thing: it predicts the most likely next token in a sequence. That's it. Everything else — the apparent intelligence, the reasoning, the personality — emerges from doing that one thing billions of times, at scale, on an almost incomprehensibly large amount of human-generated text.

A "token" is roughly a word fragment. The sentence "artificial intelligence" becomes something like ["artif", "icial", " intel", "ligence"] — four tokens, not two words. Models don't read language the way you do. They operate on these numerical fragments, converting each token into a vector — a list of numbers — and processing those vectors through a deep neural network.

Where does the knowledge come from?

Before a model can predict anything, it must be trained. Training means exposing the model to an enormous corpus of text — books, websites, academic papers, code repositories, forums, news articles — and teaching it to predict what comes next in that text. GPT-4's training data is estimated at several trillion tokens. That is, roughly, a significant fraction of all human writing that existed on the internet up to a certain date.

During training, the model adjusts billions of internal parameters — the numerical weights of its neural network — to get better and better at prediction. It's a process called gradient descent: measure how wrong the prediction was, calculate the direction to adjust the weights, adjust them slightly, repeat. GPT-3 had 175 billion parameters. GPT-4 is estimated at over a trillion.

Key concept

An LLM never "looks up" facts the way a search engine does. Everything it knows is encoded implicitly in the weights of its neural network — a vast compression of statistical patterns extracted from human language.

The transformer: the architecture that changed everything

The breakthrough that made modern LLMs possible was a neural network architecture called the Transformer, introduced by Google researchers in 2017 in a paper titled "Attention Is All You Need." Before transformers, language models processed text sequentially — one token at a time — which made them slow and poor at capturing long-range dependencies in text.

Transformers introduced a mechanism called self-attention. Instead of reading a sentence left to right, a transformer model looks at every token in relation to every other token simultaneously — asking, in effect: "how relevant is each word to every other word in this context?" This is what lets a model understand that in the sentence "The trophy didn't fit in the suitcase because it was too large", the word "it" refers to the trophy, not the suitcase.

Self-attention is computed across multiple "heads" in parallel — each head learning to attend to different kinds of relationships: syntactic, semantic, coreference. The outputs are combined, passed through feed-forward layers, normalised, and stacked dozens to hundreds of times. The result is a model that builds a rich, contextualised representation of every token in the input.

How text is actually generated

When you type a prompt into an LLM, the model encodes your input tokens, runs them through its transformer layers, and produces a probability distribution over every possible next token in its vocabulary — typically 50,000 to 100,000 tokens. It then samples from that distribution according to a parameter called temperature.

At temperature zero, the model always picks the most likely next token — deterministic, repetitive, safe. At higher temperatures, it samples more broadly, producing more creative and varied but sometimes incoherent outputs. The token is appended to the context, and the whole process repeats — one token at a time — until the model produces a stop token or hits a length limit.

This is why LLMs are sometimes described as "stochastic parrots": they are, at their core, very sophisticated next-token predictors. But that description undersells something important — the emergent capabilities that appear as models scale up are not fully understood, even by the people who build them.

Fine-tuning and RLHF: making it behave

A base model trained purely on next-token prediction is not particularly useful as an assistant. It will complete text in whatever way it finds statistically likely — which can mean finishing a poem, writing propaganda, or providing instructions for things no one should instruct. To make a model that actually follows instructions and behaves helpfully, developers apply a second stage of training.

The dominant technique is Reinforcement Learning from Human Feedback (RLHF). Human raters compare model outputs and rank them by quality, safety, and helpfulness. A separate model — a "reward model" — is trained to predict those human ratings. The LLM is then fine-tuned via reinforcement learning to maximise the reward model's score. This is what turns a text-completion engine into something that feels like a thoughtful assistant.

Why they hallucinate

LLMs hallucinate — they confidently state things that are factually wrong — because they are not designed to retrieve facts. They are designed to produce plausible-sounding continuations of text. When the training data contains uncertainty, gaps, or conflicting information, the model fills in the gaps with statistically likely language that may bear no relationship to reality.

This is not a bug that will simply be patched. It is a structural consequence of how these models work. Retrieval-Augmented Generation (RAG) — connecting a model to a live knowledge base — helps significantly, but does not eliminate the problem. The model still has to interpret retrieved text and generate a response from it.

Scale and emergence: why bigger changes everything

One of the most surprising and unsettling findings in LLM research is the existence of emergent capabilities — abilities that appear suddenly, with no smooth progression, as models are scaled up. Models above a certain parameter count can perform multi-step reasoning, solve analogies, write working code, and pass professional exams. Models below that threshold cannot. These capabilities were not explicitly programmed. They emerged.

This is what makes LLMs genuinely difficult to reason about — even for AI researchers. We are building systems whose full capabilities we do not understand in advance. We discover what they can do by making them larger and watching what happens. That is not engineering. That is something closer to archaeology of a thing we are simultaneously creating.

"We're building systems whose full capabilities we do not understand in advance. We discover what they can do by making them larger and watching what happens."

— Alberto Russo, AI Doomsday

What this means for the future

Understanding how LLMs work matters because it clarifies what questions we should be asking. Not "will AI become conscious?" — that's the wrong frame. But: what happens when systems that produce plausible-sounding outputs are placed in charge of consequential decisions? What happens when the cost of generating convincing misinformation drops to near zero? What happens when the statistical patterns encoded in these models — including every bias, every prejudice, every distortion in the training data — are amplified at scale?

These are not hypothetical questions. They are the questions of right now. And the first step toward answering them is understanding the machine.