Large Language Models, simply explained for beginners

April 2024

The jargon of Large Language Models can be genuinely confusing for beginners. Here are seven core concepts, explained plainly.

1. Language Model

A Large Language Model is a mathematical model that can understand and produce human language. It is built from many mathematical operations — mostly multiplications and additions. Give it a sentence; it generates a response.

2. Transformer

Much of the excitement in generative AI today traces back to one invention: the Transformer architecture, introduced in Google's paper "Attention Is All You Need." Transformers are the fundamental building block of today's most powerful LLMs. Their key advantage is the ability to process very long sequences of text — entire books, conversations, codebases.

3. Pre-training

The first stage of building an LLM is pre-training. The model is shown vast amounts of text — all of the internet, thousands of books — and trained to predict the next word in a sentence. That's the whole task. From this simple objective, language understanding emerges.

4. Fine-tuning

After pre-training, the model is fluent but not yet useful for answering questions. Fine-tuning exposes it to a large dataset of question-and-answer pairs. The model learns to respond, not just complete sentences.

5. Tokenization

A model cannot read words as humans do. Before processing, text is broken into tokens — small pieces that may not correspond to whole words. A single word can become multiple tokens; punctuation, spaces, and emojis are tokens too. Each token is then converted to a number.

6. Embedding

Each token is represented as a long string of numbers called an embedding. This is the format the model actually works with — all inputs in, all outputs out, entirely in numbers. The final output embeddings are converted back to words and shown to you.

7. Attention Mechanism

Attention is the most important operation in a Transformer. When the model processes a sentence, every token interacts with every other token — each one attending to the others to understand context and meaning. Attention is extraordinarily powerful and extraordinarily expensive to compute. It is the reason LLMs need so many GPUs.

Let me know if there's a concept you'd like me to cover in the next post.