:ID: F29CBBCE-BF32-4A7F-A576-A3DA674F540A
Connections: AI
Chapter 1: Understanding Large Language Models
- Input text → tokenize text → token ids → token embeddings → GPT decoder only transformer → postprocessing steps → output
- The “large” in large language models refers to the number of parameters
- The first step in creating an LLM is to train it on a large corpus of text data, sometimes referred to as “raw” text
- LLM pretraining does not require labeled data, and instead uses self-supervised learning where the model generates its own labels. The “labels” come from the structure of the data itself
Transformer Architecture
- Originally developed for machine translation
- Consists of two submodules: an encoder and decoder
- Both submodules consist of many layers connected by a self-attention mechanism
- The self-attention mechanism allows the model to weigh the importance of words or tokens in a sequence relative to each other
- Systems like BERT focus on the encoder, while systems like GPT focus on the decoder
- GPT is considered an “autoregressive” model, meaning it uses previous outputs as inputs for future predictions
Three Main Stages of Coding an LLM
- Implementing the LLM architecture and data preparation process
- Pretraining an LLM to create a foundation model
- Fine-tuning the foundation model to become a personal assistant or text classifier
Chapter 2: Working with Text Data
Data Preparation and Sampling
- Deep neural networks and LLMs cannot process text directly, so words are represented as continuous-valued vectors (embeddings)
- There are word, sentence, paragraph, and whole document embeddings. Sentence and paragraph embeddings are useful for RAG
- Systems like GPT-3 use 12,288 dimensions for embeddings
- Special tokens like “unk” indicate words not in the vocabulary, and “endoftext” indicates the beginning of a new text source
- Byte Pair Encoding (BPE) is a common tokenization scheme, available via the tiktoken package
Data Sampling With a Sliding Window
- LLMs learn to predict one word at a time
- Example input-target pairs:
- [LLMs] learn to predict one word at a time
- [LLMs learn] to predict one word at a time
- [LLMs learn to] predict one word at a time
- During training, the model predicts the next word following the input block (shown in [brackets]), with the target word underlined
Creating Token Embeddings
- The final step in preparing input is converting token IDs into embedding vectors
- Step 1: Initialize vectors with random weights
- Continuous vector representation is required because LLMs like GPT are deep neural networks trained with backpropagation
Chapter 2 Summary
- LLMs require text to be converted into numerical vectors since they cannot process raw text
- Embeddings transform discrete data (words, images) into continuous vector spaces compatible with neural networks
- Raw text is broken into tokens (words or characters), then converted to integer token IDs
- Special tokens (e.g., “unk”, “endoftext”) enhance model understanding
- BPE tokenizers handle unknown words by breaking them into subword units
- A sliding window approach generates input-target pairs for training
- Embedding layers in PyTorch act as a lookup, retrieving vectors for token IDs
- Positional embeddings (absolute or relative) are added to convey token position; OpenAI models use absolute positional embeddings