:ID: F29CBBCE-BF32-4A7F-A576-A3DA674F540A

Connections: AI

Chapter 1: Understanding Large Language Models

  • Input text → tokenize text → token ids → token embeddings → GPT decoder only transformer → postprocessing steps → output
  • The “large” in large language models refers to the number of parameters
  • The first step in creating an LLM is to train it on a large corpus of text data, sometimes referred to as “raw” text
  • LLM pretraining does not require labeled data, and instead uses self-supervised learning where the model generates its own labels. The “labels” come from the structure of the data itself

Transformer Architecture

  • Originally developed for machine translation
  • Consists of two submodules: an encoder and decoder
  • Both submodules consist of many layers connected by a self-attention mechanism
    • The self-attention mechanism allows the model to weigh the importance of words or tokens in a sequence relative to each other
  • Systems like BERT focus on the encoder, while systems like GPT focus on the decoder
  • GPT is considered an “autoregressive” model, meaning it uses previous outputs as inputs for future predictions

Three Main Stages of Coding an LLM

  • Implementing the LLM architecture and data preparation process
  • Pretraining an LLM to create a foundation model
  • Fine-tuning the foundation model to become a personal assistant or text classifier

Chapter 2: Working with Text Data

Data Preparation and Sampling

  • Deep neural networks and LLMs cannot process text directly, so words are represented as continuous-valued vectors (embeddings)
  • There are word, sentence, paragraph, and whole document embeddings. Sentence and paragraph embeddings are useful for RAG
  • Systems like GPT-3 use 12,288 dimensions for embeddings
  • Special tokens like “unk” indicate words not in the vocabulary, and “endoftext” indicates the beginning of a new text source
  • Byte Pair Encoding (BPE) is a common tokenization scheme, available via the tiktoken package

Data Sampling With a Sliding Window

  • LLMs learn to predict one word at a time
  • Example input-target pairs:
    • [LLMs] learn to predict one word at a time
    • [LLMs learn] to predict one word at a time
    • [LLMs learn to] predict one word at a time
  • During training, the model predicts the next word following the input block (shown in [brackets]), with the target word underlined

Creating Token Embeddings

  • The final step in preparing input is converting token IDs into embedding vectors
  • Step 1: Initialize vectors with random weights
    • Continuous vector representation is required because LLMs like GPT are deep neural networks trained with backpropagation

Chapter 2 Summary

  • LLMs require text to be converted into numerical vectors since they cannot process raw text
  • Embeddings transform discrete data (words, images) into continuous vector spaces compatible with neural networks
  • Raw text is broken into tokens (words or characters), then converted to integer token IDs
  • Special tokens (e.g., “unk”, “endoftext”) enhance model understanding
  • BPE tokenizers handle unknown words by breaking them into subword units
  • A sliding window approach generates input-target pairs for training
  • Embedding layers in PyTorch act as a lookup, retrieving vectors for token IDs
  • Positional embeddings (absolute or relative) are added to convey token position; OpenAI models use absolute positional embeddings