Transformers

TLDR

Now that we understand attention (lol), you should have everything you need to understand Transformers and ChatGPT. 😎

ChatGPT is in the family of decoder-only transformer models. Given some input text, the model just continually asks "what is the next most likely word?" spits that out and does it again.

As you know, the transformer was introduced in a machine translation paper in 2017 by Google authors entitled “Attention is all you need”. It is very famous. They describe an encoder-decoder model. In such models we take in some data and create a "latent" or "hidden" state with the encoder. That state is just like the hidden state in an RNN. Then, the decoder can look at that state when making its output. That's how we'd translate from English to German or, how we might add a caption to an image (the encoder makes a hidden representation of the image and then the decoder turns that into words).

Transformers can do many things! They can do translation, they can also do summarization, question-answering, classification, sentiment analysis, and of course, they can generate text. ChatGPT, Llama, Claude, Mixtral, and Gemini are popular "decoder-only" transformer models...the ones that make text.

Kyle's example code

Further reading

  1. But what is a GPT? by 3Blue1Brown. As always, incredible material from 3B1B.
  2. The Illustrated Transformer, by Jay Alammar. This is the conceptual version of the "Attention is all you need" paper. You don’t need to understand everything here. Just try to follow along.

Advanced reading

  1. Attention is All You Need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. This is the original transformer paper. It has been cited 100k times in just 7 years.

  2. Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. This is the paper that introduced the term “GPT”

  3. Language Models are Few-Shot Learners by OpenAI; This is the paper that introduced GPT3. People started to get scared 😉around this point.

  4. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. This is a super famous “encoder-only” paper. Most classification tasks on text are now BERT-based. We will use some of these next week.

  5. OpenAI's GPT-3 Language Model: A Technical Overview

  6. Watch an A.I. Learn to Write by Reading Nothing but ... by Aatish Batia, NYT