Reasoning LLMs
- Understanding Reasoning LLMs - Methods and Strategies for Building and Refining Reasoning Models.
LLMs Explaining
- LLM Visualization. A great website to show how LLM works.
- DeepSeek Explanation 52-page slides.
Transformers Explaining
- Explaining Transformers as Simple as Possible through a Small Language Model
- How Transformer LLMs Work (course by DeepLearning.AI). Instructor: Jay Alammar, Maarten Grootendorst.
- Attention is all you need and much more
- Mixture of Experts vs. Transformers.
- MoE use different “experts” to improve Transformer models.
- Mainly differ in the decoder block:
- Transformer uses a feed-forward network.
- MoE uses experts, which are feed-forward networks but smaller compared to that in Transformer.
LSTM Explaining
- Specifically designed to avoid the vanishing and exploding gradients problem.
- Thanks to a gated structure and the “constant error carousel” (where the backpropagation of the errors can decay slowly over many steps).
- The gated structure is to control information flow. Gates: forget, input, output.
- It contrasts long-term memory (cell state c) vs. short-term memory (hidden state h).
- It mitigates the vanishing gradients problem through the “constant error carousel”.
- Understand LSTM before learning about Transformers.
- Limit: due to seq2seq, encoder are separated from decoder –> can not scale.
- What makes Transformers better: they overcame the sequential processing limitations of RNNs/LSTMs. This enables MUCH faster training through GPU parallelization.
- LSTM is also the bridge between NLP and forecasting.
LSTM Resources
- Understanding the LSTM Layer. Linkedin post.
- Stanford course CS224N: NLP with DL.