NLP - LLMs - Basics

Reasoning LLMs

Understanding Reasoning LLMs - Methods and Strategies for Building and Refining Reasoning Models.

Explaining Transformers as Simple as Possible through a Small Language Model
How Transformer LLMs Work (course by DeepLearning.AI). Instructor: Jay Alammar, Maarten Grootendorst.
Attention is all you need and much more
Mixture of Experts vs. Transformers.
- MoE use different “experts” to improve Transformer models.
- Mainly differ in the decoder block:
  - Transformer uses a feed-forward network.
  - MoE uses experts, which are feed-forward networks but smaller compared to that in Transformer.

Specifically designed to avoid the vanishing and exploding gradients problem.
- Thanks to a gated structure and the “constant error carousel” (where the backpropagation of the errors can decay slowly over many steps).
The gated structure is to control information flow. Gates: forget, input, output.
It contrasts long-term memory (cell state c) vs. short-term memory (hidden state h).
It mitigates the vanishing gradients problem through the “constant error carousel”.
Understand LSTM before learning about Transformers.
Limit: due to seq2seq, encoder are separated from decoder –> can not scale.
What makes Transformers better: they overcame the sequential processing limitations of RNNs/LSTMs. This enables MUCH faster training through GPU parallelization.
LSTM is also the bridge between NLP and forecasting.