Tho Le

A Data Scientist. Looking for knowledge!

Transformers

05 Aug 2025 » ai, llms, transformers

Resources

On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
- Softmax attention has become the backbone of modern transformer architectures.
  - The nonlinearity is applied on the inner product of the query q and key k.
  - Good: expressiveness, scalability.
  - Bad: quadratic memory and computational complexity vs sequence length.
- Approach 1: replace SA by Linear form of attention –> but downstream accuracy suffers.
- This paper helps break down SA into components that can be described in the language of RNNs –> helps explain why DA is more expressive than its counterparts.

Related Posts

New Modeling Approach (Categories: ai, tech, trends, models)
Model Distillation (Categories: ai, llms, news, latest)
Off-Policy Evaluation (Categories: ai, reinforcement learning)
Deep Learning Explained (Categories: ai, dl, explain)
Mixture-of-Experts (Categories: ai, llms, moe)
Google AI Tools (Categories: ai, llms, google)

« Model Distillation New Modeling Approach »