Tho Le

A Data Scientist. Looking for knowledge!

Model Distillation

15 Feb 2025 » ai, llms, distillation

Yan LeCun’s point.
- LLM: 1E13 tokens x 0.75 word/token x 2 bytes/token = 1E13 bytes.
- 4 year old child: 16k wake hours x 3600 s/hour x 1E6 optical nerve fibers x 2 eyes x 10 bytes/s = 1E15 bytes.
- In 4 years, a child has seen 50 times more data than the biggest LLMs.
- 1E13 tokens is pretty much all the quality text publicly available on the Internet. It would take 170k years for a human to read (8 h/day, 250 word/minute).
- Text is simply too low bandwidth and too scarce a modality to learn how the world works.
- Video is more redundant, but redundancy is precisely what you need for Self-Supervised Learning to work well.
- Incidentally, 16k hours of video is about 30 minutes of YouTube uploads.

Related Posts

New Modeling Approach (Categories: ai, tech, trends, models)
Transformers (Categories: ai, llms, transformers)
Model Distillation (Categories: ai, llms, news, latest)
Off-Policy Evaluation (Categories: ai, reinforcement learning)
Deep Learning Explained (Categories: ai, dl, explain)
Mixture-of-Experts (Categories: ai, llms, moe)

« Quantile Regression Model Distillation »