- Yan LeCun’s point.
- LLM: 1E13 tokens x 0.75 word/token x 2 bytes/token = 1E13 bytes.
- 4 year old child: 16k wake hours x 3600 s/hour x 1E6 optical nerve fibers x 2 eyes x 10 bytes/s = 1E15 bytes.
- In 4 years, a child has seen 50 times more data than the biggest LLMs.
- 1E13 tokens is pretty much all the quality text publicly available on the Internet. It would take 170k years for a human to read (8 h/day, 250 word/minute).
- Text is simply too low bandwidth and too scarce a modality to learn how the world works.
- Video is more redundant, but redundancy is precisely what you need for Self-Supervised Learning to work well.
- Incidentally, 16k hours of video is about 30 minutes of YouTube uploads.