Tho Le

A Data Scientist. Looking for knowledge!

Model Distillation

15 Feb 2025 » ai, llms, distillation
  • Yan LeCun’s point.
    • LLM: 1E13 tokens x 0.75 word/token x 2 bytes/token = 1E13 bytes.
    • 4 year old child: 16k wake hours x 3600 s/hour x 1E6 optical nerve fibers x 2 eyes x 10 bytes/s = 1E15 bytes.
    • In 4 years, a child has seen 50 times more data than the biggest LLMs.
    • 1E13 tokens is pretty much all the quality text publicly available on the Internet. It would take 170k years for a human to read (8 h/day, 250 word/minute).
    • Text is simply too low bandwidth and too scarce a modality to learn how the world works.
    • Video is more redundant, but redundancy is precisely what you need for Self-Supervised Learning to work well.
    • Incidentally, 16k hours of video is about 30 minutes of YouTube uploads.

Related Posts