Why you need to understand the three-segmented Zipf’s Law if you’re working on NLP #AI.
“A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?”
The answer, of course, is 5 cents. But almost everyone has the initial inclination to think 10 cents. That’s because 10 cents feels about right. It’s the right order of magnitude and is suggested by the framing of the problem. That answer comes from the fast, intuitive side of your brain. But it’s wrong. The right answer requires the slower, more calculating part of your brain.
That should have interesting consequences for computer scientists working on natural language processing. This field has benefited from huge advances in recent years. These have come from machine-learning algorithms but also from large databases of text gathered by companies like Google.
Back in 1935, the American linguist George Zipf made a remarkable discovery. Zipf was curious about the relationship between common words and less common ones. So he counted how often words occur in ordinary language and then ordered them according to their frequency.
This revealed a remarkable regularity. Zipf found that the frequency of a word is inversely proportional to its place in the rankings. So a word that is second in the ranking appears half as often as the most common word. The third-ranked word appears one-third as often and so on.
In this research published by MIT - Yu and co say the word frequencies in languages share a common structure that differs from the one that statistical errors would produce. What’s more, they say this structure suggests that the brain processes common words differently from uncommon ones, an idea that has important consequences for natural-language processing and the automatic generation of text.