mathematical linguistics for high school students

[re-posted from my Blogspot blog, May 2014]

I received the following email this weekend:

I’m a high school junior from southern California.
 
For our final project in AP Calculus class, I’m doing a
presentation on the connection between mathematics and linguistics, and I
stumbled on your blogpost “Why Linguists Should Study Math” while
researching my topic.
 
I was wondering if you could point me towards some resources
(that are relatively easy to understand) about how math is present in and
affects our written and spoken language.
Some things that I am considering are:
– the occurrences of words in our language
– how grammar uses mathematical principles
– algorithms we use to construct sentences
 
Thanks,
M.

My [edited] response (suggestions from y’all as to better resources are much appreciated; I’ll forward; I wanted to get a response out quickly because the final is presumably fast approaching):

M.,
 
Thanks for reaching out to me. Of course, I think you’ve
chosen a good topic. There are two broad ways in which linguistics and math
intersects:
  • How the human brain uses math in natural language (psycholinguistics)
  • How linguists use math to study and model languages (computational linguistics)

From your email, it appears you are mostly interested in #1.
However, in contemporary linguistics, the two are fast becoming one. Most
contemporary linguists use math as a tool.

 

 
Let me address your three areas of interest with respect to
how the human brain might use math to process and produce language:
 
The occurrences of
words in our language
: For the most part, this means “frequency” which
really means counting. Linguists love to count. We use large corpora of texts
to count words and phrases. Lancaster University in the UK is a well-known
corpus linguistics school. Their web page has a lot of good introductory
information (although I find it a bit clunky looking).

UPDATE: I forgot to include the one item that most directly answers the basic question: frequency effects in language. Human’s are very aware of how often they hear words. In some way, we count words automatically, even if it’s not quite a specific count like 75, somehow we know which words, phonemes, syntactic structures we hear/read more than others. This gives rise to a variety of frequency effects in language processing. This is the clearest example of how the brain uses math for language.

For example, we recognize high frequency words much faster than low frequency words. The website for Paul Warren’s book “Introducing Psycholinguistics” has an online demo for a word frequency task you can walk through to see how linguists study this.

What do linguists count?
  • Words: I’m sure
    you’ve seen word clouds like Wordle. This is composed of simple word frequency counts. One of the most enduring
    facts about word counts is Zipf’s Law which says “the most frequent word [in a corpus of texts] will occur
    approximately twice as often as the second most frequent word, three times as
    often as the third most frequent word, etc.” Why would this be true? Linguists
    have been studying this for decades.
  • Ngrams: sets of
    two-word, three-words, four-word strings, etc. This helps provide more context
    than mere single word frequencies. Have some fun playing around with Google’s
    Ngram Viewer if you haven’t already.
    Try plotting the change in frequency of “mathematical linguistics” and “corpus
    linguistics” (paste those two phrases into the search box with no quotes and
    only a comma separating them). Scholars are trying to use this to plot changes
    in culture. For example, take a look at this PDF.
  • Other: We also
    count many other things too, like parts of speech (verbs, nouns, prepositions,
    etc). We also count the co-occurrence of linguistics items that are not right
    next to each other. If you want to dig into more frequency fun, check out the
    more advanced tools at BYU.
    You can read more about how these tools help us study language here.
 
How grammar uses
mathematical principles
: One of the most commonly studied types of
mathematical principle in language is statistical learning. A good example of
this is transitional probabilities, which are sets of probabilities for what
linguistic item might come next given a string of items (e.g., words or
phonemes). For example, if you read “The author signed the _______”, you could
guess what the blank word is based on the previous four words (most likely,
it’s “book”).  This is based on the
psycholinguistic tests called “Cloze tests”. Linguists have discovered that the brain tracks transitional probabilities
for all kinds of linguistic items. In fact, this is one of the most robust
areas of study in language acquisition. Linguists study how babies use
transitional probabilities to learn language. For example, one of the most
challenging problems is figuring out how babies learn to separate a continuous stream of audio noise coming in to their ears into separate words, without any
knowledge of what words are or what they mean. One theory is that babies quickly learn transitional probabilities of sounds
that tell them where one word ends and another begins. But transitional
probabilities alone are not enough. For a challenge, try reviewing this PDF:
 
Algorithms we use to
construct sentences
: This is the most controversial area you’ve asked about.
The fact is, we linguists don’t really know how the brain constructs sentences.
As I mentioned above, there are models based on transitional probabilities like
Markov models, a computer algorithm designed to make those same kinds of guesses
we made about “book”. Markov models and Cloze tests are a good example of psycholinguistics and
computational linguistics coming together. As a theoretical contrast to
statistical models, there are rule-based models like formal grammars.
These are not mathematical in a typical sense, but they are based on formal
logic, which is the underlying foundation of mathematics. Linguistics is in the
middle of a war between the formal grammar camp and the statistical grammar
camp. There’s no consensus on which is the *correct* model of language.
However, in the last decade or so, the statistical side seems to have gained
the advantage. If you really want to dig in to this war, here’s a challenging
read.
 
Additional Reading:
Linguists who count (the comments are especially engaging;
your teacher might be particularly interested in the calculus vs. algebra debate that
ensues).
 
 
I hope this gets you off to a good start. Please don’t
hesitate to ask for clarifications or more resources (especially let me know if
you need more intro level or more advanced level; I wasn’t sure if I hit the
level right or not). I’m happy to be of more assistance if I can. As a smart,
dedicated student, I’m sure you’re ready to dig in to ngrams and Markov models.
But, as a high school junior in southern California with June fast approaching,
I’m also sure you’re ready for the beach. Both are required for a healthy life
of the mind.