mathematical linguistics for high school students

[re-posted from my Blogspot blog, May 2014]

I received the following email this weekend:

I’m a high school junior from southern California.
For our final project in AP Calculus class, I’m doing a
presentation on the connection between mathematics and linguistics, and I
stumbled on your blogpost “Why Linguists Should Study Math” while
researching my topic.
I was wondering if you could point me towards some resources
(that are relatively easy to understand) about how math is present in and
affects our written and spoken language.
Some things that I am considering are:
– the occurrences of words in our language
– how grammar uses mathematical principles
– algorithms we use to construct sentences

My [edited] response (suggestions from y’all as to better resources are much appreciated; I’ll forward; I wanted to get a response out quickly because the final is presumably fast approaching):

Thanks for reaching out to me. Of course, I think you’ve
chosen a good topic. There are two broad ways in which linguistics and math
  • How the human brain uses math in natural language (psycholinguistics)
  • How linguists use math to study and model languages (computational linguistics)

From your email, it appears you are mostly interested in #1.
However, in contemporary linguistics, the two are fast becoming one. Most
contemporary linguists use math as a tool.


Let me address your three areas of interest with respect to
how the human brain might use math to process and produce language:
The occurrences of
words in our language
: For the most part, this means “frequency” which
really means counting. Linguists love to count. We use large corpora of texts
to count words and phrases. Lancaster University in the UK is a well-known
corpus linguistics school. Their web page has a lot of good introductory
information (although I find it a bit clunky looking).

UPDATE: I forgot to include the one item that most directly answers the basic question: frequency effects in language. Human’s are very aware of how often they hear words. In some way, we count words automatically, even if it’s not quite a specific count like 75, somehow we know which words, phonemes, syntactic structures we hear/read more than others. This gives rise to a variety of frequency effects in language processing. This is the clearest example of how the brain uses math for language.

For example, we recognize high frequency words much faster than low frequency words. The website for Paul Warren’s book “Introducing Psycholinguistics” has an online demo for a word frequency task you can walk through to see how linguists study this.

What do linguists count?
  • Words: I’m sure
    you’ve seen word clouds like Wordle. This is composed of simple word frequency counts. One of the most enduring
    facts about word counts is Zipf’s Law which says “the most frequent word [in a corpus of texts] will occur
    approximately twice as often as the second most frequent word, three times as
    often as the third most frequent word, etc.” Why would this be true? Linguists
    have been studying this for decades.
  • Ngrams: sets of
    two-word, three-words, four-word strings, etc. This helps provide more context
    than mere single word frequencies. Have some fun playing around with Google’s
    Ngram Viewer if you haven’t already.
    Try plotting the change in frequency of “mathematical linguistics” and “corpus
    linguistics” (paste those two phrases into the search box with no quotes and
    only a comma separating them). Scholars are trying to use this to plot changes
    in culture. For example, take a look at this PDF.
  • Other: We also
    count many other things too, like parts of speech (verbs, nouns, prepositions,
    etc). We also count the co-occurrence of linguistics items that are not right
    next to each other. If you want to dig into more frequency fun, check out the
    more advanced tools at BYU.
    You can read more about how these tools help us study language here.
How grammar uses
mathematical principles
: One of the most commonly studied types of
mathematical principle in language is statistical learning. A good example of
this is transitional probabilities, which are sets of probabilities for what
linguistic item might come next given a string of items (e.g., words or
phonemes). For example, if you read “The author signed the _______”, you could
guess what the blank word is based on the previous four words (most likely,
it’s “book”).  This is based on the
psycholinguistic tests called “Cloze tests”. Linguists have discovered that the brain tracks transitional probabilities
for all kinds of linguistic items. In fact, this is one of the most robust
areas of study in language acquisition. Linguists study how babies use
transitional probabilities to learn language. For example, one of the most
challenging problems is figuring out how babies learn to separate a continuous stream of audio noise coming in to their ears into separate words, without any
knowledge of what words are or what they mean. One theory is that babies quickly learn transitional probabilities of sounds
that tell them where one word ends and another begins. But transitional
probabilities alone are not enough. For a challenge, try reviewing this PDF:
Algorithms we use to
construct sentences
: This is the most controversial area you’ve asked about.
The fact is, we linguists don’t really know how the brain constructs sentences.
As I mentioned above, there are models based on transitional probabilities like
Markov models, a computer algorithm designed to make those same kinds of guesses
we made about “book”. Markov models and Cloze tests are a good example of psycholinguistics and
computational linguistics coming together. As a theoretical contrast to
statistical models, there are rule-based models like formal grammars.
These are not mathematical in a typical sense, but they are based on formal
logic, which is the underlying foundation of mathematics. Linguistics is in the
middle of a war between the formal grammar camp and the statistical grammar
camp. There’s no consensus on which is the *correct* model of language.
However, in the last decade or so, the statistical side seems to have gained
the advantage. If you really want to dig in to this war, here’s a challenging
Additional Reading:
Linguists who count (the comments are especially engaging;
your teacher might be particularly interested in the calculus vs. algebra debate that
I hope this gets you off to a good start. Please don’t
hesitate to ask for clarifications or more resources (especially let me know if
you need more intro level or more advanced level; I wasn’t sure if I hit the
level right or not). I’m happy to be of more assistance if I can. As a smart,
dedicated student, I’m sure you’re ready to dig in to ngrams and Markov models.
But, as a high school junior in southern California with June fast approaching,
I’m also sure you’re ready for the beach. Both are required for a healthy life
of the mind.

NOTE: this is re-posted from my Blogger blog;  The Lousy Linguist. Originally posted there July 8.

One of the metaphor recognition papers I read this week had an interesting finding wrt inter-annotator agreement and metaphor: The Automatic Identification of Conceptual Metaphors in Hungarian Texts: A Corpus-based Analysis (Babarczy et a., LREC 2010 Workshop).

The purpose of the paper was to run a sort-of bake-off between three methods of creating source/target word lists (to be used by selection preference metaphor recognition system): Three different methods of compiling the word lists were tested: a) word association experiment, b) dictionary of synonyms, and c) reference corpus.

Ultimately they found that their corpus based method was most successful as measured by recall/precision, but there was a more striking result rather buried in the paper that I feel deserves more analysis. They created a gold standard by hand-tagging a 30,000 word “baseline” corpus. Here’s what they found:

At the first attempt, inter-annotator agreement was only 17%. After refining the annotation instructions, we made a second attempt, which resulted in an agreement level of 48%, which is still a strikingly low value. These results indicate that the definition of “metaphoricity” is problematic in itself [emphasis added].

They reported three general sources of inter-annotator DISagreement:

  • Direct vs. Indirect Reference: For example, in the case of the conceptual metaphors ANGER IS HEAT or CONFLICT IS FIRE, the source domain should be an expression referring to a sort of “heated thing”. However, in some cases, one or the other annotator included words indirectly suggesting the presence of heat, such as kiolt (‘extinguish’), kihől ( ‘get cold’) etc.
  • Lexical Ambiguity: For example, the expression eljutottam a mai napig (‘I’ve gotten to this day’) may or may not represent a CHANGE IS MOTION metaphor depending on whether the Hungarian verb jut (literally: get somewhere, reach a place by moving the entire body) is taken only to denote physical movement or to be ambiguous.
  • Discrepancies in Classification: …it is difficult to make an informed decision on whether the following example contains a CHANGE IS MOTION or a PROGRESS IS MOTION FORWARD metaphor, neither of which appear to be an intuitively correct choice: a járvány végigsöpört szülıvárosukon (‘the epidemic swept through their hometown’).

Of the four or five articles I’ve reviewed on automatic metaphor identification, this is the only one which reported on the results of human-tagging a corpus for metaphor. This strikes me as the sort of thing that should be a first step for anyone seriously interested in this program (certainly anyone interested in the IARPA Metaphor Program).I don’t doubt that others have done this, but it seems to be under-reported, suggesting it is not be treated as a core part of the problem.

I’ve complained in my previous posts that there is an overly restricted definition of metaphor underlying contemporary approaches to auto identification, but even within a highly restricted definition like those used by Babarczy et al. and others, there appears to be problems at the heart of identification for humans. So what exactly is being identified?
Anna Babarczy, Ildikó Bencze M., István Fekete, & Eszter Simon (2010). The Automatic Identification of Conceptual Metaphors in Hungarian Texts: A Corpus-Based Analysis LREC 2010 Workshop. Proceedings