Friday, May 17, 2013

Lexical Complexity

Lexical simplification often requires some method of determining a word's complexity.  At first glance, this sounds like an easy task.  If I asked you to tell me which word is simpler: 'sit' or 'repose', you would probably tell me that the first was the easiest.  However, if I asked you to say why, it may be more difficult to explain.

Many factors influence the complexity of a word.  In this post, I will identify six key factors, these are: Length, Morphology, Familiarity, Etymology, Ambiguity and Context.  These are not an exhaustive list, and I'm sure other factors contribute too.  I have also mentioned how to measure these where appropriate.

1.  Length

Word length, measured in either characters or syllables is a good indicator of complexity.  Longer words require the reader to do more work, as they must spend longer looking at the word and discerning it's meaning.  In the (toy) example above, sit is 3 characters and 1 syllable whereas repose is 6 characters and 2 syllables.

Length may also affect the following two factors.

2.  Morphology

Longer words tend to be made up of many parts - something referred to as morphological complexity.  In English, many morphemes may be put together to create one word.  For example 'reposing' may be parsed as: re + pose + ing.  Here, three morphemes come together to give a single word, the semantics of which are influenced by each part.  Morphosemantics is outside the scope of this blog post (and probably this blog!) but lets just say that the more the reader understands about each part, the more they will understand the word itself.  Hence, the more parts there are, the more complex the word will be.

3.  Familiarity

The frequency with which we see a word is thought to be a large factor in determining lexical complexity.  We are less certain about the meaning of infrequent words,  so greater cognitive load is required to assure ourselves we have correctly understood a word in it's context.  In informal speech and writing (such as film dialogue or sending a text message) short words are usually chosen over longer words for efficiency.  This means that we are in contact more often with shorter words than we are with longer words and may explain in part the correlation between length and complexity.

Familiarity is typically quantified by looking at a word's frequency of occurrence in some large corpus.  This was originally done for lexical simplification using kucera-francis frequency, which is frequency counts from the 1-million word Brown corpus.  In more recent times, frequency counts from larger corpora have been employed. In my research I employ SUBTLEX (a word frequency count of subtitles from over 8,000 films), as I have empirically found this to be a useful resource.

4.  Etymology

A word's origins and historical formations may contribute to it's complexity as meaning may be inferred from common roots.  For example, the latin word 'sanctus' (meaning holy) is at the etymological root of both the English words 'saint' and 'sanctified'.  If the meaning of one of these words is known, then the meaning of the other may be inferred on the basis of their 'sounds-like' relationship.

In the above example, 'sit' is of Proto-Germanic Anglo Saxon origins whereas 'repose' is of Latin origin.  words of Latin and Greek origins are often associated with higher complexity.  This is due to a mixture of factors including the widespread influence of the Romans and the use of Latin as an academic language.

To date, I have seen no lexical complexity measures that take into account a word's etymology.

5.  Ambiguity

Certain words have a high degree of ambiguity.  For example, the word 'bow' has a different meaning in each of the following sentences:

The actors took a bow.
The bow legged boy stood up.
I hit a bull's eye with my new carbon fibre bow.
The girl wore a bow in her hair.
They stood at the bow of the boat.

A reader must discern the correct interpretation from the context around a word. This can be measured empirically by looking at the number of dictionary definitions given for a word.  According to my dictionary, sit has 6 forms as a noun and a further 2 as a verb, whereas repose has 1 form as a noun and 2 forms as a verb.  Interestingly, sit is more complex by this measure. 

6.  Context


There is some evidence to show that context also affects complexity.  For example: take the following sentences:

"The rain in Spain falls mainly on the ______"
"Why did the chicken cross the ______"
"To be or not to ___"
"The cat ____ on the mat"

In each of these sentences, you can easily guess the blank word (or failing that use Google's auto complete feature).  If we placed an unexpected word in the blank slot, then the sentence would require more effort from the reader.  Words in familiar contexts are more simple than words in unfamiliar contexts.  This indicates that a word's complexity is not a static notion, but is influenced by the words around it.  This can be modelled, using n-gram frequencies to check how likely a word is to co-occur with those words around it.

Summary

So, if we put those factors into a table it looks something like this:

Word "sat" "repose"
Length (characters) 3 6
Length (syllables)12
Familiarity (frequency)338329
Morphology (morphemes) 1 2
Etymology (origins) Proto-Germanic Latin
Ambiguity (senses)83
Context* (frequency) 6.976 0.112
*source: Google n-grams value for query "the cat ____".  Value is percentage occurrence and is multiplied by a factor of 10^7

We see that repose is more difficult in every respect except for the number of senses.

Lexical complexity is a hard concept to work with, it is often subjective and shifts from sense to sense and context to context.  Any research into determining lexical complexity values must take into account the factors outlined here.  The most recent work into determining lexical complexity is the SemEval 2012 task in lexical simplification.  This is referenced below for further reading.

L. Specia, S. K. Jauhar, and R. Mihalcea. Semeval-2012 task 1: English lexical simplification. In First Joint Conference on Lexical and Computational Semantics, 2012

3 comments:

  1. Started working on complex word identification for the first time and found your blog to be very useful for the basic understanding. Great job. Thanks for the blog.

    ReplyDelete
  2. Thanks for breaking this down! I've been exploring a few different approaches to measuring lexical complexity in the context of audio vs. text (https://phonic.ai/blog).

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete