Wednesday, November 06, 2013

Word Sense Disambiguation

Some words have more than one meaning.  The brain seems to have an innate ability to work out what a sentence means.  Take the following two sentences:

"I tied my boat to the bank"
"I put my money in the bank"

In the first sentence the you probably imagine somebody tying their boat to the side of a river, yet in the second sentence you imagine somebody investing their money with a financial institution.  That string of four characters: 'b a n k' has completely changed meaning.

Word sense disambiguation (WSD) is a well researched task in computational linguistics with an important application to lexical simplification.  The majority of previous research splits roughly into three categories:
  • Supervised: Using labelled data, a system builds a classifier which can recognise the different senses of a word, from a variety of features in the words surrounding it.
  • Unsupervised: With unlabelled data, a system learns the different senses of a word.  Classification of new data makes use of the previously learned senses.
  • Knowledge Based: A large knowledge resource such as WordNet provides information about the words which can be used during disambiguation.

 WSD is vital to the task of lexical simplification.  Consider simplifying a sentence from the previous example. If you look up the word 'bank' in a thesaurus you will have a list of synonyms that looks something like the following:

Bank:
Financial Institution; Treasury; Safe;
Edge; Beach; Riverside;

If a system does not employ WSD, then there is no method of telling which of the synonyms are correct for the context.  We do not wish to say "I tied my boat to the treasury", or "I put my money in the riverside".  These examples are at best farcical and at worst nonsensical.  WSD is paramount to selecting the correct set of synonyms.

I will not venture to a full explanation of WSD as applied to lexical simplification.  Suffice to say that there are four papers which I have so far identified as addressing the matter.  These can be found in the lexical simplification list.

  • Can Spanish be simpler? LexSiS: Lexical simplification for Spanish. Bott et al. 2012
  • Wordnet-based lexical simplification of a document. Thomas and Anderson 2012
  • Putting it simply: a context-aware approach to lexical simplification. Biran et al. 2011
  • Lexical simplification. De Belder et al. 2010

Friday, September 06, 2013

3rd Year

On Wednesday (4/9/2013) I successfully completed my end of second year interview.  This means that I am now officially a third year PhD student.  I am now at the dead halfway point of my PhD, having completed 24 months with 24 more remaining.  It has been a long road getting here and there is still a long way to go.  Below is a brief analysis of the achievements in my PhD so far and the goals yet to come.


Completed So Far:

  • Literature Review:  This was the first thing I did as a PhD student.  Reading took up most of the first six months of my research.  I consumed, refined and categorised as much of the relevant literature as I could find.  I am attempting to publish this as a survey paper currently, since the only available text simplification survey is a technical report from 2008
  • Lexical Simplification Errors: I recently undertook a pilot study looking at the errors thrown up by the lexical simplification pipeline.  I'm looking to publish this in an upcoming conference, so won't say too much about the results here and now.
  • Complex Word Identification: This was the first element of the lexical simplification pipeline that I studied.  I built a corpus of sentences, each with one word marked as complex for the purpose of evaluating current methods of identification.  This work was published in 2 separate workshop papers at ACL 2013.
  • Substitution Generation: Once we have identified a complex word, we must generate a set of substitutions for it.  However, those words which are complex are also those which are least likely to be found in a thesaurus, complicating the task.  To address this I spent considerable efforts learning simplifications from massive corpora with some success.  This work is also currently being written up for publication.


Still to come:

  • Word Sense Disambiguation: The next step in the pipeline is to apply some word sense disambiguation.  This has been done before, so I will be looking at the best ways to apply it and hopefully making a novel contribution here.  I am just starting out on this phase of research and am currently immersed in the WSD literature, trying to get my head round the myriad techniques that already exist there.
  • Synonym Ranking: I have looked into the best way to rank synonyms according to their complexity before at the start of my project.  The small amounts of work that I did back then did not discover anything radical, but did help me to better understand the structure of a lexical simplification system.  When I revisit this area it will be with the hope of making some significant contribution.  I was really interested in the work David Kauchak presented at ACL 2013 and will be interested to explore what more can be done in this area.
  • User Evaluation: Finally, I will spend some time exploring the effects of each of the modules I have developed on individual users.  It is of paramount importance to evaluate text simplification in the context of the users it is aimed at and to this end I will be focussing my research on a specific user group.  Although which group is as yet undecided.
  • Thesis: This will undoubtedly take a significant portion of my final year.  The chapter titles will hopefully be the bullet points you see listed above.

So there you have it.  Although it appears that I have done a lot so far, it still feels like I have a real mountain to climb.  There are significant hurdles and vast amounts of reading, researching and writing ahead.  I look forward to the challenges that the next two years of my PhD will bring.

    Monday, August 19, 2013

    ACL 2013 - Post Blog

    It's been a little over a week since I got back from ACL.  I think it takes a certain amount of time to process a conference, and I think I'm still processing it.  It was a massively positive experience overall.  It was very encouraging to meet so many people working in similar disciplines, people who engaged with similar problems.  It was also very encouraging to present my research and to get feedback from more experienced members of the community.  Despite being somewhat terrified about the prospect of presenting, I actually really enjoyed it.  People had really good questions which made me think more about my research and get even more excited for it.

    A real highlight of the conference was the workshop on Predicting and Improving Text Readability (PITR).  This was a small workshop, with maybe 10-20 people at any one time.  During the course of the day I gave both a poster and an oral presentation.  The people there were working in very similar areas to mine and I got such valuable feedback on my work, and was able to understand and discuss other people's research with them.

    I really enjoyed the conference experience and I will definitely be looking to attend another conference in the forthcoming season (as much as time and funding might allow!).  I have some work on automatic thesaurus generation that I am looking to write up and submit to either LREC or EACL.  Their submission dates are close together (15th and 18th October respectively), so I will likely submit the same paper to both to increase my odds of acceptance.

    The next big hurdle in my academic career is my progression interview on 4th August.  According to the supporting documentation:
    "The student has been working for 18 months on research. It should be possible at this point to determine whether they are capable of achieving at the research project they are attempting"
    Which sounds terrifying.  I'm currently choosing not to stress about it, whilst they technically have the option to throw me out at this point, the chances of them doing so are very low.  I'm required to present a short (1000 word) report and give a 10 minute talk.  I already have the talk roughly planned out in my mind, although I've not put any slides together as of yet.

    Thursday, August 15, 2013

    Orthography, Phonology and Nomenclature. Making sense of word relations!

    In my role as computational linguist, I often find myself verging into areas of linguistics which I find nothing short of fascinating.  One such is the complex relation between English orthography (that is, how a word is written) and phonology (how it sounds).  In English, we have a 'deep orthography' meaning that a word doesn't necessarily sound the way it looks, leading to beautiful confusions such as:
    weight vs. height

    foot vs. food
    or (my favourite):
    cough vs. enough vs. plough vs. though. vs. through.
    That's right, 5 distinct sounds from the letters 'ough'.

    We also get the interesting phenomenon that one set of letters, with different pronunciations can have totally different meanings.  For example.
    He decided to desert the army.
    Camels live in the desert.

    This is an example of heteronyms.  Heteronyms are different to homonyms which have the same pronounciation and spelling, but a different meaning. These are different again to heterographs, homophones and synonyms.  The table below defines the potential relations between words.  It is taken mostly from this venn diagram

    Meaning Spelling Pronunciation
    No Relation Different Different Different
    Homophone Different - Same
    Heterograph Different Different Same
    Heteronym Different Same Different
    Homonym Different Same Same
    Different Spelling Same Different Same
    Different Pronunciation Same Same Different
    Synonym Same Different Different
    Same Word Same Same Same

    • No relation: Two word words which are not related in any sense.
    • Homophones:  Words which sound the same, but have different meanings.  Further split into the following two categories:
    • Heterographs: Homophones with different spellings.  "There", "Their" and "They're" is a classic example.
    • Homonyms: Homophones with the same spelling.  E.g. "Right" (direction) vs. "Right" (entitlement).
    • Heteronyms: Words that are spelt the same but have a different sound and meaning. E.g. "desert" (leave) vs. "desert" (sahara) as in the above example.
    • Different Spelling: No technical word here, just words which mean and sound the same but are spelt differently. e.g. "Labor" (US spelling) vs. "Labour" (British Spelling).
    • Different Pronunciation: Again, no technical word, just two words which are written and mean the same, but sound different. E.g. 'the elephant' vs. 'the circus'.  ('the' takes a different sound in each).
    • Synonyms: Two words with the same meaning, but different pronunciations and written forms.  e.g. "friend" and "companion". Useful for lexical simplification as synonyms can be ranked according to their simplicity.
    • Same Word: No difference here whatsoever.
    So there you have it.  I hope this is a helpful contribution to the often confusing world of word relation nomenclature.  I am certainly much more clear on the distinction between these terms as a result of writing this blog.

    Monday, August 12, 2013

    The Lexical Simplification List

    Whilst putting together my Literature review, I decided it might be valuable if the references I was collecting were visible to other people who are interested in lexical simplification.  To that end, I  have put together a list of all the references I know of which pertain in some way to lexical simplification.  I have tried to not overload this load, so have only included those papers which seem to be explicitly working in lexical simplification, rather than those who mention it in passing.  The list is probably incomplete in it's current incarnation, so if you see any papers you think are missing, please do drop me an email and I'll be happy to add them.  To find the list you can follow the tab at the top, or click here.

    Further to this, I thought it might be nice to collect together some of the resources I have found helpful on one page.  This means that I have split the resources sections into 'my resources' and 'external resources'.  In the external resources section I have put in some links to useful resources which I have used, but have had no hand in creating.

    My idea and hope with this is that somebody wishing to start out in lexical simplification will be able to read through these two lists and find a good bed of research, and a good bed of resources to begin.  I also hope that other more established lexical simplification researchers will find the content interesting and their research will benefit from it.

    Thursday, August 01, 2013

    Randomising lines a very large file with java

    I came across an interesting problem today.  I have some nice results from counting data and I wanted to see if the same results would appear if I randomised the underlying dataset.  The problem?  The dataset is a 17 Gigabyte file.

    All the solutions I could find online required the file to be read into memory at some point.  Obviously, with my 8GB of RAM these were not acceptable solutions.  I needed a solution which would allow one line to be in memory at once and then to be discarded.

    I reasoned that if I wrote the lines of the file into separate files, I could create some randomisation.  I also realised that the more files there were, the greater the randomisation.

    Below is the java code I wrote, as well as a bash wrapper script. It takes a file and a numeric argument denoting how many files to write into.  It then  assigns each line at random to one of the files until it runs out of lines.  These files can then be concatenated together in  a post-processing step.  I think it's quite a neat solution.  I've commented the code for readability, so hopefully it will be reusable.  Of course this is not true randomisation as some ordering is preserved, however it should work for most purposes.

    For my 17GB file it took 22 minutes to run, writing to 1000 files.  Needless to say that most of that time was taken up by I/O.


    import java.io.PrintWriter;
    import java.io.FileWriter;
    import java.io.FileReader;
    import java.io.BufferedReader;
    import java.util.Random;

    public class RandomiseLines
    {
      public static void main(String [] args) throws Exception
      {
        if(args.length != 2)
        {
          System.out.println("Usage: java RandomiseLines <file> <No of Output Files>");
          System.exit(-1);
        }

        //the number of separate files to place lines into.
        final int FILENUM = Integer.parseInt(args[1]);

        //initialise the random number generator.
        final long SEED = 1;
        Random generator = new Random(SEED);

        //if seed isn't required, comment above and use:
        /*
          Random generator = new Random;
        */

        //initialise the file writers
        PrintWriter [] writers = new PrintWriter[FILENUM];
        for (int i = 0; i < FILENUM; i++)
         writers[i] = new PrintWriter(new FileWriter("out." + i + ".txt"));

        //read in the file
        int key;
        String line;
        BufferedReader in = new BufferedReader(new FileReader(args[0]));
        while((line = in.readLine()) != null)
        {

          //generate a random number between 0 and FILENUM - 1
          key = (int)Math.floor(FILENUM*generator.nextDouble());

          //write the line to the chosen file;
          writers[key].println(line);
        }//while

        //close IO
        in.close();
        for(int i = 0; i < FILENUM; i++)
         writers[i].close();
       
      }//main
    }//class

    The following shell script can be used as a wrapper to the programme.
    #!/bin/bash

    FileNum=10;

    java RandomiseLines $1 $FileNum

    echo "" > randomised.txt

    for i in `seq 0 $(($FileNum -1))`; do
     cat out.$i.txt >> randomised.txt;
     rm out.$i.txt
    done

    exit

    Monday, July 29, 2013

    ACL 2013 Pre-Blog.

    In under a week I will be sitting on a plane headed to Bulgaria.  This year I will be presenting at the Association for Computational Linguistics Conference in Sofia.  I have been fortunate enough to have had two papers accepted for presentation.

    The first paper is as a part of the Student Research Workshop.  This is a part of the main conference, but they only accept papers from PhD students (making it slightly easier to get in to!).  The paper I am presenting details some experiments in attempting to establish a baseline for complex word identification.  I used the CW Corpus (see below) to test a few standard techniques in complex word identification.  It turned out that they were all fairly similar, but that in itself was an unexpected and hence interesting finding!  The mode of presentation will be via a poster,  I think this will be quite difficult and require a lot of energy to stay engaged and motivated, but I'm up for the challenge.

    I'm excited to attend the student research workshop.  It will hopefully be an encouraging experience.  Whilst I don't expect there to be many (if any!) people who are experts in text simplification there, I'm sure it will be very useful to meet other PhD students and see where their work is taking them.

    The second paper is part of a co-located workshop called Predicting and Improving Text Readability for Target Reader Populations (PITR 2013).  This is a much smaller workshop with a more specialised focus.  It is very relevant to my field of research and so I'm interested to meet plenty of like minded people there.  I have followed the work of some of the presenting authors, so it will be very exciting to meet them face to face. 

    This is more than just academic celebrity spotting of course.  The paper I will be presenting is on the CW Corpus, a resource I developed for evaluating the identification of complex words.  There are a lot of implementation details, which for the main part I will try not to bore people with.  The main thing I want to do with this conference is to get people interested in the concept of complex word identification as it's own separate evaluable sub task.  Hopefully people will respond well to this, seeing it as a valid area to be working in.  I'm presenting a poster in this workshop and will also be giving a 15 minute talk on my research.

    I'll write about the conference again soon, either whilst I'm out in Bulgaria, or when I get back.

    Wednesday, June 19, 2013

    The importance of being accurate.

    Only a short post today. I am currently writing my transfer report, which is soaking up all of my research time.  I thought I would take some time out from that to write about an interesting phenomenon that occurs in text simplification.

    Accuracy is always to be sought after.  Regardless of your domain, the more accurate your algorithm, the better.  In many domains, negative results can be tolerated.  For example, if you search for the query 'jaguar in the jungle' you are likely to receive lots of results about big cats in their natural habitat, but you may also receive some results about fancy cars in the jungle too.  This is acceptable and may even be helpful as the original query contained some ambiguity - maybe you really wanted to know about those fancy cars.

    The same thing can occur during text simplification.  Inaccurate identifications or replacements may lead to an incorrect result being present in the final text.  Some of the critical points of failure are as follows:
    • A complex word could be mislabeled as simple - meaning it is not considered for simplification.
    • No replacements may be available for an identified complex word.
    • A replacement which does not make sense in the context of the original word may be selected.
    • A complex replacement may be incorrectly selected over a simpler alternative due to the difficulty of estimating lexical complexity
    If any of the above pitfalls occur, then either a complex word or an erroneous replacement may creep into the final text.  Unlike in web search, errors are of great detriment to the simplification process.  This is because the point is to have text which is easier to understand.  In the majority of cases, introducing errors into a text will cause it to be more difficult, completely negating any simplification made.  This is a real case of one step forwards and two steps back.  For example:

    A young couple with children will need nearly 12 years to get enough money for a deposit.
    was changed by a rudimentary lexical simplification system to:

    A young couple with children will need nearly 12 years to get enough money for a sediment.
    Not only has a synonym which is more complicated than the original word been chosen here, the synonym does not make any sense in the given context.  Through making an error, the understandability of the text is reduced, and it would have been better to make no simplification at all.

    To end this post, I will present some practical ways to mitigate this.
    1. Only simplify if you're sure.  Thresholds for deciding whether to simplify should be set high to avoid errors.
    2. Use resources which are well suited to your task, preferably built from as large a corpus as possible.
    3. Investigate these errors in resultant text.  If they are occurring, is there a specific reason?
    In summary, incomprehensible text is much more complex than understandable yet unsimplified text.  Whilst the goal of text simplification must be to simplify when and wherever possible, this must not be done at the expense of a system's accuracy.  Presenting a reader with error prone text is as bad, if not worse than presenting them with complex text.

    Friday, May 31, 2013

    muthesis.cls

    About this time last year The first CDT cohort wrote their long reports.  Some people, myself included, chose to use the university's LaTeX thesis class.  Which is a great resource for thesis writing, however is not quite designed for End of year reports.  To remedy this I modified some of the style information in the muthesis.cls file to make it more appropriate for an end of year report.  The changes made are as follows:

    • On the title page, the text 'A thesis submitted to the UoM for the degree of doctor of philosophy' has been changed to read 'An end of year report submitted to the UoM'
    • The word count at the end of the contents page was removed (this was a personal style choice)
    • On the Abstract page, the word 'thesis' is modified to read 'end of year report' similar to the first point above.  Also I changed the ordering here from:  Title, Author, Specification to Title, Specification, Author.
    • The declaration was removed as this felt out of tone for an end of year report and is not required by the formal specification.
    The file is available from the resources page. It will download as a tar archive, which can be unpacked using the command:

    tar -xf eoy.tar.gz

    Once this has been done the latex file should compile straight away using the command:

    pdflatex EOY.tex

    You can then view the resulting pdf in EOY.pdf

    Hope this is a helpful resource in writing end of year reports.  If you want a hand to modify the style file further then let me know.  I'm happy to assist where I can.

    Wednesday, May 29, 2013

    Identifying Complex Words

    The very first step in lexical simplification is to identify complex words (CWs).   This is the process of scanning a text and picking out the words which may cause a reader difficulty.  Getting this process right is important, as it is at the first stage in the simplification pipeline.  Hence, any errors incurred at this stage will propagate through the pipeline, resulting in user misunderstanding.

    How do we define a CW?

    In my previous blog post, I gave several factors that come together to form lexical complexity.  Lexical complexity values can be inferred, using the metrics given there.  Typically, word frequency is either used by itself, or combined with word length to give a continuous scale on which complexity may be measured.  We can then use this scale to define and identify our CWs as described below.

    How do we identify them?

    There are a few different methods in the literature for actually identifying CWs.  I have written a paper discussing and evaluating these which is referenced at the end of this section.  For now, I'll just give a brief overview of each technique - but please do see the paper for a more in depth analysis.
    1. The most common technique unsurprisingly requires the least effort.  It involves attempting to simplify every word and doing so where possible.  The drawback of this technique is that the CWs are never identified.  This means that difficult words which can't be simplified (e.g. beacuse there is no simpler alternative), won't be.  It also means that words which are not causing a barrier to understanding may be modified, potentially resulting in error.
    2. Lexical complexity (as explained above) can be used to determine which words are complex in a given sentence.  To do this, a threshold value must be established, which is used to indicate whether a word is complex.  Selecting a lexical complexity measure which discriminates well is very important here.  
    3. Machine learning may also be used to some effect.  Typically, Support Vector Machines (SVMs, a type of statistical classifier) have been employed for this task.  Lexical and syntactic features may be combined to give an adequate classifer for this task.
    I am soon to publish a comparison of the above techniques at the ACL-SRW 2013.  I will put a link up to that paper here when it is available.

     

    The CW corpus

    To compare different techniques in CW identification, it is necessary to have an annotated corpus.  My solution to this was to extract sentences from Simple Wikipedia edit histories which had been simplified by a revising editor.  I have a separate paper submitted on this and will write more about it in a future post.  The corpus contains 731 sentences, each with one annotated CW.  This can be used for fully automatic evaluation.  The data is available from the resources page.

    User-dependent complexity

    Complexity is a subjective measure and will vary from user group to user group and even from user to user.  For example, take the case of a class of English language learners.  They will all have different levels of English proficiency and will have differing knowledge of English, based on their experience of it to date.  A language learner who has been on holiday to England several times may have different simplification needs to a language learner who has watched many films in English, subtitled in their own language.  A language learner whose first language is Italian will find many words to be similar to their own language, similarly a learner whose first language is German may also find many words to be similar. However, German and Italian speakers will not find the same English words familiar.  It could even be hypothesised that words which an Italian speaker found simple, would need to simplified for a German speaker and vice versa.

    E.g.

      German               English                    Italian
    Verwaltung       Administration    Amministrazione
         Apfel                 Apple                       Mela

    The above toy example shows how one language learner's simplification needs may differ from another.  The German speaker will find the word 'Apple' familiar, yet struggle with 'Administration', the Italian speaker will experience the reverse.

    There is very little work on discerning the individual simplification needs of a user.  This is not just a problem confined to language learning (although it may be seen there very clearly) but it affects all spheres of text simplification.  A technique which could adapt to a user's needs, maybe incorporating feedback from a user where appropriate would go far.

    Friday, May 17, 2013

    Lexical Complexity

    Lexical simplification often requires some method of determining a word's complexity.  At first glance, this sounds like an easy task.  If I asked you to tell me which word is simpler: 'sit' or 'repose', you would probably tell me that the first was the easiest.  However, if I asked you to say why, it may be more difficult to explain.

    Many factors influence the complexity of a word.  In this post, I will identify six key factors, these are: Length, Morphology, Familiarity, Etymology, Ambiguity and Context.  These are not an exhaustive list, and I'm sure other factors contribute too.  I have also mentioned how to measure these where appropriate.

    1.  Length

    Word length, measured in either characters or syllables is a good indicator of complexity.  Longer words require the reader to do more work, as they must spend longer looking at the word and discerning it's meaning.  In the (toy) example above, sit is 3 characters and 1 syllable whereas repose is 6 characters and 2 syllables.

    Length may also affect the following two factors.

    2.  Morphology

    Longer words tend to be made up of many parts - something referred to as morphological complexity.  In English, many morphemes may be put together to create one word.  For example 'reposing' may be parsed as: re + pose + ing.  Here, three morphemes come together to give a single word, the semantics of which are influenced by each part.  Morphosemantics is outside the scope of this blog post (and probably this blog!) but lets just say that the more the reader understands about each part, the more they will understand the word itself.  Hence, the more parts there are, the more complex the word will be.

    3.  Familiarity

    The frequency with which we see a word is thought to be a large factor in determining lexical complexity.  We are less certain about the meaning of infrequent words,  so greater cognitive load is required to assure ourselves we have correctly understood a word in it's context.  In informal speech and writing (such as film dialogue or sending a text message) short words are usually chosen over longer words for efficiency.  This means that we are in contact more often with shorter words than we are with longer words and may explain in part the correlation between length and complexity.

    Familiarity is typically quantified by looking at a word's frequency of occurrence in some large corpus.  This was originally done for lexical simplification using kucera-francis frequency, which is frequency counts from the 1-million word Brown corpus.  In more recent times, frequency counts from larger corpora have been employed. In my research I employ SUBTLEX (a word frequency count of subtitles from over 8,000 films), as I have empirically found this to be a useful resource.

    4.  Etymology

    A word's origins and historical formations may contribute to it's complexity as meaning may be inferred from common roots.  For example, the latin word 'sanctus' (meaning holy) is at the etymological root of both the English words 'saint' and 'sanctified'.  If the meaning of one of these words is known, then the meaning of the other may be inferred on the basis of their 'sounds-like' relationship.

    In the above example, 'sit' is of Proto-Germanic Anglo Saxon origins whereas 'repose' is of Latin origin.  words of Latin and Greek origins are often associated with higher complexity.  This is due to a mixture of factors including the widespread influence of the Romans and the use of Latin as an academic language.

    To date, I have seen no lexical complexity measures that take into account a word's etymology.

    5.  Ambiguity

    Certain words have a high degree of ambiguity.  For example, the word 'bow' has a different meaning in each of the following sentences:

    The actors took a bow.
    The bow legged boy stood up.
    I hit a bull's eye with my new carbon fibre bow.
    The girl wore a bow in her hair.
    They stood at the bow of the boat.

    A reader must discern the correct interpretation from the context around a word. This can be measured empirically by looking at the number of dictionary definitions given for a word.  According to my dictionary, sit has 6 forms as a noun and a further 2 as a verb, whereas repose has 1 form as a noun and 2 forms as a verb.  Interestingly, sit is more complex by this measure. 

    6.  Context


    There is some evidence to show that context also affects complexity.  For example: take the following sentences:

    "The rain in Spain falls mainly on the ______"
    "Why did the chicken cross the ______"
    "To be or not to ___"
    "The cat ____ on the mat"

    In each of these sentences, you can easily guess the blank word (or failing that use Google's auto complete feature).  If we placed an unexpected word in the blank slot, then the sentence would require more effort from the reader.  Words in familiar contexts are more simple than words in unfamiliar contexts.  This indicates that a word's complexity is not a static notion, but is influenced by the words around it.  This can be modelled, using n-gram frequencies to check how likely a word is to co-occur with those words around it.

    Summary

    So, if we put those factors into a table it looks something like this:

    Word "sat" "repose"
    Length (characters) 3 6
    Length (syllables)12
    Familiarity (frequency)338329
    Morphology (morphemes) 1 2
    Etymology (origins) Proto-Germanic Latin
    Ambiguity (senses)83
    Context* (frequency) 6.976 0.112
    *source: Google n-grams value for query "the cat ____".  Value is percentage occurrence and is multiplied by a factor of 10^7

    We see that repose is more difficult in every respect except for the number of senses.

    Lexical complexity is a hard concept to work with, it is often subjective and shifts from sense to sense and context to context.  Any research into determining lexical complexity values must take into account the factors outlined here.  The most recent work into determining lexical complexity is the SemEval 2012 task in lexical simplification.  This is referenced below for further reading.

    L. Specia, S. K. Jauhar, and R. Mihalcea. Semeval-2012 task 1: English lexical simplification. In First Joint Conference on Lexical and Computational Semantics, 2012

    Wednesday, May 08, 2013

    Lexical Simplification: Background

    My PhD work is concerned with a process called lexical simplification.  I'm interested in how natural language processing can be applied to make documents easier to read for everyday users.  Lexical simplification specifically addresses the barriers to understandability provided by the difficult words in a text.

    For example, take the following sentence:

    The workers acquiesced to their boss' request.

    It is fairly clear that the rarely used verb 'acquiesce' is going to cause understandability issues here.  Do you know what it means?  Maybe in a wider context you could guess the meaning, however here it is fairly difficult to work out.  Lexical simplification deals with sentences such as the above and attempts to process them into more understandable forms.  There are several stages to the lexical simplification pipeline.  I intend to devote an entire post to each one of these as I continue, however for now, it should be sufficient to give an overview of each one.

    The first stage in any lexical simplification system is complex word identification.  There are 2 main approaches to this.  Firstly, systems will attempt to simplify every word and those for which simplifications can be found are transformed, those which cannot be transformed are left.  Secondly, some form of thresholding is applied.  There are various measures of lexical complexity - which often reside heavily in word frequency.  Some threshold may be applied to one of these measures to determine between complex and simple words.  One of the major issues in this field is the lack of evaluation resources.  I have a paper on this topic accepted at the ACL student session 2013,  so will write more at that time.

    If we assume that we can get lexical complexity values of:

    worker: 300
    acquiesce:    5
    boss: 250
    request: 450

    If we also assume that our threshold (which is set on some training data) is somewhere between 5 and 250 then we have an indicator that 'acquiesce' is a difficult word.

    The next step, once we have identified this complex word is to generate a set of synonyms which could replace it.  This is typically done with a thesaurus such as WordNet.  This could give us some of the following replacements for acquiesce.

    Acquiesce: accept, accommodate, adapt, agree, allow, cave in, comply, concur, conform, consent, give in, okay, submit, yield

    We must then process these to discover which will be valid replacements in the given context.  This third step is called word sense disambiguation.  This is necessary as a word will have typically have several senses, so some replacements will only be valid in certain contexts.  In the above example a word sense disambiguation step may look something like the following:

    Acquiesce: accept, accommodate, adapt, agree, allow, cave in, comply, concur, conform, consent, give in, okay, submit, yield

    Where words in green are those that would be valid replacements and words struck-through and in red are non-valid replacements.   This is somewhat subjective and remains an unsolved task in NLP.

    The final step is to rank the resulting replacements in order of their simplicity.  The simplest will then replace the original word.  To do this we revisit our measure of lexical complexity from before.  For example if we have the following values for the remaining candidates:

    accept:
    550
    agree:   
    450
    cave in:
    250
    comply
    35
    conform
    50
    give in
    350
    submit
    40
    yield
    20

     Then we would choose 'accept' as our replacement.  Giving the simplified sentence as:

    The workers accepted their boss' request.

    Which is a much more understandable sentence.

    There are of course some nuances of the original meaning that are lost in this simplification, however this has to be accepted.  The understandability of the sentence is obviously dramatically increased.

    My project is currently focusing on each of these stages individually.  The hypothesis is that by examining and optimising each stage in turn, it will be possible to improve the final simplification.  Work has already taken place in the first stages mentioned above and work will continue on the rest.

    There is much more to lexical simplification than the basic outline presented above and readers wishing to know more should look to read the following publications:


    Siobhan Devlin and John Tait. 1998. The use of a psycholinguistic database in the simplification of text for aphasic readers. Linguistic Databases, pages 161–173.


    Or Biran, Samuel Brody, and No ́emie Elhadad. 2011. Putting it simply: a context-aware approach to lexical simplification. In Proceedings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT ’11, pages 496–501, Stroudsburg, PA, USA. Association for Computational Linguistics.


    Stefan Bott, Luz Rello, Biljana Drndarevic, and Horacio Saggion. 2012. Can spanish be simpler? lexsis: Lexical simplification for spanish. In COLING, pages 357–374. 


    S. M. Alu ́ısio and C. Gasperin. Fostering digital inclusion and accessibility: the PorSimples project for simplification of Portuguese texts. In Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Ap- proaches to Languages of the Americas, YIWCALA ’10, pages 46–53, Strouds- burg, PA, USA, 2010. Association for Computational Linguistics. 

    L. Feng. Text simplification: A survey, 2008.

    L. Specia, S. K. Jauhar, and R. Mihalcea. Semeval-2012 task 1: English lex- ical simplification. In First Joint Conference on Lexical and Computational Semantics, 2012.

    Monday, May 06, 2013

    Me

    I thought I would put up some information about myself. So here goes:

    Name: Matthew Shardlow

    Location: Manchester, United Kingdom

    Occupation: PhD student, Graduate Teaching Assistant

    Employer: University of Manchester

    Supervisor: John McNaught

    Co-Supervisor: Simon Harper

    Project Title: Lexical Simplification

    PhD Track: 4-year centre for doctoral training

    Finish date: September 2015

    Funding body: EPSRC grant no. EP/I028099/1

    Project Description: Making difficult language easier to read by detecting and translating complex vocabulary into easy words.

    Research Interests:
    • Text simplification
    • Complex word identification
    • Substitution generation
    • Word sense disambiguation
    • Lexical complexity
    • Large scale corpus linguistics
    • The use of Wikipedia as a corpus

    About

    The simplification of the lexicon is an important task at the boundary between natural language generation and assistive technology.  It concerns the automatic replacement of complex wordforms with more easily accessible alternatives.  Complexity is of course subjective and can be interpreted differently depending upon the text and the reader.  At it's broadest, the definition may be: 'Any word which reduces the reader's overall understanding of the text'.

    This blog details the project outcomes of my PhD in Lexical Simplification (LS).  It will serve both as an archive for previous work and as a platform for the promotion of ongoing research.  I also intend to publish interesting data-sets that I create during the course of my research. 

    I intend to post here regularly, but not too often.  Hopefully like-minded researchers will find the content on here of interest.  If you do use any of the data published here or are inspired by the ideas promoted please drop me a line to encourage me!

    Matt.