Wednesday, October 29, 2014

Why Lexical?

In this post I will put forward my arguments for picking lexical simplification as a PhD topic.  I'm certainly going to face that dangerous 'why' question when writing up my thesis, not to mention in the viva. I figure that if I commit my reasoning to page now, then I can come back to it at a later date. To convince myself, if not others.

The question of 'why lexical?' needs addressing at two levels.  Firstly, why choose simplification given the wide scope of natural language processing applications to which I could have committed four years of my life?  Secondly, why lexical simplification given the wide range of simplification research out there?  I'll answer these in turn.

Why Simplification?


Let me start with a confession. I didn't start a PhD to study lexical simplification.  I didn't even come with natural language processing in mind.  About four years ago I was offered an opportunity to study at Manchester on the newly founded Centre for Doctoral Training (CDT).  One of the aspects of the CDT programme which appealed to me was the fact that students were not initially tied to a supervisor or research group. We were in effect free for our first six months to decide upon a group, supervisor and topic.

During the first six months, I took three masters modules.  Two on machine learning and one on digital biology.  Both fascinating fields, and both of which have ample opportunities for a PhD.  I spoke to the relevant people about these opportunities, but I just couldn't find something that inspired me.  Resigned to the fact that I may have to choose a topic and wait for the inspiration to come I started to focus on the machine learning research.

We had weekly seminars to acquaint us with the research in the school.  Each seminar was given by a different research group.  They varied in style and form - from the professor who brought in props to talk us through the history of hard drives to the professor who promised free cake every time their group had a major publication.  One week, it was the turn of the text mining group.  During the presentation the idea that text mining could be used to make difficult documents easier to understand was mentioned.  I took the bait, started reading papers, emailed my prospective supervisor and from there I was away.  The spark of inspiration drove me to develop my first experiments, which led to a literature review and an initial study on complex word identification.

Simplification is a great field to be working in.  I like that the research I'm doing could improve a user's quality of life. Ok, the technology isn't quite there yet, but the point of research is to reach out to the unattained. To do something that hasn't been done before. There are lots of opportunities in simplification. Lots of unexplored avenues.  I'm going to write a long future work section in my thesis (maybe a future post?) because there is a lot to say.

Why Lexical?


When I started looking into simplification back in 2011, there were three main avenues that I could discern from the literature.  Firstly, syntactic simplification - automated methods for splitting sentences, dealing with wh- phrases and the passive voice.  Secondly, lexical simplification - the automatic conversion of complex vocabulary into easy to understand words.  Finally, semantic simplification - taking the meaning into account and doing some processing based on this.  I grouped both lexical elaboration and statistical machine translation under this category, although I would probably now put elaboration under the lexical category.

I felt that my research would be best placed if it fell underneath one of those categories.  Although I could see viable research options in all of these, my background in machine learning and data mining was best suited for lexical simplification.  This was also around the time that the SemEval 2012 lexical simplification task was announced.  This task gave me an intial dataset, my background in AI gave me a set of techniques to apply.  From there I ran some initial experiments, implemented a basic system and started learning the complexities of the discipline.

It's strange to think that I'm writing about a decision I made several years ago.  I've tried to be as honest as I can remember in this post.  The PhD has not been a particularly easy road, but it has been a good one so far.

Thursday, July 24, 2014

The Choices of a Simplification System

There are many choices to make when building an LS system. In my experience there are three big decisions to take: the target audience; the source documents; and the mode of presentation. Let's look at each of these in detail.

Target Audience


Firstly, you need a well defined group to aim your simplification at.  This group should have a clear style of language, documents written specifically for the group may be useful here.  They should all require a similar form of simplification, otherwise you will be writing several simplification systems.  The group shouldn't be too narrowly defined (e.g. Deaf children in the age range 8-11 with a below average reading age), as this will make it difficult to find test subjects.  It also shouldn't be too broadly defined, otherwise different simplification needs may be present.

Once you have a group to simplify documents for, you're ready to consider the next step.


Source Material


You must decide what type of text to simplify.  It's easy to assume that text is text and you can just build a general model, but in fact different genres have their own peculiarities and jargon.  Consider the difference between a set of news articles, Wikipedia entries and social media posts.  Each will be significantly different in composition than the last.  Of course, the text genre should be one which the target audience wants you to simplify!  That's why this step comes after selecting the target group.  It's also important at this point to check that nobody else has tried to simplify the type of documents that you're working with.


Mode of Presentation


There are, roughly speaking, three ways of presenting the simplifications to the end user.  Which one you choose depends upon factors such as the text itself, the requirements of the user and the reason behind the simplification. Each has advantages and disadvantages as outlined below:

Fully Automated


Substitutions are applied directly to the text where possible.  The user never sees the simplifications being made and so does not need to know that they are reading a simplified document.

Advantages:
  • User is presented with a simple document
  • Requires minimum work from author / user
  • Can be performed on the fly - e.g. whilst browsing the web / reading e-books / etc.
Disadvantages:
  • Errors in the simplification process cannot be recovered
  • Simplification may alter the meaning away from what the original author intended
  • Some words may not be simplified - leaving difficult terms to be seen by the user

Author Aid


The simplifications are presented to the author of the document, who chooses when and how to apply the simplifications.  In this case, the simplification acts similarly to a spell checker.

Advantages:
  • Author can make simplifications which retain their original intention
  • No chance of grammar mistakes - as the author has the final say of which words to use
Disadvantages:
  • Work must be simplified before being published
  • No guarantee that the author will apply the suggested simplifications

On Demand


The user has the option to look up simplifications if they find a word too difficult.  These simplifications are typically presented somewhere nearby on the screen.  For example, if a word is clicked on, the simplification may appear in a pop up box or in a dedicated area outside of the body of text.

Advantages:
  • User gets to see the original text
  • Helps language learning / recovery as the user can decide when they require simplification
Disadvantages:
  •  User may struggle if a text or portion of a text has a high density of difficult words
  • The user may be distracted by the simplifications, which divert their attention away from the text

Tuesday, June 03, 2014

LREC 2014 - Post Blog

Looking back on LREC, I think I went with a real apprehension. My only previous conference experience was ACL 2013 in Sofia.  I had a great time and so my feelings towards LREC were a mixture of excitement for it being similar, but also a worry that it might not be as good.  I'm glad to report that it lived up to and even exceeded my expectations.  The sessions were interesting;  the posters engaging;  the networking events were even fun;  the attendees were approachable and people even engaged with my talk, asking questions and talking to me afterwards.

It was great to meet many of the people who I have previously referenced.  Many of those who wrote the papers in the lexical simplification list were in attendance.  I made it my business to approach them and introduce myself.  It's a great experience to meet people who have written such great papers and have made a contribution to the field.

For those that might find it useful, I've collected two lists below.  The first is a list of papers which had something to do with readability / simplification.  The second is a list of papers which I found interesting for one reason or another.  The simplification papers will be making their way to the lexical simplification list soon.

Improvements to Dependency Parsing Using Automatic Simplification of Data. T Jelínek PDF

Measuring Readability of Polish Texts: Baseline Experiments. B Broda, B Nitoń, W Gruszczyński and M Ogrodniczuk PDF

Can Numerical Expressions Be Simpler? Implementation and Demonstration of a Numerical Simplification System for Spanish. S Bautista and H Saggion PDF

Text Readability and Word Distribution in Japanese. Satoshi Sato PDF

And finally, my paper:
Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline. M Shardlow PDF

Some other papers that I found interesting were:

Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines. M Sabou, K Bontcheva, L Derczynski and A Scharl PDF

Identification of Multiword Expressions in the brWaC. P Anick M Verhagen and J Pustejovsky PDF

Creating a Massively Parallel Bible Corpus. T Mayer and M Cysouw PDF






Tuesday, May 27, 2014

LREC 2014

I write this post from my hotel in Reykjavik.  Ready to get started at LREC tomorrow.  I'm looking forward to another conference experience.  I don't think this city could be more contrasting to my previous conference in Sofia, Bulgaria.  I'm also particularly excited as I am presenting in the main conference this year.  It is a really great opportunity to share my research and I'm looking forward to that.

I haven't completed my presentation yet.  But when I do, the slides will be up here. I am presenting on Thursday morning, so still time to put the finishing touches to my slides. I will be presenting on an error study of the lexical simplification pipeline which I produced a short while back.  It's an interesting little side experiment which developed into some useful results.  Effectively I've been able to show that the areas in the field which are least addressed are the areas which cause the most problems.  A game changer - if people take notice.  I'm really looking forward to seeing how the work is received.  I imagine there will be a mixture of views on the work.

In terms of my PhD project, this work will feed directly into the first chapter of my thesis.  It allows me to motivate work on mitigating errors throughout the lexical simplification pipeline.  I have recently come to what I think is a final draft of my thesis structure.  But I shall save that for another post.


Friday, March 07, 2014

Accepted Papers

It has been a good start to 2014 so far with 2 papers accepted. 

The first is a conference paper at a large conference called LREC 2014.  This will take place in Reykjavik, Iceland at the end of May this year.  The paper details some experiments that I performed last year, looking at the distribution of errors in the lexical simplification pipeline.  The hope is that if I can show the types of errors and whee they're coming from then I can motivate my research into mitigating these errors.  It will certainly make a nice introduction in my thesis.  The paper has been accepted for a main track oral presentation.  This means that I'll be presenting alongside established researchers in the main event of the conference.  It's certainly a mixture of very exciting and somewhat terrifying!

The second paper is published in a special issue of an open access journal (open access means that the publisher allows anybody to read it for free).  I was unsure at first of whether to publish with them but a number of factors led me to doing so.  The danger of publishing with open access journals, particularly the relatively unknown, is that the quality of the reviewing and the impact of the journal can be very low.  In the worst case you can end up paying thousands of pounds for somebody to essentially publish your paper on their blog.  The paper I have submitted is a survey paper.  I wrote it almost two years ago and have been updating it ever since.  It is effectively the background chapter in my thesis.  I originally attempted to submit it to a few prestigious journals, but it was rejected for a few different reasons.  Survey papers from relatively unknown authors are tough to get published.  So, off the back of rejections and updates it seemed sensible to submit to a less prestigious journal.  I saw the call for the special issue and thought I'd write up my paper for it. The organisation behind the journal is fairly new (2010), but they do seem to tick the right sorts of boxes.  They have an ISSN, they produce print versions of the journal, they don't charge an extortionate fee, the past issues seem well populated with sensible research, etc.  The survey paper is well suited to this type of publication as it will appeal to a broad audience and therefore will benefit from open access publishing.


Links (and maybe some more information) will be put up when I have the pdfs.

Tuesday, January 28, 2014

XKCD and simplification.

I have been an avid reader of the webcomic xkcd since my days as an undergrad.  If you've never heard of it, I would recommend you check it out, some of them are laugh-out-loud funny.  There are several comics that stick out as having a simplification theme.  I'm going to use this post to look at those comics through the lens of automatic simplification.  I'll try to explain what we can do with the current technology and what we just plain can't.

Simple

Particle accelerators are complex beasts.  I can empathise with the character who read so much simple Wikipedia that he can only talk in that way now.  One of the techniques we use in simplification is language modelling.  A mathematical model of sentences is trained which can be used to score new sentences to say how likely they are to have been produced by a language.  So for example:  "I went to the bank" might recieve a higher score than "I to the banking place did go".  As the latter sentence is poorly written.  An interesting factor of language models is that the scores they give rely heavily on the sentences which are used to train them.  So if you train a model using the text of the English Wikipedia you are likely to get very difficult to understand language.  If you train a model using the text of Simple Wikipedia, you are likely to get very simple sounding language, just like the second character in this comic.  A great paper which explains this further (without the xkcd references) is Kauchak (2013) (see the lexical simplification list).

Up Goer Five

This next one is too long to put in this post - but it's worth a read.  The full comic is here: Up Goer Five (Or, right click on the image and open it in a new tab to view).

The comic presents a simplified blueprint of the Saturn V rocket.  The translator has been restricted to only the thousand most common words in the English language.  There is some question as to where the statistics for the 'thousand most common words' came from.  If it were taken from NASA technical rocket manuals then there may have needed to be little change!  We'll assume that it was taken from some comprehensive resource.  The best way of determining this with currently available resources would be to use the top ranked words in the Google Web1T corpus (Google counted a trillion words and recorded how often each one occurred.)

The style of translation in this comic is phenomenally difficult to achieve, even for humans.  You can try it for yourself at The Up-Goer Five text editor.  Most words have been substituted for simpler phrases, or explanations.  Some technical terms rely on outside knowledge - which actually has the effect of making the sentence more difficult to understand.  For example, one label reads: "This is full of that stuff they burned in lights before houses had power".  This is referring to kerosene, which is highly understandable if you know of kerosene but inapproachable if not.

It would be an interesting experiment to determine the lowest number of words required to be able to produce this kind of simplification without having to draw on inferred knowledge (such as the type of fuel lights once burned).  My guess is that you would need 10 - 20,000 before this became a reality.  It would be difficult to automatically produce text at this level of simplicity.  Explaining a concept requires a really deep understanding and background knowledge, which is difficult to emulate with a machine.

Winter


The above comic touches on an excellent point.  If the words we use are understandable, does it matter if they're not the correct words? Previously, I have written about lexical complexity noting that many factors affect how difficult we find a word.  The big factor that is played on here is context.  For example the term 'handcoats' in the second panel is understandable (as gloves) because we know from the first panel that 'the sky is cold'.  Handcoats is a word that you've probably never seen before, and out of context it would be difficult to get the meaning.  This highlights the importance of selecting words which fit the context of a sentence. If the correct context is chosen and a simple word fitting that context is used, then the understandability of a sentence will be dramatically increased.