Lexical Simplification

Monday, June 08, 2015

An Excellent Experimental Framework.

Recently I have been watching through a lecture series on Deep Learning for NLP. It's a topic that I have long been interested in, and I'm learning lots as I go along. It has been a welcome relief from thesis writing.

In the first 15 minutes of the fifth lecture, Richard Socher outlines the steps that his students should take for their class projects. I wanted to reproduce them here as I think they are really useful for anybody who is performing experiments in Machine Learning with Natural Language Processing. Although the steps are aimed at a class project in a university course, I think they are applicable to anybody starting out in NLP, and helpful as a reminder to all who are established. I had to figure these steps out myself as I went along, so it is very encouraging to see them being taught to students. The eight steps (with my notes) are as follows. Or, if you'd prefer, scroll to the bottom of the page and watch the first fifteen minutes of the video there.

Step 1 - Define Task

Before you make a start you need to know what your task is going to be. Read the literature. Read the literature's background literature. Email researchers who you find interesting and ask about their work. Skype them if they'll let you. Make sure you know clearly what your task is.

Step 2 - Define Dataset

There are a lot of ready made datasets that you can go and grab. You'll already know what these are if you've read the literature suitably. If there is no suitable dataset for your task, then you will need to build one. This is a whole other area and can lead to an entire paper in itself.

Step 3 - Define Your Metric

How are you going to evaluate your results? Is it the right way? Often your dataset will naturally lead you to one evaluation metric. If not, look at what other people are using. Try to understand a metric before using it. This will speed up your analysis later.

Step 4 - Split Your Dataset

Training, Validation, Testing. Make sure you have these three partitions. If you're feeling really confident, don't look at the testing data until you absolutely have to, then run your algorithm as few times as possible. Once should be enough. The training set is the meat and bones of your algorithm's learning and the validation set is for fiddling with parameters. But be careful, the more you fiddle with parameters, the more likely you become to over fit to your validation set.

Step 5 - Establish a Baseline

What would be a sensible baseline? Random decisions? Common sense judgements? Maximum class? A competitor's algorithm? Think carefully about this. How will your model be different from the baseline?

Step 6 - Implement an Existing (Neural Net) Model

Neural Net is in brackets because it is specific to the course in the video. Go find an interesting ML model and apply it. I have started to use WEKA for all my ML experiments and I find it really efficient.

Step 7 - Always Be Close to Your Data

Go look at your data. If you're tagging something look at what's been tagged. If you're generating something read the resultant text. Analyse the data. Make metadata. Analyse the metadata. Repeat. Where do things go wrong? If you have a pipeline, are some modules better than others? Are errors propagating through?

Step 8 - Try Different Models

Play around a bit. Maybe another model will perform much better or worse than the first one you try. If you analyse the differences in performance, you might start to realise something interesting about your task. I would recommend setting up a framework for your experiments, so as you can quickly change things around and re-run with different models / parameters.

Step 9 - Do Something New

OK, this isn't a step in the video per se. But, it is the extension of the methodology. Now that you have tried something, you can see where it was good and where it was bad. Now is the time to figure out how to make it better. Different models, different features, different ways of using features, are all good places to start.

The video is below and also accessible at this link:
https://www.youtube.com/watch?v=I2TfdXfSOfc

If you are interested in the topic then the course syllabus with all videos, and lecture notes is here:
http://cs224d.stanford.edu/syllabus.html

Thursday, January 15, 2015

September 2015

Thirty-Nine months ago, I started my PhD. As I look back through logbooks, half written papers and the accumulated file lint that clutters my desktop, I wonder where that time has gone. But this is no time for reflection. From here on out, my eyes are set firmly on September 2015. They must be, if I hope to finish my PhD.

Over the coming nine months I hope to achieve all of the following goals:

Finish off some experiments using machine learning to rank substitutions in different contexts.
Write up and submit this work to the main track at ACL-IJCNLP 2015.
Design and implement an experiment looking at the effects of simplification for people with aphasia.
Compose my thesis.
Find a job.

All in all, I think that this is manageable. I would really like to go to ACL this year, but my fate will be left to the hands of the reviewers. I haven't formally started writing my thesis yet, but I do have lots of material written up, so I intend to start with this and work from that point. The aphasia experiments will be interesting, but I'll write more about them at a later point. Lots to do, so best get started.

Wednesday, October 29, 2014

Why Lexical?

In this post I will put forward my arguments for picking lexical simplification as a PhD topic. I'm certainly going to face that dangerous 'why' question when writing up my thesis, not to mention in the viva. I figure that if I commit my reasoning to page now, then I can come back to it at a later date. To convince myself, if not others.

The question of 'why lexical?' needs addressing at two levels. Firstly, why choose simplification given the wide scope of natural language processing applications to which I could have committed four years of my life? Secondly, why lexical simplification given the wide range of simplification research out there? I'll answer these in turn.

Why Simplification?

Let me start with a confession. I didn't start a PhD to study lexical simplification. I didn't even come with natural language processing in mind. About four years ago I was offered an opportunity to study at Manchester on the newly founded Centre for Doctoral Training (CDT). One of the aspects of the CDT programme which appealed to me was the fact that students were not initially tied to a supervisor or research group. We were in effect free for our first six months to decide upon a group, supervisor and topic.

During the first six months, I took three masters modules. Two on machine learning and one on digital biology. Both fascinating fields, and both of which have ample opportunities for a PhD. I spoke to the relevant people about these opportunities, but I just couldn't find something that inspired me. Resigned to the fact that I may have to choose a topic and wait for the inspiration to come I started to focus on the machine learning research.

We had weekly seminars to acquaint us with the research in the school. Each seminar was given by a different research group. They varied in style and form - from the professor who brought in props to talk us through the history of hard drives to the professor who promised free cake every time their group had a major publication. One week, it was the turn of the text mining group. During the presentation the idea that text mining could be used to make difficult documents easier to understand was mentioned. I took the bait, started reading papers, emailed my prospective supervisor and from there I was away. The spark of inspiration drove me to develop my first experiments, which led to a literature review and an initial study on complex word identification.

Simplification is a great field to be working in. I like that the research I'm doing could improve a user's quality of life. Ok, the technology isn't quite there yet, but the point of research is to reach out to the unattained. To do something that hasn't been done before. There are lots of opportunities in simplification. Lots of unexplored avenues. I'm going to write a long future work section in my thesis (maybe a future post?) because there is a lot to say.

Why Lexical?

When I started looking into simplification back in 2011, there were three main avenues that I could discern from the literature. Firstly, syntactic simplification - automated methods for splitting sentences, dealing with wh- phrases and the passive voice. Secondly, lexical simplification - the automatic conversion of complex vocabulary into easy to understand words. Finally, semantic simplification - taking the meaning into account and doing some processing based on this. I grouped both lexical elaboration and statistical machine translation under this category, although I would probably now put elaboration under the lexical category.

I felt that my research would be best placed if it fell underneath one of those categories. Although I could see viable research options in all of these, my background in machine learning and data mining was best suited for lexical simplification. This was also around the time that the SemEval 2012 lexical simplification task was announced. This task gave me an intial dataset, my background in AI gave me a set of techniques to apply. From there I ran some initial experiments, implemented a basic system and started learning the complexities of the discipline.

It's strange to think that I'm writing about a decision I made several years ago. I've tried to be as honest as I can remember in this post. The PhD has not been a particularly easy road, but it has been a good one so far.

Thursday, July 24, 2014

The Choices of a Simplification System

There are many choices to make when building an LS system. In my experience there are three big decisions to take: the target audience; the source documents; and the mode of presentation. Let's look at each of these in detail.

Target Audience

Firstly, you need a well defined group to aim your simplification at. This group should have a clear style of language, documents written specifically for the group may be useful here. They should all require a similar form of simplification, otherwise you will be writing several simplification systems. The group shouldn't be too narrowly defined (e.g. Deaf children in the age range 8-11 with a below average reading age), as this will make it difficult to find test subjects. It also shouldn't be too broadly defined, otherwise different simplification needs may be present.

Once you have a group to simplify documents for, you're ready to consider the next step.

Source Material

You must decide what type of text to simplify. It's easy to assume that text is text and you can just build a general model, but in fact different genres have their own peculiarities and jargon. Consider the difference between a set of news articles, Wikipedia entries and social media posts. Each will be significantly different in composition than the last. Of course, the text genre should be one which the target audience wants you to simplify! That's why this step comes after selecting the target group. It's also important at this point to check that nobody else has tried to simplify the type of documents that you're working with.

Mode of Presentation

There are, roughly speaking, three ways of presenting the simplifications to the end user. Which one you choose depends upon factors such as the text itself, the requirements of the user and the reason behind the simplification. Each has advantages and disadvantages as outlined below:

Fully Automated

Substitutions are applied directly to the text where possible. The user never sees the simplifications being made and so does not need to know that they are reading a simplified document.

Advantages:

User is presented with a simple document
Requires minimum work from author / user
Can be performed on the fly - e.g. whilst browsing the web / reading e-books / etc.

Disadvantages:

Errors in the simplification process cannot be recovered
Simplification may alter the meaning away from what the original author intended
Some words may not be simplified - leaving difficult terms to be seen by the user

Author Aid

The simplifications are presented to the author of the document, who chooses when and how to apply the simplifications. In this case, the simplification acts similarly to a spell checker.

Advantages:

Author can make simplifications which retain their original intention
No chance of grammar mistakes - as the author has the final say of which words to use

Disadvantages:

Work must be simplified before being published
No guarantee that the author will apply the suggested simplifications

On Demand

The user has the option to look up simplifications if they find a word too difficult. These simplifications are typically presented somewhere nearby on the screen. For example, if a word is clicked on, the simplification may appear in a pop up box or in a dedicated area outside of the body of text.

Advantages:

User gets to see the original text
Helps language learning / recovery as the user can decide when they require simplification

Disadvantages:

User may struggle if a text or portion of a text has a high density of difficult words
The user may be distracted by the simplifications, which divert their attention away from the text

Tuesday, June 03, 2014

LREC 2014 - Post Blog

Looking back on LREC, I think I went with a real apprehension. My only previous conference experience was ACL 2013 in Sofia. I had a great time and so my feelings towards LREC were a mixture of excitement for it being similar, but also a worry that it might not be as good. I'm glad to report that it lived up to and even exceeded my expectations. The sessions were interesting; the posters engaging; the networking events were even fun; the attendees were approachable and people even engaged with my talk, asking questions and talking to me afterwards.

It was great to meet many of the people who I have previously referenced. Many of those who wrote the papers in the lexical simplification list were in attendance. I made it my business to approach them and introduce myself. It's a great experience to meet people who have written such great papers and have made a contribution to the field.

For those that might find it useful, I've collected two lists below. The first is a list of papers which had something to do with readability / simplification. The second is a list of papers which I found interesting for one reason or another. The simplification papers will be making their way to the lexical simplification list soon.

Improvements to Dependency Parsing Using Automatic Simplification of Data. T Jelínek PDF

Measuring Readability of Polish Texts: Baseline Experiments. B Broda, B Nitoń, W Gruszczyński and M Ogrodniczuk PDF

Can Numerical Expressions Be Simpler? Implementation and Demonstration of a Numerical Simplification System for Spanish. S Bautista and H Saggion PDF

Text Readability and Word Distribution in Japanese. Satoshi Sato PDF

And finally, my paper:
Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline. M Shardlow PDF

Some other papers that I found interesting were:

Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines. M Sabou, K Bontcheva, L Derczynski and A Scharl PDF

Identification of Multiword Expressions in the brWaC. P Anick M Verhagen and J Pustejovsky PDF

Creating a Massively Parallel Bible Corpus. T Mayer and M Cysouw PDF

Tuesday, May 27, 2014

LREC 2014

I write this post from my hotel in Reykjavik. Ready to get started at LREC tomorrow. I'm looking forward to another conference experience. I don't think this city could be more contrasting to my previous conference in Sofia, Bulgaria. I'm also particularly excited as I am presenting in the main conference this year. It is a really great opportunity to share my research and I'm looking forward to that.

I haven't completed my presentation yet. But when I do, the slides will be up here. I am presenting on Thursday morning, so still time to put the finishing touches to my slides. I will be presenting on an error study of the lexical simplification pipeline which I produced a short while back. It's an interesting little side experiment which developed into some useful results. Effectively I've been able to show that the areas in the field which are least addressed are the areas which cause the most problems. A game changer - if people take notice. I'm really looking forward to seeing how the work is received. I imagine there will be a mixture of views on the work.

In terms of my PhD project, this work will feed directly into the first chapter of my thesis. It allows me to motivate work on mitigating errors throughout the lexical simplification pipeline. I have recently come to what I think is a final draft of my thesis structure. But I shall save that for another post.

Friday, March 07, 2014

Accepted Papers

It has been a good start to 2014 so far with 2 papers accepted.

The first is a conference paper at a large conference called LREC 2014. This will take place in Reykjavik, Iceland at the end of May this year. The paper details some experiments that I performed last year, looking at the distribution of errors in the lexical simplification pipeline. The hope is that if I can show the types of errors and whee they're coming from then I can motivate my research into mitigating these errors. It will certainly make a nice introduction in my thesis. The paper has been accepted for a main track oral presentation. This means that I'll be presenting alongside established researchers in the main event of the conference. It's certainly a mixture of very exciting and somewhat terrifying!

The second paper is published in a special issue of an open access journal (open access means that the publisher allows anybody to read it for free). I was unsure at first of whether to publish with them but a number of factors led me to doing so. The danger of publishing with open access journals, particularly the relatively unknown, is that the quality of the reviewing and the impact of the journal can be very low. In the worst case you can end up paying thousands of pounds for somebody to essentially publish your paper on their blog. The paper I have submitted is a survey paper. I wrote it almost two years ago and have been updating it ever since. It is effectively the background chapter in my thesis. I originally attempted to submit it to a few prestigious journals, but it was rejected for a few different reasons. Survey papers from relatively unknown authors are tough to get published. So, off the back of rejections and updates it seemed sensible to submit to a less prestigious journal. I saw the call for the special issue and thought I'd write up my paper for it. The organisation behind the journal is fairly new (2010), but they do seem to tick the right sorts of boxes. They have an ISSN, they produce print versions of the journal, they don't charge an extortionate fee, the past issues seem well populated with sensible research, etc. The survey paper is well suited to this type of publication as it will appeal to a broad audience and therefore will benefit from open access publishing.

Links (and maybe some more information) will be put up when I have the pdfs.

Pages

Monday, June 08, 2015

An Excellent Experimental Framework.

Step 1 - Define Task

Step 2 - Define Dataset

Step 3 - Define Your Metric

Step 4 - Split Your Dataset

Step 5 - Establish a Baseline

Step 6 - Implement an Existing (Neural Net) Model

Step 7 - Always Be Close to Your Data

Step 8 - Try Different Models

Step 9 - Do Something New

Thursday, January 15, 2015

September 2015

Wednesday, October 29, 2014

Why Lexical?

Why Simplification?

Why Lexical?

Thursday, July 24, 2014

The Choices of a Simplification System

Target Audience

Source Material

Mode of Presentation

Fully Automated

Author Aid

On Demand

Tuesday, June 03, 2014

LREC 2014 - Post Blog

Tuesday, May 27, 2014

LREC 2014

Friday, March 07, 2014

Accepted Papers