Lexical Simplification: 2015

Recently I have been watching through a lecture series on Deep Learning for NLP. It's a topic that I have long been interested in, and I'm learning lots as I go along. It has been a welcome relief from thesis writing.

In the first 15 minutes of the fifth lecture, Richard Socher outlines the steps that his students should take for their class projects. I wanted to reproduce them here as I think they are really useful for anybody who is performing experiments in Machine Learning with Natural Language Processing. Although the steps are aimed at a class project in a university course, I think they are applicable to anybody starting out in NLP, and helpful as a reminder to all who are established. I had to figure these steps out myself as I went along, so it is very encouraging to see them being taught to students. The eight steps (with my notes) are as follows. Or, if you'd prefer, scroll to the bottom of the page and watch the first fifteen minutes of the video there.

Step 1 - Define Task

Before you make a start you need to know what your task is going to be. Read the literature. Read the literature's background literature. Email researchers who you find interesting and ask about their work. Skype them if they'll let you. Make sure you know clearly what your task is.

Step 2 - Define Dataset

There are a lot of ready made datasets that you can go and grab. You'll already know what these are if you've read the literature suitably. If there is no suitable dataset for your task, then you will need to build one. This is a whole other area and can lead to an entire paper in itself.

Step 3 - Define Your Metric

How are you going to evaluate your results? Is it the right way? Often your dataset will naturally lead you to one evaluation metric. If not, look at what other people are using. Try to understand a metric before using it. This will speed up your analysis later.

Step 4 - Split Your Dataset

Training, Validation, Testing. Make sure you have these three partitions. If you're feeling really confident, don't look at the testing data until you absolutely have to, then run your algorithm as few times as possible. Once should be enough. The training set is the meat and bones of your algorithm's learning and the validation set is for fiddling with parameters. But be careful, the more you fiddle with parameters, the more likely you become to over fit to your validation set.

Step 5 - Establish a Baseline

What would be a sensible baseline? Random decisions? Common sense judgements? Maximum class? A competitor's algorithm? Think carefully about this. How will your model be different from the baseline?

Step 6 - Implement an Existing (Neural Net) Model

Neural Net is in brackets because it is specific to the course in the video. Go find an interesting ML model and apply it. I have started to use WEKA for all my ML experiments and I find it really efficient.

Step 7 - Always Be Close to Your Data

Go look at your data. If you're tagging something look at what's been tagged. If you're generating something read the resultant text. Analyse the data. Make metadata. Analyse the metadata. Repeat. Where do things go wrong? If you have a pipeline, are some modules better than others? Are errors propagating through?

Step 8 - Try Different Models

Play around a bit. Maybe another model will perform much better or worse than the first one you try. If you analyse the differences in performance, you might start to realise something interesting about your task. I would recommend setting up a framework for your experiments, so as you can quickly change things around and re-run with different models / parameters.

Step 9 - Do Something New

OK, this isn't a step in the video per se. But, it is the extension of the methodology. Now that you have tried something, you can see where it was good and where it was bad. Now is the time to figure out how to make it better. Different models, different features, different ways of using features, are all good places to start.

The video is below and also accessible at this link:
https://www.youtube.com/watch?v=I2TfdXfSOfc

If you are interested in the topic then the course syllabus with all videos, and lecture notes is here:
http://cs224d.stanford.edu/syllabus.html

Thirty-Nine months ago, I started my PhD. As I look back through logbooks, half written papers and the accumulated file lint that clutters my desktop, I wonder where that time has gone. But this is no time for reflection. From here on out, my eyes are set firmly on September 2015. They must be, if I hope to finish my PhD.

Over the coming nine months I hope to achieve all of the following goals:

Finish off some experiments using machine learning to rank substitutions in different contexts.
Write up and submit this work to the main track at ACL-IJCNLP 2015.
Design and implement an experiment looking at the effects of simplification for people with aphasia.
Compose my thesis.
Find a job.

All in all, I think that this is manageable. I would really like to go to ACL this year, but my fate will be left to the hands of the reviewers. I haven't formally started writing my thesis yet, but I do have lots of material written up, so I intend to start with this and work from that point. The aphasia experiments will be interesting, but I'll write more about them at a later point. Lots to do, so best get started.

Lexical Simplification

Pages

Monday, June 08, 2015

An Excellent Experimental Framework.