Wednesday, June 19, 2013

The importance of being accurate.

Only a short post today. I am currently writing my transfer report, which is soaking up all of my research time.  I thought I would take some time out from that to write about an interesting phenomenon that occurs in text simplification.

Accuracy is always to be sought after.  Regardless of your domain, the more accurate your algorithm, the better.  In many domains, negative results can be tolerated.  For example, if you search for the query 'jaguar in the jungle' you are likely to receive lots of results about big cats in their natural habitat, but you may also receive some results about fancy cars in the jungle too.  This is acceptable and may even be helpful as the original query contained some ambiguity - maybe you really wanted to know about those fancy cars.

The same thing can occur during text simplification.  Inaccurate identifications or replacements may lead to an incorrect result being present in the final text.  Some of the critical points of failure are as follows:
  • A complex word could be mislabeled as simple - meaning it is not considered for simplification.
  • No replacements may be available for an identified complex word.
  • A replacement which does not make sense in the context of the original word may be selected.
  • A complex replacement may be incorrectly selected over a simpler alternative due to the difficulty of estimating lexical complexity
If any of the above pitfalls occur, then either a complex word or an erroneous replacement may creep into the final text.  Unlike in web search, errors are of great detriment to the simplification process.  This is because the point is to have text which is easier to understand.  In the majority of cases, introducing errors into a text will cause it to be more difficult, completely negating any simplification made.  This is a real case of one step forwards and two steps back.  For example:

A young couple with children will need nearly 12 years to get enough money for a deposit.
was changed by a rudimentary lexical simplification system to:

A young couple with children will need nearly 12 years to get enough money for a sediment.
Not only has a synonym which is more complicated than the original word been chosen here, the synonym does not make any sense in the given context.  Through making an error, the understandability of the text is reduced, and it would have been better to make no simplification at all.

To end this post, I will present some practical ways to mitigate this.
  1. Only simplify if you're sure.  Thresholds for deciding whether to simplify should be set high to avoid errors.
  2. Use resources which are well suited to your task, preferably built from as large a corpus as possible.
  3. Investigate these errors in resultant text.  If they are occurring, is there a specific reason?
In summary, incomprehensible text is much more complex than understandable yet unsimplified text.  Whilst the goal of text simplification must be to simplify when and wherever possible, this must not be done at the expense of a system's accuracy.  Presenting a reader with error prone text is as bad, if not worse than presenting them with complex text.