Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

I recently wrote a paper on a structured review of the validity of BLEU, where I brought together evidence from previously published studies on how well BLEU correlates with human evaluations. One of my main conclusions was that BLEU was much better at evaluating MT systems than NLG systems. A few people have since asked me why I thought this was the case. Below are some thoughts; these are speculations rather than proven facts!

I should add that many of the papers I surveyed made similar points, including Espinosa et al 2010, Liu et al 2016, and Reiter and Belz 2009.

Text quality

MT systems are getting better, but the output of a good MT system is still inferior to a human translation. NLG systems, in contrast, typically aim to produce texts of near-human, or even better-than-human, quality (eg, Reiter et al 2005). This is partially because there is little interest in using NLG to produce moderate quality texts, since these can be generated using templates.

BLEU is based on comparing computer-generated texts to human-written “reference” texts, and assumes that the closer the computer text is to the reference text, the better. This assumption is clearly incorrect if the computer-generated texts are *better* than the human-written reference texts! More generally, I suspect that any metric which is based on comparing computer-generated texts to human-written texts will be dubious if the computer texts are of near-human as well as better-than-human quality.

Text variability

Information can be expressed in many different ways by an NLG system. To take a very simple, the below are all acceptable ways of describing a “purchase” event

Yesterday John bought a book at the bookstore.

John purchased a book at the bookstore yeserday.

The bookstore sold John a book on 1 July.

(etc)

So even with this very simple message, we can express it in many ways by changing modifier (“yesterday”) placement, replacing words with synoyms (“bought” and “purchased”), changing temporal reference strategy (“yesterday” vs “1 July”), and paraphrasing (“John bought” vs “The bookstore sold”). So even this simple message can be expressed in dozens of ways. And a narrative which communicates ten messages can probably be expressed in thousands (millions?) of different ways.

This is a problem for BLEU, since it effectively is looking for matching ngrams in generated and reference texts. Even if multiple reference texts are provided, they are unlikely to cover all or even most of the above variations.

An obvious question is why this isnt also an issue for MT; after all, there are many acceptable ways of translating a sentence. I dont have a good answer to this, although I wonder if BLEU’s bias against rule-based systems is partially because their output is more variable than statistical/neural systems?

Variation to keep text interesting

In many contexts, human readers want texts to be varied; they do not want to see the same words and syntactic constructs repeated again and again. Hence varying the way information is communicated is appreciated by human readers, and increases their satisfaction; this is also standard advice to human writers. However, such variation *decreases* ratings from BLEU and other metrics, which tend to reward systems which are repetitive and use “preferred” wording and syntax 100% of the time.

I suspect this is a relatively minor issue compared to the previous ones, but I think it is interesting because it is a very clear example of a case where human preference is pretty much the opposite of BLEU’s preferences; systems that vary texts get higher human evaluation scores but lower BLEU scores.

Evolution

Being very speculative, I suspect that MT systems have evolved to have good BLEU scores, since a good BLEU score is very important for research success in MT; I mean this in the Darwinian sense that approaches that provide good BLEU scores get more publications and funding than approaches with poor BLEU scores, regardless of their respective human evaluations. This one of the reasons why BLEU-human correlations for MT systems have increased over time. Good BLEU score has been much less important in NLG, so hence there has been less “evolutionary pressure” in NLG in favour of approaches that lead to poor BLEU scores.

Other ideas?

If readers have other suggestions as to why BLEU is poorly suited to evaluating NLG systems, please let me know (or add a comment to this blog); I’m very interested in knowing other people’s thoughts on this!

Why doesnt BLEU work for NLG?

Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

Text quality

Text variability

Variation to keep text interesting

Evolution

Other ideas?

Recommend

Reorder digits of a given number to make it a power of 2

暴涨18倍后，千亿市值大牛股跌停，神话就此破灭？

Generate an array of minimum sum whose XOR of same-indexed elements with given a...

Monitor .NET Microservices in Kubernetes with Prometheus

又一中国骄傲！时速是高铁的两倍，国产磁悬浮列车下线

开家钱大妈月亏5万，七月拓千店的独角兽割了谁的韭菜？

docker 搭一个mongodb shard cluster

又一巨头陨落！负债114亿，果汁帝国彻底坍塌

公有云上构建云原生 AI 平台的探索与实践 - GOTC 技术论坛分享回顾

漫画：三分钟了解敏捷开发

About Joyk