Real-world utility is based on many things

A few weeks ago I gave a talk on High Quality Human Evaluations of NLG at a workshop (see also related blog). One of the points I tried to make is that that what we want to know how useful NLG systems are in real-world settings, and that real-world utility depends on a range of different factors. In other words, even if we have good techniques for measuring the fluency and accuracy of a generated text, this is not sufficient to measure real-world utility.

Example: Summarising a Medical Consultation

Let me give a concrete example, which is based on the work of Francesco Moramarco, one of my PhD students. Francesco is trying to evaluate a summarisation system which generates a written summary of a consultation between a doctor (GP) and a patient; this summary could then be added to the patient record and perhaps shown to the patient. Below is an example from one of Francesco’s papers

Consultation (input)

Doctor: Hello? Good morning, Tim. Um, how can I help you this morning?
Patient: Um, so I’m having some, some pain, uh, in my tummy, like the lower part of my tummy. Um and I’ve just been feeling, quite, hot and sweaty.
Doctor: OK. Right, I’m sorry to hear that. When, when did your symptoms all start?
Patient: About two days ago.

Summary (output)

Two days of lower abdominal pain.

Because accuracy is of paramount important in medicine, the summaries must be checked and post-edited by the doctor before they are saved into the medical record. The usefulness of the system is thus largely based on the post-editing process, including

How long does it take a doctor to post-edit the summary? Doctor’s time is expensive, we want post-editing to be quick.
Does post-editing distract the doctor or otherwise interfere with his/her workflow? If it requires a lot of cognitive effort to post-edit the summary, this could disrupt workflow.
Is the post-edited text complete and accurate?
Do doctors like using the system? If they don’t, then success in real-world deployment is unlikely.

Also, in this context we need to understand distribution (especially worst-case behaviour) as well as averages. We know people differ widely in the time taken to post edit (paper), because some people just fix major problems while others rewrite texts more compresively. So if most doctors post-edit quickly but it takes a few doctors a long time to post-edit, this is important. Also we need to be confident that serious hallucinations or omissions are very rare.

Now, most existing evaluations of summarisation systems attempt to judge the quality of the generated summary. This is true for human evaluations where Turkers give Likert ratings to texts as well as for evaluations based on ROUGE and other metrics. The quality of the generated summary is likely to influence the things I mentioned above (such as the effort required to post-edit), but so do other things, including the user-interface used for post-editing. And of course existing evaluation techniques focus on average case performance, not on worst case.

In other words, many of the things we want to measure when evaluating the consultation summariser in real-world usage are different from what is measured in most academic evaluations. There may be a link and correlation between the two, but if we want to use ROUGE-like metrics (or Likert ratings from Turkers) to predict real-world utility in this use case, we should get concrete data about real-world utility (including things like post-edit time) and measure how well this correlates with ROUGE (and Turker ratings). We cannot just assume this correlation exists, we need to empirically demonstrate it.

Other things which influence real-world success and utility

The above example is representative (at least in my experience) in the sense that when we deploy an NLG system in real production usage, success is usually influenced by a number of factors, many of which are specific to the use case. Some of the other factors which I have seen are

Response time: is a text generated quickly?
Brand fidelity: does a generated text conform to and support a corporate brand?
Control: does the NLG system reduce the user’s sense of control over what he is doing?
Risk: Is there a risk (perceived or real) that the NLG system will produce a text that does real damage (injures people, leads to bad publicity, opens door to lawsuits).

I could easily expand this list, not least by adding issues related to change management; interested readers can also look at my summary of the INLG2021 industrial panel.

What should academics measure?

I am not saying that academics should routinely try to measure the above factors when evaluating systems! But they need to be aware of them, and the field would greatly benefit from some “high quality human evaluations” which investigate the above and assess how well simple evaluations (metrics and Turkers) correlate and predict these factors. Francesco (my student) certainly hopes to do such studies, and I encourage other researchers to do likewise!

Example: Summarising a Medical Consultation

Other things which influence real-world success and utility

What should academics measure?

Recommend

设计小白收藏！HMI设计中六大交互视觉设计的原则是什么？（1）

谈谈交互设计中的Banner应用（七）

1 Month Vim Only Challenge - Week 4 (Finale)

Dev.to posts with canonical url idexing on google, why?

Running Flutter Tests using Github Actions

SPRING REACTOR资源

GitHub - web3-php/web3: ⚡️ Web3 PHP is a supercharged PHP API client that allows...

Weibo files for Hong Kong secondary listing as U.S listed Chinese tech firms acc...

PHP 检测文件编码的不完美解决方案

The Quest for Graphalue - ENCORE - the Total Economic Impact session

About Joyk