Do You Want to Know a Secret?

August 25, 2018

A riff on writing style and rating systems

Mark Glickman is a statistician at Harvard University. With Jason Brown of Dalhousie University and Ryan Song also of Harvard—we’ll call them GBS—he has used musical stylometry to resolve questions about which Beatle wrote which parts of which songs. He is also a nonpareil designer of rating systems for chess and other games and sports.

Today we discuss wider issues and challenges arising from this kind of work.

In fact, we’ll pose a challenge right away. Let’s call it The GLL Challenge. Many posts on this blog have both our names. In most of them the writing is split quite evenly. Others like this are by just one of us. Can you find regularities in the style of the single-author ones and match them up to parts of the joint ones?

Most Beatles songs have single authors, but some were joint. Almost all the joint ones were between John Lennon and Paul McCartney, and in a number of those there are different accounts of who wrote what and how much. Here are examples of how GBS weighed in:

Although the 1962 song, “Do You Want to Know a Secret?” was credited as “Lennon/McCartney” and even as “McCartney/Lennon” by a band who covered it in 1963, it has long been agreed as mostly by Lennon, as labeled on this authorship list. GBS confirm this.
The two composers differed, however, in their accounts of “In My Life” and it has taken GBS to credit it all to Lennon with over 98% confidence.
The song “And I Love Her” is mainly by McCartney, but GBS support Lennon’s claim to have written the 16-syllable bridge verse.
Lennon said “The Word” was mainly his, but GBS found McCartney’s tracks all over it.

Tell Me Why Baby It’s You

To convey how it works, let’s go back to the GLL Challenge. I tend to use longer words and sentences, often chaining further thoughts within a sentence when I could have stopped it at the comma. The simplest approach is just to treat my sole posts as “bags of words” and average their length. Do the same for Dick’s, and then compare blocks of the joint posts. The wider the gap you find in our sole writings, the more confidently you can ascribe blocks of our joint posts that approach one of our word-length means or the other.

For greater sophistication, you might count cases of two consecutive multisyllabic words, especially when a simple word like “long” could have replaced the second one. Then you are bagging the pairs of words while discarding information about sentence structure and sequencing. An opposite approach would be to model the probability of a word of length ${n}$ following a whole sequence of words of lengths ${n_1,n_2,\dots,n_r}$ . This retains sequencing information even if ${r}$ is small because one sequence is chained to the previous one.

GBS counted pairs—that is, transitions from one note or chord to another—but did not analyze whole musical phrases. The foremost factor, highlighted in lots of popular coverage this past month, is that McCartney’s transitions jump around whereas Lennon’s stay closer to medieval chant. Although GBS covered songs from 1962–1966 only, the contrast survives in post-1970 songs such as Lennon’s “Imagine” and “Woman” versus McCartney’s “Live and Let Die” and the refrain of “Band on the Run.”

To my ears, the verses of the last creep like Lennon, whereas Lennon’s “Watching the Wheels” has swoops like McCartney. Back when they collaborated they may have taken leaves from each other, as I sometimes channel Dick. The NPR segment ended with a query by Scott Simon about collaborative imitation to Keith Devlin, who replied:

For sure. And that’s why it’s hard for the human ear to tell the thing apart. It’s also hard for them to realize who did it and this is why actually the only reliable answer is the mathematics because no matter how much people collaborate, they’re still the same people, and they have their preferences without realizing it. [Lennon’s and McCartney’s] things come together—that works—but they were still separate little bits. The mathematics isolates those little bits that are unique to the two people.

GBS isolated 149 bits that built a confident distinguisher of Lennon versus McCartney. This raises the specter of AI revealing more about us than we ourselves can plumb, let alone already know. It leads to the wider matter of models for personnel evaluation—rating the quality of performance—and keeping them explainable.

A Paradox of Projections

Glickman created the rating system Glicko and partnered in the design of URS, the Universal Rating System. Rather than present them in detail we will talk about the problems they intend to solve.

The purpose is to predict the how a player ${P}$ will do against an opponent ${O}$ from the difference in their ratings ${R_P}$ and ${R_O}$ :

$\displaystyle y = f(x) \qquad\text{where}\qquad x = R_P - R_O.$

Here ${0 \leq y \leq 1}$ giving the probability for ${P}$ to win, or more generally the percentage score expectation over a series of games. The function ${f(x)}$ should obey the following axioms:

$\displaystyle \begin{array}{rcl} &&f(-x) = 1 - f(x);\\ &&f(x) \rightarrow 1 \text{ as } x \rightarrow \infty, \text{ so } f(x) \rightarrow 0 \text{ as } x \rightarrow -\infty;\\ &&f'(x) \text{ is defined and maximum at } x = 0. \end{array}$

The last says that the marginal value of extra skill tails off the more one is already superior to one’s opponent. Together these say ${f(x)}$ is some kind of sigmoidal curve, like the red or green curve in this graphic from the “Elo Win Probability Calculator” page:

To use the calculator, pop in the difference as ${x}$ , choose the red curve (for US ratings) or green curve (for international ratings), and out pops the expectation ${y}$ . What could be simpler? Such simplicity and elegance go together. But the paradox—a kind of “Murphy’s Law”—is:

Unless the players are equally rated, the projection is certainly wrong. It overestimates the chances of the stronger player. Moreover, every projection system that obeys the above axioms has the same defect.

Here’s why: We do not know each rating exactly. Hence their difference ${x}$ likewise comes with a ${\pm \epsilon}$ component. Thus our projection really needs to average ${f(x+\epsilon)}$ and ${f(x-\epsilon)}$ over a range of ${\epsilon}$ values. However, because ${f}$ is concave for ${x > 0}$ , all such averages will be below ${f(x)}$ .

We might think we can evade this issue by using the curves

$\displaystyle f_{\epsilon}(x) = \frac{1}{2}(f(x + \epsilon) + f(x - \epsilon)).$

This shifts the original ${f(x)}$ curve left and right and averages them. Provided ${\epsilon}$ is not too big, ${f_{\epsilon}}$ is another sigmoid curve. Now define ${f_*}$ by aggregating the functions ${f_{\epsilon}}$ , say over ${\epsilon}$ normally distributed around ${0}$ . Have we solved the problem? No: ${f_*}$ still needs to obey the axioms. It still has sigmoid shape concave above ${x = 0}$ . Thus ${f_*(x)}$ will still be too high for ${x > 0}$ and too low for ${x < 0}$ . The following "Law"—whom to name it for?—tries not to be hyperbolic:

All simple and elegant prediction models are overconfident.

Indeed, Glickman’s own explanation on page 11 of his survey paper, “A Comprehensive Guide to Chess Ratings,” is philosophically general:

At first, this consistent overestimation of the expected score formula may seem surprising [but] it is actually a statistical property of the expected score formula.

To paraphrase what he says next: In a world with total ignorance of playing skill, we would have to put ${f_0(x) = 0.5}$ for every game. Any curve ${y = f(x)}$ comes from a model purporting pinpoint knowledge of playing skill. Our real world is somewhere between such knowledge and ignorance. Hence we always get some interpolation of ${f(x)}$ and the flat line ${y = 0.5}$ . In chess this is really an issue: although both the red and green curve project a difference ${x = 200}$ to give almost 76% expectation to the stronger player, observed results are about 72% (see Figure 6 in the survey).

Newtonian Ratings and Grothendieck Nulls

The Glicko system solves this problem by giving every player ${P}$ a rating ${R_P}$ and an uncertainty parameter ${\epsilon_P}$ . Instead of creating ${f_{\epsilon}}$ ‘s and ${f_*}$ (or etc.) it keeps ${\epsilon}$ a separate parameter. This solves the problem by making the prediction ${y}$ a function of ${(\epsilon_P,\epsilon_O)}$ as well as ${x = R_P - R_O}$ , with optional further dependence on how the ${(R_P,\epsilon_P)}$ “glob” may skew as ${R_P}$ grows into the tail of high outliers and on other dynamics of the population of rated players.

However, Newton’s laws behave as though bodies have pinpoint mass values at their centers of gravity, no matter how the mass may “glob” around it. Trying to capture an inverse-square law for chess ratings leads to a curious calculation. Put

$\displaystyle f(x) = 1/(Ax^2 + Bx + C)$

for ${x \geq 0}$ . Taking ${C = 2}$ gives ${f(0) = 0.5}$ and allows gluing ${f(-x) = 1 - f(x)}$ . Simplifying ${\frac{1}{2}(f(x+\epsilon) + f(x-\epsilon)) - f(x)}$ gives a fraction with denominator ${f(x)f(x+\epsilon)f(x-\epsilon)}$ and numerator ${N(x)}$ given by

$\displaystyle N(x) = 3A^2 \epsilon^2 x^2 + 3AB\epsilon^2 x + (B^2 \epsilon^2 - 2A\epsilon^2 - A^2 \epsilon^4).$

Then taking ${B = \sqrt{2A}}$ cancels out the two bigger terms in the constant part, leaving the numerator as

$\displaystyle N(x) = 3A^2 \epsilon^2 x^2 + 3\sqrt{2}A^{3/2}\epsilon^2 x - A^2 \epsilon^4.$

David Mumford and John Tate, in their 2015 obituary for Alexander Grothendieck, motivated Grothendieck’s use of nilpotent elements via situations where one can consider ${\epsilon^2}$ to be truly negligible—that is, to put ${\epsilon^2 = 0}$ .

Here we have an ostensibly better situation: In our original expression for ${f(x)}$ , the coefficient ${A}$ of ${x^2}$ has to stay pretty small. The linear term for ${N(x)}$ has coefficient ${A^{3/2}\epsilon^2}$ and the ${x^2}$ term has ${A^2 \epsilon^2}$ . Thus if we could work in an algebra where

$\displaystyle A^{3/2}\epsilon^2 = 0,$

then the pinpoint value ${f(x)}$ and all averages ${\frac{1}{2}(f(x+\epsilon) + f(x-\epsilon))}$ for uncertainty would exactly agree. No separate parameter ${\epsilon_P}$ would be needed.

Alas, insofar as the real world runs on real algebra rather than Grothendieck algebra, we have to keep the numerator ${N(x)}$ and the denominator ${D(x)}$ . One can choose ${A}$ to approximate the above green or red chess rating curves in various ways, and then compare the discrepancy for various combinations of ${x}$ and ${\epsilon}$ . The discrepancies for my “Newtonian” ${f(x)}$ tend about twice as great as for the standard curves. That is too bad. But I still wonder whether the above calculation of the prediction discrepancy ${N(x)/D(x)}$ —and its curious ${A^{3/2}}$ feature—has further uses.

Open Problems

What will AI be able to tell from our “track records” that we cannot?

Several theories of test-taking postulate a sigmoid relationship between a student’s ability ${x}$ and his/her likelihood ${f(x)}$ of getting a given exam question right. Changing the difficulty of the question shifts the curve left or right. For a multiple-choice question with ${m}$ choices the floor might be ${1/m}$ rather than ${0}$ to allow for “guessing” but otherwise, similar axioms hold. Inverting the various ${f(x)}$ gives a grading rubric for the exam. Do outcomes tend to be bunched toward the middle more than predicted? Are exam “ratings” (that is, grades) robust enough—as chess ratings are—to tell?

Aggregating the ${f(x)}$ curves for various questions on an exam involves computing weighted averages of logistic curves. Is there literature on mathematical properties of the space of such averaged curves? Is there a theory of handling discrepancy terms like my ${N(x)}$ above?

[some word tweaks and typo fixes]

Do You Want to Know a Secret?

Do You Want to Know a Secret?

Tell Me Why Baby It’s You

A Paradox of Projections

Newtonian Ratings and Grothendieck Nulls

Open Problems

Like this:

Recommend

聚力开发者生态用友以“开发者大赛”为抓手驱动SaaS行业创新

Why is math needed, when we have HTML+CSS?

CSS tweaks for more accessible text

How Any Leader Can Avoid Becoming A Frog

Five Helpful Tips for CEO’s Raising a Child with Autism this Mother’s Day

Why trust is the currency of influence and how to create it

The subtle art of teaching math

Survey research - Ways to clean survey data before analysis

101 Tips For Being A Great Programmer (& Human)

Core React Concept: JSX

About Joyk