4

Pandemic Lag - 51posts

 3 years ago
source link: https://51posts.com/2021/07/30/12/58/07/2444/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Skip to content

Main Navigation

Pop up search form

In chess ratings and what other measures of cognitive development?

Henri Didon was a French priest and promoter of youth sports in the late 1800s. He coined the phrase Citius, Altius, Fortius, meaning faster-higher-stronger, which became the motto of the Olympic Games between their reinception in 1896 and its proclamation when the Games were held in Paris in 1924. For the 2020 Games being held now in 2021 they have added the word Communiter, meaning together, which is said to express solidarity during the pandemic.

Today we review how the official measure of being faster, higher, and stronger at chess has been impacted by the pandemic.

Didon spoke of his words as “the foundation and raison d’être of athletics” amid the progress of humanity. They have been borne out by the steady progression of athletic records over the Games’ 125-year history. Whether the Tokyo Games will continue that trend is open. Besides the year delay and the pandemic’s impact on qualifying competitions and athletic conditioning in general, there has emerged a question of mental effects amid the lack of spectators and straitened atmosphere. The one example I’ll quote is the claim by the Hungarian swimmer Kristof Milak that a pre-race mishap with his favorite swimming trunks cost him a record in an event he still won:

“They split 10 minutes before I entered the pool and in that moment I knew the world record was gone. I lost my focus and knew I couldn’t do it.”

At least the means of measuring athletic performances have not been disrupted. For psychometrics—a word meaning the science of measuring mental capacities and processes—the standardized tests most often used to measure aptitude have themselves been curtailed. This makes all the more open the question of how our youth have progressed during the pandemic in education on the whole. We will examine the special case of chess where the official instrument has been almost entirely frozen for 15 months but my own work carries both the ability and the responsibility to make up the difference.

Chess Ratings and Lag

The Elo rating system is simple but accurate enough for use by sporting federations besides chess. In chess, 1000 is a typical rating for a novice player, 1600 means a good club player, 2200 is the threshold for “master,” and 2800 is world championship standard. A player’s rating measures skill in a way that the difference to the opponent’s rating yields probabilities by which to predict the outcomes of games between them. Elo is the main prediction engine of FiveThirtyEight for basketball, baseball, and football (but not soccer).

Although the prediction formula uses only differences, so that an additive shift in all ratings would not affect the chances, I have shown that the ratings administered by the International Chess Federation (FIDE) have stayed stable in absolute regard to the objective quality of moves played as measured by my own predictive model, via my Intrinsic Performance Ratings (IPRs) geared to the FIDE rating scale. Having stable numbers is vital not only to my cheating tests but to the public understanding of the system on the whole, not only for FIDE but the use of Elo by Internet gaming federations and even by Tinder.

Thus it is all the more sad for me to see things like this happen not only to FIDE’s Elo ratings but also those of the US Chess Federation (USCF), who adopted Arpad Elo’s formulas in the 1950s:

This is the FIDE Rating Progress Chart of Annie Wang, who just won the US Junior Women’s Championship played in-person at the Saint Louis Chess Club last week. Her FIDE rating has been stuck at 2384 ever since the April 2020 rating list. One glance at the chart suffices to project her rating into the neighborhood of 2500 by now. Her USCF rating is closer at 2457, but this is offset by a long-known inflation of USCF ratings relative to FIDE, measured about 75 points at that level in May 2020. Wang’s USCF rating has been similarly frozen. You can find the same for a plethora of young players down to aspiring kids of single-digit age blasting out of three-digit ratings, as Wang did—and they have a flat line like the ones circled in blue, but where she had a sharp rise (circled in green):

The Need to Adjust

The lag mattered immediately for me as I gave daily statistical reports to the tournament’s chief arbiter last week. Using Wang’s official rating would have underestimated her true strength and biased my reports in the direction of false positives. Instead, having developed a formula that I won’t claim is anything more than Fermiestimated, I calculated her effective FIDE rating as 2482, adding almost 100 points. I would have upped her USCF rating to 2543 by the same formula.

Wang was both the highest rated among the ten competitors and the oldest, with a long enough record of international play to have her FIDE K-factor reduced from 40 to 20. My formula adds more points for lower ratings, higher K-factor, and younger age—all reflecting the arc of many improving junior players. My average increase to the women’s ratings was 199.1 points, versus 57.4 to the ten players in the junior men’s/mixed championship, who had mostly higher ratings to begin with.

Also playing in St. Louis were ten in the US Senior Championship, including last year’s winner Joel Benjamin, whom I knew and played in the 1970s when we were kids. Their ratings have been likewise frozen. Rating points in chess are zero-sum, so the triple-digit gains I have credited to the young would in normal reality have been taken out of other players—most plausibly, us geezers. There are more of us than keen juniors, so the presumed individual losses would be less.

Did that prove out? My IPRs furnish a way to verify. They differ from other deployed quality metrics by organically involving the difficulty of the positions a player faces, in several ways besides the complexity and temptation factors I incorporated two years ago. Here are the results—but bear in mind that these three 10-player tournaments are small data: their two-sigma error bars on the average IPRs are about {\pm 80} Elo points.

  • US Jr. W: Avg. rating 2101, adjusted 2300, avg. IPR 2337 (+37).
  • US Jr. M: Avg. rating 2492, adjusted 2550, avg. IPR 2527 (-23).
  • US Sr. M: Avg. rating 2494 (no adjustment), avg. IPR 2459 (-35).

The truly significant result is that the women performed much closer to my adjustment than to their official ratings. The men were only slightly closer amid general insignificance, which applies also to the seniors. The juniors combined were highly close to my projections.

Right now I am gathering data from larger Open tournaments in this first month of widespread in-person play. There have been some hits and misses, and I have not yet evaluated all (un-)controllable factors. But gathering the original large data for my adjustment formula required coping with a major factor: the 100–200x higher evident cheating rate I’ve observed in online chess.

How To Be Not Very Wrong

I first perceived the phenomenon when monitoring the European Youth Online Rapid Chess Championship last September. I compiled full analysis on all 689 competitors in women’s and men’s/mixed sections ranging from Under-12 to Under-18. Besides four particular cases, my results said that probably at least four of another five were cheating, but without the confidence needed to flag any one. Removing the high outliers did not, however, equate either the IPRs or my sharper test of conformance to the bell curve to my projections. The Under-12 M and W and Under-14 M sections had IPRs averaging 83, 235, and 125 higher, respectively. The Under-14 W and U16 and U18 sections were close to my projections, so I did not suspect general modeling issues.

The online World Youth Rapid Championships in November-December, which added an under-10 division, brought the lag phenomenon out in force, on all continents. The correction I postulated even before that tournament finished was:

15 Elo {\times} (months since April 2020), higher for those under 13 (50% to 2x higher).

There are several reasons I have not tried to be more precise. There is uncertainty about how many high outliers to remove, about faster time controls, and about geographical drifts in ratings. The effect depends on how much a junior player is disposed to improve in the first place; I found it absent in the lower divisions of the UK’s junior leagues played online last winter. In an individual cheating case I take a more-particular fix on the appropriate rating. What the equation is for is to show the fairness of my baseline relative to the field on the whole. There are also non-cheating purposes, which should come to the fore as FIDE and other federations emerge from the pandemic, and which I discuss next.

I have been using essentially this formula ever since. From large scholastic tournaments across the globe this spring, I specified the adjustment for those with birth year 2008 or later as 25 Elo {\times} months since April 20. For players with official rating {R > 2000} I apply the rough multiplier {(3000 - R)/1000}, and for those with {K < 40} I (also) multiply by {\sqrt{K/40}}.

I won’t claim the ’15’ and ’25’ are right, compared to adding or subtracting 1 or 2 more. But the results I have been getting all year say that my 15 and 25 are most often closer than 10 or 20 or 30 would be. In almost all cases, like for the US Jr. W above, my pre-set rating calibration has come an order of magnitude closer to the IPR verification than the adjustments themselves. Taking a cue from the title of Jordan Ellenberg’s predecessor to his book I previewed last month, my main concern is to be not very wrong.

A Dilemma Moving Forward

Providing an accurate and stable rating system has long been recognized as a prime service of FIDE. A legal dimension has been added insofar as evaluating cheating allegations requires a prior assessment of the natural skill of the accused player. The pandemic has made me take over much of the latter responsibility, but the former presents a wider dilemma doubtless faced in some form by other impacted sporting federations and educational assessment agencies on the whole:

Is it a higher responsibility to provide the most accurate assessment of current ability obtainable now, or to maintain continuity of the official assessment mechanism?

I could go even wider to analogize this to the US Census debate over whether estimations, presuming demonstration of their greater accuracy, should be used in preference to the conducted count. The latter is enshrined in the US Constitution, while the principle that chess rating points should be won or lost only in actual combat is similarly hallowed. But I have certainly demonstrated that the current official ratings of almost all the keenest young players are very wrong.

Mathematically, the rating system will re-establish equilibrium if the current discrepancy is left alone. The trouble is that the mathematical nature of the update and the relative paucity of chess games also guarantees that the process will be slow, measured in years. FiveThirtyEight has remarked in several recent article about the long update times in baseball as measured by Elo ratings. My cheating tests often cannot wait a day. I have to use my cross-check and validation features to detect and remove a huge amount of mathematically the same kind of bias believed to afflict other currently-deployed predictive models less transparently.

There is precedent for a large-scale adjustment of ratings by FIDE. Women’s chess used to be even more segregated from men than today. In 1986, Arpad Elo himself—as secretary of FIDE’s Qualifications Commission—reported that women’s ratings had drifted down by about “one half of a class interval.” FIDE added 100 points to the rating of every active female player except Susan Polgar, whose rating was already ‘well-mixed’ according to the report, since she had faced many more male players than the others.

Attempting to resolve that historical controversy by computing IPRs for Polgar and the other players in Elo’s study has never reached my front burner. But the point remains that my work is uniquely capable of informing the state of ratings in a radical manner. The pandemic has created both a need and an opportunity for a reset that could also solve other issues previously noted—while ensuring that ratings on all continents are on a common scale.

Open Problems

How pronounced is the lag of assessment in education and other competitive arenas, both physical and in mind-sports?

I had not noticed that Tyler Cowen had already used the term “psychometric test” in a post on the Marginal Revolution blog at the beginning of the pandemic, until he repeated it just today.

I have hinted at some other issues in chess but stopped short of addressing them. One is whether online play—where play at 5-minute “Blitz” down to 1-minute “Bullet” time controls predominates even over “Rapid” beginning at 10 minutes—has a similar effect on development in the absence of any in-person “Classical” chess. Another is whether the observed increase in the ranks of players with 2700+ elite ratings is really Fortius or merely rating inflation. A third is whether the current conditions for in-person chess will last long enough to get a good fix on the ‘post-pandemic’ state of skill, and a fourth—coming back to what I quoted about the current Olympics—is whether they are truly “normal” enough even now.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK