2

Wordle: Revised mathematical analysis of the first guess

 2 years ago
source link: https://withoutbullshit.com/blog/wordle-revised-mathematical-analysis-of-the-first-guess
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Yesterday’s mathematical analysis of the popular word puzzle Wordle was flawed. It assumed that the words used in the puzzle were similar to other word lists, but they’re not — and that changes everything.

What word list does Wordle use — and why does that matter?

Wordle actually uses two word lists (or properly speaking, lexicons). There is a list of over 10,000 legal English words that are appropriate for guesses. And there is a shorter list of 2315 potential solutions to the puzzle. The short list is carefully and idiosyncratically curated to include only common and familiar words, and it apparently omits plurals like TUCKS.

I had assumed that both lists were private and inaccessible, but the internet quickly informed me that they are actually in the Javascript code that the Wordle site uses. (Spoiler warning: if you search this information out, you may find all the daily solutions listed in order, which will ruin the game for you.) Fortunately for me, others have extracted and published these lists.

It only took a small modification to my Python code to analyze the new lists. And it made a dramatic difference, because the letter frequencies in the Wordle solutions list are very different from the letter frequencies in the Scrabble list I’d been using.

Revised letter frequencies suggest a different first guess.

Here’s a chart of the letter frequencies in the list of Wordle solutions:

While E appears in the most words, this is not the same frequency list as ordinary English, which begins ETAOIN. And it is also different from the frequencies in the Scrabble list, in which S is the most popular letter because of all the plurals.

A first guess based on these letters would include the letters EAROT. ORATE comes to mind. This is a pretty good first guess, since it will tell you about the three most common vowels immediately, along with the frequently occurring R and T.

However, as we learned yesterday, if you want your guess to have a higher likelihood of direct hits (letters in the right place), you need to look at frequencies in each of the five letter positions.

Here’s how that looks for the new list of solution words:

So the “word” most likely to get direct hits is SAAEE — which is not an actual valid guess. Looking at these frequencies, a good guess word should start with S and end in E, and include an A in the middle.

SAINE (to make the sign of the cross) and SOARE (an obsolete word for a young hawk) are good candidates, if you don’t mind guessing archaic or obsolete words. Other potential choices that aren’t so arcane are SANER, SAINT, or SLATE.

The brute force method reveals the best guesses

With a slight tweak to the code, I tested every possible guess against every possible solution. The results were enlightening.

If you want to maximize the chances of getting any hits at all, you’ll guess URAEI, which is the plural of “uraeus,” which is the snaky thing on top of Egyptian sarcophagi. (That’s one new fact you learned today!) You could also guess its anagram AUREI (ancient Roman gold coins). Each has a 95% chance of generating a hit, and each will give you information about the vowels A, E, I, and U, and with a higher hit rate than ADIEU. ALOES and ADIEU have a respectable 93% hit rate.

If you want to maximize the chances of getting exact matches (right letter in the right place), your best choices are SOOEY (a hog call), SAREE (alternate spelling of SARI, the common Indian garment), SOREE (an obsolete name for a bird), SIREE (as in “Yes, siree!”), and SEMEE (a spotted field in heraldry). That last is just apparently what happens when you fling as many E’s as possible into one word. Unless you get your thrills from direct hits, I wouldn’t recommend any of these words, as the double letters mean you’ll get information only three or four different letters in these guesses, when you could be finding out about five.

If you want to maximize the total number of matches in the word (as opposed to the chance for getting one match), you’ll pick one of the anagrams ORATE (speak formally in front of a group), OATER (Hollywood slang for a Western), or ROATE (to learn by repetition). Each will get you an average of 1.79 hits per guess. Close behind are REALO (a German politician who is a Green, but moderate), ARTEL (a Russian crafts cooperative), TALER (old German coins), RATEL (a type of badger), TERAI (a wide-brimmed hat), and RETIA (a type of yarn). Can you feel your vocabulary growing?

The best guess, according to my combined score

I value letters in position as twice as much as letters out of position, since they seriously help you narrow down your guesses. Based on that score, what are the most valuable guesses? Here’s the list, best guesses first:

SOARE (young hawk)
ROATE (learn by repetition)
RAILE (to flow steadily)
SAINE (to make the sign of the cross)
ORATE (to speak formally)
STRAE (straw, in Scottish dialect)
RAINE (a kingdom)
SLANE (a spade for cutting turf)
SALET (a helmet that covers the back of your neck)
ARIEL (an African gazelle — the Disney princess is a proper noun and therefore not relevant in this context)

Here’s why I like SOARE. It includes four of the five most popular letters, and will tell you right away about the three most popular vowels. It begins with S and ends with E, the two highest-likelihood letters in position, so there’s a decent chance you’ll know how your solution word begins and ends. It is a complete miss only 8% of the time. Half the time it will get you a letter in the right place, and will generate an average of 1.77 hits. And finally, it will get you a shot at 11 near matches (that is, either four hits with three in the right place, or five hits with two in the right place): SCARE, SCORE, SHARE, SHORE, SNARE, SNORE, SPARE, SPORE, STARE, STORE, and SWORE.

If you’re allergic to obsolete words, I suppose you could try ORATE. But surprisingly, none of these ten highest-scoring words is in the list of solutions, not even ORATE, which ought to be. In fact, none of the top 800 best guesses are in the list of solutions, because many of them are plurals, end in “ed,” or end in “er.”

I look forward to your insights on this revision based on the real Wordle data. What has been your experience?


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK