Wordle Letter Statistics
You’ve likely already heard of Wordle, the simple but excellent word game that’s sweeping the internet right now. If not, the gist of it is that you’re given six attempts to guess a five-letter word, and every incorrect guess gives you clues pointing you towards the correct word. If a letter is in the right place, it’s marked in green, and if it’s present but in the wrong place, it’s marked in yellow.
I’m not too interested in “solving” the game programatically, but upon playing I was immediately curious about the statistics behind the words used in the game. I have the mnemonic ETAOIN SHRDLU for the 12 most common letters permanently stuck in my head, but these frequencies will differ from Wordle in a few important ways:
- Wordle only contains 5 letter words, whereas standard letter frequency analysis is based on a corpus containing words of arbitrary length. Many short, common words such as “the”, “be”, “to”, and “of” are missing.
- Speaking of common words, word frequency is mostly irrelevant. Wordle picks a word uniformly from its list of words, without placing more weight on common words.
Luckily Wordle’s word list is easy to find in the page’s Javascript, so we can extract it and do our own analysis. Wordle actually includes two separate word lists – one that only includes more common words (2314 words total), which is used for answers, and one that includes more obscure words (10656 words total), which is combined with the first list for checking whether a guess is permissible. The analysis here is done on the smaller word list, as that’s what we care about when guessing. This does mean that word frequency matters somewhat, but within the list no word is weighted higher than any other.
Questions
I had three main questions I wanted to answer when doing this analysis:
Overall letter frequency
For each letter in the alphabet, what proportion of words contain that letter at least once?
This will help guide us towards guessing words with more common letters first, so we can get yellow letters sooner.
Computed via:
from collections import Counter
letter_freqs = Counter()
for word in ANSWER_WORDS:
for letter in set(word):
letter_freqs[letter] += 1
for letter, count in letter_freqs.most_common():
print('{0}: {1:>5.1%}'.format(letter, count / len(ANSWER_WORDS)))
Result:
Letter | Frequency |
---|---|
E | 45.6% |
A | 39.3% |
R | 36.2% |
O | 29.1% |
T | 28.8% |
L | 28.0% |
I | 27.9% |
S | 26.7% |
N | 23.8% |
U | 19.7% |
C | 19.4% |
Y | 18.0% |
H | 16.4% |
D | 16.0% |
P | 14.9% |
G | 13.0% |
M | 12.9% |
B | 11.5% |
F | 8.9% |
K | 8.7% |
W | 8.4% |
V | 6.4% |
X | 1.6% |
Z | 1.5% |
Q | 1.3% |
J | 1.2% |
Observations:
- The vowels (bolded above) are still very frequent, and even in the same order.
- “R” is way up the list compared to normal letter frequency, possibly because the word list includes lots of agent nouns which usually end in “-er”.
- “Y” is higher up the list, possibly because of lots of adverbs ending in “-ly”.
Per-position distributions
For each of the five positions in a word, what is the distribution of letters appearing in that position?
This will help us placing known letters in the right position early on when we have little information. And perhaps for picking early guesses which are more likely to have green letters.
Computed via:
from collections import Counter
import string
pos_freqs = [Counter({l: 0 for l in string.ascii_lowercase}) for _ in range(5)]
for word in ANSWER_WORDS:
for i, letter in enumerate(word):
pos_freqs[i][letter] += 1
for i, counter in enumerate(pos_freqs):
print('Position #{0}: {1}'.format(i + 1, '_' * i + 'X' + '_' * (4 - i)))
for letter, count in counter.most_common():
print('{0}: {1:>5.1%}'.format(letter, count / len(ANSWER_WORDS)))
Result:
X____ |
_X___ |
__X__ |
___X_ |
____X |
---|---|---|---|---|
S: 16% | A: 13% | A: 13% | E: 14% | E: 18% |
C: 9% | O: 12% | I: 12% | N: 8% | Y: 16% |
B: 8% | R: 12% | O: 11% | S: 7% | T: 11% |
T: 6% | E: 11% | E: 8% | A: 7% | R: 9% |
P: 6% | I: 9% | U: 7% | L: 7% | L: 7% |
A: 6% | L: 9% | R: 7% | I: 7% | H: 6% |
F: 6% | U: 8% | N: 6% | C: 7% | N: 6% |
G: 5% | H: 6% | L: 5% | R: 7% | D: 5% |
D: 5% | N: 4% | T: 5% | T: 6% | K: 5% |
M: 5% | T: 3% | S: 4% | O: 6% | A: 3% |
R: 5% | P: 3% | D: 3% | U: 4% | O: 3% |
L: 4% | W: 2% | G: 3% | G: 3% | P: 2% |
W: 4% | C: 2% | M: 3% | D: 3% | M: 2% |
E: 3% | M: 2% | P: 3% | M: 3% | G: 2% |
H: 3% | Y: 1% | B: 3% | K: 2% | S: 2% |
V: 2% | D: 0.9% | C: 2% | P: 2% | C: 1% |
O: 2% | B: 0.7% | V: 2% | V: 2% | F: 1% |
N: 2% | S: 0.7% | Y: 1% | F: 2% | W: 0.7% |
I: 2% | V: 0.6% | W: 1% | H: 1% | B: 0.5% |
U: 1% | X: 0.6% | F: 1% | W: 1% | I: 0.5% |
Q: 1% | G: 0.5% | K: 0.5% | B: 1.0% | X: 0.3% |
J: 0.9% | K: 0.4% | X: 0.5% | Z: 0.9% | Z: 0.2% |
K: 0.9% | F: 0.3% | Z: 0.5% | X: 0.1% | U: 0.0% |
Y: 0.3% | Q: 0.2% | H: 0.4% | Y: 0.1% | J: 0.0% |
Z: 0.1% | J: 0.1% | J: 0.1% | J: 0.1% | Q: 0.0% |
X: 0% | Z: 0.1% | Q: 0.0% | Q: 0.0% | V: 0.0% |
Vowel: 14% | Vowel: 52% | Vowel: 50% | Vowel: 37% | Vowel: 24% |
We can also break this down by letter instead, to show where a letter is most likely to appear given that it appears at least once:
Computed via:
letter_position_freqs = {letter: [0] + [0 for _ in range(5)]
for letter in string.ascii_lowercase}
for word in ANSWER_WORDS:
for letter in set(word):
letter_position_freqs[letter][0] += 1
for i, letter in enumerate(word):
letter_position_freqs[letter][i + 1] += 1
for letter in string.ascii_lowercase:
words_with_letter, *counts = letter_position_freqs[letter]
print('{0}: '.format(letter) + ' '.join(
'{0:>5.0%}'.format(count / words_with_letter)
for count in counts))
Result:
Letter | X____ |
_X___ |
__X__ |
___X_ |
____X |
---|---|---|---|---|---|
A | 16% | 33% | 34% | 18% | 7% |
B | 65% | 6% | 21% | 9% | 4% |
C | 44% | 9% | 12% | 34% | 7% |
D | 30% | 5% | 20% | 19% | 32% |
E | 7% | 23% | 17% | 30% | 40% |
F | 66% | 4% | 12% | 17% | 13% |
G | 38% | 4% | 22% | 25% | 14% |
H | 18% | 38% | 2% | 7% | 37% |
I | 5% | 31% | 41% | 24% | 2% |
J | 74% | 7% | 11% | 7% | 0% |
K | 10% | 5% | 6% | 27% | 56% |
L | 14% | 31% | 17% | 25% | 24% |
M | 36% | 13% | 20% | 23% | 14% |
N | 7% | 16% | 25% | 33% | 24% |
O | 6% | 41% | 36% | 20% | 9% |
P | 41% | 18% | 17% | 14% | 16% |
Q | 79% | 17% | 3% | 0% | 0% |
R | 13% | 32% | 19% | 18% | 25% |
S | 59% | 3% | 13% | 28% | 6% |
T | 22% | 12% | 17% | 21% | 38% |
U | 7% | 41% | 36% | 18% | 0% |
V | 29% | 10% | 33% | 31% | 0% |
W | 43% | 23% | 13% | 13% | 9% |
X | 0% | 38% | 32% | 8% | 22% |
Y | 1% | 6% | 7% | 1% | 87% |
Z | 9% | 6% | 31% | 57% | 11% |
These tables are kinda hard to read, but we can make out some patterns:
- Vowels are most common in 2nd and 3rd position, and least in first position. Bear in mind that these aren’t independent probabilities – I haven’t run the numbers but I would strongly suspect that a vowel in 2nd or 3rd position dramatically reduces the chance of a vowel in the other one. A vowel in 3rd position most likely corresponds to a vowel in 1st, or consonants in both 1st and 2nd.
- Distributions are dramatically different between each position. For example “S” is the most common by a long shot in 1st position, but appears less than 1% of the time in 2nd position.
- “S” very rarely appears at the end, which suggests that plural nouns aren’t present (which is confirmed by manually inspecting the word list).
- This confirms as suspected above that “Y” appears much more often at the end of words.
- “R” on the other hand appears most often in 2nd position, so my agent noun theory may be nonsense.
Multiply-appearing letters
How often do different letters appear more than once?
When you see a word in green or yellow, all you know is that it appears at least once. You are given no information about how many times it appears. But we can get a prior on the count of each individual letter.
Computed via:
letter_counts = {letter: Counter() for letter in string.ascii_lowercase}
for word in ANSWER_WORDS:
for letter in set(word):
letter_counts[letter][word.count(letter)] += 1
for letter in string.ascii_lowercase:
counter = letter_counts[letter]
words_with_letter = sum(counter.values())
print('{0}: '.format(letter) + ' | ' .join(
'{0:>5.0%}'.format(counter[i] / words_with_letter)
for i in range(1, 4)))
Result:
Letter | 1 | 2 | 3 |
---|---|---|---|
A | 92% | 8% | 0% |
B | 95% | 4% | 0% |
C | 94% | 6% | 0% |
D | 94% | 6% | 0% |
E | 84% | 16% | 0% |
F | 89% | 10% | 0% |
G | 96% | 4% | 0% |
H | 97% | 3% | 0% |
I | 96% | 4% | 0% |
J | 100% | 0% | 0% |
K | 96% | 4% | 0% |
L | 89% | 11% | 0% |
M | 95% | 4% | 1% |
N | 96% | 4% | 0% |
O | 88% | 12% | 0% |
P | 95% | 5% | 1% |
Q | 100% | 0% | 0% |
R | 93% | 7% | 0% |
S | 92% | 8% | 0% |
T | 91% | 9% | 0% |
U | 98% | 2% | 0% |
V | 97% | 3% | 0% |
W | 99% | 1% | 0% |
X | 100% | 0% | 0% |
Y | 98% | 2% | 0% |
Z | 86% | 14% | 0% |
Observations:
- Only “J”, “Q”, and “X” never appear more than once.
- Only “M” and “P” ever appear three times (“poppy” and “puppy” for “P”, and “mammy”, “mommy” and “mummy” for “M”, since you’re wondering).
- The most common letters to appear twice are “E”, “Z”, “O”, “L”, and “F”, which isn’t surprising, as those all clearly common digraphs when repeated twice in a row.
Conclusion
There’s a lot more that could be analysed here, particularly conditional probabilities (e.g. “If we know we have an E, what is the most likely second vowel?"). But even what I’ve computed so far is obviously too much to remember in its entirety and apply when playing the game (and referring to these tables while playing feels way too cheaty). The number one takeaway for me is the relative frequency of “R”, I’ll be sure to include that earlier on in my guesses.