# Keyphrase autosolver

I just made a breakthrough in solving Key Phrase ciphers. I’ve written a recursive program that finds solutions by trying words of the right length to match patterns of the ciphertext. I have the program order the ct words by length and start with the longest, working down to shortest, eliminating combos that create conflicts.

I got the program working a few days ago but when I tested it on E-6 in the ND19 issue, it ran painfully slow – too slow for any practical use. Today I made a breakthrough, by realizing that the trick was to require the program to try common words first rather than go through word lists in alphabetical order. I first had to write a program that orders all the words in each separate length list, e.g. A3.txt, etc., and measure the frequency of each word numerically, sort the list, then write them out by frequency, e.g. Fq3.txt. This way the program gets to the most likely combinations much more quickly. Once it gets a conflict-free solution, it displays the solution in the correct word order. I’ve found that it can find hundreds or thousands of “valid” solutions of this type, but the correct one is near the top.

I tested it on E6 and it solved it perfectly in less than 30 seconds. I’ve since tested it on two more (published below for you to solve) and in each case it solved the cipher in less than ten seconds. The biggest problem is that it doesn’t judge whether the key is a valid sentence, so it may find several valid solutions consisting of one or a few words that are more common than the correct plaintext one. For example, in E6 it found solutions placing the word BY where the correct word should be UP. I may fine tune the program to score the solutions in some fashion to solve this. In the second example below, the 5th word proved to be the weak point, as it found 129 valid solutions before finding that one, all in a matter of seconds. As a practical matter this weakness is probably unimportant most of the time because by the time it has correctly identified the largest words, it is obvious to the user which words are valid and it is easy to reconstruct the key by hand and finish those dubious words.

Another weakness is its dependence on all the words being in my word lists. Proper nouns, new slang terms, hyphenated words, etc. may foul it up. I haven’t tested that type of plaintext, but I’ll be experimenting over the next few days. Okay, here are the two test encryptions I used. All the words are in normal word lists. No cribs are provided.

Test Keyphrase 1: FNENEA HNNEAEFE HGGRSHITAE EEEIHIT ENE TE IEF ENSSRIHETAHNI ETAEGGHAE TETBE ANFTFB HAE THTT NFHHA HI ELTEE

Test Keyphrase 2: HNYDIR ONIR H HWDDNDR RHND NWRWRAWN OKHO OKW YWRNWO IO H KHDDR RHNNNHRW NY OKNWW ENOOEW HINDY RINNW NNRKO DWHN

## 3 thoughts on “Keyphrase autosolver”

1. The Rat says:

I forgot to mention: I sorted the word lists using frequencies from the data files from Google’s N-gram files. That data can be downloaded from the Ngram site. By the way, the program also solves Aristocrats.

2. BION says:

I was able to solve both test ciphers, but it took a lot longer than 10 seconds. I used a hill-climber in combination with a pattern search program. The pattern search program allows only three or fewer word patterns; I don’t put in the entire ciphertext. I only use it for long word patterns — usually ten letters or more — and it searches its word list in alphabetical order. The search program does allow you to enter letter substitutions you’ve already discovered which cuts down on the number of words that it outputs. But your program sounds quite a bit better than mine.

1. The Rat says:

I’ve been testing my program. It only solved about two-thirds of Aristocrats int he ND19 issue. The main reason was that many proper names, e.g. Carlin, are not in my word lists. Another reason is that the files I was using for the A’s didn’t parse hyphenated words/phrases correctly. I’ve corrected that. A third reason is that Google N-grams, I discovered, doesn’t preserve frequency data on contractions. Isn’t is parsed as is not there for frequency purposes. Even cannot is treated as can not. This means that all those contractions either don’t appear at all, or show up as very low frequency. The word can’t, for example, reflects the frequency of the valid word cant. Possessives pose a similar problem. The ‘s is split off from the word in N-grams. Some of this can be overcome by editing the word lists by hand. I may be fine-tuning it for a while. Or fine tuning it.