For the course Experimental Mathematics, with Doron Zeilberger, we read and discussed this interesting article by Persi Diaconis. We were assigned the task of making an algorithm in a computer algebra system similar to the one described in the paper.

By using a technique called simulated annealing, we can start with encrypted text and assign that text a "weight" or "score" based on how likely it is to be English text. Our assignment was to use two-letter pairs (also called "transitions") to score the word. Each transition was a conditional probability, like P(h|t) would be the probability of "h" following "t." Something like P(x|q) would be virtually zero, while P(h|t) would be relatively high. The cipher is then perturbed, meaning a small change is made in the encryption. If the new text "looks more like English" (meaning has a higher score), it becomes the working version of the text. If it is not higher, we have a small chance (based roughly on how "good" it is, even if it is not "better") of jumping there anyway. This heuristic algorithm attempts to perturb the ciphertext until it "settles" into the "best" configuration, which should be (hopefully!) the plaintext message.

0. The development of the algorithm was simple, but several steps were taken to improve its efficacy. The zeroth step was to compile the transition matrix, where each row is a set of probabilities for the second letter (for a fixed first letter). This was done with a large sample of text from varied English-langugage sources. Below is a graphical representation (first letter for the row, second letter column). Red indicates high probability, purple is low.

image

For example, d is most commonly followed by a space, t is the most common beginning of a word, and q is (almost) exclusively followed by u. (The discrepancy there is because some of the text is taken from news sources, or other sources, which contain some non-English or obscure words -- the obvious word that does not have a q followed by a u is Iraq, which accounts somewhat for a nonzero probability P(space|q), but this is negligbile).

1. After this initial development, the algorithm proved to be mostly ineffective. Careful examination of the output of the program revealed that it was "stuck" on ciphertext with score 0. The problem is that a random configuration of the text tends to generate at least one "impossible" transition, and since the score is computed by multiplying entries from the transition matrix -- if any one of these is zero, we lose the ability to distinguish this from a "better" text that happens to retain the 0 term. The solution? Add ε to every multiplicand.

2. In order to help speed things along, an "initial guess" was added. The basic way to guess a decryption (without any iteration) is to simply analyze single-letter frequencies. We use this to start the annealing process in a better place, presumably somewhat closer to the plaintext than the initial ciphertext would be.

3. The next step was some minor tweaking: instead of using single transpositions, we initially start with many letter-swaps. This helps push us closer to a moderately good region -- once there, we can switch to single transpositions to hone in on the "best" looking text possible. A parameter was added to adjust the preference towards "jumping" from a good text to a slightly less good text.

4. The final, and most significant improvement, was generalizing this to three-letter patterns. Using this analysis helps significantly in making the annealing effective (in the sense that it is accurate). It avoids the problem, which happened frequently, that with only two-letter analysis, there was a fairly good chance that the annealing will get "stuck" with something that somehow looks like English, but is really gibberish.

Below is the code and a sample of the output it generates.

image
image
image
image
image

(This code will hopefully be replaced by text that you can highlight and, more importantly, copy/paste into Mathematica. But that may take some time.) This code is released under my license.