basic (crypto)analysis: IV friedman and kasiski
After examing about 50 different languages (http://scriptsource.org/) and the interesting timeline of languages evolvements I decide the text is ciphered by using some sort of key.
The simplest cipher i can think of is the use of ligatures. A ligature is a character that combines more than one letter. Of course we can look at the EVA defined ligatures:
currier B | repeats | avg. distance in graph-lengths |
||
cfh | 28 | 2560,37 | ||
cph | 84 | 882,18 | ||
cth | 329 | 223,48 | ||
ckh | 536 | 136,67 | ||
ch | 6229 | 15,46 | highest repeats | |
sh | 2862 | 33,64 | very high |
..but it does not compare to anything in another language at this moment.
Yesterday i looked into the language dna when i removed the y,m, and n. Because those letters are always almost exclusively at the end of words. It show that ‘no y,m’ makes no difference compared to ‘no y’. Already i guessed that m is probably a reading sign and this confirms my theory a little bit more that ‘m’ is probably an end-of-line or a text marker.
The ‘no y, n’ has only extra effect on one other letter and that is ‘i’.
Today 20th febr.2014, I read about the Italian Alberti, who lived and worked in the neighbourhood of the date of the VMS, perhaps a little bit later. Looking at his a particular cipher and the disk makes me wonder if he will give the needed clue.
Alberti is Italian and also uses for his plaintext abcdefghiklmnopqrstvxyz&, that is 24 chars.
The omitted letters are: j, u, w.
This resulted in a polyalphabetical (monoperiodical) cipher where frequency analysis does not help the attacker because the frequency is flatter. Also the number of letters used and the spread is wider distributed than normal with lesser peaks.
In the VMS the spread of used letters seems to be narrowed down, so how did the author do that? The frequency is also not flat and the peaks such as in normal text appear to be present.
Does this mean the author uses a cipher that did not quite work out? Or cipher partially exectuted mixed with a letter substitution or ligatures ?
The VMS uses 20 most common characters (18 active) where there are the letters y, m and n that occur only as last letter of a word. Why is that I questioned myself, because there is no other language that has one or more letters always on the end of a word and never anywhere else. The letters ‘y+m+n’ occupy 10%+3% =13% of the text. With the Alberti cipher, and an constantly changing key-letter this could explain that behaviour. Let us assume that key-changes are marked in the beginning of each word and at the end. So AsecretB will tell us that the key is A for the ciphertext ‘secret’ and at the end of the word the key changes to B. Then for the word that follows we would not need a new key, because we already have the setting that the key is B. On the other hand if we want a new key we could write Canothertext to make it clear that we changed the key again. In ciphertext those capitals are not used but lowercase ciphertext is used. Suppose the VMS author only wanted to use not many key changes, but a few. Than that is obvious the last letter of each word, or perhaps the first letter as well.
Currier B is used. g,and x are not used here.
- Letters that occur mainly or exclusive as last letter: y, m, n
- Letters that never occur as last letter: a,c,e,f,h,i,k,p,q,t
- Letters that can occur as last letter: y,m,n, d,l,o,r,s
- Letters that occur mainly or exclusive as first letter: q, o
- Letters that never occur as first letter: e,f,h,i,m,n
- Letters that can occur as first letter: a,c,d,k,l, o,p,q,r,s,t,y
A few thoughts:
If a key is used at the end and that same letter occurs at the beginning of a word, or at the end of a couple of words in ciphertext (ct) than this could mean that all those plaintext (pt) is coded with the same key.
For example if i would type: ct: abcdef9 lkjsdfsd9 8768769 jjdhsjkahd9 gjhgads9
this could indicate that the entire piece of text uses key 9.
Also when a word looks like: ct: yhahahay nextline or words of text
this could mean that the key is y for the first word and probably not changed for the remainder of the text because no new (special) key was used.
The Friedman index of coiincidence
See also wikipedia. Related: Chi-square test.
Index of coincidence = IC
When on Caesar ciphers (substition cipher) the IC has the same IC as plaintext
In such polyalphabetic plaintext or cipher the IC= 0,066 for English
Also such index can indicate a polyalphabetic cipher (if the language is known)
Polyalphabetic cipher has a kappa IC: | 10 alphabets | 0,038 | |||
5 alphabets | 0,044 | ||||
2 alpha | 0,052 | ||||
1 alpha | 0,066 |
I did such a test on:
Latin text
latin kappa IC = 0,071914286
total alphabet lettercount 29
voynich currier B
kappa IC = 0,076692732
alphabet lettercount 20
A normal distribution of letters would give an IC of 1.0 because each letter could appear
1/26 th of the times during the frequency of the letters in that language. 0,038461538
For English that would be 1,73 for the frequency and thus IC=1,73/26= 0,066538462
The kappa IC is written when we do not take the effect of the counted number of letters
in the text (we now call alphabet lettercount).
After many contradictional readings I decided I will define:
kappa IC = (sum f*f-1) / (sum counted*counted-1)
and IC = kappa IC * used or expected or comparing lettercount in alphabet
IC | kappa IC | |||
random text | 0,38500 | |||
latin (29 letters) | 2,086 | 0,07191 | ||
latin (26 letters) | 1,871 | 0,07197 | ||
latin (24 as Alberti) | 1,865 | 0,07771 | ||
latin (20 letters) | 1,558 | 0,07789 | ||
currier B | 20 letters | 1,534 | 0,07669 | |
24 letters | 1,841 | same | ||
english | 1,730 | 0,06654 | ||
french | 2,020 | 0,07769 | ||
german | 2,050 | 0,07885 | ||
italian, spanisch and portugese | 1,940 | |||
russian | 1,760 |
What causes a lower index of coincidence, such as the 1,534?
It could indicate polyalphatical cipher if we know the language. (See above)
But the Currier B text has a slight higher kappa IC then Latin: that is because
the number of letters did not reflect the difference in alphabets.
If we would take only 20 letters in Latin we would get a kappa IC of 0,7789
which is almost the same as the 0,07669 of the VMS and there would be no difference
That means the text is probably monoalphabetic if it is indeed Latin.
It seems now that also Trithemius (around 1500) used Latin alphabets with 25 letters (omit the J and W as last letter) or 22 letters (no Y, V,W,J).
Monoalphabetic means it is a Caesar cipher according to the theory.
If so, why is it so difficult to get the text then, because a monoalphabetic
cipher can be solved with frequency analysis and paper and pencil !?
Anyway let me try to find the a key length, if any, in the VMS by this method.
We can take fragments of text, that seem to have the same cipher
and calculate the CI of the probable key-lengths on those text.
First, i did a quick IC on currierA and B and passed every line of text seperately.
in cA : highest IC (4.5) on “ykshy ytchy dol ytydy yky”
lowest on “kodaiin cthy qokeey s ol”
cB has the
lowest (0.853) on “daiin sheol chedy qotyl rar”
Now have a look at the Alberti letters (used the & sign and removed other occurences such as j,u, w) and compare it with a currier B – IC, in the table coloured orange. Once again, very close.
I ran a friedman index of coincidence test on ALL text and calculated an average kappa IC
to see if there is an keylength that pops out. I ran it for keylength 1 to 45.
There is no change in the IC length within a Currier-type whatsoever.
keylength | avg kappa IC (24) cA |
IC cA | avg kappa IC (24) cB |
IC cB | avg kappa IC (24) cAb |
c AB |
1 | 0,0815 | 1,9557 | 0,0767 | 1,8397 | 0,0770 | 1,8469 |
2 | 0,0815 | 1,9556 | 0,0767 | 1,8396 | 0,0770 | 1,8469 |
etc | etc | etc |
Strange thing is, the Currier A has an different (higher) IC then Currier B.
They have the same amount of letters. I have no explanation for that at this point.
Kasiski test
A Kasiski test could show the keylength used in both an poly- and a monoalphabatic cipher.
I did a thorough Kasinski test on 2,3,4,5,6,10, and 12 grams.
The count of di, tri, quad, quint, hexa, 10-gram, 12-grams shows no common factor,
cause there are only normal repetitions found in a normal sliding scale.
All the factors examined on the mentioned (2 to 25) are no indication for a possible keyword-length.
Looking at the tables seperate, i found only this:
cB 8-graph
There is a slight increase on factor 11, also on 14 and 16.
However at 22 there is no peak.
cA 7-graph
There is a slight increase on factor 12, also on 17.
However at 24 there is no peak.
cA 8-graph
There is a slight increase on factor 6. However at 12 there is no peak.
Looking at all the tables combined (2-till-9-graph):
cA: nothing but a normal decreasing logarithmic line on the factors
cB: almost the samen line, but not quite. There is a very small bump at factor 6,or 7 and perhaps at 8. Seeing that i took small pieces of text and that pattern can be seen as well there.
Something happens when the distance is 7 or when a word becomes 7 characters long.
Abbreviation or chunking? What does it mean for our decipherment? At this point i have no idea.
Displayed are five rows of the graphs, each sorted on average distance increasing. The smallest distance is 7 !
2-graph | rep. | avg dist. | 3-graph | rep. | avg dist. | 4-graph | rep. | avg dist. | 5-graph | rep. | avg dist. | |||
ch | 6229 | 22 | ehd | 2 | 14 | okae | 2 | 13 | hedas | 2 | 7 | |||
he | 5991 | 23 | htl | 2 | 18 | kehd | 2 | 14 | okehd | 2 | 14 | |||
dy | 5658 | 25 | dlo | 2 | 32 | ehdy | 2 | 14 | kehdy | 2 | 14 | |||
ed | 4844 | 29 | edy | 4048 | 35 | dyts | 2 | 15 | dytsh | 2 | 15 | |||
ai | 4448 | 31 | keh | 3 | 38 | shsa | 2 | 17 | shaik | 2 | 78 | |||
6-graph | rep. | avg dist. | 7-graph | rep. | avg dist. | 8-graph | rep. | avg dist. | 9-graph | rep. | avg dist. | |||
okehdy | 2 | 14 | keeodar | 2 | 36 | qopcheos | 2 | 357 | ofchedaii | 2 | 706 | |||
shaikh | 2 | 78 | opchear | 2 | 184 | tchedair | 2 | 572 | chedaiiin | 2 | 954 | |||
oedair | 2 | 165 | dchedai | 2 | 350 | ofchedai | 2 | 706 | tchedaiin | 2 | 1016 | |||
ssheey | 2 | 174 | opcheos | 2 | 357 | fchedaii | 2 | 706 | sholkeedy | 2 | 1254 | |||
checfh | 2 | 305 | tchedai | 5 | 428 | shecthed | 2 | 914 | qolsheedy | 2 | 1900 |
Conclusion
There is probably no keyword. The text is not polyalphabeticly coded, nor is it monoalphabatic ciphered with a obvious key. Otherwise we would have found the keylength or would have seen a big variation in the factors.
These Kasiski tables are huge, too huge to display, the biggest x-gram that could be made on the text was a 10 and 9-gram. The table on cB is still quite big (53 graphs), but the one cA is the smallest table there is: (the factors tested are displayed in the columns 2…25, rep.=repeated) and is displayed:
9-graph cA | rep. | avg dist. | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
qotchoiin | 2 | 243 | 1 | 1 | ||||||||||||||||||||||
cheodaiin | 8 | 6946 | 2 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | ||||||||
dchodaiin | 2 | 28087 | ||||||||||||||||||||||||
pchodaiin | 4 | 17788 | 1 | 3 | 1 | 1 | 2 | 1 | 1 | |||||||||||||||||
cthodaiin | 2 | 20192 | 1 | 1 | 1 | 1 | ||||||||||||||||||||
qotcheaii | 2 | 2408 | 1 | 1 | 1 | 1 | 1 | |||||||||||||||||||
otcheaiin | 2 | 2408 | 1 | 1 | 1 | 1 | 1 | |||||||||||||||||||
qotchaiin | 2 | 1733 | ||||||||||||||||||||||||
qokchaiin | 2 | 317 | ||||||||||||||||||||||||
chokeeody | 2 | 10037 | ||||||||||||||||||||||||
tchodaiin | 3 | 9406 | 2 | 1 | 1 | 1 | 1 | |||||||||||||||||||
qokcheody | 2 | 12113 | ||||||||||||||||||||||||
keeodaiin | 2 | 5887 | 1 | |||||||||||||||||||||||
sheockhey | 2 | 4636 | 1 | 1 | 1 | |||||||||||||||||||||
sum: | 9 | 7 | 5 | 2 | 2 | 5 | 4 | 4 | 1 | 1 | 1 | 1 | 3 | 1 | 2 | 2 | 2 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | ||
processed 1817 lines | ||||||||||||||||||||||||||
processed 11415 words and 373 graphs | ||||||||||||||||||||||||||
unique graphs repeated 15 |
Based on the letter freq. analysis, as well as the language dna, there is no monographic similarity. Also there seems to be no basis for a polygraphic substitution.
Now what? Some thoughts:
- A transition of the text, where pieces of text are taking, as for example in Trithemius Steganographia book I and II: skip 1 letter, get 1 letter till the end of the word. Then skip a word. Then skip a letter etc… or a variation on that: skip no words only letters, skip multiple letters. Would it scramble the freq.analysis ? Yes it would. Test that and look at a language DNA each time.
Make a symmetrical frequency matrix (SFM) on the VMS words and compare them with Latin words and Italian.- Use that SFM and change it on the VMS so that it matches Latin or It. words.
Make a vowel identification routine. Run it on the VMS. Also try deleting endings of words and see if that gives a good freq. analysis.Make language DNA’s on the VMS using each of the 20 letters as word-spacing and compare that with other language DNA.- Get the ‘list of good words’ (words that stick out and for which there are word guesses) such as the planets and the zodiacs, ‘dairol’, ‘otaim dam alam’ etc. and try word pattern matches.
- try with the Alberti cipher possibilities to make a cipher that resembles the VMS. Try coding each word. Try coding lines. Try coding parts of words. Key’s are combined keys ?
- Make a Language DNA for other languages such as Greec.
- Make a number display of the text, are there obvious patterns to be seen?
Are there startkeys and endkeys inside the words or together in combination with patterns within lines, within words with the same characters or in lines in the immediate neighbourhood?
Word Letter Rhythm
Are there startkeys and endkeys inside the words or together in combination with patterns within lines, within words with the same characters or in lines in the immediate neighbourhood? I called this the rhythm of letters in the words, because in this analysis we can see quickly if there are patterns in letter in words and/or/ if something if going on with the letter positions or sequences in the words.
The line (sentence) is taken. Words are analysed. Every letter in the word has an position.
if the word is chedy. Then the c is on 1, the h on 2, the e on 3, the d on 4 en y is on 5e. And y is at the end of the word, that is the reason for the e-addition.
It looks like this:
No obvious repeating patterns can be found, expect for some digraphs and things like:
- if there is an m, then the letter before that is an a
- if there is an h, then the letter before that is probably an c
- if there is an n, then it is probable that we see daiin
- if there is an q, then it is probable that is followed by an o
Language DNA shifts with different delimiters
Using my quick programmed language dna function I tried all letters in the VMS as delimiter and treated the space as a character identity.
Also tried and tested are the most occuring di- and trigrams such as cfh, cph, cth, ckh, ch, sh and some other combinations: he,dy, ed, ai, space y, ehd,htl,dlo,edy, keh
No sudden changes in the dna peaks or cohesion can be found.
Special attention was given at sudden appearance of A-pos and/or Z-pos on the letters.
The only visible change on those was given by y as delimiter and on space o.
However also those are not satisfactory.
Wordlengths
Because I could not find a good vowel math, nor a good digraph math and some other known problems, again I compared the wordlengths of Italian, Latin, Cur.A & B and German. This time as percentage of the their’s found wordlengths total.
As can be seen both cA and cB build up to an average length of 5, which is a very remarkable build as you compare it to the lines in the other languages.
Compare it to Latin, still my favourite also in this graph, we see that the words in the VMS have to be lengthened (for example at length 2, 8,9,10,11,12) and on other occasions the words are too long and need to be shortened (at length 4, much at 5, 6). The length seems allright at 1, 3 and 7.
How would we practically do that? Make words of length 3 and 4 longer and on 6 smaller?
Vowel identification
Based on a sloppy algorithm the prominent languages have been compared.
This is the graph.
Once again we see an ordinary pattern of cB.
There are also other graphs i made and what pops out is:
- german has one very high peak at e. Makes 1 big one.
- latin has 2 major peaks on e and i, and one a little lesser on a. makes 3 big ones.
- italian has 2 major peaks on a and e, and two half their size on i and o. Makes 4 big ones.
- cB has 2 major peaks on a and o, and 3 lesser ones. Makes 5 big ones.
Something is going on, in cB, with the: c,d,e, h,i,k,o, (rstu), and y.
Use Alberti to change the language DNA
I implemented Alberti and changed the plaintext (base) and ciphertext (rotor) so that it uses 4 keys, as instructed by Alberti. I displayed the key and indexletter at the first and last position of a word. Of course we now see only 4 letters exclusively prominent on the A-pos and the same on the Z-pos.
Then i made some more experiments and since the way one could use the method more or less and the number of possible keys, this still is possible.
Some links
http://ixoloxi.com/voynich/ekt.txt
http://www.ciphermysteries.com/category/historical-ciphers
Caesar’s cipher: A letter is shifted in the alphabet see wikipedia
Online Vigenère Calculator: http://www.asecuritysite.com/security/Coding/vigcalc
Online decrypto:
http://www.blisstonia.com/software/WebDecrypto/index.php
http://www.cryptool.org/en/cryptool2-en
Leave a Reply
You must be logged in to post a comment.