basic (crypto)analysis: IV friedman and kasiski

After examing about 50 different languages (http://scriptsource.org/) and the interesting timeline of languages evolvements  I decide the text is ciphered by using some sort of key.

The simplest cipher i can think of is the use of ligatures. A ligature is a character that combines more than one letter. Of course we can look at the  EVA defined ligatures:

currier B   repeats avg. distance
in graph-lengths
cfh cfh 28 2560,37
cph cph 84 882,18
cth cth 329 223,48
ckh ckh 536 136,67
ch ch 6229 15,46 highest repeats
sh sh 2862 33,64 very high

..but it does not compare to anything in another language at this moment.

Yesterday i looked into the language dna when i removed the y,m, and n. Because those letters are always almost exclusively at the end of words. It show that ‘no y,m’ makes no difference compared to ‘no y’. Already i guessed that m is probably a reading sign and this confirms my theory a little bit more that ‘m’ is probably an end-of-line or a text marker. 
The ‘no y, n’ has only extra effect on one other letter and that is ‘i’.

Today 20th febr.2014, I read about the Italian Alberti, who lived and worked in the neighbourhood of the date of the VMS, perhaps a little bit later.  Looking at his a particular cipher and the disk makes me wonder if he will give the needed clue.
Alberti is Italian and also uses for his plaintext abcdefghiklmnopqrstvxyz&, that is 24 chars.
The omitted letters are: j, u, w.

This resulted in a polyalphabetical (monoperiodical) cipher where frequency analysis does not help the attacker because the frequency is flatter. Also the number of letters used and the spread is wider distributed than normal with lesser peaks.

In the VMS the spread of used letters seems to be narrowed down, so how did the author do that? The frequency is also not flat and the peaks such as in normal text appear to be present.
Does this mean the author uses a cipher that did not quite work out? Or cipher partially exectuted mixed with a letter substitution or ligatures ?

The VMS uses 20 most common characters (18 active) where there are the letters y, m and n that occur only as last letter of a word. Why is that I questioned myself, because there is no other language that has one or more letters always on the end of a word and never anywhere else. The letters ‘y+m+n’ occupy 10%+3% =13% of the text. With the Alberti cipher, and an constantly changing key-letter this could explain that behaviour. Let us assume that key-changes are marked in the beginning of each word and at the end. So AsecretB will tell us that the key is A for the ciphertext ‘secret’ and at the end of the word the key changes to B. Then for the word that follows we would not need a new key, because we already have the setting that the key is B.  On the other hand if we want a new key we could write Canothertext to make it clear that we changed the key again. In ciphertext those capitals are not used but lowercase ciphertext is used. Suppose the VMS author only wanted to use not many key changes, but a few. Than that is obvious the last letter of each word, or perhaps the first letter as well.

Currier B is used. g,and x are not used here.

  1. Letters that occur mainly or exclusive as last letter: y, m, n
  2. Letters that never occur as last letter: a,c,e,f,h,i,k,p,q,t
  3. Letters that can occur as last letter: y,m,n, d,l,o,r,s
  1. Letters that occur mainly or exclusive as first letter: q, o
  2. Letters that never occur as first letter: e,f,h,i,m,n
  3. Letters that can occur as first letter: a,c,d,k,l, o,p,q,r,s,t,y

A few thoughts:

If a key is used at the end and that same letter occurs at the beginning of a word, or at the end of a couple of words in ciphertext (ct) than this could mean that all those plaintext (pt) is coded with the same key.

For example if i would type: ct: abcdef9 lkjsdfsd9 8768769 jjdhsjkahd9 gjhgads9

this could indicate that the entire piece of text uses key 9.

Also when a word looks like: ct: yhahahay nextline or words of text 

this could mean that the key is y for the first word and probably not changed for the remainder of the text because no new (special) key was used.

 

 


The Friedman index of coiincidence

See also wikipedia. Related: Chi-square test.

Index of coincidence = IC
When on Caesar ciphers (substition cipher) the IC has the same IC as plaintext
In such polyalphabetic plaintext or cipher the IC= 0,066 for English
Also such index can indicate a polyalphabetic cipher (if the language is known)

Polyalphabetic cipher has a kappa IC: 10 alphabets 0,038
5 alphabets 0,044
2 alpha 0,052
1 alpha 0,066

I did such a test on:

Latin text
latin kappa IC = 0,071914286
total alphabet lettercount 29

voynich currier B
kappa IC = 0,076692732
alphabet lettercount 20

A normal distribution of letters would give an IC of 1.0 because each letter could appear
1/26 th of the times during the frequency of the letters in that language. 0,038461538
For English that would be 1,73 for the frequency and thus IC=1,73/26= 0,066538462

The kappa IC is written when we do not take the effect of the counted number of letters
in the text (we now call alphabet lettercount).

After many contradictional readings I decided I will define:

kappa IC = (sum f*f-1) / (sum counted*counted-1)
and IC =  kappa IC * used or expected or comparing lettercount in alphabet

IC kappa IC
random text 0,38500
latin (29 letters) 2,086 0,07191
latin (26 letters) 1,871 0,07197
latin (24  as Alberti) 1,865 0,07771
latin (20 letters) 1,558 0,07789
currier B  20 letters   1,534 0,07669
24 letters 1,841 same
english 1,730 0,06654
french 2,020 0,07769
german 2,050 0,07885
italian, spanisch and portugese 1,940
russian 1,760

 

What causes a lower index of coincidence, such as the 1,534?

It could indicate polyalphatical cipher if we know the language. (See above)

But the Currier B text has a slight higher kappa IC then Latin:  that is because
the number of letters did not reflect the difference in alphabets.

If we would take only 20 letters in Latin we would get a kappa IC of 0,7789
which is almost the same as the 0,07669 of the VMS and there would be no difference

That means the text is probably monoalphabetic if it is indeed Latin.

It seems now that also Trithemius (around 1500) used Latin alphabets with 25 letters (omit the J and W as last letter) or 22 letters (no Y, V,W,J).

Monoalphabetic means it is a Caesar cipher according to the theory.
If so, why is it so difficult to get the text then, because a monoalphabetic
cipher can be solved with frequency analysis and paper and pencil !?

Anyway let me try to find the a key length, if any, in the VMS by this method.
We can take fragments of text, that seem to have the same cipher
and calculate the CI of the probable key-lengths on those text.

First, i did a quick IC on currierA and B and passed every line of text seperately.

in cA : highest IC (4.5) on “ykshy ytchy dol ytydy yky”
lowest on “kodaiin cthy qokeey s ol”
cB has the
lowest (0.853) on “daiin sheol chedy qotyl rar”

ic_first

Now have a look at the Alberti letters (used the & sign and removed other occurences such as j,u, w) and compare it with a currier B – IC, in the table coloured orange. Once again, very close.

 

I ran a friedman index of coincidence test on ALL text and calculated an average kappa IC
to see if there is an keylength that pops out. I ran it for keylength 1 to 45.
There is no change in the IC length within a Currier-type whatsoever.

keylength avg kappa
IC (24) cA
IC cA avg kappa
IC (24) cB
IC cB avg kappa
IC (24) cAb
c AB
1 0,0815 1,9557 0,0767 1,8397 0,0770 1,8469
2 0,0815 1,9556 0,0767 1,8396 0,0770 1,8469
etc etc etc

Strange thing is, the Currier A has an different (higher) IC then Currier B.
They have the same amount of letters. I have no explanation for that at this point.

Kasiski test

A Kasiski test could show the keylength used in both an poly- and a monoalphabatic cipher.
I did a thorough Kasinski test on 2,3,4,5,6,10, and 12 grams.
The count of di, tri, quad, quint, hexa, 10-gram, 12-grams shows no common factor,
cause there are only normal repetitions found in a normal sliding scale.
All the factors examined on the mentioned (2 to 25) are no indication for a possible keyword-length.

 

Looking at the tables seperate, i found only this:

cB 8-graph
There is a slight increase on factor 11, also on 14 and 16.
However at 22 there is no peak.
cA 7-graph
There is a slight increase on factor 12, also on 17.
However at 24 there is no peak.
cA 8-graph
There is a slight increase on factor 6. However at 12 there is no peak.

kasiski_graph

Looking at all the tables combined (2-till-9-graph):
cA: nothing but a normal decreasing logarithmic line on the factors
cB: almost the samen line, but not quite. There is a very small bump at factor 6,or 7 and perhaps at 8. Seeing that i took small pieces of text and that pattern can be seen as well there.

Something happens when the distance is 7 or when a word becomes 7 characters long.
Abbreviation or chunking? What does it mean for our decipherment?  At this point i have no idea.

Displayed are five rows of the graphs, each sorted on average distance increasing. The smallest distance is 7 !

2-graph rep. avg dist. 3-graph rep. avg dist. 4-graph rep. avg dist. 5-graph rep. avg dist.
ch 6229 22 ehd 2 14 okae 2 13 hedas 2 7
he 5991 23 htl 2 18 kehd 2 14 okehd 2 14
dy 5658 25 dlo 2 32 ehdy 2 14 kehdy 2 14
ed 4844 29 edy 4048 35 dyts 2 15 dytsh 2 15
ai 4448 31 keh 3 38 shsa 2 17 shaik 2 78
6-graph rep. avg dist. 7-graph rep. avg dist. 8-graph rep. avg dist. 9-graph rep. avg dist.
okehdy 2 14 keeodar 2 36 qopcheos 2 357 ofchedaii 2 706
shaikh 2 78 opchear 2 184 tchedair 2 572 chedaiiin 2 954
oedair 2 165 dchedai 2 350 ofchedai 2 706 tchedaiin 2 1016
ssheey 2 174 opcheos 2 357 fchedaii 2 706 sholkeedy 2 1254
checfh 2 305 tchedai 5 428 shecthed 2 914 qolsheedy 2 1900

Conclusion

There is probably no keyword. The text is not polyalphabeticly coded, nor is it monoalphabatic ciphered with a obvious key.  Otherwise we would have found the keylength or would have seen a big variation in the factors.

These Kasiski tables are huge, too huge to display, the biggest x-gram that could be made on the text was a 10 and 9-gram.  The table on cB is still quite big (53 graphs), but the one cA is the smallest table there is: (the factors tested are displayed in the columns 2…25, rep.=repeated) and is displayed:

9-graph cA rep. avg dist. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
qotchoiin 2 243 1 1
cheodaiin 8 6946 2 3 1 1 1 1 1 1 1 1 1 2 1 1 1 1
dchodaiin 2 28087
pchodaiin 4 17788 1 3 1 1 2 1 1
cthodaiin 2 20192 1 1 1 1
qotcheaii 2 2408 1 1 1 1 1
otcheaiin 2 2408 1 1 1 1 1
qotchaiin 2 1733
qokchaiin 2 317
chokeeody 2 10037
tchodaiin 3 9406 2 1 1 1 1
qokcheody 2 12113
keeodaiin 2 5887 1
sheockhey 2 4636 1 1 1
 sum: 9 7 5 2 2 5 4 4 1 1 1 1 3 1 2 2 2 1 0 0 1 1 1 0
processed 1817 lines
processed 11415 words and 373 graphs
unique graphs repeated 15

 

Based on the letter freq. analysis, as well as the language dna, there is no monographic similarity. Also there seems to be no basis for a polygraphic substitution.

Now what? Some thoughts:

  1. A transition of the text, where pieces of text are taking, as for example in Trithemius Steganographia book I and II: skip 1 letter, get 1 letter till the end of the word. Then skip a word. Then skip a letter etc… or a variation on that: skip no words only letters, skip multiple letters. Would it scramble the freq.analysis ? Yes it would. Test that and look at a language DNA each time.
  2. Make a symmetrical frequency matrix (SFM) on the VMS words and compare them with Latin words and Italian.
  3. Use that SFM and change it on the VMS so that it matches Latin or It. words.
  4. Make a vowel identification routine. Run it on the VMS. Also try deleting endings of words and see if that gives a good freq. analysis.
  5. Make language DNA’s on the VMS using each of the 20 letters as word-spacing and compare that with other language DNA.
  6. Get the ‘list of good words’ (words that stick out and for which there are word guesses) such as the planets and the zodiacs, ‘dairol’, ‘otaim dam alam’ etc. and try word pattern matches.
  7. try with the Alberti cipher possibilities to make a cipher that resembles the VMS. Try coding each word. Try coding lines. Try coding parts of words. Key’s are combined keys ?
  8. Make a Language DNA for other languages such as Greec.
  9. Make a number display of the text, are there obvious patterns to be seen?
  10. Are there startkeys and endkeys inside the words or together in combination with patterns within lines, within words with the same characters or in lines in the immediate neighbourhood?

 

Word Letter Rhythm

Are there startkeys and endkeys inside the words or together in combination with patterns within lines, within words with the same characters or in lines in the immediate neighbourhood?  I called this the rhythm of letters in the words, because in this analysis we can see quickly if there are patterns in letter in words and/or/ if something if going on with the letter positions or sequences in the words.

The line (sentence) is taken. Words are analysed. Every letter in the word has an position.
if the word is  chedy. Then the c is on 1, the h on 2, the e on 3, the d on 4 en y is on 5e. And y is at the end of the word, that is the reason for the e-addition.

It looks like this:

letter_rhythm_perword

No obvious repeating patterns can be found, expect for some digraphs and things like:

  1. if there is an m, then the letter before that is an a
  2. if there is an h, then the letter before that is probably an c
  3. if there is an n, then it is probable that we see daiin
  4. if there is an q, then it is probable that is followed by an o

 

Language DNA shifts with different delimiters

Using my quick programmed language dna function I tried all letters in the VMS as delimiter and treated the space as a character identity.

Also tried and tested are the most occuring di- and trigrams such as cfh, cph, cth, ckh, ch, sh and some other combinations: he,dy, ed, ai, space y, ehd,htl,dlo,edy, keh

No sudden changes in the dna peaks or cohesion can be found.
Special attention was given at sudden appearance of A-pos and/or Z-pos on the letters.
The only visible change on those was given by y as delimiter and on space o.
However also those are not satisfactory.

 

Wordlengths

Because I could not find a good vowel math, nor a good digraph math and some other known problems, again I compared the wordlengths of Italian, Latin, Cur.A & B and German. This time as percentage of the their’s found wordlengths total.

As can be seen both cA and cB build up to an average length of 5, which is a very remarkable build as you compare it to the lines in the other languages.
wordlength

Compare it to Latin, still my favourite also in this graph, we see that the words in the VMS have to be lengthened (for example at length 2, 8,9,10,11,12) and on other occasions the words are too long and need to be shortened (at length 4, much at 5, 6). The length seems allright at 1, 3 and 7.
How would we practically do that? Make words of length 3 and 4 longer and on 6 smaller?

Vowel identification 

Based on a sloppy algorithm the prominent languages have been compared.
This is the graph.

vowelasumptions

 

Once again we see an ordinary pattern of cB.
There are also other graphs i made and what pops out is:

  • german has one very high peak at e. Makes 1 big one.
  • latin has 2 major peaks on e and i, and one a little lesser on a. makes 3 big ones.
  • italian has 2 major peaks on a and e, and two half their size  on i and o. Makes 4 big ones.
  • cB has 2 major peaks on a and o,  and 3 lesser ones. Makes 5 big ones.

Something is going on, in cB,  with the: c,d,e, h,i,k,o, (rstu), and y.

 

Use Alberti to change the language DNA

I implemented Alberti and changed the plaintext (base) and ciphertext (rotor) so that it uses 4 keys, as instructed by Alberti. I displayed the key and indexletter at the first and last position of a word. Of course we now see only 4 letters exclusively prominent on the A-pos and the same on the Z-pos.

latin_dna_alberti_4keys

Then i made some more experiments and since the way one could use the method more or less and the number of possible keys, this still is possible.

 


 

Some links

We put the “Brute” in the “Force”

Numerical Coding of Word Sections


http://ixoloxi.com/voynich/ekt.txt
http://www.ciphermysteries.com/category/historical-ciphers

Caesar’s cipher: A letter is shifted in the alphabet see wikipedia
Online Vigenère Calculator: http://www.asecuritysite.com/security/Coding/vigcalc

Online decrypto:
http://www.blisstonia.com/software/WebDecrypto/index.php
http://www.cryptool.org/en/cryptool2-en

 

 

 

Loading

Leave a Reply