basic analysis II: language dna

In Currier A, B and the entire VMS (i call it Currier AB, although it is more than Currier A and B together) and counted the words:

Singlets are defined as single words, those occur only once.

currier A count %   currier B count %   c AB count %
words 11389 words 23205 words 37839
singlets 2437 21,40 singlets 3338 14,38 singlets 5646 14,92
word counted word counted word counted
daiin 511 4,49 chedy 491 2,12 daiin 863 2,28
chol 280 2,46 ol 421 1,81 ol 537 1,42
chor 182 1,60 shedy 417 1,80 chedy 501 1,32
s 162 1,42 aiin 351 1,51 aiin 469 1,24
dy 124 1,09 daiin 315 1,36 shedy 426 1,13
shol 118 1,04 qokeedy 305 1,31 chol 396 1,05
sho 106 0,93 qokain 275 1,19 or 363 0,96
chy 104 0,91 qokedy 271 1,17 ar 349 0,92
ol 101 0,89 qokeey 264 1,14 chey 344 0,91
cthy 101 0,89 or 250 1,08 dar 318 0,84
or 96 0,84 chey 250 1,08 qokeey 308 0,81
dar 95 0,83 ar 249 1,07 qokeedy 305 0,81
dain 93 0,82 qokaiin 240 1,03 shey 283 0,75
dal 85 0,75   shey 204 0,88   qokain 279 0,74

It is quite reassuring to see that the % of the word ‘daiin’ is not really off chart compared to a 25 chapters of Latin (Genesis 25 chapters).

In cA the word ‘daiin’ appeared the most, where ‘chedy‘ appeared 0.02%.
In cB the word ‘daiin’ was used as less  as 1.36% there where ‘chedy‘ climbed to no.1 with  2.12%.

The 14-15% of singlets is much for the VMS, why in cA there are 21.4% of unique words (singlets) can be explained if:

  • there are many null words
  • the words are ciphered in such a way that a single word not always ciphers the same

I also took some other languages for comparisement of the word frequencies.

latin latin % IT count % NL count %
bible 25 chapters genesis bible 25 chapters genesis bible 25 chapters genesis
words 11419 words 15406 words 17338
singlets 1775 15,54 singlets 1252 8,13 singlets 862 4,97
word counted word counted word counted
ET 960 8,41 E 1139 7,39 EN 1556 8,97
IN 292 2,56 DI 400 2,6 DE 629 3,63
EST 198 1,73 LA 303 1,97 HET 401 2,31
AD 182 1,59 CHE 277 1,8 VAN 395 2,28
DE 99 0,87 IL 270 1,75 HIJ 313 1,81
DEUS 97 0,85 L 231 1,5 ZIJN 287 1,66
AUTEM 92 0,81 A 208 1,35 TOT 239 1,38
EIUS 81 0,71 DEL 141 0,92 DEN 230 1,33
QUAE 80 0,7 I 140 0,91 ZIJ 225 1,3
QUI 76 0,67 ETERNO 137 0,89 IN 216 1,25
SUNT 75 0,66 GLI 129 0,84 IK 207 1,19
TERRAM 74 0,65 IN 122 0,79 EEN 193 1,11
DOMINUS 70 0,61 PER 122 0,79 DAT 172 0,99
UT 69 0,6   TERRA 121 0,79   ZEIDE 172 0,99

Well, that was for comparisement a nice exercise.

Counting and defining bigraphs and trigraphs (..and higher) are very nice, but it does not really tell us a great deal about the letter positions within the word. Nor does it tell anything about the  possible vowel-consonant-position-combinations-within a word. To get all that we need to make new analysis and commonly one would use vowel and consonant counts and positions as well. So now I want to try something new. Also because we still have a possible difference in how to handle the text ‘Currier A’  and ‘Currier B’.

Let’s call this the ABCmidXYZ method because it will show the positions of letters in the words and counts those positions within the word. A quick table with example will explain this, letters/words that have length=1 will be skipped:

A pos B pos C pos mid X pos Y pos Z pos
de d e
des d e s e s
dest d e s e s t
desti d e s s t i
destin d e s t i n
destina d e s t i n a
destinat d e s t, i n a t

If we process Currier A and B we will get a stacked graph that show us this:



This shows that the position and usage of the the letters did NOT change between Currier A and B. Therefore it is the same language and is the same cipher.

Differences are there of course in frequencies (percentages) used, which can be caused by the choice of words, a changed grammar, a changed key etc. Some letters have different position preferences.

Looking from A to B the preferent positions shifts:

  • letter a shifts from Bpos to Ypos
  • letter d shifts from Apos to Ypos
  • letter k shifts from Bpos to Cpos
  • letter o shifts from Ypos to Apos and Bpos
  • letter x occurs 34 times in Currier B

We can now read the position preferences rather clearly in the graph or in the generated table.
I harvested from several medieval documents about 1 Mb of text in Latin and I took all the unique words from that text. Then from that text I analyzed the letters with the ABCmidXYZ method and the result is displayed now:


As you can see the title is renamed to ‘latin % dna’ because this represents the language Latin, where i took the percentage of each letter in stead of the text counts.

Please note that the % graph will change when you have a smaller piece of text of Latin, because some letter positions only show when you have enough occurences.

A bit more solid dna is a plain Latin Dna based on a file with 25.000 (unique) lettercounts:


Compare the Latin DNA graph with the first Currier B graph.

The language DNA  shows the positions, the letters and the frequency of those and is useful for identification of languages.

The language % DNA shows the positional occupation within the language itself and is a lesser candidate for comparing  languages.

The best language DNA, meaning “the visible clearest DNA”, we get when we use unique words.
For identification of the languages published here I will use that method. Currier B will then look like this:



Of course if life was simple, I could compare Currier B and Latin and shift & substitute the letters from the CurrierB in order to get a good latin text.

Unfortunately we have some big problems:

  1. currierB has only 20 letters, of which g,x, are dormant, so 18 active
    where Latin has 26 letters, of which k,w, z are dormant, so 23 active
    still leaving a gap of 5 letters
  2. we do not know the plaintext language

Therefore I will try two things now:

a) is there another language with a DNA such as Currier B

b) is there a way to change the Currier B script so i can get the match with for example Latin?

Dna of other Languages

Possible languages are: Italian, a Scandinavian language, German (although we already saw at the letter f.a. this is hopeless), French, English (just for fun), Spanisch (thank you  Juan-José Marcos!), Macedonian, Dutch and perhaps Greec.







Update may 2016:  about 50 languages now done are:

Afrikaner Albanian Amharic Arabic Aramaic Estrangela, Madnhaya, Serta Armenian Avesta Yasna Azerbaijan (Azeri) Catalan Cebuano Coptic Bohairic – Coptic Sahidic Croatian Danish Dutch Elbasan English – Middle English (John Wycliffe 1380) – mix middle English and Dutch – dna mix middle English and French Finnish French Georgian German / Swiss Glagolitic Gothic Greec Hawaiian Hebrew Hindi (Devangari) Hindi (transliterated) Hungarian Icelandic Indonesian Italian (1649) Kurdish (Kurmandji) Latin (vulgata) Latvian Lithuanian Maori Mandaic (Mandaean) Portugese Romanian (Cornilescu) Romanian: Romani NT: E Lashi Viasta (Gypsy) Slovak Slovene Spanish (1569) -> see also Catalan Syriac -> see Aramaic Tagalog Turkish (Latin alphabet for Turkish (türk alfabesi) Uzbek Welsh (Cymraeg) Swedisch Norwegian …still working



Adjusting CurrierB towards … Latin dna

If you look closely there are many differences:

  • cB has 12 letters for posA, in Latin a word can begin with almost every letter
  • cB has no real letter for mid, Latin has the vowels a,e,i and u, but also l, m, n,o,p,r,s,t,v
  • cB has has m, n and y almost exclusively on posZ, y also occurs on pos A (and very rarely on some on positions)
  • cB has 8 possible letters (d,l, m,n, o,r,s,y) for posZ, Latin has 8 possible a,e,i,m, o,r,s,t (n and u, y  very few)
  • cB letters a and h, resemble Latin u and n

and so on and so on….

In order to see if we can redesign the VMS alphabet we first have to look at the transcription of e,c and h

Looking at the linear transcription there have been made some decisions on letters.

In order to calculate the % of occurence, the total letters in cAB approximately is:

127859 and half of that is 63929

in Curier AB so the
EVA Voynich  in cB word cAB voynich occur % is
cc cc not kccky f13r.P.8   3 0,005
ce ce not 0 0,000
ch ch f31r.p3 chehey 8817 13,792
ec ec f31r.p7 checkhey   32 0,050
ee ee f31r.p9 sheeo 3647 5,705
eh eh f43r.p12 qokehdy 4 0,006
hc hc f43r.p14 chckhhhy   534 0,835
he he see ch 6559 10,260
hh hh see hc 26 0,041
cth cth 805 1,259
cph cph 185 0,289
ckh ckh 779 1,219
cfh cfh 63 0,099

look at the ‘cc’ on the 3 occurences:

f13r.P.8 f13rp8-kcckythe text you see is: ..rcey kccky
f68r1.P.3 f68r1the text you see is:1 shokchy chteey choteey cphol cheor opcheeol otor choctheeey okchoal
2 tochso otchl qokeeedy cheey cheeteey yteody chpor cheo!korchey chod
3 ykor shey qocheey chokal okeey ror cckheor daram
4 dchor okaii!n
f100r.P2.6 f100rp2The text:Paragraph: P2
5 folshody chol daiin fchod!y!cheol cphol qotees shey oreso alcfhy
6 soiin chol cphol shol shol qockhol chor chol sho keey cckhhy ykeeam
7 saii!chor sheor qockhody odeor yksheey chol sheody sai cheol raiin
8 sheor qkeeody chol daiin ctheol olcheol chek!y cheol cheockhy okeol
9 yaiin chekeey chol cholody chos olchor qokeol okeeol cheol!s al
10 chol cheol cho chckheody otolchey

As you can see above these 3 occurence of ‘cc’  really must have been ‘ch, ce all along.

I also examined the  4 occurences of eh on folio f43r on p.12.w6 / p12.w8 / p13.w4 and p14.w3:  See for yourself, in my opinion every c, eh and e is almost open for discussion here.

12 dor shol qokol shedy qotedy qokehdy qokody okehdy otedy shedy oty yty dy saiin
13 tshed qosheckhhy odeedy qeokehy qotedy daiin shodody shochol chckhy ykedy dy
14 ykeody checkhy chotehy odain chckhhhy choko!r aiin



Also, look at the word qosheckhhy and then ‘hh‘ in that word on p13.w2.
As you can see that is not a hh, but rather a ‘cc’ !

Conclusion: the linear transcription of these digraphs are doubtful: cc, ce, eh.



Word end:  posZ

It is very peculiar behaviour on the letters ‘m’ (85%) and ‘n’ (95%)  always end a word, and in 28 words so does ‘g’. Also the ‘y’ ends a word for 79% and starts a word for 13% (elsewhere 8%).

In latin the highest percentage on posZ reaches m (43%) and s (32%). In fact, the overall high score in Latin is is for p on posA (51%) and x posB (53%) and q on posX (52%).

Word start:  posA

In cB we see almost 0% on posA at letters e, h, i, m, n
In latin the lowest perc. is u with 3%.

When i was struggling with the fact that i needed to “create” 5 more letters in the VMS in order to approach Latin, I realized that perhaps i should remove the nonsense-letters first, and then i could perhaps see a piece of the real-vms dna.

But which letters are the nonsense-letters ?

update 2016: That is now under intensive investigation!


Leave a Reply