basic analysis II: language dna
In Currier A, B and the entire VMS (i call it Currier AB, although it is more than Currier A and B together) and counted the words:
Singlets are defined as single words, those occur only once.
currier A | count | % | currier B | count | % | c AB | count | % | ||
words | 11389 | words | 23205 | words | 37839 | |||||
singlets | 2437 | 21,40 | singlets | 3338 | 14,38 | singlets | 5646 | 14,92 | ||
word | counted | word | counted | word | counted | |||||
daiin | 511 | 4,49 | chedy | 491 | 2,12 | daiin | 863 | 2,28 | ||
chol | 280 | 2,46 | ol | 421 | 1,81 | ol | 537 | 1,42 | ||
chor | 182 | 1,60 | shedy | 417 | 1,80 | chedy | 501 | 1,32 | ||
s | 162 | 1,42 | aiin | 351 | 1,51 | aiin | 469 | 1,24 | ||
dy | 124 | 1,09 | daiin | 315 | 1,36 | shedy | 426 | 1,13 | ||
shol | 118 | 1,04 | qokeedy | 305 | 1,31 | chol | 396 | 1,05 | ||
sho | 106 | 0,93 | qokain | 275 | 1,19 | or | 363 | 0,96 | ||
chy | 104 | 0,91 | qokedy | 271 | 1,17 | ar | 349 | 0,92 | ||
ol | 101 | 0,89 | qokeey | 264 | 1,14 | chey | 344 | 0,91 | ||
cthy | 101 | 0,89 | or | 250 | 1,08 | dar | 318 | 0,84 | ||
or | 96 | 0,84 | chey | 250 | 1,08 | qokeey | 308 | 0,81 | ||
dar | 95 | 0,83 | ar | 249 | 1,07 | qokeedy | 305 | 0,81 | ||
dain | 93 | 0,82 | qokaiin | 240 | 1,03 | shey | 283 | 0,75 | ||
dal | 85 | 0,75 | shey | 204 | 0,88 | qokain | 279 | 0,74 |
It is quite reassuring to see that the % of the word ‘daiin’ is not really off chart compared to a 25 chapters of Latin (Genesis 25 chapters).
In cA the word ‘daiin’ appeared the most, where ‘chedy‘ appeared 0.02%.
In cB the word ‘daiin’ was used as less as 1.36% there where ‘chedy‘ climbed to no.1 with 2.12%.
The 14-15% of singlets is much for the VMS, why in cA there are 21.4% of unique words (singlets) can be explained if:
- there are many null words
- the words are ciphered in such a way that a single word not always ciphers the same
I also took some other languages for comparisement of the word frequencies.
latin | latin | % | IT | count | % | NL | count | % | ||
bible 25 chapters genesis | bible 25 chapters genesis | bible 25 chapters genesis | ||||||||
words | 11419 | words | 15406 | words | 17338 | |||||
singlets | 1775 | 15,54 | singlets | 1252 | 8,13 | singlets | 862 | 4,97 | ||
word | counted | word | counted | word | counted | |||||
ET | 960 | 8,41 | E | 1139 | 7,39 | EN | 1556 | 8,97 | ||
IN | 292 | 2,56 | DI | 400 | 2,6 | DE | 629 | 3,63 | ||
EST | 198 | 1,73 | LA | 303 | 1,97 | HET | 401 | 2,31 | ||
AD | 182 | 1,59 | CHE | 277 | 1,8 | VAN | 395 | 2,28 | ||
DE | 99 | 0,87 | IL | 270 | 1,75 | HIJ | 313 | 1,81 | ||
DEUS | 97 | 0,85 | L | 231 | 1,5 | ZIJN | 287 | 1,66 | ||
AUTEM | 92 | 0,81 | A | 208 | 1,35 | TOT | 239 | 1,38 | ||
EIUS | 81 | 0,71 | DEL | 141 | 0,92 | DEN | 230 | 1,33 | ||
QUAE | 80 | 0,7 | I | 140 | 0,91 | ZIJ | 225 | 1,3 | ||
QUI | 76 | 0,67 | ETERNO | 137 | 0,89 | IN | 216 | 1,25 | ||
SUNT | 75 | 0,66 | GLI | 129 | 0,84 | IK | 207 | 1,19 | ||
TERRAM | 74 | 0,65 | IN | 122 | 0,79 | EEN | 193 | 1,11 | ||
DOMINUS | 70 | 0,61 | PER | 122 | 0,79 | DAT | 172 | 0,99 | ||
UT | 69 | 0,6 | TERRA | 121 | 0,79 | ZEIDE | 172 | 0,99 |
Well, that was for comparisement a nice exercise.
Counting and defining bigraphs and trigraphs (..and higher) are very nice, but it does not really tell us a great deal about the letter positions within the word. Nor does it tell anything about the possible vowel-consonant-position-combinations-within a word. To get all that we need to make new analysis and commonly one would use vowel and consonant counts and positions as well. So now I want to try something new. Also because we still have a possible difference in how to handle the text ‘Currier A’ and ‘Currier B’.
Let’s call this the ABCmidXYZ method because it will show the positions of letters in the words and counts those positions within the word. A quick table with example will explain this, letters/words that have length=1 will be skipped:
A pos | B pos | C pos | mid | X pos | Y pos | Z pos | |
d | |||||||
de | d | e | |||||
des | d | e | s | e | s | ||
dest | d | e | s | e | s | t | |
desti | d | e | s | s | t | i | |
destin | d | e | s | t | i | n | |
destina | d | e | s | t | i | n | a |
destinat | d | e | s | t, i | n | a | t |
If we process Currier A and B we will get a stacked graph that show us this:
This shows that the position and usage of the the letters did NOT change between Currier A and B. Therefore it is the same language and is the same cipher.
Differences are there of course in frequencies (percentages) used, which can be caused by the choice of words, a changed grammar, a changed key etc. Some letters have different position preferences.
Looking from A to B the preferent positions shifts:
- letter a shifts from Bpos to Ypos
- letter d shifts from Apos to Ypos
- letter k shifts from Bpos to Cpos
- letter o shifts from Ypos to Apos and Bpos
- letter x occurs 34 times in Currier B
We can now read the position preferences rather clearly in the graph or in the generated table.
I harvested from several medieval documents about 1 Mb of text in Latin and I took all the unique words from that text. Then from that text I analyzed the letters with the ABCmidXYZ method and the result is displayed now:
As you can see the title is renamed to ‘latin % dna’ because this represents the language Latin, where i took the percentage of each letter in stead of the text counts.
Please note that the % graph will change when you have a smaller piece of text of Latin, because some letter positions only show when you have enough occurences.
A bit more solid dna is a plain Latin Dna based on a file with 25.000 (unique) lettercounts:
Compare the Latin DNA graph with the first Currier B graph.
The language DNA shows the positions, the letters and the frequency of those and is useful for identification of languages.
The language % DNA shows the positional occupation within the language itself and is a lesser candidate for comparing languages.
The best language DNA, meaning “the visible clearest DNA”, we get when we use unique words.
For identification of the languages published here I will use that method. Currier B will then look like this:
Of course if life was simple, I could compare Currier B and Latin and shift & substitute the letters from the CurrierB in order to get a good latin text.
Unfortunately we have some big problems:
- currierB has only 20 letters, of which g,x, are dormant, so 18 active
where Latin has 26 letters, of which k,w, z are dormant, so 23 active
still leaving a gap of 5 letters - we do not know the plaintext language
Therefore I will try two things now:
a) is there another language with a DNA such as Currier B
b) is there a way to change the Currier B script so i can get the match with for example Latin?
Dna of other Languages
Possible languages are: Italian, a Scandinavian language, German (although we already saw at the letter f.a. this is hopeless), French, English (just for fun), Spanisch (thank you Juan-José Marcos!), Macedonian, Dutch and perhaps Greec.
…………
Update may 2016: about 50 languages now done are:
Afrikaner Albanian Amharic Arabic Aramaic Estrangela, Madnhaya, Serta Armenian Avesta Yasna Azerbaijan (Azeri) Catalan Cebuano Coptic Bohairic – Coptic Sahidic Croatian Danish Dutch Elbasan English – Middle English (John Wycliffe 1380) – mix middle English and Dutch – dna mix middle English and French Finnish French Georgian German / Swiss Glagolitic Gothic Greec Hawaiian Hebrew Hindi (Devangari) Hindi (transliterated) Hungarian Icelandic Indonesian Italian (1649) Kurdish (Kurmandji) Latin (vulgata) Latvian Lithuanian Maori Mandaic (Mandaean) Portugese Romanian (Cornilescu) Romanian:Romani NT: E Lashi Viasta (Gypsy) Slovak Slovene Spanish (1569) -> see also Catalan Syriac -> see Aramaic Tagalog Turkish (Latin alphabet for Turkish (türk alfabesi) Uzbek Welsh (Cymraeg) Swedisch Norwegian …still working
Adjusting CurrierB towards … Latin dna
If you look closely there are many differences:
- cB has 12 letters for posA, in Latin a word can begin with almost every letter
- cB has no real letter for mid, Latin has the vowels a,e,i and u, but also l, m, n,o,p,r,s,t,v
- cB has has m, n and y almost exclusively on posZ, y also occurs on pos A (and very rarely on some on positions)
- cB has 8 possible letters (d,l, m,n, o,r,s,y) for posZ, Latin has 8 possible a,e,i,m, o,r,s,t (n and u, y very few)
- cB letters a and h, resemble Latin u and n
and so on and so on….
In order to see if we can redesign the VMS alphabet we first have to look at the transcription of e,c and h
Looking at the linear transcription there have been made some decisions on letters.
In order to calculate the % of occurence, the total letters in cAB approximately is:
127859 and half of that is 63929
look at the ‘cc’ on the 3 occurences:
f13r.P.8 | the text you see is: ..rcey kccky |
f68r1.P.3 | the text you see is:1 shokchy chteey choteey cphol cheor opcheeol otor choctheeey okchoal 2 tochso otchl qokeeedy cheey cheeteey yteody chpor cheo!korchey chod 3 ykor shey qocheey chokal okeey ror cckheor daram 4 dchor okaii!n |
f100r.P2.6 | The text:Paragraph: P2 5 folshody chol daiin fchod!y!cheol cphol qotees shey oreso alcfhy 6 soiin chol cphol shol shol qockhol chor chol sho keey cckhhy ykeeam 7 saii!chor sheor qockhody odeor yksheey chol sheody sai cheol raiin 8 sheor qkeeody chol daiin ctheol olcheol chek!y cheol cheockhy okeol 9 yaiin chekeey chol cholody chos olchor qokeol okeeol cheol!s al 10 chol cheol cho chckheody otolchey |
As you can see above these 3 occurence of ‘cc’ really must have been ‘ch, ce all along.
I also examined the 4 occurences of eh on folio f43r on p.12.w6 / p12.w8 / p13.w4 and p14.w3: See for yourself, in my opinion every c, eh and e is almost open for discussion here.
12 dor shol qokol shedy qotedy qokehdy qokody okehdy otedy shedy oty yty dy saiin
13 tshed qosheckhhy odeedy qeokehy qotedy daiin shodody shochol chckhy ykedy dy
14 ykeody checkhy chotehy odain chckhhhy choko!r aiin
Also, look at the word qosheckhhy and then ‘hh‘ in that word on p13.w2.
As you can see that is not a hh, but rather a ‘cc’ !
Conclusion: the linear transcription of these digraphs are doubtful: cc, ce, eh.
Word end: posZ
It is very peculiar behaviour on the letters ‘m’ (85%) and ‘n’ (95%) always end a word, and in 28 words so does ‘g’. Also the ‘y’ ends a word for 79% and starts a word for 13% (elsewhere 8%).
In latin the highest percentage on posZ reaches m (43%) and s (32%). In fact, the overall high score in Latin is is for p on posA (51%) and x posB (53%) and q on posX (52%).
Word start: posA
In cB we see almost 0% on posA at letters e, h, i, m, n
In latin the lowest perc. is u with 3%.
When i was struggling with the fact that i needed to “create” 5 more letters in the VMS in order to approach Latin, I realized that perhaps i should remove the nonsense-letters first, and then i could perhaps see a piece of the real-vms dna.
But which letters are the nonsense-letters ?
update 2016: That is now under intensive investigation!
Leave a Reply
You must be logged in to post a comment.