basic analysis I : letters
Some Facts about the text.
We have two text types, tells Mr. Currier, but let’s find out and double check.
Labels after the folio number in the linear VMS transcription:
P paragraph text , L label text, T title text, C circular text, R radial text, S star text
X/Y/Z are sometimes label-like text embedded within drawings, C1/C2/K/W were just arbitrary
I took the ‘H’ transcription for analysis which is Takeshi Takahashi transcription.
lines with <3 chars removed | with | with | |||||
total lines | total words | longest lines | wordcount | biggest words | |||
Currier A&B | 5135 | 37839 | f68r3.C1.1 | 38 | dolchsyckheol (13) | f89r2.P1.2 | |
f70v2.R3.1 | 45 | otcholcheaiin (13) | f19v.T.13 | ||||
(127859 chars) | f72v3.R1.1 | 40 | shapchedyfeey (13) | f26r.P.7 | |||
f68v3.O.1 | 49 | ycheeytydaiin (13) | f86v5.P.18 | ||||
f72r3.R1.1 | 46 | cheoltchedaiin (14) | f114r.P2.39 | ||||
chesokeeoteody (14) | f68v1.C.1 | ||||||
ypchocpheosaiin (15) | f87r.P.1 | ||||||
with | with | ||||||
total lines | total words | longest lines | wordcount | biggest words | |||
Currier A | 1790 | 11389 | f101r1.P.7 | 20 | ararchodaiin (12) | f89r1.t.5 | |
f101v2.P.3 | 22 | chokchodaiin (12) | f49v.P.15 | ||||
(66701 chars) | f101v2.P.4 | 26 | ctheockhosho (12) | f101v2.P.6 | |||
otcholocthol (12) | f15v.P.9 | ||||||
chotcheytchol (13) | f56r.P.9 | ||||||
dolchsyckheol (13) | f89r2.P1.2 | ||||||
otcholcheaiin (13) | f19v.T.13 | ||||||
ypchocpheosaiin (15) | f87r.P.1 | ||||||
chars per word | chars per word | ||||||
AVERAGES | chars (incl space) | (incl spaces) | (not spaces) | ||||
per line | words per line | per line | per line | ||||
Currier A&B | 24,9 | 4 | 6,17 | 5,3 | |||
chars per word | chars per word | ||||||
AVERAGES | chars (incl space) | (incl spaces) | (not spaces) | ||||
per line | words per line | per line | per line | ||||
Currier A | 37,2 | 6,4 | 5,86 | 5,2 | |||
Currier B | 53 | 5,2 | |||||
avg chars excl space: 45 |
To compare the script with possible decipherments we need characterics.
I already collected some in other (previous) pages. But here i use my own (limited) analylitical capabilities and display them as objective as possible.
currierA | currierA | currierAB | currierA | currierAB | |
wordlength | count | count | % of total | % of total | |
1 | 342 | 726 | 3 | 2 | |
2 | 566 | 2283 | 5 | 6 | |
3 | 1273 | 3543 | 11 | 9 | |
4 | 2430 | 6843 | 21 | 18 | |
5 | 2938 | 9621 | 26 | 25 | |
6 | 1934 | 7583 | 17 | 20 | |
7 | 1131 | 4484 | 10 | 12 | |
8 | 508 | 1800 | 4 | 5 | |
9 | 190 | 669 | 2 | 2 | |
10 | 57 | 204 | 1 | 1 | |
11 | 12 | 47 | 0 | 0 | |
12 | 4 | 25 | 0 | 0 | |
13 | 3 | 8 | 0 | 0 | |
14 | 0 | 2 | 0 | 0 | |
15 | 1 | 1 | 0 | 0 | |
total | 11389 | 37839 | 100% | 100% | |
processed 1795 lines | processed 5139 lines | ||||
processed 11389 words | processed 37839 words |
On the previous page (decipher start) we already saw that the average wordlength lies between 5 and 6 characters. Beside the slight difference at length at length 3,4, there is no real difference between Currier A and Currier A + B on that respect. At this point i see no reason to split up the analysis in 3 sections like: Currier A and Currier B and Currier A + B. Not from the wordcount view, nor from the charactercount-view.
The average word lengths in English, French, Spanish and German are approximately 5.10, 5.13, 5.22 and 6.26. It seems for most European languages to be around 5 characters.
Now i counted the letters in the words:
currierA | count | count | count |
letter | currierA | currier AB | courier B |
* | 117 | 252 | 41 |
a | 3577 | 14279 | 9240 |
c | 5056 | 13313 | 7261 |
d | 3157 | 12966 | 8876 |
e | 3761 | 20067 | 14295 |
f | 155 | 500 | 289 |
g | 42 | 96 | 30 |
h | 6413 | 17854 | 10183 |
i | 3614 | 11732 | 7398 |
k | 2707 | 10931 | 7365 |
l | 3004 | 10512 | 6569 |
m | 391 | 1116 | 604 |
n | 1825 | 6141 | 4028 |
o | 8864 | 25448 | 14011 |
p | 437 | 1627 | 1078 |
q | 1131 | 5422 | 4207 |
r | 2382 | 7447 | 4342 |
s | 2421 | 7376 | 4187 |
t | 2238 | 6942 | 3891 |
v | 9 | ||
x | 32 | 28 | |
y | 4508 | 17645 | 11599 |
z | 2 | 2 | |
20 | letters | 22 | letters |
If you notice that Currier A and B together do not sum up as the same total AB it is because lines with trash or to many stars and only a few letters were thrown out. Also I used the extractor from here and used these settings:
Page range: none, Takeshi Takahashi, Remove comments, Remove inline comments, Remove parsable information
* for currier: A only: result: 1790 lines
* for B: currier: B only: result: 2647 lines
* for A and B: i selected none and used everything (Currier A and B together are only 4462 lines). If none Currier selected you get 5214 lines. Perhaps because some pages were not analysed by currier?
As you can see the behaviour of the occurences of the letters are different between A and B.
In the first mountain we see a difference in ‘acdef‘ and because of that i want to see percentages as well:
The letters c, e,h
Yes, there is some difference in usage of the c, the e, the h and the o.
Apparently there seems to be a difference, in currier A en AB together, but this is not very significant for the other letters. It is possible that the cipher shifted or switched letters. For example look at the h-purple and the c-blue height. But if i then look at the VMS alphabet for h and c, you notice that these letters are almost the same ones in voynese writing !
Perhaps this means that the transcription can be changed, we could merge letters c and h, and e and h ?
The o
It is also obvious that the most used letter was o, but then was spread between o and e.
Biggest differences can be seen in: e (5.2%) ,o (4.2%) and c,h (both 3%)
Then I made a table of the letter frequency shift from A to B:
Suppose the letters are ciphered with a key or table and the author of the VMS changed that in Currier B. By looking at the frequency shift we now know which letters were shifted in the key or table that is long somewhere between 20 (more probable 15) characters.
Based on the frequency shift we know that in order to make new words of that lengths there must be another key possible which has that same key shift. Like words that are anagrams perhaps or such as paternoster in the sator square?
Or it is something that can be changed easily such as f.e. feminas et herba = woman and plants which could be changed quickly into plants and women: herba et feminas.
20 letters
Another observation is that there only 20 real letters used in the words. There are no capital letters, no reading signs, and no end of lines.
The 2 letters v and x have a neglectable amount of occurences in my opinion in Currier AB.
The z occurs only two times in the entire text but they are in real words: zepchy f17r.P.7 and tazain f58r.P.24. Therefore i do not want to make that character obsolete.
We all know that our Latin script uses 26 letters. Could the lack of 6 letters be explained by the vowels+ another letter : a e i o u + another letter?
Perhaps the writer somehow deleted/ciphered these from the VMS
or in the written language there are only 20 letters.
The old Latin alphabet was written without the u and without the w but these letters would have been added around 7th century already.
I could now try to decipher the text by:
1. place a vowel on 1 place (or more) in or around a EVA word
2. replace a EVA letter by another Latin Letter (like an ordinary Caesar cipher)
The number of combinations on an average word are not so huge, and an examening eye could perhaps spot if there are language recognisable fragments.
The places where the vowels could be, are for a 5 letter word:
v12345 12345v 1v2345 12v345 123v45 1234v5
= word charscount+1 for each 5 vowels
we could also add multiple vowels (aouei) per word of course.
Combinations of vowels = word charscount+1^2*5 vowels
so that is 125 combinations for the vowels.
The Caesar decipherment itself depends on the replacement of the 20 letters, so that is 20^2*125=50000 combinations for 1 word. That is do-able i think.
The only problem is that it is unthinkable that the VMS only has words with seperate vowels with a nice spread and only 1 of the same vowel at a time. Or did i already calculated that a word like ‘abacaadroo’ can exist? That is with more 1 of the same vowel at a time in the same word. I will find out later in an experiment.
Character positions
I examined the positions in the words, because the tables are a bit big. Here only the summary is displayed:
currierAB | letter | total count | pos 1 | pos 2 | pos 3 | pos 4 | pos 5 | pos 6 | pos 7 | pos 8 | pos 9 | pos 10 | pos 11 | pos 12 | pos 13 | pos 14 | pos 15 |
* | 252 | 67 | 35 | 46 | 29 | 22 | 24 | 14 | 6 | 6 | 2 | 1 | |||||
a | 14279 | 1958 | 3709 | 3330 | 2794 | 1363 | 710 | 297 | 79 | 26 | 10 | 2 | 1 | ||||
c | 13313 | 6920 | 2016 | 2217 | 1427 | 463 | 171 | 67 | 23 | 4 | 5 | ||||||
d | 12966 | 3664 | 399 | 912 | 2659 | 2664 | 1752 | 648 | 196 | 54 | 11 | 5 | 1 | 1 | |||
e | 20067 | 140 | 926 | 7951 | 6233 | 2999 | 1200 | 419 | 142 | 42 | 8 | 6 | 1 | ||||
f | 500 | 121 | 160 | 81 | 62 | 42 | 13 | 15 | 5 | 1 | |||||||
g | 96 | 16 | 8 | 9 | 13 | 16 | 16 | 10 | 6 | 2 | |||||||
h | 17854 | 2 | 9219 | 3390 | 1958 | 1980 | 846 | 302 | 108 | 31 | 12 | 6 | |||||
i | 11732 | 15 | 876 | 2830 | 3161 | 2433 | 1353 | 646 | 283 | 89 | 24 | 14 | 5 | 2 | 1 | ||
k | 10931 | 1162 | 3855 | 3944 | 1201 | 527 | 175 | 50 | 11 | 5 | 1 | ||||||
l | 10512 | 1366 | 2186 | 1666 | 2039 | 1764 | 873 | 403 | 150 | 42 | 14 | 5 | 2 | 2 | |||
m | 1116 | 13 | 127 | 186 | 269 | 230 | 156 | 75 | 38 | 17 | 4 | 1 | |||||
n | 6141 | 4 | 10 | 132 | 971 | 1851 | 1493 | 922 | 446 | 207 | 74 | 15 | 10 | 4 | 1 | 1 | |
o | 25448 | 8519 | 6936 | 3751 | 3290 | 1755 | 793 | 272 | 84 | 33 | 9 | 2 | 4 | ||||
p | 1627 | 544 | 608 | 238 | 141 | 53 | 23 | 12 | 2 | 6 | |||||||
q | 5422 | 5388 | 25 | 2 | 3 | 1 | 1 | 1 | 1 | ||||||||
r | 7447 | 501 | 1144 | 1269 | 1695 | 1438 | 771 | 406 | 157 | 49 | 12 | 4 | 1 | ||||
s | 7376 | 4542 | 787 | 482 | 523 | 504 | 308 | 153 | 42 | 25 | 8 | 1 | 1 | ||||
t | 6942 | 978 | 3565 | 1415 | 590 | 244 | 99 | 29 | 14 | 7 | 1 | ||||||
v | 9 | 9 | |||||||||||||||
x | 32 | 13 | 7 | 4 | 1 | 2 | 2 | 2 | 1 | ||||||||
y | 17645 | 1896 | 515 | 974 | 2228 | 4093 | 4044 | 2497 | 962 | 311 | 91 | 21 | 10 | 2 | 1 | ||
z | 2 | 1 | 1 | ||||||||||||||
currierAB | |||||||||||||||||
totals | 22 | 191709 | 37839 | 37113 | 34830 | 31287 | 24444 | 14823 | 7240 | 2756 | 956 | 287 | 83 | 36 | 11 | 3 | 1 |
What is interesting that some letters occur never at a particular position. You can read a 1 or 2 occurences also as ‘almost never’. Based on this analysis we could match this with language specific information if it is a simple cipher.
For example the EVA letter ‘v‘ only occurs 9 times and also on the first position. And for ‘q’ we have on almost 5400 views, only 25 on position 2:
letter | pos 1 | pos 2 | pos 3 | pos 4 | pos 5 | pos 6 | pos 7 | pos 8 |
q | 5388 | 25 | 2 | 3 | 1 | 1 | 1 | 1 |
Here is visible for example that any letter can be at position 1, but letter v can only be at position 1 nowhere else.
There actually is a slight difference in letter positions between A and B. The most important difference is for letter d : in currier A it almost only occurs on position 7.
Below is another view:
As a quick test i compared the % of the frequency of the VMS text with several other languages:
As you can see, the VMS behaves inside the frequency of all other languages that have been examined.
In the image below only displayed are the 23 most frequent letters of some languages that i saw people discussed:
Even if i displayed only languages with a f.a. of their highest % letter being between 10 and 17% (the perc. range of A, B and AB) , Currier still looks out of tune:
(The Red one is Currier A and beside it is Currier B.)
In the graph with only four selected close languages I wanted to show you how strange the f.a. is for CurrierAB, especially on the low regions. But this is already what one would expect from a language that has to few letters (about 15) to form a latin or Italian word !
The f.a. with other languages above are comparisements based on current languages.
Old german
Now i took German from the bible (deutsch here in the graph) and the language of the gart der Gesundheit from 1485 (also German) and analysed Total 85129 characters. It is surprising to me to see that the f.a. actually differs by some acceptable degree:
There were some more characters that had some percentages but i could not match them easily to the current German, that is why the table on the left shows some blanks. The graph however displays up to the row with the u-umlaut.
2016: total count table, sorted
letter |
counted |
o |
28712 |
e |
25054 |
h |
22317 |
a |
18117 |
y |
17565 |
i |
15078 |
d |
13682 |
c |
13664 |
k |
12905 |
l |
11025 |
t |
8373 |
r |
7692 |
s |
7324 |
n |
6142 |
q |
5427 |
p |
1727 |
m |
1123 |
f |
528 |
* |
166 |
g |
88 |
x |
35 |
v |
2 |
z |
2 |
Leave a Reply
You must be logged in to post a comment.