basic analysis I : letters

Some Facts about the text.

We have two text types, tells Mr. Currier, but let’s find out and double check.

Labels after the folio number in the linear VMS transcription:

P paragraph text , L label text, T title text, C circular text, R radial text, S star text
X/Y/Z are sometimes label-like text embedded within drawings, C1/C2/K/W were just arbitrary

I took the ‘H’ transcription for analysis which is Takeshi Takahashi transcription.

lines with <3 chars removed with with
total lines total words longest lines wordcount biggest words
Currier A&B 5135 37839 f68r3.C1.1 38 dolchsyckheol (13) f89r2.P1.2
f70v2.R3.1 45 otcholcheaiin (13) f19v.T.13
(127859 chars) f72v3.R1.1 40 shapchedyfeey (13) f26r.P.7
f68v3.O.1 49 ycheeytydaiin (13) f86v5.P.18
f72r3.R1.1 46 cheoltchedaiin (14) f114r.P2.39
chesokeeoteody (14) f68v1.C.1
ypchocpheosaiin (15) f87r.P.1
with with
total lines total words longest lines wordcount biggest words
Currier A 1790 11389 f101r1.P.7 20 ararchodaiin (12) f89r1.t.5
f101v2.P.3 22 chokchodaiin (12) f49v.P.15
(66701 chars) f101v2.P.4 26 ctheockhosho (12) f101v2.P.6
otcholocthol (12) f15v.P.9
chotcheytchol (13) f56r.P.9
dolchsyckheol (13) f89r2.P1.2
otcholcheaiin (13) f19v.T.13
ypchocpheosaiin (15) f87r.P.1
chars per word chars per word
AVERAGES chars (incl space) (incl spaces) (not spaces)
per line words per line per line per line
Currier A&B 24,9 4 6,17 5,3
chars per word chars per word
AVERAGES chars (incl space) (incl spaces) (not spaces)
per line words per line per line per line
Currier A 37,2 6,4 5,86 5,2
Currier B  53 5,2
avg chars excl space: 45

To compare the script with possible decipherments we need characterics.
I already collected some in other (previous) pages. But here i use my own (limited) analylitical capabilities and display them as objective as possible.

currierA currierA currierAB currierA currierAB
wordlength count count % of total % of total
1 342 726 3 2
2 566 2283 5 6
3 1273 3543 11 9
4 2430 6843 21 18
5 2938 9621 26 25
6 1934 7583 17 20
7 1131 4484 10 12
8 508 1800 4 5
9 190 669 2 2
10 57 204 1 1
11 12 47 0 0
12 4 25 0 0
13 3 8 0 0
14 0 2 0 0
15 1 1 0 0
total 11389 37839 100% 100%
processed 1795 lines  processed 5139 lines 
processed 11389 words  processed 37839 words 

wordlength

percwordcounts

 

On the previous page (decipher start) we already saw that the average wordlength lies between 5 and 6 characters. Beside the slight difference at length at length 3,4,  there is no real difference between Currier A and Currier A + B on that respect. At this point i see no reason to split up the analysis in 3 sections like: Currier A and Currier B and Currier A + B. Not from the wordcount view, nor from the charactercount-view.

The average word lengths in English, French, Spanish and German are approximately 5.10, 5.13, 5.22 and 6.26. It seems for most European languages to be around 5 characters.

Now i counted the letters in the words:

currierA count count count
letter currierA currier AB courier B
* 117 252 41
a 3577 14279 9240
c 5056 13313 7261
d 3157 12966 8876
e 3761 20067 14295
f 155 500 289
g 42 96 30
h 6413 17854 10183
i 3614 11732 7398
k 2707 10931 7365
l 3004 10512 6569
m 391 1116 604
n 1825 6141 4028
o 8864 25448 14011
p 437 1627 1078
q 1131 5422 4207
r 2382 7447 4342
s 2421 7376 4187
t 2238 6942 3891
v 9
x 32 28
y 4508 17645 11599
z 2 2
20 letters 22 letters

 

If you notice that Currier A and B together do not sum up as the same total AB it is because lines with trash or to many stars and only a few letters were thrown out. Also I used the extractor from here and used these settings:

Page range: none, Takeshi Takahashi, Remove comments, Remove inline comments, Remove parsable information

* for currier: A only: result: 1790 lines

* for B: currier: B only: result: 2647 lines

* for A and B: i selected none and used everything (Currier A and B together are only 4462 lines). If none Currier selected you get 5214 lines. Perhaps because some pages were not analysed by currier?

 

countingletters

As you can see the behaviour of the occurences of the letters are different between A and B.
In the first mountain we see a difference in ‘acdef‘ and because of that i want to see percentages as well:

percletterocc

The  letters c, e,h 

Yes, there is some difference in usage of the c, the e, the h and the o.

Apparently there seems to be a difference, in currier A en AB together, but this is not very significant for the other letters. It is possible that the cipher shifted or switched letters. For example look at the h-purple and the c-blue height. But if i then look at the VMS alphabet for h and c, you notice that these letters are almost the same ones in voynese writing !
Perhaps this means that the transcription can be changed, we could merge letters c and h, and e and h ?

The o

It is also obvious that the most used letter was o, but then was spread between o and e.

fanforaandb

Biggest differences can be seen in: e (5.2%) ,o (4.2%) and c,h (both 3%)

Then I made a table of the letter frequency shift from A to B:

shift ca to cb

Suppose the letters are ciphered with a key or table and the author of the VMS changed that in Currier B. By looking at the frequency shift we now know which letters were shifted in the key or table that is long somewhere between 20 (more probable 15) characters.

Based on the frequency shift we know that in order to make new words of that lengths there must be another key possible which has that same key shift. Like words that are anagrams perhaps or such as paternoster in the sator square?

Or it is something that can be changed easily such as f.e.  feminas et herba = woman and plants  which could be changed quickly into plants and women: herba et feminas.

20 letters


Another observation is that there only 20 real letters used in the words. There are no capital letters, no reading signs, and no end of lines.

The 2 letters v and x  have a neglectable amount of occurences in my opinion in Currier AB.
The z occurs only two times in the entire text but they are in real words: zepchy f17r.P.7 and tazain f58r.P.24. Therefore i do not want to make that character obsolete.

We all know that our Latin script uses 26 letters. Could the lack of 6 letters be explained by the vowels+ another letter : a e i o u  + another letter?
Perhaps the writer somehow deleted/ciphered these from the VMS
or in the written language there are only 20 letters.
The old Latin alphabet was written without the u and without the w but these letters would have been added around 7th century already.

lettersabortedoncount

I could now try to decipher the text by:

1. place a vowel on 1 place (or more) in or around a EVA word
2. replace a EVA letter by another Latin Letter (like an ordinary Caesar cipher)
The number of combinations on an average word are not so huge, and an examening eye could perhaps spot if there are language recognisable fragments.

The places where the vowels could be, are for a 5 letter word:

v12345 12345v 1v2345 12v345 123v45 1234v5
= word charscount+1 for each 5 vowels

we could also add multiple vowels (aouei) per word of course.
Combinations of vowels = word charscount+1^2*5 vowels
so that is 125 combinations for the vowels.

The Caesar decipherment itself depends on the replacement of the 20 letters, so that is 20^2*125=50000 combinations for 1 word. That is do-able i think.

The only problem is that it is unthinkable that the VMS only has words with seperate vowels with a nice spread and only 1 of the same vowel at a time. Or did i already calculated that a word like ‘abacaadroo’ can exist? That is with more 1 of the same vowel at a time in the same word.  I will find out later in an experiment.

 


Character positions

I examined the positions in the words, because the tables are a bit big. Here only the summary is displayed:

currierAB letter total count pos 1 pos 2 pos 3 pos 4 pos 5 pos 6 pos 7 pos 8 pos 9 pos 10 pos 11 pos 12 pos 13 pos 14 pos 15
* 252 67 35 46 29 22 24 14 6 6 2 1
a 14279 1958 3709 3330 2794 1363 710 297 79 26 10 2 1
c 13313 6920 2016 2217 1427 463 171 67 23 4 5
d 12966 3664 399 912 2659 2664 1752 648 196 54 11 5 1 1
e 20067 140 926 7951 6233 2999 1200 419 142 42 8 6 1
f 500 121 160 81 62 42 13 15 5 1
g 96 16 8 9 13 16 16 10 6 2
h 17854 2 9219 3390 1958 1980 846 302 108 31 12 6
i 11732 15 876 2830 3161 2433 1353 646 283 89 24 14 5 2 1
k 10931 1162 3855 3944 1201 527 175 50 11 5 1
l 10512 1366 2186 1666 2039 1764 873 403 150 42 14 5 2 2
m 1116 13 127 186 269 230 156 75 38 17 4 1
n 6141 4 10 132 971 1851 1493 922 446 207 74 15 10 4 1 1
o 25448 8519 6936 3751 3290 1755 793 272 84 33 9 2 4
p 1627 544 608 238 141 53 23 12 2 6
q 5422 5388 25 2 3 1 1 1 1
r 7447 501 1144 1269 1695 1438 771 406 157 49 12 4 1
s 7376 4542 787 482 523 504 308 153 42 25 8 1 1
t 6942 978 3565 1415 590 244 99 29 14 7 1
v 9 9
x 32 13 7 4 1 2 2 2 1
y 17645 1896 515 974 2228 4093 4044 2497 962 311 91 21 10 2 1
z 2 1 1
currierAB
totals 22 191709 37839 37113 34830 31287 24444 14823 7240 2756 956 287 83 36 11 3 1

What is interesting that some letters occur never at a particular position. You can read a 1 or 2 occurences also as ‘almost never’. Based on this analysis we could match this with language specific information if it is a simple cipher.

For example the EVA letter ‘v‘ only occurs 9 times and also on the first position.  And for ‘q’ we have on almost 5400 views, only 25 on position 2:

letter pos 1 pos 2 pos 3 pos 4 pos 5 pos 6 pos 7 pos 8
q 5388 25 2 3 1 1 1 1

curabletterposgraph

 

Here is visible for example that any letter can be at position 1, but letter can only be at position 1 nowhere else.
There actually is a slight difference  in letter positions between A and B. The most important difference is for letter d : in currier A it almost only occurs on position 7.

Below is another view:

currierabantoherview

 

As a quick test i compared the % of the frequency of the VMS text with several other languages:

letterfa1

As you can see, the VMS behaves inside the frequency of all other languages that have been examined.

letterfa2a

 

In the image below only displayed are the 23 most frequent letters of some languages that i saw people discussed:

fasome

 

 

Even if i displayed only languages with a f.a. of their highest % letter being between 10 and 17% (the perc. range of A, B and AB) , Currier still looks out of tune:

(The Red one is Currier A and beside it is Currier B.)

 

fa_limited

 

letterfa2b

In the graph with only four selected close languages I wanted to show you how strange the f.a. is for CurrierAB, especially on the low regions. But this is already what one would expect from a language that has to few letters (about 15) to form a latin or Italian word !

The f.a. with other languages above are comparisements based on current languages.

Old german

Now i took German from the bible (deutsch here in the graph) and the language of the gart der Gesundheit from 1485 (also German) and analysed Total  85129 characters. It is surprising to me to see that the f.a. actually differs by some acceptable degree:

oldgerman

There were some more characters that had some percentages but i could not match them easily to the current German, that is why the table on the left shows some blanks. The graph however displays up to the row with the u-umlaut.

2016:  total count table, sorted

letter
counted
o
28712
e
25054
h
22317
a
18117
y
17565
i
15078
d
13682
c
13664
k
12905
l
11025
t
8373
r
7692
s
7324
n
6142
q
5427
p
1727
m
1123
f
528
*
166
g
88
x
35
v
2
z
2

313 total views, 2 views today

Leave a Reply