Word classes Ocay!

In many ways the words in the VMS can be put in categories, for example based on length, on ngram or based on repeat-ness (the higher the repeat the higher the class). But one of the simplest ways is sorted alphabetically.  It does not bring us any closer to decryption though. What could bring us closer to decryption, if any cryptographic element is present, is to find an order in the mess.

EVA characters sorted on highest (see graph language dna) occurrences to lowest are:

o e h a y i d c k l t r s n q p m f

high rank ———————– low rank

If we would try to define a word class, the most obvious way would be to use one or more of these characters, and perhaps we should use pairs or triples finally.

If you take the “full CAB txt incl Ros.” text and try to use a minimum characters to define a word class, that definition must be simple. So I split up the text and replaced SH by C5H. Then I parsed the words and every time a high rank letter occurs I stop and I count that.


If the letter sequence is  “ohxxx”  and the VMS word ‘ehola’
we process from left to right, so the first letter we check is the ‘e’
then the ‘h’, which will give a hit on h (h=1) and further processing of the word at o will not change the count of ‘o’ (o=0) because only the first letter gets a count.

So, only 1 point per word, or first letter that hits wins a point.
This means the order of the letters tested is not relevant,  because every letter is matched against it.

It looks like this:


With this I was not immediately satisfied, first all combinations of the high ranked letters must be tested for optimal configuration.

Finally the conclusion was the best combination for 5 letters: o c a y e

and for 4 letters:  o c a y


If you take these 4 letters: O C A Y you can make a quick classification on all 37608 words, having one of these letters inside, except 170 (small) words.

The unclassed words (4 letter classed) are:

word -> repeated x times
dl -> 20
lr -> 13
ls -> 10
eees ->9
lkl -> 9
ees 6
ld 4
lkeed 3
dm 2
ds 2
in 2
keed 2
kl 2
rl 2
teed 2
lg 2
iir 2     and then some more single repeated words. Most  (at least 52) words contain an ‘e’.

Only today I noticed another remarkable thing when I looked at the letter counts derived from the “language dna”. Then made a quick visual graph to show it. It’s not very clear.


To see it better, for every possible “letter match”, this “zoomed2 graph.”


Below are the total counts per letter over the entire text. Then the difference with the nearest letter total  is taken (delta) and calculated in %.

So what you see is that the difference of the total count between letters is in fact very small.

If there is any order of letters defined, that order is probably the order as displayed above:








Of course, all that could tell us nothing, if the manuscript is not complete and there are many pages missing.


Another method is shown here:

The simplest grouping can be done by an [a]  & [o] check

25% of the words do not contain an [a] nor an [o] in the word.

Y => yes  counts, if a and o are found on a line, wordcounts
N => no, ibid

What does it mean?

It means that we checked if  [a] or [o] are in a word.
If so we count Yes=1 per word. If neither, than No=+1 per word.

It seems that in (7717/29891= 25.8) almost 26% of the words (that is is 1/4) of all words there is not an [a] and/or/ not an [o] in the word.