Words as tokens 1

Reading Nick’s mail and posting made me reflect about the path of my investigations. Here is Nick Pellings posting: http://ciphermysteries.com/2016/10/08/voynichese-letters-vs-glyph-groups. And here a related posting.

So in my paperwork, this was called Nick’s Tokens, and although the idea is his, the investigation here is 100% mine, the name of this page is “Words as Tokens”.

The aim is to define the text based on specific text strings, called tokens.
We want the most optimal set of tokens, as minimalistic as possible.

I took the total CAB incl Ros. text, change the Sh into c5h and then changed it into cbh, because numbers will have a special meaning later on.
The total unique words are harvested and sorted on high repeat first.
Words with a * in front or at the end were removed.
Words of length 1 were removed.
We now have 8188 unique words with a highest repeat of 866 times for ‘daiin’ and a low repeat of 1 for ‘ypchocpheosaiin’.

Then the words were split up in groups according to their length.
Each group has a 100% percentage for itself and the top words of each group were used.

Top words of each group are each time a bit different, because there are many words in groups of length 3,4,5 but lesser in groups 6,7,8 and for example in group length 9, the top word ‘cbhodaiin’ has a repeat of 23 occurrences only.

Group length 10, the highest repeated word has a repeat of 5, at very low wordgroup percentage (only 1.5%) sof that group is skipped all together.

ex_group10

We start looking at the unique words in group length 9.
The top 10% of that group is compared to the entire group of unique 8187 words.

Run1: first big tokens then smaller

The top words until about a repeat of 20 (lowest repeat) were taken and replaced in the list. For every word a capital letter is taken and a number, which then is called the token.  The group percentage is the total sum of the percentages of those words within the length group.

Starting from bigger words length 9 towards smaller words, and seeing if they are absorbed by VMS.

token-summary-table

First the bigger tokens are checked, like for example  ‘qopchedy’ = G5 and then finally the tokens with length 2 may have a go, for example ‘dl’ = A12.

Results run1

It’s quite remarkable to see that many words have been hit by the tokens.  This means that many words are subject to the rules as defined by the 297 words, or tokens.
That set defines the entire VMS text.

The highest number of tokens is only 4 in one word. They are

7393 -> aldalosam -> A5B2A9A6

8262 -> tchoarorcbhy -> tB14A3A2C15

The numbers on the left are the repeat position in the original word list of >8000 words, position 1 = daiin as the highest repeated word.

But unfortunately not all words have been tokenized.  As you can see 1097 words have no hit, that is 13% of the total words. Here you see the amounts per wordlength.

run1-result-graph

Quite interesting to see that so many 2-letter words are not a token, or it that because already bigger words hold those letters ?   For example untokenized letters are ty, do, lo, ly,ro, but also , da, ot, an, oy, in, etc.    ‘ty’ exists in ‘dary’ and ‘ary’.   ‘do’ exists in ‘dol’ and ‘dor’.

absorption-run1

It is remarkable to see that tokens at a length of 2, have a high absorption (blue dotted graph), but the amount of tokens of that length that are responsible for that is very small: twelve. As a matter of fact that are all the tokens for that length.

Of course we want the most optimal set of token words, as minimalistic as possible.
The set in run1, holds 297 tokens and left 1097 words untokenized.

Run2: sole impacts of twelve

Here we want to see the sole impact of some groups.

First test is the group of twelve 2-letter words (ol or ar dy al am sy qo os ky om dl).
This group is responsible for an untokenized absorption of 35% of the words. This means the remainder is absorbed, which is 65%.

On visual inspection some patterns in untoken words can be detected at once, such as ‘ai  aii  cbh daiin keey ees’.


Run3: all 2-letter tokens

All the 2-letter tokens were now thrown in. There are 78 of them, after removal of some dubious ones.

Now only 260 words, 3% are untokenized.

tokenized-words-run3

On visual inspection especially the sequences ‘air, ai, cbhc’ do not catch a token.

A quick look at the absorption shown a single (itself) use of ‘gm, vo, vs’ as bottom usage and high usage on  ‘ch. ol, dy’. Those three are good for 34% of the absorption of tokens.

90% coverage of all tokens are performed by 23 of the 2-letter tokens:

ch ol dy al in ar or qo ey ok ot da od te am yk os oc es op ky ek ka

 

Run 4: everything

Now the decision was made to throw everything at the wordlist:

  1. all words itself, length 2.3.4.5.6.7.8  (a total of 7178 word-tokens)
  2. the 2-grams, and 3-4-5-6-7 grams (a total of 9382 word-tokens)

Putting this to the test will show by natural selection, what will survive of the the possible 16560 tokens.

After the run it is not really a surprise to see that the biggest word and last word in the most repeated unique words ‘ypchocpheosaiin’ has been hit 79 times by tokens.

There are 3 words not really hit, they are ‘vs, vo, gm’, except by itself.
If they were converted to tokens they would be identified as ‘A59, A62, A74’.

However there is now a problem, because everything hits everything there is not a realistic way of telling what are the “best” tokens.

Let me first take care of the problem of doubles:  Because the found n-grams hold the word-tokens as well, there is need to split it up and remove the doubles.

The 16560 tokens became 13775 tokens in total with the following distribution:

absorption-run4

The main question is: which tokens should we use.
If we use the top tokens, that is not a guarantee for 100% coverage.

If we use small tokens first we are left with untokenized partial words such as
daiin -> d J4 A44 or qokeedy -> q A31 J20 A4.

Of the 13775 there are 4321 that are only used once, that means they only had a hit on themselves. Mostly 8,7 and 6 long (3536 pcs), but also 5 (551), 4 (195) and some smaller ones. The smallest ones with one single hit are  ‘gm and ax’.  These are now removed.

There are 3508 ones with only two hits and 1497 with three hits, 892 with four, 621 with five, and 98 with six.  Etc.  The problem is that only the top tokens hit very much, the rest almost nothing:

run4_absorp-flatliner

If the longest word-token is any indication, it’s almost certain that any token of length 8 is senseless. Looking for the length in the current list, shows ‘chedaiin=G4’ with a usage (read  kind of absoption) of 12 hits only.

The deviation between the absorption was taken and some visual aids:

run4_dev1

If we divide the token absorption into two parts, the current list is 5947 tokens, the high part and the rest:

  • high part: 60 tokens responsible for absorption of 33% of the words
  • middle: 60-150 tokens responsible for absorption of 15% of the words
  • rest: 150 till end, form the absorption of the remaining 52%.
    ending in an absorption of 3 words.

So, the numbers are not so low that “the rest” can be neglected.

 

Looking at the length over this same list:

run4_tokenlength

Just to stay focussed:

daiin  is about on position 124 with an absorption of 170 and a deviation of 3.

Below are separate absorption ranges  with respect to the length of the tokens.

Based on their absorption, it must be possible to see, if the token-length is an indication for a bad absorbed token, or if there is no specific “hard” relation.

The first occurrences of separate lengths in the tokens are:

length 3 at position 8. hits 1230. cbh, followed by che and iin.

length 4 at position 32. hits 576. cbhe, followed by aiin, cheo

length 5 at position 123. hits 170. daiin, followed by cbheo

length 6 at position 282. hits 72. odaiin, followed by cbhedy

length 7 at position 625. hits 31. hodaiin, followed by cheodai

length 8 at position 1426. hits 12. chedaiin

That seems quite interestingly focussed on -daiin-.
Lets split the table up in the different tokenlengths.

 

211table_with_lengths

You can clearly see that the bigger the length, the smaller the absorption hits are.  That is only obvious because a small token will get more hits than a bigger token. The impact can be seen in the following table:

211-per-length

Also you see the impact of the 163 hits of length 8 can be neglected.

Oh, and before you are totally into this page and understand this table: this is not a true representation of what happened in this run, because this was a ‘dry’ run, a fake run.

The tokens are compared to the words but nothing was touched, no words were replaced by tokens. By using this “dry run” we can see how many damage the tokens could do as a maximum in theory. In reality, once a word is “hit” by a token and absorbed 100% by it, it can not be hit again. That is the contrary of this dry run:  hitting a word is possible, over and over again.

Now it can be seen, we will have to locate the unique words & ngrams inside those tokens, in order to get me the minimum set.

This means:

  • the table with tokens (A,B etc.)  & ngrams (J,K,L et..) is sorted on length
  • coloring is applied to differentiate between ngrams (green) and tokens (white)
  • coloring is applied to see which tokens have a high absorption (orange) and which low (white)

tokens_colors

Now, per length the columns are sorted in alphabetical order and for each possible start letter every possible token (5947 !) is placed in a token tree manually.

A small piece of the first token tree:

tokentree_a

Although we already know some are very small, nevertheless everything will be done until token length 6,  (length will only be used partly in the token tree).

 

Continue here to 2 >>

Loading