Words as tokens 2

On the previous page 1 the “word tokens” and “ngram tokens (green color)” were merged into one big pool of tokens.

On this page:

  • the link to the dendrogram sources
  • startletters discussed: which trees do we need to look at


All hierarchical (token) trees are published and placed here:


In the files you will find the following sheets:

  • run4classes sheet: this sheet shows the tokens, on the left the lengths 2 up to 8, sorted per length on alphabetical order.
    on the right you find the exact same information but there the columns are sorted on usage
  • sheets a till y will show you the hierarchical trees of the tokens, the dendrograms of the letters in the Voynich manuscript
  • the last sheet contains ‘tokens not in tree’ which are the tokens that could be integrated in  the future. At this point there were of little significance.

The coloring:


A last note on the difference in “word tokens” and “ngram tokens”. As you can see the ngram tokens are artificial tokens: they do not exist specifically as word, but they are defined by the software as part of a word which is of importance. The “word tokens” are real words in the text and do exist as presented and were upgraded to token as explained in the text on this page.

When speaking about “tokens” that refers to the group name and refers to both flavours in general.


background read more:
hierarchical clusters https://en.wikipedia.org/wiki/Hierarchical_clustering
dendrogram https://en.wikipedia.org/wiki/Dendrogram


The following dendrograms are in the 21 sheets, based on the start letter:

a b c d e f g h i k o l m n p q r s t x y

Which dendrograms are relevant ?

The letter b is a fake letter: is lies always between c_h  and checking the dendrogram shows indeed that this is true. cbh is always the correct configuration for that letter.
The letter g and x show only two hands of hits (as first letter) and are not of interest. The same for the dendrogram on first letter m and n. The remaining 15 letters will be discussed now:  a c d e f h i k o l p q r s t y.

Dendrogram h

The most obvious question here is:  can we put this over the dendrogram cbh?

The letter is not a real startletter of any word (two occurrences were detected which are flaws) but always present as a minimum Bpos (second letter) or higher position in the word.
That written will immediately rule out the sole existence of the letter and it can be ignored as well. 15 sheets remain.



Dendrograms on letters which are not startletters 

In fact any letter that is not a Apos (startletter) can be ignored because all parts of that dendrogram will occur in the dendrograms of the other letters. But they might give us a quick insight in the structure of the tokens for that letter !

Any letter (c o q y d l t s k p a r e f i x g h m v z)  shown, here in descending order of usage, can be startletter of a word, except b and n.

Words that are two letters long and bigger are taken into account here, not “words” of 1 letter.

The letter z

For z the single word is zepchy (f17r).

* is this letter placed intentional
* if not what letter would it have been?

It can not be the letter k, cause then the word would have been kepchy and that can not exist within the Voynichese ruleset:   ke can not exist as word. and kep as word is also impossible.

The most obvious answer is, that it should have been an -epchy word. Like the only variations presented:


That is not the only word that starts with the letter z but the second z, on page 58r, is obviously a mistake.   https://www.jasondavies.com/voynich/#f58r/0.256/0.477/5.00

It probably automatically mean that both z occurrences are a flaw.

The letter h

Furthermore the letter h, occurs 2 times,  as h*s and as haiin (1x on f26r).  Which the latter then can be seen as flaw.



The letter v

The letter v only occurs in 57v as single character, not as character in a sentence or a word.


The letter u

This seems only to exist in ‘uochs’ which was a transcription on the spiralling text coming out the tower,  on the upper right rosette:

ros.f86r6.C10 :   uochs.oeteey.osar.aram.askeeody.ochdor.al.oekairy.ytodaro.opalshy.

Clearly this u is wrong and will be removed.

The letter m

M occurs as ‘mol’ (1x on f17r)


and ‘mar’ (1x on f23v)


Both words can be seen as statistical anomalies. For me they represent an error in these cases and the m must have been a Voynichese d which looks very similar to the witten characters.

mol -> dol
mar -> dar


The word mar is also used in the word ‘cheamar’ (1x on f111r)



Here the author did not clearly write the m-character and  perhaps he wrote two words:  cheam and ar.

The word cheam already occurs 5 times, which makes it plausible that this was the intention.

Conclusion: m does not occur as first word character.


The letter g

These are the words that have the g as startletter

gm  gal  geedy  gaiin  ge**eem


gm (1x on f111v)

These two characters are not written very clearly, but it resembles gm.

gal (1x on f26r)

This word should have been dal probably. Look at the three line below, just one word to the left.  There you see three 8’s (VMS d’s) beneath each other. Compare the last two with the strike of the pen to the bottom with this g in gal.

geedy (1x on 26r)

It seems that on this page the d is often written as g. If you look two lines up again there is another g which is very doubtful: it should have been a d in that word.

gaiin (1x on 36r)

The character g is tilted to the right and does not really resemble a g, nor does it resemble any other letter. Perhaps the writer intended to write cSh ?

It is difficult to say that the letter g does not exist as startletter, because it’s not decisive.
However in many cases of the character g, it is very doubtful that this is the character g intended. In most cases it is probable the letter d.

The letter g occur in the total text only about 63 times, of which 52 times as last word letter.
Let’s ignore this letter in the token research as startletter.


The letter x

As start of a word this letter can be found in:

xar xor xol xsl xdar xoiin xoltedy xaloeees xasacbhe

xar (2x f112r and f55r) also  as part of a word 6 times (f111r, f66r, f112r…)



This already shows clearly visual that this letter is a real start character.



Alright, we now only need to asses the dendrograms for the 15 letters:

a c d e f i k o l p q r s t y


for example:

Dendrogram p

This seems rather straightforward for po (161 hits) and py (47 hits).

We also clearly needs ap for the 3 words chap (1x), chapchy (1x),  cbhapchedyfeey (1x). The word ‘apy’ (1x)  remains partially unsolved unfortunately with that.

We see a lot of chedy, and variants such as qofchedy and pchedy and pchey that need a p in front and an dy or y on the back.


Continue here  to 3 >>