Words as tokens 3
On the previous page you can find words as tokens 2, the source links and more.
Here we are investigating the dendrograms for the 15 letters: a c d e f i k o l p q r s t y
Any Dendrogram
The small tokens can be analysed quickly in any tree, by looking at three key parts, they are:
- cbh
- dy
- ai, al, al, ain, air and variants
If we see those parts on the right in a word, we can be assured that the left part of that same word is at least one token, sometimes two or more combined tokens.
Dendrogram c
Looking at dendrogram c, the first token cd is green. This means it’s a ngram token and should be avoided if possible. Preferred are the word tokens, the white unmarked ones. For example looking down the ch token with a huge usage of 2951 is a true candidate: on the right we see ch_ain, ch_air and also ch_cbh. Browsing through the list, we can quickly write down these tokens:
cbh, ckh, cth, cph, cfh and ch.
If we search for these in the unique word list, we get: 1230, 240, 252, 117, 58, 2951 hits. The list remaining afterwards still contains these single letters: b, h: 0. c: 3 times k: 116 t: 139 p: 143 f: 51
The three c hits comes from these words: ccbhe cckhhy cckheor (f100r, f68r1 and f13r) which will be regarded as a flaw and are ignored.
The letters f ,p, k, t forms always an combination like fchdy, fchol, fchor, fcholdy, pcheedy, pchdol and kchs, kchar, tch, tchd, tchty, tchom etc.
These letters now seem to stand alone as first position letter (Apos) but we will see later.
A possible extra token is che (1226) and then ch (1820). This gives a slightly better result.
The letter h now occurs four times and the end of these words:
cbh-ckh-h
o-ckh-h
so-cth-h
o-ckh-h
These four words all exist only once in the text are they are obviously a repetition of the letter h.
Dendrogram a
Apparent visible when looking at this tree are the double i’s.
We already choose the cbh as token so let’s use that here, found are: a-cbh al-cbh ar-cbh.
Let’s find dy: a-dy ai-dy aii-dy aiin-dy ail-dy air-dy airo-dy alch-dy al-dy alke-dy alo-dy aral-dy arch-dy ar-dy aro-dy
Glueing these into a progressive token list we get an incomplete:
a | al | ar | |
ai | ail | ||
aii | alo | airo | |
aiin | aro |
It is probable that if ail exists that also air exists, but let’s try these now. We remove a and run and view the results.
aiin airo aii ail alo aro al ar ai
497 32 163 22 106 87 863 763 601
We see words like qokain, that leave an n, so adding ain seems wise. AlsO because it is indeed a token word. As well as am, ain, air, as, an, ad and alo, aly?
The words not covered with an ` are ak-ar ak-al ak-ain/air and 35 similar words.
Also added are ag (5x), ae (14x) and at (11x).
Theses tokens can now be grouped as:
Going through the other dendrograms visually it becomes clear, that most words have an even length and can be divided into 2-grams, although there are many that have an odd length.
Back to dendrogram c
With the word-token chb a lot of odd lengths can be solved, but there are also words that look like:
ngram – letter – ngram = 2 – 1 – 2
tokenn – letter – token = x – 1 – x
which does not feel right considered that most of those “letters that stand alone” are an e or an o. For example those words are cbh_o_ckh, cbh_o_ko, cbh_ede, ch_e_dy_t, ch_ee_oy.
Here the space between, assumed tokens is filled with an underscore _.
A decision has to be made if trying to make big tokens will be more fruitful, then to use smaller tokens and perhaps merge some of them later.
Because many words do actually show their endform in the dendrogram c, we can see what is the most plausible. For c the lengths 6, 7 and 8 are all added and these are the branches which have finished growing.
There are only a few still growing to length 9, but there are are still a few of those problematic words: ch_e_daiin, ch_e_ckh_ey. But also shorter words like cbh_ckh_o, cbh_ckh_h with a one letter ending and also cbh_o_te makes it difficult. Perhaps a solution for those problems will present itself later.
First it is seems rather interesting to present this, the tree part of cf and cp look a-like:
Here are both merged:
There are nine differences:
If we take the (mutual ) similar word tokens (not the green ngrams) they are:
-dy -eo -ey -hy -ol -or and -a -d -e -h -o -y
Both cp and cf have 6 first tree branches: cph_-cfh_ : a d e h oy
But, wait ! The tree part of ck is almost exactly ct, not only the branches but also the usage is almost identical ! Here are the two merged:
ck_ a c e h o y and ct_ a c e h o y
are the first branches. Then
cth_ : adehoy
ckh_ : acdehoy (where c has 4 word occurences)
The usage is almost identical and no apparent anomalies can be seen, in 52 lines there are only these 17 different (33%):
The question is why cth and ckh words have an almost identical word-tree, and even more interesting: why do these occur equally frequent ?
Two word variants are now investigated:
cth_edy & ckh_edy and cth_eey & ckh_edy
These are the words in the entire text, where ckh is made blue for comparison reasons, sorted on page number:
What is very strange is that the words on the left resemble often. That can be seen if sorted on left:
On the left (as prefix for ckhedy) very often occurs ch- or sh-
Also things like
lche- & lshe
daiin.ch & daiin.sh
chalkain.ch & chckhal.she
okedy.ch & olkedy.she
qokeedy.sh & qokeey.che
are curious. It almost does look like ch = sh
On these 105 words, the only eight unique prefixes (only 69 prefixes) are:
a (once) ch che cheo cho dch keo lche lshe o q qo sh she sheo
This is 8/105= almost 7%. So 92% of the 105 words have an identical (predictable) prefix.
Let’s do this also for:
cth_eol & ckh_eol
cth_ody & ckh_ody
Except 1 occurence (cth_eoly) these words exist without any suffix.
The 13 unique suffixes on 70 words are about the same:
ch- chee cho d dolchsy etol o q qo sh sho so y-
Looking at all 1856 words with cth or ckh on any position, the patterns do not really change.
Indicational based on first letters of the prefixes and suffixes
prefix (1157x): a… (3) c- (2) ch- (370) che- (104) chee-(2) cheo-(23) cho-(64) sh-(140) y-(7) y…(13) etc.
Thus cheo and cho and che are counted as their first letter c.
prefix (1157) first ltr count
a – 3
c – 573
d – 23
e – 3
f – 2
k -3
l – 12
o – 113
p – 4
q – 122
s – 270
t – 7
y – 20
suffix (1840): first ltr count
a – 26
c – 1
d – 33
e – 172
h – 26
i – 1
l – 1
o – 232
r – 2
s – 5
y – 433
It is peculiar that the ckh_ and cth_ are so similar in size and tree but with the current knowledge no conclusion can be made.
Back to the original idea. Can we define what will follow ckh_ or cth_ ?
Yes, there seems to be some rules.
For the c-token, if the letter after it there is a letter and it is
- a then another a-token will follow (ai, al, am, aiin etc.)
- c ,, c-token will follow (ch, cbh, ckh, cph etc)
- d ,, d-token will follow (daiin, dal, dar, dy etc.)
- e ,, e-token will follow, but unsure how that looks
- o ,, o-token will follow, but unsure how that looks
- k ,, k-token will follow, but unsure how that looks
- r then nothing follows
- s ,, s-token will follow, but unsure how that looks
- t ,, t-token will follow, ta, taii, tal tch, teo, tod or ty
- x then nothing follows
- y ,, y-token will follow, but unsure how that looks
These are exactly 10 following letters. Are these rules specific, or do they (already) apply to the entire text ?
* This angle is now abandoned and replaced by new research. *
Can we find the unique tokens that define the text, and define the primary set ?
If we want to reflect from the perspective of the Markov model,
The next obvious step is to find out if we can define a rule or method to create these words.
During setup of the tokens it was already obvious that there is a pattern if you would reverse the words for one length and then for the other length not, but it seems this does not work for all words. What is the method to create these basic words?
In order to do that, we need to make some educated guesses or assumptions.
The next step is to combine this research with the findings of “letters per wordlength” and try to find a deciphering system based on found tokens.