Words as tokens 3

On the previous page you can find words as tokens 2, the source links and more.

 

Here we are investigating the dendrograms for the 15 letters:  a c d e f i k o l p q r s t y

Any Dendrogram

The small tokens can be analysed quickly in any tree, by looking at three key parts, they are:

  • cbh
  • dy
  • ai, al, al, ain, air and variants

If we see those parts on the right in a word, we can be assured that the left part of that same word is at least one token, sometimes two or more combined tokens.

Dendrogram c

Looking at dendrogram c, the first token cd is green. This means it’s a ngram token and should be avoided if possible. Preferred are the word tokens, the white unmarked ones. For example looking down the ch token with a huge usage of 2951 is a true candidate: on the right we see ch_ain, ch_air and also ch_cbh. Browsing through the list, we can quickly write down these tokens:

cbh, ckh, cth, cph, cfh and ch.

If we search for these in the unique word list, we get:  1230, 240, 252, 117, 58, 2951  hits. The list remaining afterwards still contains these single letters:    b, h: 0.   c: 3 times   k: 116    t: 139   p: 143   f:  51

The three c hits comes from these words: ccbhe cckhhy cckheor  (f100r, f68r1 and f13r) which will be regarded as a flaw and are ignored.
The letters f ,p, k, t  forms always an combination like  fchdy, fchol, fchor, fcholdy, pcheedy, pchdol and kchs, kchar, tch, tchd, tchty, tchom  etc.

These letters now seem to stand alone as first position letter (Apos) but we will see later.

A possible extra token is che (1226) and then ch (1820).  This gives a slightly better result.

The letter h now occurs four times and the end of these words:

cbh-ckh-h
o-ckh-h
so-cth-h
o-ckh-h

These four words all exist only once in the text are they are obviously a repetition of the letter h.

 

Dendrogram a

Apparent visible when looking at this tree are the double i’s.
We already choose the cbh as token so let’s use that here, found are:  a-cbh   al-cbh   ar-cbh.

Let’s find dy: a-dy  ai-dy  aii-dy aiin-dy ail-dy  air-dy airo-dy alch-dy  al-dy alke-dy alo-dy  aral-dy arch-dy ar-dy aro-dy

Glueing these into a progressive token list we get an incomplete:

a al ar
ai ail
aii alo airo
aiin aro

 

It is probable that if ail exists that also air exists, but let’s try these now.  We remove a and run and view the results.

aiin airo aii ail alo aro al ar ai
497 32 163 22 106 87 863 763 601

We see words like qokain, that leave an n, so adding ain seems wise. AlsO because it is indeed a token word. As well as am, ain, air, as, an, ad and alo, aly?

The words not covered with an ` are ak-ar  ak-al  ak-ain/air and 35 similar words.
Also added are ag (5x), ae (14x) and at (11x).

Theses tokens can now be grouped as:

group_a


Going through the other dendrograms visually it becomes clear, that most words have an even length and can be divided into 2-grams, although there are many that have an odd length.

 

Back to dendrogram c

With the word-token chb a lot of odd lengths can be solved, but there are also words that look like:

ngram – letter – ngram = 2 – 1 – 2

tokenn – letter – token = x – 1 – x

which does not feel right considered that most of those “letters that stand alone” are an e or an o.  For example those words are cbh_o_ckh, cbh_o_ko, cbh_ede, ch_e_dy_t, ch_ee_oy.

Here the space between, assumed tokens is filled with an underscore _.

A decision has to be made if trying to make big tokens will be more fruitful, then to use smaller tokens and perhaps merge some of them later.

Because many words do actually show their endform in the dendrogram c, we can see what is the most plausible.  For c the lengths 6, 7 and 8 are all added and these are the branches which have finished growing.

There are only a few still growing to length 9, but there are are still a few of those problematic words:  ch_e_daiin, ch_e_ckh_ey. But also shorter words like cbh_ckh_o, cbh_ckh_h with a one letter ending and also cbh_o_te makes it difficult. Perhaps a solution for those problems will present itself later.

First it is seems rather interesting to present this, the tree part of cf and cp look a-like:

cf-and-cp

Here are both merged:

cf-and-cp-merged

There are nine differences:

differences-f-en-p

If we take the (mutual ) similar word tokens (not the green ngrams) they are:

-dy -eo -ey -hy -ol -or and -a -d -e -h -o -y

Both cp and cf have 6 first tree branches:  cph_-cfh_ :  a d e h oy

 

But, wait !    The tree part of ck is almost exactly ct, not only the branches but also the usage is almost identical !  Here are the two merged:

cth-en-ckh-merged

ck_  a c e h o y   and    ct_  a c e h o y

are the first branches.  Then

cth_ :  adehoy
ckh_ :  acdehoy  (where c has 4 word occurences)

 

The usage is almost identical and no apparent anomalies can be seen, in 52 lines there are only these 17 different (33%):

kh-and-th-different

The question is why cth and ckh words have an almost identical word-tree, and even more interesting:  why do these occur equally frequent ?
Two word variants are now investigated:

cth_edy & ckh_edy  and  cth_eey & ckh_edy

These are the words in the entire text, where ckh is made blue for comparison reasons, sorted on page number:

words-ckh-en-cth

What is very strange is that the words on the left resemble often.  That can be seen if sorted on left:

ckh-left-sorted

On the left (as prefix for ckhedy) very often occurs ch- or sh-

Also things like

lche- & lshe
daiin.ch & daiin.sh
chalkain.ch & chckhal.she
okedy.ch & olkedy.she
qokeedy.sh & qokeey.che

are curious. It almost does look like ch = sh

On these 105 words, the only eight unique prefixes (only 69 prefixes) are:

a (once)  ch  che   cheo   cho   dch keo  lche lshe o  q  qo sh she sheo

This is 8/105= almost 7%. So 92% of the 105 words have an identical (predictable) prefix.

Let’s do this also for:

cth_eol & ckh_eol
cth_ody & ckh_ody

Except 1 occurence (cth_eoly) these words exist without any suffix.
The 13 unique suffixes on 70 words are about the same:

ch- chee cho d dolchsy etol o q qo sh sho so y-
Looking at all 1856 words with cth or ckh on any position, the patterns do not really change.

Indicational based on first letters of the prefixes and suffixes

prefix (1157x):  a… (3)   c- (2) ch- (370) che- (104)  chee-(2)  cheo-(23) cho-(64) sh-(140)  y-(7) y…(13) etc.

Thus cheo and cho and che are counted as their first letter c.

prefix (1157) first ltr count
a – 3
c – 573
d – 23
e – 3
f – 2
k  -3
l – 12
o – 113
p – 4
q – 122
s – 270
t – 7
y – 20

suffix (1840): first ltr count
a – 26
c – 1
d – 33
e – 172
h – 26
i – 1
l – 1
o – 232
r – 2
s – 5
y – 433

It is peculiar that the ckh_ and cth_ are so similar in size and tree but with the current knowledge no conclusion can be made.

Back to the original idea. Can we define what will follow ckh_ or cth_ ?

Yes, there seems to be some rules.

For the c-token,  if the letter after it there is a letter and it is

  • a then another a-token will follow (ai, al, am, aiin etc.)
  • c   ,,                     c-token will follow (ch, cbh, ckh, cph etc)
  • d   ,,                     d-token will follow (daiin, dal, dar, dy etc.)
  • e   ,,                    e-token will follow, but unsure how that looks
  • o   ,,                    o-token will follow, but unsure how that looks
  • k   ,,                    k-token will follow, but unsure how that looks
  • r  then nothing follows
  • s   ,,                    s-token will follow, but unsure how that looks
  • t   ,,                    t-token will follow, ta, taii, tal tch, teo, tod or ty
  • x then nothing follows
  • y   ,,                    y-token will follow, but unsure how that looks

These are exactly 10 following letters. Are these rules specific, or do they (already) apply to the entire text ?

 

* This angle is now abandoned and replaced by new research. *

 

Can we find the unique tokens that define the text, and define the primary set ?

If we want to reflect from the perspective of the Markov model,

we want to find  pi_nul

The next obvious step is to find out if we can define a rule or method to create these words.
During setup of the tokens it was already obvious that there is a pattern if you would reverse the words for one length and then for the other length not, but it seems this does not work for all words.    What is the method to create these basic words?

In order to do that, we need to make some educated guesses or assumptions.

The next step is to combine this research with the findings of “letters per wordlength” and try to find a deciphering system based on found tokens.

 

 

Loading