Horizontal repeats

06- febr-2017

It was suggested by many people that the text in the VMS has been made by some sort of repetitive process. That is based on the fact that we see many repeated words on the same line, but we also see words that resemble much or are identical on the following and previous lines.

On this page I will show my detailed research on the horizontal and vertical repeats.
The text used is the CAB text incl. rosette without labels.

Based on a visual inspection you can see many patterns but we need to see some numbers. So I’ve investigated if the methods Levenshtein Word Edit Distance and the Jaro-Winkler distance would reveal anything. Perhaps they would show that there is or isn’t a pattern. But helás!  These methods only showed, compared to many other corpora, that the text behaves within boundaries “normal”.

Let’s have a look a this graph. It shows all exact repeated words taken into account only on one paragraph (as commonly defined throughout this website).

Words are counted. If the word or a piece actually is found, the count in increases by 1. Any repeated pieced is therefore always 2 or higher.

Here is the exact routine but then performed on Latin genesis (nova vulgata).

The avg number of words per line in Latin: 16.8 and the avg number of words per line in the VMS: 8.3. So the there must be expected identical words per line in Latin in the low character regions 1,2,3 at least. We see that already in the 4R and 4L the number of identical wordparts that are occurring are in both corpora the same. That is strange; we would expect there would be much less hits in the VMS.

Looking at the VMS graph of identical words, there are two lines with 8 and two lines with a number 7, these are the repeated words:

SIDE by SIDE words

Of course there are other words on a horizontal line that touch each, with the best example:



——.—–.qokedy 1.qokedy 2.qokedy 3.qokedy 4.——.——-


There are 264 such words (almost 8%) of the total 3905 words.

This graph shows the amount of words on a line and the exact matching words:

The next graph was made for Latin Genesis, which has a higher count of words per line, and a more equal spread of wordcounts than the VMS.

In the VMS, sometimes identical words on a page are scarse: for example on the lilly page 2v we find only:   —.chol.—-.chol.—-.—.—-.—–.——.—- (f2v.P.6).



If we look at the start-letters (left) and end-letters (right) of a word we can also make a list in a horizontal line. Looking for any positions in that line that are the same result in:

2L= compared 2 left characters of every word
2R= compared 2 right characters of every word
and so on

Of course, the last graph (for 7 characters) resembles the total word graph very much and is almost identical: if a word is smaller than the number of characters to be compared, the whole word is taken.

What is really interesting that the left and right pattern are very alike.
Possibilities considered are:

a) many small words are similar, or
b) the word-start and word-endings are similar but the rest of the word is different, or
c) the words similar are often so small that from the left 3 letters equals from the right 3 letters.

Especially when looking at 2 or 3 chars we see that the graphs are similar.  Looking at the particular pieces of words shows us that the words concerned are not the same.

It concerns often the final total words that match fully of course, but also many words that only have a partial match. Those pieces that match are often small, such as .ar. or .ol.

If we would remove the words that are fully the same, we would see that that are many words similar but not exactly the same.

Here the wordparts for the lines were compared, but the exact identical whole words were not considered:


As expected the longer the pieces are becoming, the more empty are the graphs.
Here are the equivalent routines on the Latin text:


Let’s take some lines from the VMS graphs.

Here are the top words in 7L and 7R:

Looking at some in 7L:

You see that there are no words the exact same, but yet  much pieces remain the same.

we see that on f115r.p29 there is a repeat, which is not exactly the same word, only 7 characters:






It is interesting in Latin to see (excel line 302) the left high repeat of 5 in 7L-Latin:



It was already noticed that the small portions on the end of words have specific patterns (see dendrograms).

It is remarkable that despite the fact that all exact similar words were removed from the equation, still the patterns left and right look similar. Why?

Let’s have a closer look in the Latin example.

I took 6L and 6R

line 757


As you can see:

maledixerit and maledictus are not the same and

benedixerit and benedictus also not

But they share the common “prefix” and the common “suffix”.

If the VMS text does not contain a complex cipher this could mean that on the pages where as well as on the Left and on the Right pieces of words are the same, it could be possible that in the text similar words are. Of course in a classical language such as Latin or Greek word inflection can be the cause of that quite well.

Variations on a theme could also be responsible for the “similarity of words”.  Take for example this piece from Culpeper:

“These are not medicines which breed good blood, nor which correct the intemperature of the place afflicted, but which defend the blood and the ulcer itself from corruption in breeding flesh.”

If one wants to talk about some then something is everybody’s best friend.


Looking at the first graphs, the high peak just on the right side of the middle of the graph L1 is formed by f101v2 and f101r1 and the slightly lower peak by f89r2 and f89v2



What’s going on there? The 101 are the root pages and the 89r/v are also a root page.

There are 18 pages (see here pharma page =pages with roots) there:

f88r – f89v1 and

f99r – f102v1

After inspection it became clear that these pages are part of a foldout and the transcription is in the longest paragraph actually a spread of text in the foldout over 3 pages!!. The pages concerned are:

f101r1 => the transcript is actually of page  f100v + r1 + r2.
f101v2 => see here f101v2 =101v2 + v1
f89r2 => 89r2 + r3
f89v2 => 89v3 + v2

The pages 89v2 has 15 words, which is average.  f89r2 with 17 a 18 words is still acceptable. Now i’ve split up the 101r1 and 101v2 into 2 lines per paragraph.

Because 89r still hold 16 words, which is now the highest account, that is now the high peak visible in the new graphs:

pieces of word hits after splitting up root pages


If you compare this with the LATIN graphs it is clear that the spread in Latin is much more random and in the VMS the letters seem to have some sort of coherency. An explanation for such a relation between words could be the use of similar n-grams per line. Similar n-grams also are visible if you use a numbering system. Or in other words: a cipher system.

Let us look at the count of the unique pieces of words that we found in our (character)matches.

What is apparent that there are 14 letters on the left and 13 on the right position of a word that are repetitive. But we already knew that, the DNA of these letters and words is the same as for the whole text, with a difference for the letter i. That letter is now the most seen letter, and not o.



A possible problem in this detection method could lie in the fact that we consider start and endletters of words. Let us assume that certain beginletters and certain endletters must be kept outside the comparisons. Of course we can not take every start and endletter of every word, because we would end up with almost nothing: ol.or.ar.al. would become useless for example.

Perhaps it would be best if we consider removal of:

startletters:  q. o. ch

endletters:  y. n.



We removed on the start qo and ch and as endletter y. Here are the results:

These results show what was expected: all patterns are based on “known” dependencies between letters.  The pattern of the horizontal repeats did not change compared to the previous “full” exercise.


The horizontal repeats show indeed that there are many starts and endings of words in a line that are repeated. Such behavior would not have this high constancy when the language would be randomly created.
The statistical number of those repeats follow an unnatural pattern.
However, the pattern touches the frequency of horizontal repeats in a standard written language.



–possible to do — examine vertical repeats & combination hor. & vert.