Jaro-Winkler distance

If you saw my page on repeats and Levenshtein distance you will agree that those were nice exercises but the statistical information did not quite get to the point where differences between an “ordinary” text from around 1450 and the VMS could be seen. Mainly because you can only compare specific string- (read: word-) lengths only.

The Jaro-Winkler distance promises the following:

The Jaro-Winkler distance is a measure of similarity between two strings.
It is a variant of the Jaro distance metric (Jaro, 1989, 1995), a type of string edit distance, and was developed in the area of record linkage (duplicate detection) (Winkler, 1990).

The higher the Jaro–Winkler (JW) distance for two strings is, the more similar the strings are. A low number shows almost no similarity.

The score is normalized such that 0 equates to no similarity and 1 is an exact match. It should work even when you compare strings of different length.

To test this, I followed the exact same working method as for the Levenshtein research.

jw-on-cab-all-words

Top graph
What you see here is the top graph showing the JW value sorted on the unique words from High to Low. The first blue peak are the two highest repeated words ‘daiin’ and ‘ol’ on about position 5324. Directly followed by a much lower peak for the word ‘qo’ which stands on itself on repeat position 205.

The next high peak combines the following word repeat positions, 3,4,5,6,7.  Also followed by a lower peak on 5937. That combines the words (she, kor, qokam, chety, ykedy, chedal, om, ldy, aiir). All with about only 25 repeats, occupying repeat positions 228…244.

On the left of ‘daiin’ there are three clusters of words.  The most left at height of about 17…20 repeats are a lot of words such as (otaly, chdaiin, qokair, keol, kchor, okees, chain etc.)
then words like (ches, sheky, tchedy, dshedy, olkain, qokor etc…)
then words like (shal, lkchedy, chokchy, ychor, lo, okeeody).

Lower graph
The graph below zooms in on the repeat count and length and reveals indeed that JW has a better distribution for the word distances!

Of course the high repeat of qo- -dy ch- and other high repeated ngrams will have to emerge they way they do here. It looks promising and i decided to remove all words that are polluting. This means all words with a repeat=1 will have to go.

First, let us have a look at the same exercise but then for the Canter text.

canter-all-words

The peaks in the repeat count sorted on JW HL (first graph of the serie), show us that the top repeated words are scattered throughout the graph. The first peak is rep.pos 3, next is rep.pos.9, etc., and the last peaks are rep.pos 7 and then pos 4. And rep.pos.=1 has a repeat of 8422 and can easily be found because it’s the highest.

Because Canter is a big bible text if we sort on JW-distance (grey) the progression is a nice slide. Smaller text or with lesser repeated words, will show a JW-distance-sorted with more sudden falls in value.

Let’s use these same texts and remove also words with repeat=1 for a clearer picture.

jw-cab-removed-1

canter-jw-all-words-no-1

jw-culpeper

The highest peak in the most repeated word. It can been seen that that word has a high JW-distance for CAB, but relatively much lower in Canter and Culpeper because it is at the right end of the graphs.

 

Now the flow of the JW is shown if on the left are the highest repeated words and on the most right words with a repeat=2:

cab-jw-vs-repeat-no-1

canter-jw-vs-repeat-no-1

When we look at Culpeper the graph shows words have a lesser variation than in VMS or Canter.

culpeper-jw-vs-repeat-no-1

Why is that? It seems that there is lesser variation in the JW distance than in other text.
The possible reason lies in the fact that the Culpeper text is a rather repeating text: the chapters follow a distinct pattern and the text uses almost the same words to fill us in about the properties, the usage and the “thynges” of a specific herbal. Those words therefore have a tedious repeat but also the same JW-distance towards other tedious words.

It follows from that, that the Canter Bible text has more variation throughout the text.

We must now conclude that the VMS is not a text with repeating words or fixed chapters with a repeating description of herbals of stars. Far from that, it seems that the VMS contains a text which has a layout similar that of the religious corpus that was examined.  It could be any other type of text, but the text does not follow a repeating pattern as can be expected from a herbal.

To back this up, the herbal pages were analysed the same way, they must follow the same pattern as the main text of the CAB, and it does:

cab-herbal-no1

Now this only should be tested with a repetitive text.

<..>

Loading