Levenshtein or Edit Distance

Levenshtein Distance

2-10-2016

The Levenshtein Distance or Edit Distance (E.D.) shows the number of times a letter in a word has to be “inserted, deleted or changed” to become another word.

It shows the minimum number of  single-character edits required to change one word into the other.  The maximum of the E.D. is the length of a word, because when you change all letters you will have another word.

  • ABC -> ABC has a E.D. of 0 because they are equal
  • ABC -> ABF has a E.D. of 1 because only 1 letter is different.
  • ABC -> ZYC has a E.D. of 2 because 2 letters are different.
  • ABC -> XYZW has a E.D. of 4 because all 4 letters are different (always use the longest word as upper bound)

That number as maximum is not very useful if you compare words with different lengths. It also does not tell you much about a text if a wordA compared to wordB has an E.D.=1 if you can not consider all words.

These graphs show the Levenshtein Distance values for the words on the stars pages and also the pharma pages. From all these values no conclusion whatsoever can be made from this.

levenshtein-distance-top-repeated-words-pharma levenshtein-distance-top-repeated-words-stars

That is why the method used is the following:

  1. find the unique words in a text and count the number of times these repeat (word repeats)
  2. sort from high repeat to low and assign word repeat order 1 for the word with the highest repeat, the next one will become number 2  and so on
  3. remove words of length 1 and sort the words alphabetically
  4. compare all words in a matrix and calculate the E.D. but show only where the E.D.=1
  5. per word calculate the sum of those E.D.=1

This results in an graph where can be seen that some words have an direct inflection from a word in the same alphabetical region. This is accented by the blue triangle “hits”.

levenshtein-distance-top-repeated-words-only-1-whole-cab

Since there are over 8000 words in the VMS, this results in an extensive calculation matrix which can not be showed visually (operations like 8000*8000*wordlengths word 1*wordlengths word 2). The “sum of ED=1” will show how many “hits” there were with other words based on only an Edit Distance of one.

Another major problem is that we can only calculate the E.D. per word length and compare it for that length alone. Since we do not know if a word of for example 5 letters really is a 5 letter word, it could be 3 or 7 or anything, this method only is a rough comparison, but accurate enough to show major anomalies.

 

Let’s compare

sumed-canter-vs-repeat-counts

 

sumed-cab-total-vs-repeat-counts

 

Remember:
higher E.D. = lower resemblance with other words.
low E.D. = good resemblance. (E.D.=0 when equal)

Of course: small words have an Sum(ED=1) quicker because the chance that they have a match with another word is much higher than a longer word.

Conclusion Edit Distance comparison

What does this mean? It means that

  • the number of unique words follow a normal pattern in this comparison
  • the Distance for the words in the VMS is a little bit more dense, but they follow a normal pattern for E.D.=1

When the sum of the Edit Distance =1 on all words in a text tells something about the words used, but as an average it is specific for the total text examined.

The morphology of words in a language is fixed to that language. This means that the number of morphs of any word depend on the inflections sustained in that language. The number of inflections for a word are defined in grammar rules and all words in the text follow those rules.

It is assumed as logical conclusion that when there are many inflections, the E.D.=1 will occur also more often than when there are lesser inflections.

 

Below you see the orange line reflects the number of repeats for an unique word divided by the sum of (E.D.=1). Only words with length 4 are used.

 

rep-div-ed-sum-for-cab-length-4

The highest peak is of course the word “daiin”.

sum-ed-sorted-on-length

As you can see in the graph above, every time when there is a length-change in the table, the highest repeated word has also the highest Sum(ED=1).  Any other text will show the same graph.

Let’s show it again, by zooming in on length (len) and repeats in the graph where the sum(ED) was calculated, and then sorted High-to-Low on Sum(ED).

graph-cab-zoomed-in-on-repeats-and-length

This is the graph of the sum(ED)  sorted HL using the exact same information.

cab-sum-ed

You can clearly see that the higher length words, generate a higher sum(ED).
Words that occur only once in the text at a length of 14 generate a higher sum(ED) than a word like ‘daiin’ which has a huge repeat and a modest length.

Note: Calculation of sum(ED) or sum(ED=1) does not matter regards to the generated graph.

This little research shows that for different word lengths, where we do not know if the length transcribed and analyzed is a realistic word length, this technique is worthless.

There seems to be another technique which does a better job there, the Jaro-Winkler distance. Let’s see.

Loading