RuG/L04

Manuals

leven

Description

using the Levenshtein algorithm, derive a difference matrix for a set of locations based on sets of strings for each location

Synopsis

leven -n number [-d filename | [-i filename | -s filename | -g ] [-l filename] [-o filename] [-p filename] [-B] [-e] [-f number] [-F] [-m] [-N number] [-t] datafile(s)

Options

-B
binary comparison instead of Levensthein
-d filename
output file for standard deviation
-e
equal weight for indel and subst
-f int
minimum number of occurrences required for each variant
-F
skip files with less than two variants, depends on option -f
-g
character-based G.I.W.
-i filename
file with indel values
-l filename
file with location labels
-m
calculate median difference instead of mean difference
-n int
number of locations
-N int
select another normalisation function
-o filename
output file
-p filename
file defining a linkage partition
-s filename
file with substitution and indel values
-t
test all input files

Testing

Before doing any time-consuming calculations, it is recommended to run the program in test mode, using option -t, to check if all input files are syntactically correct.

Number of locations

The total number of locations used in the datafiles must be specified with the option -n. This number must be identical to the highest index number in the location file, if used.

Weights for indels and substitutions

By default, all indels have the weight 1, and all substitutions have weight 2. There are two ways to alter this behaviour:

Specifying weights for indels only

Use the option -i to specify a file that lists the indel weights. Substitution weights are calculated by adding the indel weights for each pair of character codes.

The indel file should be written as follows:

    PRINT max
    NEWLINE
    FOR i = 1 TO max
        PRINT w[i]
        NEWLINE
max is the highest character code number used in the dialect data.

w[i] is the weight for inserting or deleting a character with code i.

Note: All weights should be integers in the range 0 through 65535 (or 255 if compiled with LEVEN_SMALL defined).

Specifying weights for indels and substitutions

Use the option -s to specify a file that lists both indel weights and substitution weights.

The substitution file should be written as follows:

    PRINT max
    NEWLINE
    FOR i = 1 TO max
         FOR j = 0 TO i - 1
             PRINT w[i,j]
             NEWLINE
max is the highest character code number used in the dialect data.

w[i,j], for j > 0, is the weight for substituting a character with code i for a character with code j, or vice versa.

w[i,0] is the weight for inserting or deleting a character with code i.

Note: All weights should be integers in the range 0 through 65535 (or 255 if compiled with LEVEN_SMALL defined).

Specifying labels for locations

Use the option -l to specify a file with labels for locations. If this option is absent, numbers are used in the output file instead. You can't use labels in the data files if you omit this option.

Datafiles

Each additional datafile should hold dialect data, one file for each dialect item. All words in a single file are compared for each pair of locations.

A weight can be specified to a datafile (default 1), preceded by an asterisk. This can be used to alter the relative importance of one datafile compared to another. Only the last weight specified in a particular file will be used, and it will be used for all of that datafile, even for those strings preceding the weight specification.

Each location is introduced with a location number (counting from 1 upward) or a colon followed by a location label.

Words can be specified in two ways, with a leading minus as an ascii string, or with a leading plus as a sequence of decimal numbers.

Example:

    * 1.2
    5
    - fiets
    - fIts
    + 102 116 115
    : GR7
    - fietse
    - fiets
    - fiets
    5
    - flits
As you can see, each location can have more than one word. In this example there are two locations: location number 5 (note that this location is mentioned twice), which has four words, of which the third is specified as a sequence of numbers; and the location with label GR7, which has three words.

The order of the locations is immaterial. Not every location needs to be present in each data file. In the end, the sum of all differences between two locations is divided by the number of times that averaged difference was calculated, taking any weights specified in the datafiles into account.

The rest of this section is a bit technical. You can skip to the next section, if you want.

For this file, the actual difference between location number 5 and location GR7 is calculated by choosing a (near-)optimal one-by-one alignment of the strings of these locations. Optimal, meaning: the sum of the pair-wise differences is minimised. The actual difference is the average of these pair-wise differences. If two locations have different numbers of strings, all strings of one or both locations are duplicated once or more, until the locations have the same number of strings. In this case, the strings for location number 5 are duplicated three times, and those of location GR7 four times, resulting in twelve strings at each location.

The rationale for this procedure is as follows. Look at this example:

What would happen if you calculated the average difference of all possible pairs? You would get:
Q:ABBAAA
P:
B100111
A011000
A011000
gives 8 / 18 = .444 between P and Q for this word. But if you look at how the variant strings of this word are distributed among P and Q you can see that they are in identical proportions. So the difference should be zero. The (near-)optimal one-by-one alignment gives you this value:
P:
 
Q:
B
|
B
A
|
A
A
|
A
B
|
B
A
|
A
A
|
A
(variants repeated twice)
 
(variants sorted to line up with first row)
differences: 0 0 0 0 0 0
gives 0 / 6 = 0.

Note that determining the optimal one-by-one alignment is a variant of the Travelling Salesman Problem, which is computationally expensive. A heuristic is used that gives you a reasonable approximation.

Defining a linkage partition

If you are interested in only a small subset of the difference matrix, you can use a linkage partition file. Such a file looks like this:
    : 1 2
    : 2 3
Lines not starting with a colon are ignored.

The numbers are location index numbers, numbered from 1 to max. You can not use labels in this file. In the example, the differences between locations 1 and 2, and between locations 2 and 3 are calculated, but not between the locations with numbers 1 and 3.

Output files

By default, results are written to the output file in the form of a difference matrix. If no labels are available, numbers from 1 to max are used as labels, where max is the number of locations.

By default, the difference between two locations is the (weighted) mean of all differences calculated between those two locations. If the option -m is used, the difference between two locations is the median of the calculated differences. Weights in the data files will be ignored. Note: this requires a lot of memory.

If a linkage partition is used, differences are written to file as a value followed by two index numbers, followed by -- if available -- two location labels. Example:

    .487308    2    1    Calcutta      Bombay
    .835438    3    2    "New Delhi"   Calcutta
Note: the order used in the linkage partition file is lost.

If the option -d is used, then, for each pair of locations, the standard deviation of the calculated differences between those two locations will be returned in the output file specified by the option -d in the form of a difference matrix. Weights in the data files will be ignored.

Normalisation

At the top of the source code leven.c is a function normalise(), which is currently defined as follows:
    double normalise (double f, int len1, int len2, int minlen, int maxlen)
    {
        double
            result = 0;

        switch (normalise_function) {
            case 1:
                result = f / (len1 + len2);
                break;
            case 2:
                result = f / ((len1 > len2) ? len1 : len2);
                break;
            case 3:
                result = sqrt (f / (len1 + len2));
                break;
            case 4:
                result = f / minlen;
                break;
            case 5:
                result = f / maxlen;
                break;
            default:
                errit ("Normalisation function number %i undefined", normalise_function);
        }
        
        return result;
    }
This function is called as soon as the Levenshtein algorithm is applied to two strings. The arguments to normalise() are the value calculated by the Levenshtein algorithm, the lengths of both strings, and the minimum and maximum length of the least-cost alignment(s). (There may be several such alignments, of unequal length.)

The purpose of the function normalise() is to normalise the Levenshtein distance. This is especially needed in cases where not all pairs of strings are available for all pairs of locations, and the longer strings that are available would skew the overall results, if normalisation were omitted.

By default (case 1), the normalisation function normalises by dividing by the sum of the lengths of the two strings. This makes sense, because in cases where the two strings don't have any character in common, the Levenshtein distance is proportional to the sum of the lengths of the two strings.

You might want to experiment with alternate normalisation functions. You can select another predefined normalisation with the command line option -N (case 2 ... case 5). Check the source code to see what types of normalisation are available. You can easily add your own types of normalisation (case 6, etc), and select these with the -N option. If you do add to the definition of the function, make sure you save and compile the source under a different file name.

Equal weight for indel and substitution

Usually, with a plain Levenshtein comparison, the weight of an indel is half the weight of a substitution. With the option -e the weight of an indel is set equal to the weight of a substitution.

Character-based G.I.W.

G.I.W. or Gewichteter Identitätswert is a measure to express distance based on frequency of elements. If an element is rare, the co-occurrence of that element in two places suggests a strong correlation between those two places.

A word-based distance of G.I.W. can be calculated with the giw program. That method is only suitable when there are relatively few word variants in a dataset.

Character-based G.I.W. (cGIW) is a modification of the plain Levenshtein distance. The costs of a substitution are listed below. (The cost of an indel is 0.5 in both cases, or 1 if the option -e was used.)

plaincGIW
equal 0 f / F
different 1 1

f
The number of times the character in question is present, in the datafile being processed
F
The total count of all characters, in the datafile being processed

Binary comparison

If the option -B is used a binary comparison will be made, instead of a Levenshtein distance calculation. If two strings are not the same, the difference is set to 1 (or the weight of the datafile, if specified).

Minimum number of occurrences required for each variant

The option -f can be used to set a minimum number each variant form should be present in the data before it is actually used. If you set this values to 2, each variant that occurs only once will be ignored. The rational behind this is that infrequent variants may represent just noise.

This option is mainly useful for comparison of lexical variants. With phonetic data, variants that occur only once are very common, and usually don't represent noise.

Skip files with less than two variants

If the option -F is used, all data files that contain only one variant are skipped. This not only effects processing speed, but calculated differences as well. If there is only one variant form, it won't add to the sum of individual differences, but these data are counted, and used in the calculation of the average of the differences. This effect is notable unless the datafile contains an equal number of strings for each and every location.

Note that this option is effected by the option -f. A datafile may have more than one variant, but only one if infrequent variants are removed with the option -f. If you use option -F as well, the file will be skipped completely.