
Specifying weights for indels only
Use the option -i to specify a file that lists the indel weights. Substitution weights are calculated by adding the indel weights for each pair of character codes.
The indel file should be written as follows:
PRINT max
NEWLINE
FOR i = 1 TO max
PRINT w[i]
NEWLINE
max is the highest character code number used in the dialect
data.
w[i] is the weight for inserting or deleting a character with code i.
Note: All weights should be integers in the range 0 through 65535 (or 255 if compiled with LEVEN_SMALL defined).
Specifying weights for indels and substitutions
Use the option -s to specify a file that lists both indel weights and substitution weights.
The substitution file should be written as follows:
PRINT max
NEWLINE
FOR i = 1 TO max
FOR j = 0 TO i - 1
PRINT w[i,j]
NEWLINE
max is the highest character code number used in the dialect
data.
w[i,j], for j > 0, is the weight for substituting a character with code i for a character with code j, or vice versa.
w[i,0] is the weight for inserting or deleting a character with code i.
Note: All weights should be integers in the range 0 through 65535 (or 255 if compiled with LEVEN_SMALL defined).
A weight can be specified to a datafile (default 1), preceded by an asterisk. This can be used to alter the relative importance of one datafile compared to another. Only the last weight specified in a particular file will be used, and it will be used for all of that datafile, even for those strings preceding the weight specification.
Each location is introduced with a location number (counting from 1 upward) or a colon followed by a location label.
Words can be specified in two ways, with a leading minus as an ascii string, or with a leading plus as a sequence of decimal numbers.
Example:
* 1.2
5
- fiets
- fIts
+ 102 116 115
: GR7
- fietse
- fiets
- fiets
5
- flits
As you can see, each location can have more than one word. In this
example there are two locations: location number 5 (note that this
location is mentioned twice), which has four words, of which the third
is specified as a sequence of numbers; and the location with label GR7,
which has three words.
The order of the locations is immaterial. Not every location needs to be present in each data file. In the end, the sum of all differences between two locations is divided by the number of times that averaged difference was calculated, taking any weights specified in the datafiles into account.
The rest of this section is a bit technical. You can skip to the next section, if you want.
For this file, the actual difference between location number 5 and location GR7 is calculated by choosing a (near-)optimal one-by-one alignment of the strings of these locations. Optimal, meaning: the sum of the pair-wise differences is minimised. The actual difference is the average of these pair-wise differences. If two locations have different numbers of strings, all strings of one or both locations are duplicated once or more, until the locations have the same number of strings. In this case, the strings for location number 5 are duplicated three times, and those of location GR7 four times, resulting in twelve strings at each location.
The rationale for this procedure is as follows. Look at this example:
| Q: | A | B | B | A | A | A | |
| P: | |||||||
| B | 1 | 0 | 0 | 1 | 1 | 1 | |
| A | 0 | 1 | 1 | 0 | 0 | 0 | |
| A | 0 | 1 | 1 | 0 | 0 | 0 |
| P: Q: |
B | B |
A | A |
A | A |
B | B |
A | A |
A | A |
(variants repeated twice) (variants sorted to line up with first row) |
| differences: | 0 | 0 | 0 | 0 | 0 | 0 |
Note that determining the optimal one-by-one alignment is a variant of the Travelling Salesman Problem, which is computationally expensive. A heuristic is used that gives you a reasonable approximation.
: 1 2
: 2 3
Lines not starting with a colon are ignored.
The numbers are location index numbers, numbered from 1 to max. You can not use labels in this file. In the example, the differences between locations 1 and 2, and between locations 2 and 3 are calculated, but not between the locations with numbers 1 and 3.
By default, the difference between two locations is the (weighted) mean of all differences calculated between those two locations. If the option -m is used, the difference between two locations is the median of the calculated differences. Weights in the data files will be ignored. Note: this requires a lot of memory.
If a linkage partition is used, differences are written to file as a value followed by two index numbers, followed by -- if available -- two location labels. Example:
.487308 2 1 Calcutta Bombay
.835438 3 2 "New Delhi" Calcutta
Note: the order used in the linkage partition file is lost.
If the option -d is used, then, for each pair of locations, the standard deviation of the calculated differences between those two locations will be returned in the output file specified by the option -d in the form of a difference matrix. Weights in the data files will be ignored.
double normalise (double f, int len1, int len2, int minlen, int maxlen)
{
double
result = 0;
switch (normalise_function) {
case 1:
result = f / (len1 + len2);
break;
case 2:
result = f / ((len1 > len2) ? len1 : len2);
break;
case 3:
result = sqrt (f / (len1 + len2));
break;
case 4:
result = f / minlen;
break;
case 5:
result = f / maxlen;
break;
default:
errit ("Normalisation function number %i undefined", normalise_function);
}
return result;
}
This function is called as soon as the Levenshtein algorithm is applied
to two strings. The arguments to normalise() are the
value calculated by the Levenshtein algorithm, the lengths of both
strings, and the minimum and maximum length of the least-cost alignment(s).
(There may be several such alignments, of unequal length.)
The purpose of the function normalise() is to normalise the Levenshtein distance. This is especially needed in cases where not all pairs of strings are available for all pairs of locations, and the longer strings that are available would skew the overall results, if normalisation were omitted.
By default (case 1), the normalisation function normalises by dividing by the sum of the lengths of the two strings. This makes sense, because in cases where the two strings don't have any character in common, the Levenshtein distance is proportional to the sum of the lengths of the two strings.
You might want to experiment with alternate normalisation functions. You can select another predefined normalisation with the command line option -N (case 2 ... case 5). Check the source code to see what types of normalisation are available. You can easily add your own types of normalisation (case 6, etc), and select these with the -N option. If you do add to the definition of the function, make sure you save and compile the source under a different file name.
A word-based distance of G.I.W. can be calculated with the giw program. That method is only suitable when there are relatively few word variants in a dataset.
Character-based G.I.W. (cGIW) is a modification of the plain Levenshtein distance. The costs of a substitution are listed below. (The cost of an indel is 0.5 in both cases, or 1 if the option -e was used.)
| plain | cGIW | |
|---|---|---|
| equal | 0 | f / F |
| different | 1 | 1 |
This option is mainly useful for comparison of lexical variants. With phonetic data, variants that occur only once are very common, and usually don't represent noise.
Note that this option is effected by the option -f. A datafile may have more than one variant, but only one if infrequent variants are removed with the option -f. If you use option -F as well, the file will be skipped completely.