Roughly speaking, the dialectometric processing (reading in different variations, calculating dialect distrances in one of many possible ways) yields a difference matrix of dialect distances. The data in the difference matrix can be processed into a dendrogram from grouping, or via clustering into generalized distances, often as a step to create a map:
Note that this hides variations and much functionality. Also: between every program there is an intermediate file format, so at any point you can also take out the data and use it another way if you wish.
This is meant for people who have seen the above diagram, have done some basic reading (particularly at least skimmed the L04 tutorials) and have a basic idea of what the package can do, but don't know where to start putting in data.
This graph is rather packed with information, but is still an incomplete summary. Some things, mostly miscellaneous abilities, are omitted from the graph to make the graph more readable; see details under the images.
Made with graphviz. Get [the legend | the main diagram ] as PDF.
There are two main intermediate points for the data - the dialect files and the difference matrix. Generally, you'll probably want to use this package because you want to use the dialectometric part, leading to numbers on distance.
It is of course possible to inject and extract data into the process at various points. For example, the graph might suggest you should use 'perfiles', while it's probably more likely that you'll preformat your data into dialect files yourself - I've never used perfiles myself. You may only want the levenshtein distances for a number of words. You may already have a difference matrix, possibly unrelated to dialects, and wish to do, say, a clustering analysis on it.
This part is probably more confusing than it should be. I myself have always transformed the data into something leven or features (I haven't used xstokens) can understand, when necessary doing what perfiles does myself.
In any case, the point is making files with word data in such a way that when you run other programs over it, they do exactly what you were originally planning -- which is not what this cluster suggests to the casual glancer.
The general purpose is to go from words to distances, and there are numerous way of doing this. Generally, you'll start with dialect files - a file containing a number of words in a single dialect. Generally, (if you're not specifically interested in the Gewichteter Identitätswert) you'll use the leven program to get the levenshtein distance. Even then there are options.
The dialect file needs some elaboration. The general format can either take a string as-is, or as numbers indicating byte values representing ascii (or extended ascii) presentation of a string, eg. "97 20 98" for "a b". The same format is also used as intermediate files generated by xstokens and features, which only use the numerical format, but there represent and enumeration (shared by the cost file they also generate).
You can use the leven utility on strings directly, which does not conceptually weigh letters at all; this is the simplest form of levenshtein you probably read up on somewhere, based solely on whether two characters are identical or not.
You can also let the package interpret letters perceptually, and calculate
the distance between each. You can do this by preprocessing in one of two
ways - with xstokens (a little simpler) or features (more flexible), which
both basically act like processors to leven.
The supplied example configurations try to resolve X-SAMPA into
further abstracted perceptual units.
For example, the features config defines
how to use tokens (say j\ as one token, or a with a :\ directly
after it to modify it, also essentially one token, a modified a),
and how to figure out the distance between each uniquely featured token
See also, perhaps, more details on the features config file and related behaviour.