Details on what features and leven do with your data.

There are a few details to (mostly) features: it produces two inseparable output files, one where the tokens are numbered, and a difference matrix listing the distance between each token number. The distances are from 0.0 to 1.0, eg. 0.0, 0.5 and 0.7, but are stored as integers (resp. 0, 32767, 45874). To do this, features' divides and clamps the token distances it calculates; see the documentation of features under the header SUBSTMAX.

In an effort to make the distances from features similar to normal levenshtein distances, if all values lie under (<, not ≤) 1.0 (65535) the numbers are scaled up (by features, after SUBSTMAX scaling, before file output) so that the largest number present becomes 1.0 (65535) - if only those 0.0, 0.5 and 0.7 from before existed, they would become 0.0, 0.71 and 1.0, or rather, their integer equivalents.
This isn't noticable when you have impossible comparisons built in (like consonant vs. vowel, costing a lot and truncated to 1.0) but may be when you're comparing very simple strings.
The reason it does this is that you can use any scale of numbers inside features and still have it calculate distances roughly comparable to other levenshtein distances - scaled between 0 and 1. I say roughly because two models are never directly comparable, and in the worst case it may be that separate runs with the same model give results in a slightly different scale.

In fact, leven does a similar thing -- though note: only when such a cost file is supplied, like when you use features in combination with leven. It can be seen to sum up the costs in integers, divide by the normalisation number (depending on choice), and then by a scaling number (in the range 0 to 65535), to make it a floating point number again.
The reasoning is similar to the previous one, that of scaling roughly(!) up to the sort of distance that a regular levenshtein distance would give, but it is unlike the calculations that features does - it looks at all the substitution values, takes the maximum, divides it by two; it also takes the largest indel value. Of the two figures, it takes the largest, and uses that to divide the integer-scale cost mentioned before.
More symbolically, the scaling number is: max( max(substs)/2 , indels).

This is not often a problem if you do a single run of features over all data and a single run of leven. However, if you want to ensure that the levenshtein distances between runs of features and/or leven are directly comparable, you need to trick the package. The simplest way is to add a location with a string that either has all possible characters (or rather, the largest possible distance in your model, in most cases as at least an indel) and ignore that location in your results.

Notes on features' configutation file

This is meant for potential model makers ( for features). You can get most or all of this from the documentation, but since some of it isn't explicitly or clearly mentioned, reading this may save you from a headache or two.

The following is a somewhat altered, somewhat shortened version of the configuration file that comes with L04, with added commentary. It is not complete enough to work, just complete enough to give you an impression of the possibilities.

I've yet to explain the STATE part, but it's fairly straightforward,

#
# Further documented version of the almost-X-SAMPA example config that comes
# with L04. Also has a few modifications; don't depend on this config.
#
#
# Terms used:
#  - input:  a string of characters (eg. "be:\y")
#
#  - tokens: 
#       either 'input tokens' (units of meaning taken from the input, 
#         eg. "b", "e", ":\", "y"
#       or output tokens, which in the above would probably represent
#         a 'b', an altered 'e', and a 'y'.
#    The documentation doesn't explicitly make the input/output distinction,
#    but it does exist due to the pre/post-modifiers.  See also the 
#    head/premodifier/postmodifier description later.
#
#  - features: (typed) values assignable to each output token
#
#  - output: each uniquely featured, non-ignored output token gets a 
#            unique number, used in the table and .ftr file. 
#            Example: "ebe:\ye" would likely become "1 2 3 4 1",
#            and in the distance file there would probably be a
#            relatively small cost between 1 and 3.
#
#  Things works by setting the features for each (non-modifier) token
#  that is read (can be none);
#  - If all features are equal, it is taken to be the same token
#  - If not, the tokens are separated, and indel and subst values
#      calculated based on feature definitions and further configuration.
#
#  The calculated costs between the generated *token* identifiers are
#  stored in a file containing these costs (features.table.out) and can
#  be used with the -s option on leven. Note it uniquely belongs with the
#  (<inputFileName>.ftr) the same run of features generated, which 
#  stores just the token enumeration; the combination is fed to the 'leven'
#  program.
#  The general purpose of 'features' is to read in tokens (possibly
#  multi-character, as in eg. X-SAMPA or even unicode), assign features
#  values to them and calculate the distances between them on a per-run 
#  basis, depending on the configuration and depending a little on the data.
#
#  'Features' is for a large part a difference table generator for the
#  comparison of input strings, which is already simpler and more flexible 
#  than tokenizing and making a full distance table by hand, especially
#  when considering modifiers such as diacritics.
#  It can also carry simple state (usable for eg. stress) so is potentially
#  more than that. (State, incidentally, means something can change a single
#   built-in state after the current and before the next token)
#
#
# General file syntax and structure:
#  Comments start with a hash (#), at the start of the line,
#    and apparently later on it too - with the exception of the H lines in
#    the TOKENS section, so that you can use #'s in tokens tring recognition.
#  Generally, things are space or tab delimited, also meaning you
#    can't use spaces in strings.
#    (except, probably, when you use the tokenstring raw and escape it)
#  Everything's case sensitive. 
#    (hex strings can be written in both cases, though)
#  The file has five sections, which must be present and in the order:
#    DEFINES, FEATURES, TEMPLATES, INDELS, TOKENS
#    These are detailed below.



#########################################################################
DEFINES
########
#   Can optionally have each of the following, in any order. Defaults are:
#      VERSION 0,
#      TOP 65535,
#      SUBSTMAX 1.0,
#      INDEL becomes 0.5*SUBSTMAX unless it is defined 
#      METHOD SUM,
#      TOKENSTRING RAW,
#      START 0.

VERSION 2    
#   Can also be 0 or 1. The versions differ in the semantics of
#   feature set comparisons, and also the definition of the features
#   The documentation notes 0 has a methodological bug, 1 and
#   2 don't. See below somewhere for details.
#   I suppose it would be best to write everything in the same
#   version to avoid confusion, probably version 2, but use as
#   you see logical.

TOP 65535
#   Can be either 255 or 65535, and is the latter by default.
#
#   Affects the features.table.out file, which encodes the token
#   distances (always in the range 0.0 to 1.0, see SUBSTMAX below)
#   as either 0 to 255 or 0 to 65535.
#
#   The point is that the generated tables file can get large, and for
#   leven to read it in means a fair bunch of memory use. Using values 
#   0..255 means leven can use an 8-bit int instead of an 16-bit int, 
#   saving memory. (This applies only to the version of leven 
#   specifically compiled that way. It isn't really an issue on 
#   modern computers.)
#   Choose 65535, or 255 if you use the alternative leven.

SUBSTMAX 8.0    
#   Defines range (0.0 to this) in which distances between feature sets
#   work. Because of this, it essentially defines what float value TOP
#   coressponds to - the file contains integer values on the
#   scale of 0 .. TOP that correspond to the internal calculated distances
#   in range 0.0 .. SUBSTMAX. (Higher values get clamped to SUBSTMAX) 
#
#   Examples:
#   If substmax is 4.0 (and TOP is the usual 65535), the indel/subst
#   values of eg. 1.0, 2.0, and 4.0 and 5.0 are encoded as roughly (there
#   are some rounding details; these are reasoned figures, not tested) 
#   0, 16383, 32767, 65534 and 65535. (Which you could easily say actually
#                                      represent 1/8, 2/8, 4/8 and 5/8)


INDEL 1.00
# Cost of insertion or deletion of tokens. Overrides the detault, 
# which is SUBSTMAX/2. If there is a featureset defined in the INDELS
# section, INDEL is overridden in turn.

#   Note: SUBSTMAX and INDEL can take float values. 


METHOD SUM      
# Given two feature sets, defines how the distance is
# to be calculated.  Default is SUM; possible values:
#   MINKOWSKI   (where rhoValue is a float)
#   SUM,                  (equivalent to MINKOWSKY 1)
#   EUCLID,               (equivalent to MINKOWSKY 2)
#   SQUARE,
# See documentation for more detail.


TOKENSTRING RAW 
#  Possible values: RAW and ESC.
# When ESC, you can use \033, \x0f, or \d047 type strings to
# use octal, hex and decimal escaped characters to be able to
# accept non-ASCII characters cleanly.   The only drawback
# seems to be that a backslash (used in eg. XSAMPA) then 
# must be escaped itself: \\ 


START 0       # STATE at start of string (default: 0)
# START defines what state you are in at the start of each input
# token string. In processing, you can accept things in state 
# machine sort of way, by altering features based on conditions, 
# by making conditions depend on the special-purpose, always
# defined feature STATE.



#########################################################################
FEATURES
########
# Defines the features available to be set for each token. 
#
# Features are essentially typed variables with a possible 'undefined'
# value. The types are:
#    B: bitmap   (integer used as bitmap)
#    N: numeric  (float)
#    D: discrete (integer numbers)
#
# 
# In VERSION 0, these should be used in one of these three forms:
#    B|N|D label
#    B|N|D weight label
# eg. "B 3 type" or for a type that weighs three times as much,
#  or "B sonorant" to be used, for example, as a boolean value.
# 
# In VERSION 1 and VERSION 2, there are three forms:
#    B|N|D label
#    B|N|D default_diff label
#    B|N|D default_diff weight label
# eg. "D 1 0.5 length"
#
# Note that when they are not set (by an F line below) to a value, they
#   stay undefined for that token, and the distance calculation semantics
#   are based on the version defined.
# The default difference applies at a per-feature level, when a feature is
#   undefined in both or one of the two compared tokens (depends on version)
#
#
# Version behaviour differences:
# Version 0
# - No common features defined:
#         distance is SUBSTMAX
# - Individual feature defined in only one token: 
#         distance is 0
#
# Version 1
# - No common features defined:
#         distance calculated with default distances (***checkme)
# - Individual feature defined in zero or one of the two:
#         distance = defaultDifference*weight   (both 1 by default)
#
# Version 2 
# - No common features defined:
#         distance calculated with default distances (checkme)
# - Individual feature defined in zero of two:
#         distance = 0
#   Individual feature defined in one of two: 
#         distance = defaultDifference*weight   (both 1 by default)
#
#
#
#  Type choice considerations:
#   B: When at least one bit overlaps (eg, between 2 and 3, 00000010 and 
#      00000011 in binary), the distance is the weight, else it is zero.
#   D: When these are equal the distance is the weight, else it is zero.
#   N: The distance is <the absolute value of the difference> times 
#      <the feature weight>.
#
# When you don't change the weights (1 by default), these three simplify 
# to '0 or 1', '0 or 1', 'the absolute difference,' respectively.
# Weights multiply those distances.
#
# Numeric differences (N) weigh on the same scale as weights - a
# difference of 3 and a weight of 1 has in itself the same effect as a
# difference of 1 and a weight of 3, but there are often style,
# readability and other features to consider - eg. weights of different
# features looking and working comparably.
# 


#  Type as in vowel/consonant. A bit-set; bit 1 indicates a vowel, bit 2 a consonant.
B 1 3 type

#  vowel features
N 4 v_advancement       # front .. back : -2 .. 2
N 3 v_high              # open .. close : -1.5 .. 1.5
N 1 v_long
N 1 v_rounded           # unrounded .. rounded : -.5 .. .5
#  consonant features
N 1 c_place
N 1 c_high
N 1 c_distributed
N 1 c_voice
N 1 c_nasal
N 1 c_stop
N 1 c_glide
N 1 c_lateral
N 1 c_fricative
N 1 c_trill
N 1 c_aspire
#  general features
D 1 .7 breathy
N 1 .7 stress

# some diacritics and other explicitly named features.
N 1 .5 _w
N 1 .5 _m
N 1 .5 _r
N 1 .5 _X
N 1 .5 _>
N 1 .5 _G\
N 1 .5 _k
N 1 .5 _"
N 1 .5 _R
N 1 .5 _v
N 1 .5 _<
N 1 .5 _j
N 1 .5 =
N 1 .5 .
N 1 .5 ~
N 1 .5 `\
N 1 .5 -
N 1 .5 +
N 1 .5 *\
N 1 .5 *


#########################################################################
TEMPLATES
########
# Templates act like procedures - you call them by their name (from the TOKENS), and they
# set values, possibly depending on conditions. They don't add functionality
# as such, but you can avoid redundancy and keep things more organized.
#
# The format is: 
#   "T templatenamedefinition"
#   "[condition] F featurename operation value"
# 
# The value is required for all operations except U (undefine feature),
# which takes none.
#
# The condition is optional, and possibly fairly rare. 
# This tests the single, special, internal (and alterable) feature STATE
# using one of the four possible condition operators:
#   ":  value"    Bitwise AND, condition is true if result is nonzero
#   "^: value"    Inverse of the last: condition is true if result is zero
#   "=  value"    Value equality test.
#   "^= value"    Value inequality test.
#
# (Note, incidentally, that when you change state that state will only be
#  changed by the time you process the next token - this eliminates some
#  ways you could use this)
#
#
# The F lines are feature operations, often setting a B or D value,
# or adding to an N value. There are a few possible operations:
#   
# Available operators for...          B  N  D
#   =       assignment                y  y  y
#   +       bitset/addition(*1)       y  y
#   -       bitclear/subtraction(*2)  y  y
#   *       multiplication               y 
#   !       xor                       y
#   U       make undefined            y  y  y
#
# (*1) for B, this sets the bits that are set in the given value (bitwise OR)
#      for N, it is numerical addition.
# (*2) for B, this clears the bits in the given value,
#      for N, it is numerical subtraction
#


T vowel
F type = 1
F v_long = 1
F v_rounded = -.5
F c_place = 3
F c_voice = 0
F breathy = 0


F stress = 0     # no stress
: 1 F stress = 2   # primairy stress
: 2 F stress = 1   # secundairy stress

F STATE - 3


F _w = 0
F _m = 0
F _r = 0
F _X = 0
F _> = 0
F _G\ = 0
F _k = 0
F _" = 0
F _R = 0
F _v = 0
F _< = 0
F _j = 0
F = = 0
F . = 0
F ~ = 0
F `\ = 0
F - = 0
F + = 0
F *\ = 0
F * = 0

T v_close
F v_high = 1.5

T v_near-close
F v_high = 1

T v_close-mid
F v_high = .5

T v_mid
F v_high = 0

T v_open-mid
F v_high = -.5

T v_near-open
F v_high = -1

T v_open
F v_high = -1.5

T v_front
F v_advancement = -2

T v_central
F v_advancement = 0

T v_back
F v_advancement = 2

T v_rounded
F v_rounded = .5

#=======

T consonant
F type = 2
F c_place = 3
F c_voice = 0
F c_nasal = 0
F c_stop = 0
F c_glide = 0
F c_lateral = 0
F c_fricative = 0
F c_trill = 0
F c_high = 0
F c_distributed = 0
F c_aspire = 0
F v_long = 1
F v_rounded = 0
F v_advancement = 0
F breathy = 0

F stress = 0
F _w = 0
F _m = 0
F _r = 0
F _X = 0
F _> = 0
F _G\ = 0
F _k = 0
F _" = 0
F _R = 0
F _v = 0
F _< = 0
F _j = 0
F = = 0
F . = 0
F ~ = 0
F `\ = 0
F - = 0
F + = 0
F *\ = 0
F * = 0

T c_voice
F c_voice = 1

T c_nasal
F c_nasal = 1

T c_stop
F c_stop = 1

T c_glide
F c_glide = 1

T c_lateral
F c_lateral = 1

T c_fricative
F c_fricative = 1

T c_trill
F c_trill = 1

T c_bilabial
F c_place = 1
F c_distributed = 1

T c_labiodental
F c_place = 1

T c_dental
F c_place = 1.5

T c_alveolar
F c_place = 2

T c_postalveolar
F c_place = 2
F c_high = 1
F c_distributed = 1

T c_retroflex
F c_place = 2.5

T c_palatal
F c_place = 3
F c_high = 1
F c_distributed = 1

T c_velar
F c_place = 4
F c_high = 1

T c_uvular
F c_place = 4

T c_pharyngeal
F c_place = 4.5

T c_glottal
F c_place = 5


#set both vowel and consonant type (bitmap feature)
#may leave a lot of default distances - test me.
T both
F type = 3


#=======

T extra
F type = 8


#########################################################################
INDELS
########
# Instead of using a fixed indel for everything, you can here specify 
# what is essentially a 'neutral' token feature set, and the token 
# distance (cost) will be used for each indel, instead of the INDEL
# defined in DEFINES
# 
# Essentially, this is a featureset with no related token. In this
# example, it calls both consonant and vowel templates, and the 
# sort-of-general case for both, so that all common features are defined;
# two features are specifically set too; type mostly to match both 
# consonant and vowel (because the type is assigned, it would be set
# to vowel. 255 has all bits set, so it would match new types too)
#
# In the end, this is a slightly rough way to make more extreme vowels
# and consonants (eg. y, ) cost more to insert and delete than relatively
# neutral (mouth-central) ones (eg. @, h).

T consonant c_glottal c_fricative   # h
T vowel v_mid v_central         # @
F v_rounded = 0                 
F type = 255                    # matches all possible types - consonants, vowels, both, etc.



#########################################################################
TOKENS
########
# Defines tokens to be recognized, and what to do when they are met.
# conditions and actions ("F" lines) are explained further in TEMPLATES, above.
#
# The full format of the below is any number of:
#           [condition] P|H|M|PI|HI|MI tokenstring
# followed by zero or more of (note that zero will still mean an output token):
#           [condition] F featurename operator [value]
# and/or:   [condition] T templateName
# 
# A tokenstring is a number of characters. If more than one template can match,
# (eg. the next two characters to be accepted are ":\" and there is an entry
#  for ":" as well as ":\") the longest tokenstring that accepts it is chosen.
#
# Most of these lines will probably call templates, and occasionally fine tune
# a feature - ie. an H line followed by a T line, and sometimes also by an F
# line.
#
#
# P, H and M (premodifier, head, (post)modifier) all do the same thing, accept
# token strings. The different letters help in processing in a particular
# order; there are two major rules to the order things are done in. 
#
# The following applies to each section of the input that has P's, H's
# and M's, in that order.
# Often enough, the input will be mostly a sequence of H's (eg. directly
# phonetic symbols) and have the occasional M, for example diacritics that
# chance some aspect of the features just set by the H.
#
# To steal the example from the documentation, if you have input that
# has two P's, an H and two M's, you can pretend they will be numbered like:
#   P1 P2 H M1 M2
#
# Based on this numbering:
# - H's are applied
# - M's are done in ascending order
# - P's are done in descenting order
#
# So for the example:
#   H M1 M2 P2 P1
# Yes, that's a little counterintuitive from the naming. It's the way it is.
#
# MI, HI and PI are used to Ignore tokens. Technically, one would have
# sufficed, I'm guessing there are three to be able to add just an I 
# while developing a config file.

PI <
HI >

P "
F STATE + 1

P %
F STATE + 2

#======= vowels

H i
T vowel v_close v_front both

H y
T vowel v_close v_front v_rounded

H I
T vowel v_near-close v_front

H Y
T vowel v_near-close v_front v_rounded

H e
T vowel v_close-mid v_front

H 2
T vowel v_close-mid v_front v_rounded

H E
T vowel v_open-mid v_front

H 9
T vowel v_open-mid v_front v_rounded

H {
T vowel v_near-open v_front

H a
T vowel v_open v_front

H &
T vowel v_open v_front v_rounded

H 1
T vowel v_close v_central

H }
T vowel v_close v_central v_rounded

H 8
T vowel v_close-mid v_central v_rounded

H @
T vowel v_mid v_central
F v_rounded = 0

H 6
T vowel v_near-open v_central

H M
T vowel v_close v_back

H u
T vowel v_close v_back v_rounded both

H U
T vowel v_near-close v_back v_rounded

H 7
T vowel v_close-mid v_back

H o
T vowel v_close-mid v_back v_rounded

H V
T vowel v_open-mid v_back

H O
T vowel v_open-mid v_back v_rounded

H A
T vowel v_open v_back

H Q
T vowel v_open v_back v_rounded

#======= consonants

H p
T consonant c_bilabial c_stop

H b
T consonant c_bilabial c_stop c_voice

H m
T consonant c_bilabial c_nasal c_voice

H p\
T consonant c_bilabial c_fricative

H B
T consonant c_bilabial c_fricative c_voice

H w
T consonant both

H F
T consonant c_labiodental c_nasal c_voice

H f
T consonant c_labiodental c_fricative

H v
T consonant c_labiodental c_fricative c_voice

H v\
T consonant c_labiodental c_glide c_voice

H t
T consonant c_alveolar c_stop

H d
T consonant c_alveolar c_stop c_voice

H n
T consonant c_alveolar c_nasal c_voice

H r
T consonant c_alveolar c_trill c_voice

H 4
T consonant c_alveolar c_voice

H T
T consonant c_dental c_fricative

H D
T consonant c_dental c_fricative c_voice

H s
T consonant c_alveolar c_fricative

H z
T consonant c_alveolar c_fricative c_voice

H S
T consonant c_postalveolar c_fricative

H Z
T consonant c_postalveolar c_fricative c_voice

H K
T consonant c_alveolar c_lateral c_fricative

H K\
T consonant c_alveolar c_lateral c_fricative c_voice

H r\
T consonant c_alveolar c_glide c_voice

H l
T consonant c_alveolar c_lateral c_voice

# velarized
H l_G   # ????
T consonant c_velar c_lateral c_voice

H c
T consonant c_palatal c_stop

H J\
T consonant c_palatal c_stop c_voice

H J
T consonant c_palatal c_nasal c_voice

H C
T consonant c_palatal c_fricative

H j
T consonant c_palatal c_glide c_voice both

H L
T consonant c_palatal c_lateral c_voice

H k
T consonant c_velar c_stop

H g
T consonant c_velar c_stop c_voice

H N
T consonant c_velar c_nasal c_voice

H x
T consonant c_velar c_fricative

H G
T consonant c_velar c_fricative c_voice

H M\
T consonant c_velar c_glide c_voice

H q
T consonant c_uvular c_stop

H G\
T consonant c_uvular c_stop c_voice

H N\
T consonant c_uvular c_nasal c_voice

H R\
T consonant c_uvular c_trill c_voice

H X
T consonant c_uvular c_fricative

H R
T consonant c_uvular c_fricative c_voice

H ?
T consonant c_glottal c_stop

H h
T consonant c_glottal c_fricative

H h\
T consonant c_glottal c_fricative c_voice

H H
T consonant

H W
T consonant

H s\
T consonant

H z\
T consonant

# interpreted as a sound between postalveolair S and retroflex s`
H S`
T consonant

H 4\`
T consonant

H r\`
T consonant c_retroflex c_glide

#=======

# labialized
M _w
F _w = 1

M _h
F c_aspire + 1

M _h\
F c_aspire + 1

M _0
F c_voice = 0

# more rounded
M _O
F v_rounded + .2

# lowered
M _o

# less rounded
M _c
F v_rounded - .2

M _+
F v_advancement - 1
F c_place - .5

M _-
F v_advancement + 1
F c_place + .5

M _d
F c_place = 1.5

# laminal
M _m
F _m = 1

# raised
M _r
F _r = 1

# extra short
M _X
F _X = 1

# ejective
M _>
F _> = 1

M _t
F breathy = 1

M _G\
F _G\ = 1

M _k
F _k = 1

M _"
F _" = 1

M _R
F _R = 1

M _v
F _v = 1

# implosive
M _<
F _< = 1

# palatalised!
M _j
F _j = 1

M =
F = = 1

M :
F v_long + 1

M .
F . = 1

M ~
F ~ = 1

# rhoticity
M `\
F `\ = 1

M -
F - = 1

M +
F + = 1

#=======

P *\
F *\ = 1

M *
F * = 1

#############################################