Lexurgy Sound Changer Reference
This is an in-depth explanation of the Lexurgy SC rules language. For a gentler introduction, see the tutorial
Overall Structure
Lexurgy sound changes consist of any number of declarations followed by any number of rules.
Declarations define concepts that can be used in the rest of the file: features, symbols, diacritics, classes, and elements.
Most rules define how input words change into output words. Each word in the input lexicon is passed through all rules in the order they're declared, transforming it one step at a time into an output word. There are also a few special rule types that alter the program behaviour rather than directly transforming words: deferred rules, cleanup rules, syllable rules, and intermediate romanizers.
Comments
The #
character indicates the start of a comment. Lexurgy ignores everything between the #
and the end of the line (including other #
characters).
Escapes
The characters \ , = > ( ) [ ] { } * + ? / - _ : ! $ @ # &
, as well as the digits 0
to 9
, are part of Lexurgy's syntax, and can't be used as sounds by themselves. If you want to use one of these as a sound, you have to put a backslash (\
) in front of it:
Whitespace
Whitespace is sometimes significant in the SC rules language.
Lexurgy generally ignores blank lines, indentation, and trailing whitespace:
(Lexurgy can tell where one rule ends and another begins because rule names must end with a colon.)
However, line breaks are significant. Wrapping a line at an arbitrary point is likely to cause a syntax error:
Line wrapping is allowed after the =>
, /
, and //
operators:
Names
There are several kinds of structures in Lexurgy that you have to name, including rules, features, and classes.
For most structure types, the name must consist entirely of plain Latin letters and numbers (uppercase or lowercase), and must have at least one letter. So x
, eEeEeE
, l3xur9y
, and 4EVA
are valid names, while 12
, my_rule
, and bõò
are not.
Rule names follow the above restrictions, except that they also allow any number of non-consecutive hyphens between the letters and numbers. So my-rule
, a-b-c-d-e
, and easy-as-1-2-3
are valid rule names, while my--rule
, -abcde-
, and 1-2-3
are not. These hyphens are meant to be used as word separators—indeed, the web interface replaces them with spaces in the output table.
Case Sensitivity
Generally, Lexurgy is case-sensitive: a
and A
are different sounds, and you can have a rule called lenition
followed by another rule called Lenition
. However, all keywords can be written either with an initial capital or all lowercase, with no change in meaning: classes can be declared with class
or Class
, syllables can be set to explicit
or Explicit
, and so on.
Declarations
Feature Declarations
Features allow breaking down sounds into simpler components, which rules can manipulate separately. A feature declaration consists of the keyword feature
followed by a definition of one or more features.
A feature represents a dimension across which sounds can vary; individual sounds must be assigned a single value for each feature. For example, you might declare a feature called height
, representing the vertical location of the tongue within the mouth when pronouncing a vowel. Then you might assign the sound /a/ the value low
for its height
feature, and the sound /i/ the value high
for its height
feature.
Simple Features
Binary features are declared using just their name: feature b
declares a binary feature called b
. Binary features have three values: plus (written +b
), minus (written -b
), and absent (written *b
). Any sound that isn't explicitly assigned a value for binary feature b
has the value *b
by default.
Univalent features are declared using a plus sign before the name: feature +u
declares a univalent feature called u
. Univalent features have only two values: plus (written +u
) and minus (written either -u
or *u
). Any sound that isn't explicitly assigned a value for univalent feature u
has the value -u
by default.
Multiple binary or univalent features can be declared on the same line, separated by commas. For example, feature a, +b, c
declares two binary features a
and c
, and a univalent feature b
. This has the same meaning as declaring them on separate lines, it's just more compact.
Multivalent Features
Multivalent features can have any number of values. They're declared by writing the feature's name, followed by a comma-separated list of the feature's values in parentheses: feature m(v1, v2, v3)
declares a multivalent feature called m
with the three values v1
, v2
, v3
.
Multivalent features also always have an absent value, written as the feature name with an asterisk in front of it. So the feature above actually has four values: v1
, v2
, v3
, and *m
. Any sound that isn't explicitly assigned a value for multivalent feature m
has the value *m
by default.
While you can't disable the absent value, you can give it a name: after feature a(*v1, v2, v3)
, the feature a
has only the three values v1
, v2
, and v3
. It's still valid to write *a
, but this means the same thing as v1
. Since v1
is the absent value, it's also the default value for sounds that don't have a different value for a
assigned.
Unlike binary and univalent features, each multivalent feature must be declared on its own line.
Syllable-Level Features
A syllable-level feature is assigned to an entire syllable, rather than only one sound. This can be used to model phenomena that operate on syllables, like stress and syllable weight.
Any feature can be marked as a syllable-level feature by adding the keyword (syllable)
before the feature definition. So feature (syllable) a
declares a binary syllable-level feature, while feature (syllable) b(v1, v2, v3)
declares a multivalent syllable-level feature.
Note that the (syllable)
modifier only applies to the immediately following definition: feature (syllable) a, b, c
declares one syllable-level feature a
and two ordinary features b
and c
. To declare three syllable-level features, you need to use feature (syllable) a, (syllable) b, (syllable) c
.
Symbol Declarations
By default, Lexurgy treats each character in a word as a separate sound, with all features set to their default values. You can change this behaviour with symbol declarations.
Multi-Character Symbols
A multi-character symbol declaration tells Lexurgy to treat a particular sequence of characters as a single sound. For example, if symbol tʃ
is declared, any tʃ
found in input words is unaffected by rules that change t
or ʃ
alone, and rules that count sounds count tʃ
as one sound.
Several multi-character symbols can be declared in one line by separating them with commas: symbol tʃ, dʒ
declares both tʃ
and dʒ
as symbols.
Note that Lexurgy doesn't automatically combine adjacent characters that match a symbol declaration when they appear as the result of rules:
The ts-frication
rule applies to tsatsa
, whose initial ts
sequences were interpreted as the sound ts
. But dzadza
only acquires a ts
sequence piecemeal, by the d
becoming a t
and then the z
becoming an s
, so Lexurgy still sees it as starting with a t
sound followed by an s
sound. If you want arbitrary sequences of a t
and an s
to become the sound ts
, you have to add a rule that does that:
When breaking up input words into sounds, Lexurgy always chooses the longest possible symbols, moving from left to right. For example, if ts
and sh
are both declared as symbols, tsh
will be treated as ts
followed by h
, not t
followed by sh
.
Feature Matrix Symbols
A feature matrix symbol declaration assigns feature values to a sound. For example, symbol e [mid front vowel]
assigns the three feature values mid
, front
, and vowel
to the sound e
. Naturally, all values used in the feature matrix have to be declared in feature
declarations first!
If the symbol contains multiple characters, the declaration will also act as a multi-character symbol declaration: symbol ts [-voiced alveolar affricate]
simultaneously makes ts
a single sound and assigns it the values -voiced
, alveolar
, and affricate
.
Diacritic Declarations
Like a feature matrix symbol declaration, a diacritic declaration assigns feature values to characters. But diacritics modify other symbols; once a character is declared as a diacritic, it can't stand on its own.
A basic diacritic declaration looks like a feature matrix symbol declaration, but with the diacritic
keyword instead of the symbol keyword. The declaration diacritic ʼ [+ejective]
makes ʼ
a diacritic that gives the preceding symbol the +ejective
value; so if t
is declared as [-voiced alveolar stop]
, tʼ
automatically gets the feature matrix [-voiced alveolar stop +ejective]
.
Diacritic feature values overwrite any values of the same features on their host symbol. For example, if a diacritic defined as [-voiced]
is applied to a sound defined as [+voiced alveolar stop]
, the result is [-voiced alveolar stop]
.
Diacritics can be attached to sounds that don't have feature matrices, in which case the resulting sound has only the diacritic's features. So if the above [-voiced]
diacritic is attached to a t
, but t
doesn't have a symbol declaration, the result is a t
with the -voiced
value attached to it.
Diacritic Order
If multiple diacritics are attached to the same sound, they're always displayed in the order they're declared. In this example, the declaration order of the diacritics causes long nasal vowels to display weirdly:
Switching the two declarations fixes the problem:
Diacritic Position
By default, diacritics are written after their host symbols:
You can change this by adding a position modifier to the diacritic declaration. A diacritic marked (before)
is written before its host symbol:
A diacritic marked (first)
is written after the first character of its host symbol:
Floating Diacritics
A diacritic marked with (floating)
is interpreted as creating a superficial variant of the base sound, meaning rules that apply to the base sound should apply to the modified sound too. With an ordinary hightone
diacritic, this rule affects a
but not á
:
But with a floating diacritic, both versions are considered to be types of a
, so both are affected:
Syllable-Level Diacritics
If a diacritic is assigned syllable-level features, it's written in the designated position relative to the entire syllable:
It's an error to assign a mix of sound-level and syllable-level features to the same diacritic.
Class Declarations
Classes group together similar sounds so that rules affect all of them together. A class declaration consists of the keyword class
, the class name, and a list of sounds in the class. For example, class nasal {m, n, ŋ}
creates a class called nasal
, which contains the sounds m
, n
, and ŋ
. Then the class can be referenced from rules by putting a @
character before the class name, e.g. @nasal
.
The order in which sounds appear in the class matters, since corresponding members of a class or alternative list are paired up:
The same sound can even appear multiple times in the same class to ensure alignment with other classes:
A class definition can itself contain references to previously defined classes, in which case the sounds from the referenced classes are inserted into the class. So if class stop {p, t, k}
and class fricative {f, s}
are already defined, class obstruent {@stop, @fricative}
defines the obstruent
class as {p, t, k, f, s}
.
Element Declarations
New in 1.1.0An element declaration defines a reusable element and gives it a name. Then, as with classes, the element can be referenced from rules by putting a @
character before the class name. When the sound changes run, the element reference is replaced by the element's definition.
Element declarations are much more flexible than class declarations, allowing arbitrary element syntax. But classes have one special property that can't be replicated with elements: nested classes are flattened, while nested elements stay nested. The following works:
But replace the stop
class with an identical element declaration, and you get an error:
The class reference @stop
acts like {p, t, k, b, d, ɡ}
, allowing it to line up with the six-element fricative
class. But the element reference acts like {{p, t, k}, {b, d, ɡ}}
, which doesn't line up.
Rules
A normal rule consists of a rule name ending in a colon, followed by a nested structure of blocks. The innermost blocks contain one or more expressions, and each expression is made out of elements.
Expressions
Expressions are the key to Lexurgy rules. Each expression does a specific transformation on each input word.
Input and Output
At minimum, an expression consists of an input pattern, a =>
symbol, and an output pattern. The input pattern describes which sounds are affected by the change, while the output pattern describes what changes need to be applied. Take this rule:
The i
is the input pattern, indicating that this expression only applies to the sound i
. The e
is the output pattern, indicating that any sounds matching the input pattern have to become e
. The result is that all instances of i
in each input word are replaced by e
.
Environments
An expression may also include an environment, statements that must be true of nearby sounds in order for the change to happen. The environment consists of a condition, an exception, or both. The condition starts with /
, and gives a pattern that must match nearby sounds; the exception starts with //
, and gives a pattern that must not match nearby sounds. This expression has a condition, so that the change applies only to i
followed by n
:
This expression has an exception, so that the change applies only to i
not preceded by k
:
This expression has both, so the change only applies to i
that are both followed by n
and preceded by k
:
The underscore in a condition or exception marks where the sounds matched by the input pattern are in the word. The portion before the underscore is called the before environment, while the portion after is the after environment.
Alternative Environments
Within a condition or exception, more than one environment can be listed, enclosed in braces ({}
) and separated by commas.
If multiple conditions are listed, the change will happen if at least one of them is true:
If multiple exceptions are listed, the change will only happen if all of them are false:
Evaluation Order
When evaluating an expression, Lexurgy always proceeds in the following order:
- It checks whether the input pattern matches at a specific location, working from left to right within the input pattern.
- It checks whether the before environment matches the sounds immediately before what the input pattern matched, working from right to left.
- It checks whether the after environment matches the sounds immediately after what the input pattern matched, working from left to right.
- It evaluates the output pattern from left to right to produce the output sounds.
This matters for some kinds of elements. For example, captures must save sounds before they can copy sounds, so if Lexurgy encounters a capture reference before its capture binding, it will produce an error.
Unchanged
There is one special expression that doesn't follow the above form: unchanged
. As the name suggests, unchanged
never makes any changes, i.e. the output word is always the same as the input word.
Use unchanged
when Lexurgy's syntax demands an expression, but you don't actually want anything to happen to the words; for example:
- When you've added a rule, but haven't decided what changes to apply yet.
- When you need the first sub-block in a sequential block to have modifiers.
- When you want to dump out an intermediate stage exactly as it is.
Elements
Each part of an expression (input pattern, output pattern, before environment, and after environment) must be an element. There are many types of elements, some of which contain other elements. This section gives a tour of all the different element types Lexurgy provides.
Elements actually come in two flavours: matchers and emitters. Matchers look for sounds with specific properties, while emitters apply transformations to matched sounds, or even create sounds out of thin air. Any element in the output pattern is treated as an emitter, while any element anywhere else is treated as a matcher.
The difference is important because the same element type can have very different behaviour depending on whether it's a matcher or an emitter. Some element types are even invalid if used as the wrong flavour.
Symbol Elements
A symbol element consists of one or more characters that isn't part of Lexurgy's syntax. It represents literal sounds.
As a matcher, a symbol element matches the exact sequence of sounds that it's made of; for example, s
matches only the sound s
, ho
matches only the sound h
followed by the sound o
. The only exception is that sounds with additional floating diacritics count as matches too: if ^
is declared as a floating diacritic, then s
matches both the sound s
and the sound s^
.
As an emitter, a symbol element emits the exact sequence of sounds that it's made of, regardless of what its corresponding matcher matched. The only exception is if the corresponding matcher is also a symbol matcher, and it found additional floating diacritics; in that case, the additional floating diacritics are copied to the output sound. For example, in the expression s => h
, if ^
is declared as a floating diacritic, and the s
matcher found a s^
, then the h
emitter will emit a h^
.
Exact Symbol Elements
An exact symbol element consists of a symbol element followed by !
. It causes the element to treat all diacritics as non-floating.
As a matcher, an exact symbol element matches the exact sequence of sounds that it's made of, and only that exact sequence, regardless of any floating diacritics that might be declared. So s!
matches only the sound s
; even if ^
is declared as a floating diacritic, s!
doesn't match s^
.
As an emitter, an exact symbol element emits the exact sequence of sounds that it's made of, regardless of any floating diacritics that might be declared. It refuses to copy any floating diacritics found by the corresponding matcher. So the expression s => h!
will still affect s^
if ^
is declared as a floating diacritic, but it will turn into h
, not h^
.
Empty Elements
An empty element, written as *
, represents no sounds at all.
As a matcher, an empty element matches the point between any two adjacent sounds. This allows the expression to insert sounds at that point.
As an emitter, an empty element emits zero sounds, regardless of what its corresponding matcher matched. This causes the expression to delete all sounds matched by the corresponding matcher.
Word Edges
A word edge element, written as $
, represents the beginning or end of a word. It can be used in environments to apply specific changes only at the edges of the word.
The placement of word edge elements is highly restricted. Word edge elements are always invalid in the input pattern and output pattern. Even in an environment, a word edge element can only appear at the very beginning of the before environment (to match the beginning of a word), or at the very end of the after environment (to match the end of a word). Any other usage of word edge elements causes errors.
Word Boundaries
A word boundary element, written as $$
, represents the space between two words.
As a matcher, a word boundary element matches the space between words. If connected to an emitter that doesn't contain a word boundary element, the words get fused together.
As an emitter, a word boundary element replaces whatever the corresponding matcher matched with a new space between words, i.e. the word gets split apart.
Syllable Breaks
A syllable break element, written as .
, represents the edge of a syllable—either an actual syllable break, or the edge of a word. Note that syllable break elements only work if syllables are enabled (i.e. after a syllable rule, but before a clear-syllables rule); otherwise, .
is just a (symbol element)[#symbol-elements].
As a matcher, a syllable break element fails the match if there isn't a syllable edge at its location. If connected to an emitter that doesn't contain a syllable break element, the syllable break at that location is deleted, merging the two adjacent syllables together.
As an emitter, a syllable break element inserts a syllable break at its location.
Note that inserting and deleting syllable breaks is usually only helpful when using manual syllables. If automatic syllables are turned on, the syllable breaks are still inserted and deleted (which may affect subsequent expressions in a sequential block or iterations in a propagation), but those changes are swept away as soon as the rule finishes and the syllable rule is reapplied.
Class References
A class reference, written as @
followed by the class name, invokes a declared class.
As a matcher, a class reference matches any of the sounds in the class. As with symbol elements, sounds with additional floating diacritics count as matches too.
As an emitter, a class reference is only valid if paired up with another class matcher (or alternative list) of the same length. It transforms the matched sound into the sound at the same position in the emitter class. So the expression {p, t, k} => {b, d, ɡ}
transforms a matched p
into b
, a matched t
into d
, and a matched k
into ɡ
. As with symbol elements, additional floating diacritics found by the matcher are copied to the output sound.
Element References
An element reference looks the same as a class reference—a @
followed by the element name—but refers to a declared element.
As a matcher or emitter, an element reference is replaced by the element's definition.
Matrix Elements
A matrix element consists of a pair of square brackets ([]
) containing zero or more feature values, feature variables, or negated feature values.
As a matcher, a matrix element matches any single sound with at least the listed feature values. Sounds with additional values count as matches; sounds with values that contradict those in the matrix don't count as matches. The case of zero values (i.e. an empty pair of brackets) is called a "wildcard", since it matches any sound.
As an emitter, a matrix element adds the listed features to all the sounds matched by the corresponding matcher, overwriting any contradictory values. If there is no corresponding matcher, the matrix element produces the sound represented by the matrix itself.
Feature Variables
A feature variable consists of the name of the feature (not the value) with a $
in front of it. Feature variables allow you to copy feature values from one sound to another.
The first time a feature variable appears in a matcher, the value of the specified feature is saved. Then every subsequent use of the same feature variable in the same expression (whether in an emitter or another matcher) is replaced with the saved value.
Negated Feature Values
A negated feature value consists of the name of the feature value with a !
in front of it.
A matrix matcher containing a negated feature value only matches sounds that don't have that value.
Negated feature values are invalid in emitters.
Captures
A capture consists of a $
followed by any positive integer: $1
, $2
, $3
, etc.
If a capture is attached directly to a matcher (a capture binding), it saves whatever sounds that matcher matches. A capture can't be attached directly to an emitter in this way.
If a capture on its own (a capture reference) is used as a matcher, it matches exactly the sounds saved by the matcher with the same number.
If a capture on its own is used as an emitter, it produces exactly the sounds saved by the matcher with the same number.
Either way, if nothing has been saved by a capture binding with the same number, the rule fails with an error.
Inexact Captures
An inexact capture ignores floating diacritics when used as a matcher. It's written with a preceding ~
: ~$1
, ~$2
, ~$3
, etc.
Inexact captures can't be used as emitters.
Syllable Captures
New in 1.2.0Normal captures only copy sounds, not syllable information. To copy syllable information, you need a syllable capture, written with a .
between the $
and number: $.1
, $.2
, $.3
, etc.
As an emitter, a syllable capture produces exactly the sounds and syllable information saved by the capture binding with the same number.
Syllable captures can't be used as matchers.
Sequences
A sequence combines several elements, expecting them to be adjacent in the word. It's written by putting spaces between the individual elements.
As a matcher, a sequence checks each of its elements. The match only succeeds if all of its elements succeed, on adjacent parts of the input word. Note that in the before environment, the elements are checked from right to left.
What a sequence does as an emitter depends on its corresponding matcher.
If the matcher is also a sequence, and it has exactly the same number of elements, then the elements are paired off one to one, with each element of the emitter sequence transforming the corresponding element of the matcher sequence.
Otherwise, all the elements of the emitter sequence will try to produce sounds out of thin air, with the resulting string of sounds replacing everything that the matcher matched. Certain kinds of elements (e.g. class references) require a corresponding matcher, so putting one in such an emitter sequence will result in an error.
Repeaters
A repeater represents some number of copies of an element. Repeaters can only be used as matchers.
A general repeater is written as element*(min-max)
—the repeated element, followed by a *
, followed by the minimum and maximum allowed number of repetitions, in parentheses and separated by a -
. For example, b*(2-5)
matches bb
, bbb
, bbbb
, and bbbbb
, but not b
or bbbbbb
.
Either the minimum or maximum can be omitted. If the minimum is omitted, it's treated as 0; if the maximum is omitted, there's no upper limit to the number of repetitions. So b*(-5)
matches any number of b
characters up to 5 (even zero b
characters, i.e. no sounds at all), and b*(2-)
matches any string of two or more b
characters.
There are three kinds of repeaters with special syntax:
- An element followed by
+
matches one or more copies of the element:b+
is equivalent tob*(1-)
. - An element followed by
*
matches any number of copies of the element:b*
is equivalent tob*(0-)
. - An element followed by
?
matches zero or one copies of the element:b?
is equivalent tob*(0-1)
. This expresses that the element is optional.
Alternatives
An alternative list is like a class, but it can contain any elements, not just single sounds. The syntax is similar to a class declaration: a list of elements separated by commas, and wrapped in braces ({}
). For example, {$1, [+voiced], k}
is an alternative list with three elements: $1
, [+voiced]
, and k
.
As a matcher, an alternative list matches anything that at least one of its elements matches.
As an emitter, an alternative list is only valid if paired up with another alternative list (or class reference) of the same length. It transforms the matched sounds using the emitter at the same position in the list as the matcher that matched those sounds.
Intersections
An intersection, written as a list of elements joined with &
, matches only what all of its elements match. For example, @fricative&[+voiced]
matches only sounds that both belong to the fricative
class and have the +voiced
feature value. This contrasts with the alternative list {@fricative, [+voiced]}
, which matches everything in the fricative
class, and also everything in with the [+voiced]
class. Using an intersection makes the rule more selective about what it applies to; using an alternative list makes it more permissive.
Intersections can't be used as emitters.
Negated Elements
Negation behaved erratically prior to 1.2.0. In older versions, use it only to
negate single sounds (e.g. !a
) or in intersections (e.g.
@fricative&!@voiced
)
Adding !
before an element negates it, only matching things that don't match the element. For example, !@fricative
matches any sound that isn't in the fricative
class.
Elements that always match exactly one sound (such as matrix elements and class references) can be freely negated. The negation matches any single sound that doesn't match the element.
Since 1.2.0, syllable break elements can also be freely negated. The negation matches zero sounds, only if there isn't a syllable edge at that location.
Negating anything else is restricted. You can't write an expression like !abc => x
; it's clear what !abc
shouldn't match (namely, abc
), but it isn't clear what it should match. Older versions of Lexurgy would accept this rule, but it would turn the word abcabc
into xxxxxx
. That first a
isn't abc
, and neither is the first b
, and so on, so everything would get turned into x
, even though the entire word is made of copies of abc
.
So negation of elements like this is only allowed in specific situations:
-
In an intersection, after
&
. Once the first element in an intersection has matched something, there's a definite sequence of sounds to check against the negated element. For example,([]*3)&!abc
is valid, and matches any three sounds that aren't the sequenceabc
; the first element in the intersection provides the needed context to make the interpretation of the negated element clear. -
At the very beginning of a before environment or the very end of an after environment. In this case, the negated element can simply check whether anything matching the element is present in the environment, and stop the rule from running if it is. For example,
e => f / !abc d _
changesbcde
tobcdf
, but leavesabcde
unchanged. Since there's nothing beyond the!abc
in the rule, how many sounds to match is irrelevant.
Negated elements can't be used as emitters.
Environment Elements
An environment can be added not just to an entire expression, but to an individual element:
In fact, this is how all environments work under the hood: an expression like i => e / _ n // k _
gets transformed internally into i / _ n // k _ => e
.
As a matcher, an environment element matches its input pattern, only if the surroundings of the matched sounds satisfy the condition and don't satisfy the exception.
An environment element can't be used as an emitter.
Precedence
In case of ambiguity, elements have the following precedence:
If you need to change the precedence, you can use parentheses. For example, by default captures have a higher precedence than sequences, so the following capture only captures one sound:
If you want to capture both matched sounds, wrap the [] []
in parentheses.
Blocks
Blocks let you organize multiple expressions within a rule and control how they apply.
Simultaneous Blocks
You can write several expressions on consecutive lines within a rule. When you do, Lexurgy automatically wraps them in a simultaneous block.
In a simultaneous block, all the expressions apply at the same time, everywhere they can apply. The procedure looks like this:
-
Lexurgy looks for all the places where any of the expressions could apply—parts of the word that match the input pattern and whose surroundings satisfy the environment. It compiles this into a list of application sites.
-
Lexurgy resolves conflicts between application sites—places where two expression applications would try to change overlapping parts of the word:
-
If two application sites were produced by the same expression and start at the same location in the word, the one that ends earlier in the word (i.e. the shorter match) is discarded.
-
If two overlapping application sites were produced by different expressions, the one produced by the later expression in the block is discarded.
-
If two overlapping application sites were produced by the same expression and start at different locations in the word, the one that starts later in the word is discarded.
-
-
Lexurgy applies the expressions at the remaining application sites.
Look at this example:
When the rule applies to áéàè
, the first expression matches áé
, àè
, and áéàè
. First, the match to áé
is discarded because it starts at the same location as áéàè
but ends earlier. Then the match to àè
is discarded because it starts later in the word than áéàè
. So the expression applies to the entire word, and the result is y
, not áéy
or yàè
.
When the rule applies to áéó
, the first expression matches éó
while the second expression matches áé
. Since these overlap, the match from the second expression is discarded, and only the first expression applies. So the result is áx
, not yó
.
When the rule applies to áàä
, the third expression matches both áà
and àä
, and no other expressions match anything. Since these overlap, and áà
starts earlier in the word than àä
, the match to àä
is discarded, and the expression only applies to áà
. So the result is zä
, not áz
.
When the rule applies to áéàèó
, you might think that the result would be yàx
, but it's actually áéàx
. That's because shorter matches are discarded before expression order is resolved. The rule produces the following application sites: èó
from the first expression, and áé
, àè
, and áéàè
from the second expression. First, the áé
match is discarded because áéàè
is a longer match at the same location. Then àè
and áéàè
are discarded because they overlap with the match to èó
from the first expression. The result is that only èó
ever gets changed.
Sequential Blocks
A sequential block consists of a list of other blocks separated by then:
. Each nested block is applied in sequence, as if they were separate rules, except that cleanup and syllable rules don't run in between them.
Hierarchical Blocks
A hierarchical block consists of a list of other blocks separated by else:
. The first nested block is applied; only if it fails to make any changes to the word, the second block is applied; and so on.
This is particularly useful for writing stress rules. For example, the following rule stresses a word on the last syllable if it contains a long vowel, the second-last syllable otherwise:
When nesting hierarchical blocks inside sequential blocks or vice versa, put the nested blocks in parentheses. This is a hierarchical block inside a sequential block:
This is a rule with the same expressions, but with the sequential block inside the hierarchical block:
Modifiers
Modifiers alter the behaviour of a single block. They can be added directly after the rule name to affect the whole rule, or after a then
or else
to affect ony that block. To add a modifier to the first nested block in a sequential or hierarchical block, add an extra unchanged
expression before it.
Propagation
A block tagged with propagate
is applied repeatedly, with the output from each application being fed into the next, until the word stops changing. Take this rule without modifiers:
This turns pairs of a
characters into a single a
, halving the total number. Now add the propagate
modifier:
After halving the number of a
characters once, the rule applies again to the output, halving the number again, and then again. The result is a single a
, no matter how many the word started with.
Left-To-Right
A block tagged with ltr
is applied once starting at the first character in the word, then once starting at the second character, and so on, with each application seeing the result of all previous applications. This ensures that effects only propagate from the start of the word to the end, not the other way around.
Take this propagate
rule:
The dd
in the middle turns into xx
, and then the other expressions spread the line of x
in both directions. Now change the propagate
modifier to ltr
:
This rule only spreads the line of x
to the right, leaving the initial abc
untouched.
Right-To-Left
A block tagged with rtl
is applied once starting at the last character in the word, then once starting at the second-last character, and so on, with each application seeing the result of all previous applications. As ltr
rules propagate effects from the start of the word to the end, rtl
rules propagate them from the end to the start.
Take the same propagating example from before and replace propagate
with rtl
:
This rule only spreads the line of x
to the left, leaving the final cba
untouched.
Filters
A filter is a modifier that can be added to a rule, right after a rule name. It causes the rule to only consider certain sounds when matching. Any class reference or feature matrix can be used as a filter condition.
When a rule has a filter:
- The input pattern and environment can only match sounds that satisfy the filter condition, as if they all had
&<condition>
after them. - Two sounds are considered adjacent if all the sounds between them don't satisfy the filter condition; sounds that don't match the filter are skipped entirely when determining matches.
- Expressions are forbidden from inserting sounds.
Here's a rule with a class reference as a filter condition:
This rule turns each vowel into a copy of the previous vowel, even if there are consonants between them.
Here's the same rule written with features instead:
Special Rule Types
Deferred Rules
New in 1.1.0A deferred rule, marked with the modifier defer
after the rule name, doesn't apply at the point where it's defined. Instead, it can be referenced later as part of other rules, using the rule name with a colon attached to the beginning.
The following rule does nothing:
But with a reference from another rule, the change is applied:
The same deferred rule can be referenced multiple times:
A rule containing deferred references can also contain ordinary blocks and expressions. But if the deferred rule contains nested blocks, it can't be referenced from inside a sequential block.
Cleanup Rules
New in 1.1.0A cleanup rule, marked with the modifier cleanup
after the rule name, applies once at the point where it's defined, and then again after every subsequent rule:
A cleanup rule can be deactivated by defining another rule with the same name, containing only the word off
:
The cleanup rule runs one last time immediately before being deactivated, but no longer runs after subsequent rules.
Syllables
By default, syllables are disabled; the character .
is treated as a sound like any other. To enable syllables, you need at least one syllable rule.
Manual Syllables
To enable manual syllables, use an explicit
syllable rule:
With manual syllables, you have to manage syllables yourself. Rules will generally leave syllable breaks and syllable-level features where they are, even if this no longer makes sense:
You can deal with this by adding and removing syllable breaks as appropriate within the rules:
A clear
syllables rule turns manual syllables back off, and discards all the syllable breaks:
Simple Automatic Syllables
To enable automatic syllables, use a syllable rule containing one or more syllable patterns.
A syllable rule automatically breaks up each word into syllables that match the syllable patterns. For example, the following syllable rule creates syllables with an optional consonant onset, a vowel as the nucleus, and an optional consonant code:
The syllable rule always ends each syllable as early as possible, so that consonants go into the onset rather than the coda. So the above rule breaks up apat
as a.pat
, not ap.at
, since a
is shorter than ap
. But kiski
is broken up as kis.ki
, since ki.ski
would have an invalid syllable ski
.
Once a syllable rule is defined, it runs again after each subsequent rule, similar to cleanup rules. This ensures that the syllables continue to make sense:
You can disable the syllable rule again using an explicit
syllable rule, or remove the syllable breaks altogether with a clear
syllable rule. Unlike a cleanup rule, a syllable rule does not run one more time immediately before being disabled.
You can also change the syllable rule by defining a new one. Again, the previous rule doesn't apply one more time, so this can be used to adapt the syllable structure to a rule that produces previously forbidden syllables:
If a word can't be broken up into valid syllables, the rule produces an error.
Structured Automatic Syllables
New in 1.7.0As mentioned, simple syllable rules always stop each syllable as early as possible. This is intended to assign consonants to the onset of a syllable instead of the coda of the previous syllable, as is common when dividing words into syllables. But this has two major problems:
- Simple syllable rules don't actually know where the "onset" and "coda" are, so they apply the early stopping rule to the nucleus as well. This makes it hard to allow multiple vowels in the same nucleus:
- Sometimes you don't want to assign consonants to the onset, especially in word-internal clusters. This rule assigns the internal
sk
to the onset, even if your intention is to divide the syllables between thes
andk
:
To make both scenarios easier, Lexurgy 1.7.0 introduced structured syllable rules. These is now the preferred way of defining syllable rules.
A structured syllable rule lets you explicitly define where the onset, nucleus, and coda are by separating them with ::
symbols:
Structured syllable rules take this into account when resolving ambiguity. Codas still yield to following onsets: if a consonant could go in either the coda of one syllable or the onset of the next without creating invalid syllables, it will be assigned to the onset. But the nucleus is greedy, consuming as many sounds as possible. The nucleus pattern @vowel @vowel?
will always consume a second vowel if it's available, preventing a syllable break from appearing between the vowels.
The coda is optional. This is also a valid pattern:
But the onset is mandatory. If for some reason you need a syllable pattern with no onset, use *
as the onset:
There's another optional part of a structured syllable pattern: the reluctant onset. This is written before the regular onset, separated from it by ?:
. The reluctant onset yields to the coda in the previous syllable, allowing you to force consonant clusters to be split between syllables:
Multiple Syllable Patterns
If a syllable rule contains more than one syllable pattern, then it use all of them when breaking up a word into syllables. A syllable is valid if it matches at least one of the syllable patterns.
Assigning Syllable Features
You can assign syllable-level features in a syllable pattern by putting the feature matrix after =>
:
When assigning syllable-level features, the order of the patterns can matter. Only the first syllable pattern that matches a syllable assigns its feature:
In this example, each syllable gets either +kiki
or +bouba
, never both, even though anything that matches the first pattern also matches the second.
Romanization
Lexurgy sound changes are normally written in phonetic notation, like IPA. Lexurgy provides special rule names designed to convert between this notation and an easier-to-type romanization system for the language.
The first rule in a sound change file can be named deromanizer
, indicating that it's meant to translate the initial language's romanization system into phonetic notation. Similarly, the last rule can be named romanizer
, indicating that it translates phonetic notation into the final language's romanization system.
For the most part, romanizers and deromanizers behave like any other rule. They can contain blocks, and can reference classes, elements, and deferred rules. However:
- Other rules can't come before the deromanizer or after the romanizer.
- Romanizers and deromanizers can't have modifiers or filters (but blocks inside them can have modifiers).
- Any active syllable and cleanup rules don't apply again after the romanizer.
Notation Conflicts
Occasionally, you may find that your romanization system conflicts with the declarations. Suppose you have a language with ejective consonants, marked with the IPA diacritic ʼ
. But in the romanization, you use the same character to indicate the glottal stop. Here's an attempt to write this up:
This produces an error before you even run it. The ʼ
character is declared as a diacritic, but both the romanizer and deromanizer try to use it on its own, without a symbol to attach it to.
As a workaround, romanizers and deromanizers have the special modifier literal
, which lets you temporarily ignore all declarations.
If the top-level block in a deromanizer is a sequential block, then everything up to the first then:
ignores declarations. Otherwise, the entire deromanizer ignores declarations. Similarly, if the top-level block in a romanizer is a sequential block, then everything after the last then:
ignores declarations, otherwise the entire romanizer ignores declarations.
Here's the above example fixed using literal
:
Intermediate Romanizers
Intermediate romanizers don't affect the final output of the sound changes, but they add intermediate stages to the output, letting you see what each word looks like partway through.
An intermediate romanizer looks like an ordinary rule, except that its name must start with romanizer-
. Here's an example with some intermediate romanizers:
Notice how the output table now has extra columns for the intermediate stages. The first intermediate romanizer contains the unchanged expression, so it writes out the word exactly as it is at that stage. The second intermediate romanizer contains a real expression c => k
, so even though the word is cccc
at that stage, it shows up as kkkk
. But this change has no lasting effect on the word; the next rule still sees cccc
, as if the intermediate romanizer wasn't there.