man ragel (Commandes) - compile regular languages into executable state machines

NAME

ragel - compile regular languages into executable state machines

SYNOPSIS

ragel [options] file

DESCRIPTION

Note: this is the frontend component of Ragel, which generates an intermediate file format that must be processed by rlcodegen(1).

Ragel compiles finite state machines from regular languages into executable C/C++/Objective-C/D code. Ragel state machines can not only recognize byte sequences as regular expression machines do, but can also execute code at arbitrary points in the recognition of a regular language. This code embedding is done using inline operators that do not disrupt the regular language syntax.

The core language consists of standard regular expression operators (such as union, concatenation and kleene star) and action embedding operators. Ragel also provides operators that let you control any non-determinism that you create, construct scanners using the longest match paradigm and build state machines using the statechart model. It is also possible to influence the execution of a state machine from inside an embedded action by jumping or calling to other parts of the machine and reprocessing input.

Ragel provides a very flexibile interface to the host language that attempts to place minimal restrictions on how the generated code is used and integrated into the application. The generated code has no dependencies.

OPTIONS

-h, -H, -?, --help
Display help and exit.
-o file
Write output to file. If -o is not given, a default file name is chosen by replacing the suffix of the input. For source files ending in .rh the suffix .h is used. For all other source files a suffix based on the output language is used (.c, .cpp, .m, .dot)
-m
Find the minimal FSM accepting the language. This is the default behaviour. Minimization reduces the number of states in the final FSM. The algorithm implemented is similar to Hopcroft's state minimization algorithm. Differences are due to Ragel supporting arbitrary integer alphabets. Though exact analysis is very difficult, it runs close to O(n*log(n)) and requires O(n) temporary storage where n is the number of states.
-n
Do not perform state minimization.
-S <spec>
FSM specification to output for -V
-M <machine>
Machine definition/instantiation to output for -V

RAGEL INPUT

NOTE: This is a very brief description of Ragel input. Ragel is described in more detail in the user guide available from the homepage (see below).

Ragel normally passes input files straight to output. When it sees a FSM specification it stops to generate FSM code and then continues on passing input through. There can be any number of FSM specifications in an input file. An FSM specification always starts with '%%', followed by an optional machine name and then either a single FSM statement or a block of statements enclosed in {}.

FSM STATEMENTS

Alphabet Type:
Set the data type of the alphabet.
GetKey:
Specify how to retrieve the alphabet character from the element type.
Interface:
Cause the machine's interface to be emitted.
Include:
Include a machine of same name as the current or of a different name in either the current file or some other file.
Action Definition:
Define an action that can be invoked by the FSM.
Fsm Definition and Instantiation:
Used to build FSMs. Syntax description in next few sections.
Access:
Specify how to access the persistent state machine variables.
Write:
Write some component of the machine.

Ragel generates C/C++/Objective-C code. Sometimes you want to generate only the header portion of the FSM, sometimes you want to generate only the code portion and sometimes you may want both. Ragel decides what to generate in place of the FSM specification based on what type of statements are in the specification. If Ragel sees the interface statement then it generates the machine's interface. If it sees a definition of the main machine then it generates the code portion of the FSM. This lets you choose exactly what type of output each FSM specification produces.

BASIC MACHINES

The basic machines are the base operands of the regular language expressions.

'hello'
Concat literal. Produces a concatenation of the characters in the string. Supports escape sequences with '\'. The result will have a start state and a transition to a new state for each character in the string. The last state in the sequence will be made final.
hello
Identical to single quote version.
[hello]
Or literal. Produces a union of characters. Supports character ranges with '-', negating the sense of the union with an initial '^' and escape sequences with '\'. The result will have two states with a transition between them for each character or range.

NOTE: '', "", and [] produce null FSMs. Null machines have one state that is both a start state and a final state and match the zero length string. A null machine may be created with the null builtin machine.

integer
Makes a two state machine with one transition on the given integer number.
hex
Makes a two state machine with one transition on the given hexidecimal number.
/simple_regex/
A simple regular expression. Supports the notation '.', '*' and '[]', character ranges with '-', negating the sense of an OR expression with and initial '^' and escape sequences with '\'.
lit .. lit
Specifies a range. The allowable upper and lower bounds are concat literals of length one and number machines. For example, 0x10..0x20, 0..63, and 'a'..'z' are valid ranges.
variable_name
References the machine definition assigned to the variable name given.
builtin_machine
There are several builtin machines available. They are all two state machines for the purpose of matching common classes of characters. They are:
any
Any character in the alphabet.
ascii
Ascii characters 0..127.
extend
Ascii extended characters. This is the range -128..127 for signed alphabets and the range 0..255 for unsigned alphabets.
alpha
Alphabetic characters /[A-Za-z]/.
digit
Digits /[0-9]/.
alnum
Alpha numerics /[0-9A-Za-z]/.
lower
Lowercase characters /[a-z]/.
upper
Uppercase characters /[A-Z]/.
xdigit
Hexidecimal digits /[0-9A-Fa-f]/.
cntrl
Control characters 0..31.
graph
Graphical characters /[!-~]/.
print
Printable characters /[ -~]/.
punct
Punctuation. Graphical characters that are not alpha-numerics /[!-/:-@\[-`{-~]/.
space
Whitespace /[\t\v\f\n\r ]/.
null
Zero length string. Equivalent to '', "" and [].
empty
Empty set. Matches nothing.

BRIEF OPERATOR REFERENCE

Operators are grouped by precedence, group 1 being the lowest and group 6 the highest.

GROUP 1:

expr , expr
Join machines together without drawing any transitions, setting up a start state or any final states. Start state must be explicitly specified with the "start" label. Final states may be specified with the an epsilon transitions to the implicitly created "final" state.

GROUP 2:

expr | expr
Produces a machine that matches any string in machine one or machine two.
expr & expr
Produces a machine that matches any string that is in both machine one and machine two.
expr - expr
Produces a machine that matches string that is in machine one but not in machine two.

GROUP 3:

expr . expr
Produces a machine that matches all the strings in machine one followed by all the strings in machine two.

NOTE: Concatenation is the default operator. Two machines next to each other with no operator between them results in the concatenation operation.

GROUP 4:

label: expr
Attaches a label to an expression. Labels can be used by epsilon transitions and fgoto and fcall statements in actions. Also note that the referencing of a machine definition causes the implicit creation of label by the same name.

GROUP 5:

expr -> label
Draws an epsilon transition to the state defined by label. Label must be a name in the current scope. Epsilon transitions are resolved when comma operators are evaluated and at the root of the expression tree of machine assignment/instantiation.

GROUP 6: Actions

An action may be a name predefined with an action statement or may be specified directly with '{' and '}' in the expression.

expr > action
Embeds action into starting transitions.
expr @ action
Embeds action into transitions that go into a final state.
expr $ action
Embeds action into all transitions. Does not include pending out transitions.
expr % action
Embeds action into pending out transitions from final states.

GROUP 6: EOF Actions

When a machine's finish routine is called the current state's EOF actions are executed.

expr >/ action
Embed action into the EOF action list of the start state.
expr @/ action
Embed action into the EOF action list of all states that are not the start state and that are not final.
expr %/ action
Embed action into the EOF action list of all final states.
expr $/ action
Embed action into the EOF action list of all states.

GROUP 6: Global Error Actions

Global error actions are stored in states until the final state machine has been fully constructed. They are then transferred to error transitions, giving the effect of a default action.

expr >! action
Embed action into the global error transition list of the start state.
expr @! action
Embed action into the global error transition list of all states that are not the start state and that are not final.
expr %! action
Embed action into the global error transition list of the final states.
expr $! action
Embed action into the global error transition list of all states.

GROUP 6: Local Error Actions

Local error actions are stored in states until the named machine is fully constructed. They are then transferred to error transitions, giving the effect of a default action for a section of the total machine. Note that the name may be omitted, in which case the action will be transferred to error actions upon construction of the current machine.

expr >^ (name,action)
Embeds action into the error transition taken when the machine fails while in the start state.
expr @^ (name,action)
Embeds action into the error transition taken when the machine fails while in any state that is not the start state and is not final.
expr %^ (name,action)
Embed action into the error transition taken when the machine fails while in a final state.
expr $^ (name,action)
Embeds action into the error transition taken when the machine fails while in any state.

GROUP 6: Priority Assignment

Priorities are assigned to names within transitions. Only priorities on the same name are allowed to interact. In the first form of priorities the name defaults to the name of the machine definition the priority is assigned in. Transitions do not have default priorities.

expr > int
Assigns the priority int in all transitions entering into the machine.
expr @ int
Assigns the priority int in all transitions that go into a final state.
expr $ int
Assigns the priority int in all existing transitions.
expr % int
Assigns the priority int in all pending out transitions.

A second form of priority assignment allows the programmer to specify the name to which the priority is assigned, allowing interactions to cross machine definition boundaries.

expr > (name,int)
Assigns the priority int to name in all transitions entering into the machine.
expr @ (name, int)
Assigns the priority int to name in all transitions that go into a final state.
expr $ (name, int)
Assigns the priority int to name in all existing transitions.
expr % (name, int)
Assigns the priority int to name in all pending out transitions.

GROUP 7:

expr *
Produces the kleene star of a machine. Matches zero or more repetitions of the machine.
expr **
Longest Matching Kleene Star. This version of kleene star puts a higher priority on staying in the machine over wrapping around and starting over. This operator is equivalent to ( ( expr ) $0 %1 )*.
expr ?
Produces a machine that accepts the machine given or the null string. This operator is equivalent to ( expr | '' ).
expr +
Produces the machine concatenated with the kleen star of itself. Matches one or more repetitions of the machine. This operator is equivalent to ( expr . expr* ).
expr {n}
Produces a machine that matches exactly n repetitions of expr.
expr {,n}
Produces a machine that matches anywhere from zero to n repetitions of expr.
expr {n,}
Produces a machine that matches n or more repetitions of expr.
expr {n,m}
Produces a machine that matches n to m repetitions of expr.

GROUP 8:

! expr
Produces a machine that matches any string not matched by the given machine. This operator is equivalent to ( *extend - expr ).

GROUP 9:

( expr )
Forces precedence on operators.

VALUES AVAILABLE IN CODE BLOCKS

fc
The current character.
fpc
A pointer to the current character.
fcurs
An integer value representing the current state.
ftargs
An integer value representing the target state.
fentry(<label>)
An integer value representing the entry point <label>.

STATEMENTS AVAILABLE IN CODE BLOCKS

fhold;
Do not advance over the current character.
fgoto <label>;
Jump to the machine defined by <label>.
fgoto *<expr>;
Jump to the entry point given by <expr>. The expression must evaluate to an integer value representing a state.
fnext <label>;
Set the next state to be the entry point defined by <label>. The fnext statement does not immediately jump to the specified state. Any action code following the statement is executed.
fnext *<expr>;
Set the next state to be the entry point given by <expr>. The expression must evaluate to an integer value representing a state.
fcall <label>;
Call the machine defined by <label>. The next fret will jump to the target of the transition on which the action is invoked.
fcall *<expr>;
Call the entry point given by <expr>. The next fret will jump to the target of the transition on which the action is invoked.
fret;
Return to the target state of the transition on which the last fcall was made.

BUGS

Ragel is still under development and has not yet matured. There are probably many bugs.

CREDITS

Ragel was written by Adrian Thurston <thurston@cs.queensu.ca>. Objective-C output contributed by Eric Ocean. D output contributed by Alan West.

SEE ALSO

rlcodegen(1), re2c(1), flex(1)

Homepage: http://www.elude.ca/ragel/