man re2c (Commandes) - convert regular expressions to C/C++
NAME
re2c - convert regular expressions to C/C++
SYNOPSIS
[-efsbvhd] [-o output] file
DESCRIPTION
is a preprocessor that generates C-based recognizers from regular expressions. The input to consists of C/C++ source interleaved with comments of the form /*!re2c ... */ which contain scanner specifications. In the output these comments are replaced with code that, when executed, will find the next input token and then execute some user-supplied token-specific code.
For example, given the following code
#define NULL ((char*) 0) char *scan(char *p){ char *q; #define YYCTYPE char #define YYCURSOR p #define YYLIMIT p #define YYMARKER q #define YYFILL(n) /*!re2c [0-9]+ {return YYCURSOR;} [\000-\377] {return NULL;} */ }
will generate
/* Generated by re2c on Sat Apr 16 11:40:58 1994 */ #line 1 "simple.re" #define NULL ((char*) 0) char *scan(char *p){ char *q; #define YYCTYPE char #define YYCURSOR p #define YYLIMIT p #define YYMARKER q #define YYFILL(n) { YYCTYPE yych; unsigned int yyaccept; goto yy0; yy1: ++YYCURSOR; yy0: if((YYLIMIT - YYCURSOR) < 2) YYFILL(2); yych = *YYCURSOR; if(yych <= '/') goto yy4; if(yych >= ':') goto yy4; yy2: yych = *++YYCURSOR; goto yy7; yy3: #line 10 {return YYCURSOR;} yy4: yych = *++YYCURSOR; yy5: #line 11 {return NULL;} yy6: ++YYCURSOR; if(YYLIMIT == YYCURSOR) YYFILL(1); yych = *YYCURSOR; yy7: if(yych <= '/') goto yy3; if(yych <= '9') goto yy6; goto yy3; } #line 12
}
After the /*!re2c */ blocks you can place a /*!max:re2c */ block that will output a define (YYMAXFILL) that holds the maximum number of characters required to parse the input. That is the maximum value YYFILL() will receive.
OPTIONS
provides the following options:
- -e
- Cross-compile from an ASCII platform to an EBCDIC one.
- -f
- Generate a scanner with support for storable state. For details see below at SCANNER WITH STORABLE STATES.
- -s
- Generate nested ifs for some switches. Many compilers need this assist to generate better code.
- -b
- Implies -s. Use bit vectors as well in the attempt to coax better code out of the compiler. Most useful for specifications with more than a few keywords (e.g. for most programming languages).
- -d
- Creates a parser that dumps information about the current position and in which state the parser is while parsing the input. This is useful to debug parser issues and states. If you use this switch you need to define a macro YYDEBUG that is called like a function with two parameters: void YYDEBUG(int state, char current). The first parameter receives the state or -1 and the second parameter receives the input at the current cursor.
- -i
- Do not output #line information. This is usefull when you want use a CMS tool with the re2c output which you might want if you do not require your users to have re2c themselves when building from your source.
- -h
- -? Invoke a short help.
- -v
- Show version information.
- -V
- Show the version as a number XXYYZZ.
- -o output
- Specify the output file.
INTERFACE CODE
Unlike other scanner generators, does not generate complete scanners: the user must supply some interface code. In particular, the user must define the following macros:
- YYCTYPE
- Type used to hold an input symbol. Usually char or unsigned char.
- YYCURSOR
- of type *YYCTYPE that points to the current input symbol. The generated code advances YYCURSOR as symbols are matched. On entry, YYCURSOR is assumed to point to the first character of the current token. On exit, YYCURSOR will point to the first character of the following token.
- YYLIMIT
- Expression of type *YYCTYPE that marks the end of the buffer (YYLIMIT[-1] is the last character in the buffer). The generated code repeatedly compares YYCURSOR to YYLIMIT to determine when the buffer needs (re)filling.
- YYMARKER
- of type *YYCTYPE. The generated code saves backtracking information in YYMARKER.
- YYFILL(n)
- The generated code "calls" YYFILL when the buffer needs (re)filling: at least n additional characters should be provided. YYFILL should adjust YYCURSOR, YYLIMIT and YYMARKER as needed. Note that for typical programming languages n will be the length of the longest keyword plus one.
- YYGETSTATE()
- The user only needs to define this macro if the -f flag was specified. In that case, the generated code "calls" YYGETSTATE at the very beginning of the scanner in order to obtain the saved state. YYGETSTATE must return a signed integer. The value must be either -1, indicating that the scanner is entered for the first time, or a value previously saved by YYSETSTATE. In the second case, the scanner will resume operations right after where the last YYFILL was called.
- YYSETSTATE(n)
- The user only needs to define this macro if the -f flag was specified. In that case, the generated code "calls" YYSETSTATE just before calling YYFILL. The parameter to YYSETSTATE is a signed integer that uniquely identifies the specific instance of YYFILL that is about to be called. Should the user wish to save the state of the scanner and have YYFILL return to the caller, all he has to do is store that unique identifer in a variable. Later, when the scannered is called again, it will call YYGETSTATE() and resume execution right where it left off.
- YYDEBUG(state,current)
- This is only needed if the -d flag was specified. It allows to easily debug the generated parser by calling a user defined function for every state. The function should have the following signature: void YYDEBUG(int state, char current). The first parameter receives the state or -1 and the second parameter receives the input at the current cursor.
- YYMAXFILL
- This will be automatically defined by /*!max:re2c */ blocks as explained above.
SCANNER WITH STORABLE STATES
When the -f flag is specified, re2c generates a scanner that can store its current state, return to the caller, and later resume operations exactly where it left off.
The default operation of re2c is a "pull" model, where the scanner asks for extra input whenever it needs it. However, this mode of operation assumes that the scanner is the "owner" the parsing loop, and that may not always be convenient.
Typically, if there is a preprocessor ahead of the scanner in the stream, or for that matter any other procedural source of data, the scanner cannot "ask" for more data unless both scanner and source live in a separate threads.
The -f flag is useful for just this situation : it lets users design scanners that work in a "push" model, i.e. where data is fed to the scanner chunk by chunk. When the scanner runs out of data to consume, it just stores its state, and return to the caller. When more input data is fed to the scanner, it resumes operations exactly where it left off.
At this point, the -f option only works with "mono-block" re2c scanners: if the scanner is described with more than one /*!re2c ... */ block, re2c -f fails with an error.
Changes needed compared to the "pull" model.
1. User has to supply macros YYSETSTATE() YYGETSTATE(state)
2. The -f option inhibits declaration of yych and yyaccept. So the user has to declare these. Also the user has to save and restore these. In the example examples/push.re these are declared as fields of the (C++) class of which the scanner is a method, so they do not need to be saved/restored explicitly. For C they could e.g. be made macros that select fields from a structure passed in as parameter. Alternatively, they could be declared as local variables, saved with YYFILL(n) when it decides to return and restored at entry to the function. Also, it could be more efficient to save the state from YYFILL(n) because YYSETSTATE(state) is called unconditionally. YYFILL(n) however does not get state as parameter, so we would have to store state in a local variable by YYSETSTATE(state).
3. Modify YYFILL(n) to return (from the function calling it) if more input is needed.
4. Modify caller to recognise "more input is needed" and respond appropriately.
Please see examples/push.re for push-model scanner.
SCANNER SPECIFICATIONS
Each scanner specification consists of a set of rules and name definitions. Rules consist of a regular expression along with a block of C/C++ code that is to be executed when the associated regular expression is matched. Name definitions are of the form ``name = regular expression;''.
SUMMARY OF RE2C REGULAR EXPRESSIONS
- "foo"
- the literal string foo. ANSI-C escape sequences can be used.
- 'foo'
- the literal string foo (characters [a-zA-Z] treated case-insensitive). ANSI-C escape sequences can be used.
- [xyz]
- a "character class"; in this case, the matches either an 'x', a 'y', or a 'z'.
- [abj-oZ]
- a "character class" with a range in it; matches an 'a', a 'b', any letter from 'j' through 'o', or a 'Z'.
- [^class]
- an inverted "character class".
- r\s
- match any r which isn't an s. r and s must be regular expressions which can be expressed as character classes.
- r*
- zero or more r's, where r is any regular expression
- r+
- one or more r's
- r?
- zero or one r's (that is, "an optional r")
- name
- the expansion of the "name" definition (see above)
- (r)
- an r; parentheses are used to override precedence (see below)
- rs
- an r followed by an s ("concatenation")
- r|s
- either an r or an s
- r/s
- an r but only if it is followed by an s. The s is not part of the matched text. This type of is called "trailing context".
- r{n}
- matches r exactly n times.
- r{n,}
- matches r at least n times.
- r{n,m}
- matches r at least n but not more than m times.
- .
- match any character except newline (\n).
- def
- matches named definition as specified by def.
Character classes and string literals may contain octoal or hexadecimal character definitions and the following set of escape sequences (\n, \t, \v, \b, \r, \f, \a, \\). An octal character is defined by a backslash followed by its three octal digits and a hexadecimal character is defined by backslash, a lower cased 'x' and its two hexadecimal digits.
The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. Those grouped together have equal precedence.
A LARGER EXAMPLE
#include <stdlib.h> #include <stdio.h> #include <fcntl.h> #include <string.h>
#define ADDEQ 257 #define ANDAND 258 #define ANDEQ 259 #define ARRAY 260 #define ASM 261 #define AUTO 262 #define BREAK 263 #define CASE 264 #define CHAR 265 #define CONST 266 #define CONTINUE 267 #define DECR 268 #define DEFAULT 269 #define DEREF 270 #define DIVEQ 271 #define DO 272 #define DOUBLE 273 #define ELLIPSIS 274 #define ELSE 275 #define ENUM 276 #define EQL 277 #define EXTERN 278 #define FCON 279 #define FLOAT 280 #define FOR 281 #define FUNCTION 282 #define GEQ 283 #define GOTO 284 #define ICON 285 #define ID 286 #define IF 287 #define INCR 288 #define INT 289 #define LEQ 290 #define LONG 291 #define LSHIFT 292 #define LSHIFTEQ 293 #define MODEQ 294 #define MULEQ 295 #define NEQ 296 #define OREQ 297 #define OROR 298 #define POINTER 299 #define REGISTER 300 #define RETURN 301 #define RSHIFT 302 #define RSHIFTEQ 303 #define SCON 304 #define SHORT 305 #define SIGNED 306 #define SIZEOF 307 #define STATIC 308 #define STRUCT 309 #define SUBEQ 310 #define SWITCH 311 #define TYPEDEF 312 #define UNION 313 #define UNSIGNED 314 #define VOID 315 #define VOLATILE 316 #define WHILE 317 #define XOREQ 318 #define EOI 319
typedef unsigned int uint; typedef unsigned char uchar;
#define BSIZE 8192
#define YYCTYPE uchar #define YYCURSOR cursor #define YYLIMIT s->lim #define YYMARKER s->ptr #define YYFILL(n) {cursor = fill(s, cursor);}
#define RET(i) {s->cur = cursor; return i;}
typedef struct Scanner { int fd; uchar *bot, *tok, *ptr, *cur, *pos, *lim, *top, *eof; uint line; } Scanner;
uchar *fill(Scanner *s, uchar *cursor){ if(!s->eof){ uint cnt = s->tok - s->bot; if(cnt){ memcpy(s->bot, s->tok, s->lim - s->tok); s->tok = s->bot; s->ptr -= cnt; cursor -= cnt; s->pos -= cnt; s->lim -= cnt; } if((s->top - s->lim) < BSIZE){ uchar *buf = (uchar*) malloc(((s->lim - s->bot) + BSIZE)*sizeof(uchar)); memcpy(buf, s->tok, s->lim - s->tok); s->tok = buf; s->ptr = &buf[s->ptr - s->bot]; cursor = &buf[cursor - s->bot]; s->pos = &buf[s->pos - s->bot]; s->lim = &buf[s->lim - s->bot]; s->top = &s->lim[BSIZE]; free(s->bot); s->bot = buf; } if((cnt = read(s->fd, (char*) s->lim, BSIZE)) != BSIZE){ s->eof = &s->lim[cnt]; *(s->eof)++ = '\n'; } s->lim += cnt; } s->cur = cursor; return cursor; }
int scan(Scanner *s){ uchar *cursor = s->cur; std: s->tok = cursor; /*!re2c any = [\000-\377]; O = [0-7]; D = [0-9]; L = [a-zA-Z_]; H = [a-fA-F0-9]; E = [Ee] [+-]? D+; FS = [fFlL]; IS = [uUlL]*; ESC = [\\] ([abfnrtv?'"\\] | "x" H+ | O+); */
/*!re2c "/*" { goto comment; } "auto" { RET(AUTO); } "break" { RET(BREAK); } "case" { RET(CASE); } "char" { RET(CHAR); } "const" { RET(CONST); } "continue" { RET(CONTINUE); } "default" { RET(DEFAULT); } "do" { RET(DO); } "double" { RET(DOUBLE); } "else" { RET(ELSE); } "enum" { RET(ENUM); } "extern" { RET(EXTERN); } "float" { RET(FLOAT); } "for" { RET(FOR); } "goto" { RET(GOTO); } "if" { RET(IF); } "int" { RET(INT); } "long" { RET(LONG); } "register" { RET(REGISTER); } "return" { RET(RETURN); } "short" { RET(SHORT); } "signed" { RET(SIGNED); } "sizeof" { RET(SIZEOF); } "static" { RET(STATIC); } "struct" { RET(STRUCT); } "switch" { RET(SWITCH); } "typedef" { RET(TYPEDEF); } "union" { RET(UNION); } "unsigned" { RET(UNSIGNED); } "void" { RET(VOID); } "volatile" { RET(VOLATILE); } "while" { RET(WHILE); } L (L|D)* { RET(ID); } ("0" [xX] H+ IS?) | ("0" D+ IS?) | (D+ IS?) | (['] (ESC|any\[\n\\'])* [']) { RET(ICON); } (D+ E FS?) | (D* "." D+ E? FS?) | (D+ "." D* E? FS?) { RET(FCON); } (["] (ESC|any\[\n\\"])* ["]) { RET(SCON); } "..." { RET(ELLIPSIS); } ">>=" { RET(RSHIFTEQ); } "<<=" { RET(LSHIFTEQ); } "+=" { RET(ADDEQ); } "-=" { RET(SUBEQ); } "*=" { RET(MULEQ); } "/=" { RET(DIVEQ); } "%=" { RET(MODEQ); } "&=" { RET(ANDEQ); } "^=" { RET(XOREQ); } "|=" { RET(OREQ); } ">>" { RET(RSHIFT); } "<<" { RET(LSHIFT); } "++" { RET(INCR); } "--" { RET(DECR); } "->" { RET(DEREF); } "&&" { RET(ANDAND); } "||" { RET(OROR); } "<=" { RET(LEQ); } ">=" { RET(GEQ); } "==" { RET(EQL); } "!=" { RET(N); } ";" { RET(';'); } "{" { RET('{'); } "}" { RET('}'); } "," { RET(','); } ":" { RET(':'); } "=" { RET('='); } "(" { RET('('); } ")" { RET(')'); } "[" { RET('['); } "]" { RET(']'); } "." { RET('.'); } "&" { RET('&'); } "!" { RET('!'); } "~" { RET('~'); } "-" { RET('-'); } "+" { RET('+'); } "*" { RET('*'); } "/" { RET('/'); } "%" { RET('%'); } "<" { RET('<'); } ">" { RET('>'); } "^" { RET('^'); } "|" { RET('|'); } "?" { RET('?'); }
[ \t\v\f]+ { goto std; }
"\n" { if(cursor == s->eof) RET(EOI); s->pos = cursor; s->line++; goto std; }
any { printf("unexpected character: %c\n", *s->tok); goto std; } */
comment: /*!re2c "*/" { goto std; } "\n" { if(cursor == s->eof) RET(EOI); s->tok = s->pos = cursor; s->line++; goto comment; } any { goto comment; } */ }
main(){ Scanner in; int t; memset((char*) &in, 0, sizeof(in)); in.fd = 0; while((t = scan(&in)) != EOI){ /* printf("%d\t%.*s\n", t, in.cur - in.tok, in.tok); printf("%d\n", t); */ } close(in.fd); }
FEATURES
does not provide a default action: the generated code assumes that the input will consist of a sequence of tokens. Typically this can be dealt with by adding a rule such as the one for unexpected characters in the example above.
The user must arrange for a sentinel token to appear at the end of input (and provide a rule for matching it): does not provide an <<EOF>> expression. If the source is from a null-byte terminated string, a rule matching a null character will suffice. If the source is from a file then the approach taken in the example can be used: pad the input with a newline (or some other character that can't appear within another token); upon recognizing such a character check to see if it is the sentinel and act accordingly.
does not provide start conditions: use a separate scanner specification for each start condition (as illustrated in the above example).
BUGS
Only fixed length trailing context can be handled.
Difference only works for character sets.
The internal algorithms need documentation.
SEE ALSO
AUTHORS
Peter Bumbulis <peter@csg.uwaterloo.ca>
Brian Young <bayoung@acm.org>
Dan Nuffer <nuffer@users.sourceforge.net>
Marcus Boerger <helly@users.sourceforge.net>
Hartmut Kaiser <hkaiser@users.sourceforge.net>
Emmanuel Mogenet <mgix@mgix.com> added storable state
VERSION INFORMATION
This manpage describes re2c, version 0.9.12.