Go to the first, previous, next, last section, table of contents.
This section is excerpted from The GNU C Library reference manual by Sandra Loosemore with Richard M. Stallman, Roland McGrath, and Andrew Oram.
The GNU C library supports the standard POSIX.2 interface. Programs using this interface should include the header file `rxposix.h'.
Before you can actually match a regular expression, you must compile it. This is not true compilation--it produces a special data structure, not machine instructions. But it is like ordinary compilation in that its purpose is to enable you to "execute" the pattern fast. (See section 4.3 Matching a Compiled POSIX Regular Expression, for how to use the compiled regular expression for matching.)
There is a special data type for compiled regular expressions:
re_nsub
There are several other fields, but we don't describe them here, because only the functions in the library should use them.
After you create a regex_t object, you can compile a regular
expression into it by calling regcomp.
regcomp "compiles" a regular expression into a
data structure that you can use with regexec to match against a
string. The compiled regular expression format is designed for
efficient matching. regcomp stores it into *compiled.
The parameter pattern points to the regular expression to be
compiled. When using regcomp, pattern must be
0-terminated. When using regncomp, pattern must be
len characters long.
regncomp is not a standard function; strictly POSIX programs
should avoid using it.
It's up to you to allocate an object of type regex_t and pass its
address to regcomp.
Before freeing the object of type regex_t You must pass it
to regfree. Not doing so may cause subsequent calls to Rx
functions to behave strangely.
The argument cflags lets you specify various options that control the syntax and semantics of regular expressions. See section 4.2 Flags for POSIX Regular Expressions.
If you use the flag REG_NOSUB, then regcomp omits from
the compiled regular expression the information necessary to record
how subexpressions actually match. In this case, you might as well
pass 0 for the matchptr and nmatch arguments when
you call regexec.
If you don't use REG_NOSUB, then the compiled regular expression
does have the capacity to record how subexpressions match. Also,
regcomp tells you how many subexpressions pattern has, by
storing the number in compiled->re_nsub. You can use that
value to decide how long an array to allocate to hold information about
subexpression matches.
regcomp returns 0 if it succeeds in compiling the regular
expression; otherwise, it returns a nonzero error code (see the table
below). You can use regerror to produce an error message string
describing the reason for a nonzero value; see section 4.6 POSIX Regexp Matching Cleanup.
Here are the possible nonzero values that regcomp can return:
REG_BADBR
REG_BADPAT
REG_BADRPT
REG_ECOLLATE
REG_ECTYPE
REG_EESCAPE
REG_ESUBREG
REG_EBRACK
REG_EPAREN
REG_EBRACE
REG_ERANGE
REG_ESPACE
regcomp ran out of memory.
These are the bit flags that you can use in the cflags operand when
compiling a regular expression with regcomp.
REG_EXTENDED
REG_ICASE
REG_NOSUB
REG_NEWLINE
Once you have compiled a regular expression, as described in section 4.1 POSIX Regular Expression Compilation, you can match it against strings using
regexec. A match anywhere inside the string counts as success,
unless the regular expression contains anchor characters (`^' or
`$').
*compiled against string.
regexec returns 0 if the regular expression matches;
otherwise, it returns a nonzero value. See the table below for
what nonzero values mean. You can use regerror to produce an
error message string describing the reason for a nonzero value;
see section 4.6 POSIX Regexp Matching Cleanup.
The parameter string points to the text to search. When using
regexec, string must be 0-terminated. When using
regnexec, string must be len characters long.
regnexec is not a standard function; strictly POSIX programs
should avoid using it.
The argument eflags is a word of bit flags that enable various options.
If you want to get information about what part of string actually
matched the regular expression or its subexpressions, use the arguments
matchptr and nmatch. Otherwise, pass 0 for
nmatch, and NULL for matchptr. See section 4.4 Match Results with Subexpressions.
You must match the regular expression with the same set of current locales that were in effect when you compiled the regular expression.
The function regexec accepts the following flags in the
eflags argument:
REG_NOTBOL
REG_NOTEOL
Here are the possible nonzero values that regexec can return:
REG_NOMATCH
REG_ESPACE
regexec ran out of memory.
When regexec matches parenthetical subexpressions of
pattern, it records which parts of string they match. It
returns that information by storing the offsets into an array whose
elements are structures of type regmatch_t. The first element of
the array (index 0) records the part of the string that matched
the entire regular expression. Each other element of the array records
the beginning and end of the part that matched a single parenthetical
subexpression.
regexec. It containes two structure fields, as follows:
rm_so
rm_eo
regoff_t is an alias for another signed integer type.
The fields of regmatch_t have type regoff_t.
The regmatch_t elements correspond to subexpressions
positionally; the first element (index 1) records where the first
subexpression matched, the second element records the second
subexpression, and so on. The order of the subexpressions is the order
in which they begin.
When you call regexec, you specify how long the matchptr
array is, with the nmatch argument. This tells regexec how
many elements to store. If the actual regular expression has more than
nmatch subexpressions, then you won't get offset information about
the rest of them. But this doesn't alter whether the pattern matches a
particular string or not.
If you don't want regexec to return any information about where
the subexpressions matched, you can either supply 0 for
nmatch, or use the flag REG_NOSUB when you compile the
pattern with regcomp.
Sometimes a subexpression matches a substring of no characters. This
happens when `f\(o*\)' matches the string `fum'. (It really
matches just the `f'.) In this case, both of the offsets identify
the point in the string where the null substring was found. In this
example, the offsets are both 1.
Sometimes the entire regular expression can match without using some of
its subexpressions at all--for example, when `ba\(na\)*' matches the
string `ba', the parenthetical subexpression is not used. When
this happens, regexec stores -1 in both fields of the
element for that subexpression.
Sometimes matching the entire regular expression can match a particular
subexpression more than once--for example, when `ba\(na\)*'
matches the string `bananana', the parenthetical subexpression
matches three times. When this happens, regexec usually stores
the offsets of the last part of the string that matched the
subexpression. In the case of `bananana', these offsets are
6 and 8.
But the last match is not always the one that is chosen. It's more
accurate to say that the last opportunity to match is the one
that takes precedence. What this means is that when one subexpression
appears within another, then the results reported for the inner
subexpression reflect whatever happened on the last match of the outer
subexpression. For an example, consider `\(ba\(na\)*s \)*' matching
the string `bananas bas '. The last time the inner expression
actually matches is near the end of the first word. But it is
considered again in the second word, and fails to match there.
regexec reports nonuse of the "na" subexpression.
Another place where this rule applies is when the regular expression
`\(ba\(na\)*s \|nefer\(ti\)* \)*' matches `bananas nefertiti'.
The "na" subexpression does match in the first word, but it doesn't
match in the second word because the other alternative is used there.
Once again, the second repetition of the outer subexpression overrides
the first, and within that second repetition, the "na" subexpression
is not used. So regexec reports nonuse of the "na"
subexpression.
When you are finished using a compiled regular expression, you must
free the storage it uses by calling regfree.
regfree frees all the storage that *compiled
points to. This includes various internal fields of the regex_t
structure that aren't documented in this manual.
regfree does not free the object *compiled itself.
You should always free the space in a regex_t structure with
regfree before using the structure to compile another regular
expression.
When regcomp or regexec reports an error, you can use
the function regerror to turn it into an error message string.
regcomp or
regexec was working with when it got the error. Alternatively,
you can supply NULL for compiled; you will still get a
meaningful error message, but it might not be as detailed.
If the error message can't fit in length bytes (including a
terminating null character), then regerror truncates it.
The string that regerror stores is always null-terminated
even if it has been truncated.
The return value of regerror is the minimum length needed to
store the entire error message. If this is less than length, then
the error message was not truncated, and you can use it. Otherwise, you
should call regerror again with a larger buffer.
Here is a function which uses regerror, but always dynamically
allocates a buffer for the error message:
char *get_regerror (int errcode, regex_t *compiled)
{
size_t length = regerror (errcode, compiled, NULL, 0);
char *buffer = xmalloc (length);
(void) regerror (errcode, compiled, buffer, length);
return buffer;
}
Go to the first, previous, next, last section, table of contents.