Change to the specification of match.
Regex is an ANS Forth program to compile regular expressions written in the conventional format and to match ASCII text against them. The regular expression flavour is Perl-like. This release is rich in features but is still in development with features to be added. The package includes a String Builder module to make use of the results from a successful regular expression match. Regex is intended for use on 32 bit or larger desktop computers rather than 16 bit or embedded processors and so little attention has been paid to code size - ease of development and maintenance have been the main drivers. To this end the tools LexGen and Grace have been used for the regular expression scanner and parsers and an object oriented approach has been used in the design. The availability of this program has not yet been publicly announced, if you stumble upon it you are welcome to use it but be warned that it is still under development so various things may change until version 1.0 is released.
The features currently provided are:
abc
a|b
? * + {n,m} {n,} and {n}
for specification of repetition
*? +? ?? {n,m}? {n,}? and
{n}?
*+ ++ ?+ {n,m}+ {n,}+ and
{n}+
[abc] or [a-zA-Z]
[^abc] or
[^a-zA-Z]
\n using .
\{ \} \[ \( \) \. \| \* \+ \? \] \^
\$ \\
\a \e \f \n \r \t \v \0 \bl \#
\b \- \] \^ \$
\\
\xF \xFF \x{F...}
\cH for CTRL-H
(a(bc)def)gh
(?:abc)
(?<alice>...)^ $ \A \z \Z
\w \W \d \D \s \S both
inside and outside character classes
\1 \2 etc and named
using \g{alice}
(?{forth code})
(?(1)abc|def) (?(name)abc|def)
(?(?{forth-code})abc|def) (?(?=xyz)abc|def)
(?(?!xyz)abc|def) (?(?<=xyz)abc|def)
(?(?<!xyz)abc|def)
(?i) (?-i) (?i:...) (?-i:...) for case
(in)sensitivity(?m) (?-m) (?m:...) (?-m:...) for enhanced line anchor
match mode(?s) (?-s) (?s:...) (?-s:...) for "dot matches all"
mode(?x) (?-x) (?x:...) (?-x:...) for free-form layout and
comments mode terminated with (?e) or (?end).
# is used for comments running to the end of the current
line(?=...) (?!...) and look behind
(?<=...) (?<!...)
(?>...)
\U \L ... \E\u \l
For String Builder features see String Builder.
Regex 0.7 has been tested successfully using GForth, SwiftForth, VFX Forth, Win32 Forth and BigForth all running under MS Windows XP and Windows 7 (32 bit version) and is, therefore, likely to be ANS Forth compliant. Version 0.2 was reported as working on GForth under Mac OSX and Linux, and on PFE; there is no reason to suppose the latest version won't work on these systems. It has not been tested under other operating systems or on 64 bit systems and feedback on the use of Regex with such systems will be welcomed.
This release will not work on 16 bit Forth systems - that would require using Grace to re-generate the parsers for a 16 bit system.
Regex version 0.7 released 11 August 2010. This contains the following files:
Regex files:
xmini_oof.fth - a library file
sets.fth - a library file
regex.fth - the Regex and String Builder source code file
Testing:
regextest.fth - the test program
Documentation:
regex.html - this file
Note that, from version 0.3 onwards, Regex and String Builder files have been concatenated until one large file - regex.fth. Similarly regextest.fth. The state transition tables and parsers were generated by LexGen and Grace in 'ugly' mode and hence are unreadable (being computer generated, they were never intended to be readable anyway).
This is configured to work with GForth 0.7.0. To test Regex with GForth,
unzip the files into a suitable directory under GForth called, say,
regex. Then type:
s" regex/regextest.fth" included
or similar, which should give the following output:
Gforth 0.7.0, Copyright (C) 1995-2008 Free Software Foundation, Inc. Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license' Type `bye' to exit s" regex/regextest.fth" included Loading extended mini oof ... Extended mini oof loaded. <0> Loading Sets.fth ... redefined cell redefined decode Sets.fth loaded. <0> redefined 1Cell with 1cel l redefined ?name-found redefined noop redefined ~ redefined testsym? redef ined this-parser redefined ~ redefined testsym? redefined >string redefined Format redefined <metachar> redefined <forth> redefined <group> redefined <e scapedchar> redefined <reference> redefined <conditional> redefined <modemod> redefined <casefolder> redefined <modifier> redefined this-parser redefined ~ Regex version 0.7 Copyright (C) Gerry Jackson 2010 redefined TESTING with Testing Running Regex tests * 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * 10 * 11 * 12 * 13 * 14 * 15 * 16 * 17 * 18 * 19 * 20 * 21 * 22 * 23 * 24 * 25 * 26 * 27 * 28 * 29 * 30 * 31 * 32 * 33 * 34 * 35 * 36 * 37 * 38 * 39 * 40 * 41 * 42 * 43 * 44 * 45 * 46 * 47 * 48 * 49 * 50 * 51 * 52 * 53 * 54 * 55 * 56 * 57 * 58 * 59 * 60 * 61 * 62 * 63 * 64 Regex & String Builder tests completed ok
With other systems, directory paths may need to be added to the
s" filename" included
lines in files regex.fth and regextest.fth.
This is system dependent as there is no ANS Forth standard way to specify
directory paths.
The output from other systems will vary, ignore any redefinition messages (unless they cause an error). The last six lines are the important ones that prove the software works.
If a system fails with an error message during the tests, please send details (the regular expression, any output and the system used) to the contact email address below.
This section describes features relevant to Regex; it is not a primer or tutorial on regular expressions - there are many such resources available on the internet or books available for purchase. Knowledge of regular expressions is assumed.
Assuming the directory paths in regex.fth are correct for
your Forth system, simply type:
s" regex.fth" included
These words are provided to compile regular expressions and to match
strings against the compiled regular expression:
regex ( "ccc..." -- rgx ) compiles the following text as a
regular expression until one of (?e), (?end) or
the end of the line is reached. The multi-line free-format modifiers
(?x) or (?x: can be used in the first line to
compile a multi-line regular expression with # comments (note that this
free-format mode is ignored inside a character class specification where
white space and # are treated as characters in the usual way).
regex$ ( caddr u -- rgx ) compiles the regular expression
held as a string at (caddr u) returning the address of the
compiled regular expression
parse-regex ( char "ccc<char>" -- rgx ) parse ccc in
the input buffer delimited by the delimiter char. If the
delimiter is not present, it parses to the end of the line. The parsed
text is treated as a regular expression and compiled to address
rgx. This word enables a delimiter of choice to be used, for
example a user may define:
: parse/ [char] / parse-regex ;
to delimit a regular expression with the character /
match ( caddr1 u1 rgx -- caddr u -1 | caddr1 u1 0 ) attempts
to match the text at (caddr1 u1) against the compiled
regular expression rgx. If a match is found (caddr u
-1) is returned where (caddr u) is the rest of the
input string following the leftmost matching string, otherwise the input
string and 0 (caddr1 u1 0) are returned to indicate failure
to match. (Note that the returned string may have length u =
0 if, say, the match was at the end of the input string. The matching
string may be obtained using get-match. match was changed
in version 0.5 because experience using Regex has
shown that returning the rest of the string being searched is far more
useful than returning the matching string.
get-match ( -- caddr u -1 | 0 0 0 ) returns a matching
string from the previous execution of match if a match was
found, otherwise (0 0 0).
get-subex ( n -- caddr u ) returns the contents of the nth
pair of capturing parentheses, where the count starts at 1 from the start
of the regular expression. 0 get-subex is the same as
get-match
substitute and head&tail were removed in
version 0.2 as they have been superseded by the string builder.
No description is provided for most of the features listed above as they behave in a standard manner - examples of usage can be found in the test program provided. This section notes a few items that may not be standard.
subex-limit in
file regexmatch.fth
(?{forth code}), the embedded forth
code is a sequence of one or more Forth words. Restrictions on the Forth
code are:
$ \z \Z treat both CRLF and LF as
the end of the line.(?i) and friends operate until
one of: the corresponding (?-i); the end of the regular
expression; or until the end of the immediate enclosing parenthesessave-idefault
save-mdefault save-sdefault save-xdefault, all with stack effect
( flag -- ). For example:true save-idefault(?<name>...) and named references
\g{name} must have been previously declared using the
defining word refname. e.g.refname alicealice which will return the matched string
as ( caddr u ) if matched or ( 0 0 ) otherwise.2variable as was the case
in previous versions. If an invalid name is used the system will report
an error.
(?<=(\d+)\s+abc\s+def.*)\1\s+abc\s+\?(?=xyz)(\d+)\s+abc\s+\?(?=xyz)(?<=\1\s+abc\s+def.*)
This version of Regex has been tested with the 7 bit ASCII character set.
Limited tests have been carried out with 8 bit characters and there
should be no problems. The character size can be changed to 8 bit by
setting the constant maxchar to 255 in file
regex.fth. Do not set it larger, e.g. for 16 bit characters,
as character classes will then use large amounts of memory.
This has been introduced to the Regex package to simplify using the
results from matching a regular expression. The String Builder has syntax
in this form (alternatives use stringer and
parse-stringer - see below):
pattern-string stringer$
where the pattern-string is a string containing a sequence
of:
\n to copy a line-feed into the
concatenation buffer
\1 for the contents of the first pair of capturing
parentheses. Also references to Forth variables. These values are then
copied to the concatenation buffer.
(?x...) e.g. (?-c) to stop the concatenation
buffer from being cleared. Note that this is a change between versions
0.2 and 0.3. The commands include a sprintf formatter.
The word stringer$ parses the pattern-string
and acts on the contents to build up a string in the concatenation
buffer. The typical use is to replace the matching text from a regular
expression match with some other text. Note that the necessary code to
read/write strings from/to files is outside the scope of the Regex
package at present - these operations can clearly be written in standard
Forth. A possible future extension could handle some of this in the
String Builder command language.
For examples of usage refer to the test program included in the download.
stringer ( "ccc..." -- caddr u ) Accepts the
pattern-string from the input source until terminated by the
end of line or (?e) in free-layout mode. The concatenation buffer is
cleared, unless the command (?-c) is first in
pattern-string. The pattern-string is then
parsed and acted upon according to the contents. The string built up in
the concatenation buffer is returned as (caddr u) on completion, unless
the command (?-g) has been used.
stringer$ ( caddr u -- caddr2 u2 ) As for
stringer except that the pattern-string is
already available as (caddr u).
parse-stringer ( char -- caddr u ) Parse the input until the
delimiting character. The parsed string is then used as the
pattern-string and stringer$ is then executed.
stringify ( -- caddr u) Accepts text from the input source
in free-layout mode until an (?end) or (?e)
command is encountered. White space, including new line characters, are
stripped out and the remaining characters concatenated into a single
string. The (caddr u) pair of the resulting string is
returned to the caller. The aim of this word is to create a pattern
string from a readable free-layout representation that may be used inside
a colon definition; alternatively to easily create strings containing
'difficult' characters such as ". Use the following word
constant$ to save the (caddr u) of the
string.
The following points apply to stringify:
(?x) is not needed at the
start of the free-layout text - if present it is ignored and not copied
to the resulting string.
(-?x) and
(?x) are obeyed to switch free-layout mode off/on
respectively. The (-?x) and (?x) commands are
not copied to the resulting string.
(?-x) the
concatenation of text is terminated at the end of the current line; a
terminating (?end) or (?e) is not required in
this case. Free-layout mode may be switched on again with
(?x) before the end of the line.
(?-c) when
all text will be appended to the contents of the buffer. The first
(?-c) is not copied to the resulting string. Subsequent
occurrences of (?c) and (?-c) are not acted
upon, but simply copied to the buffer.
stringer commands not already described are not
acted upon, but simply copied to the buffer. This is true for the
spanning commands (?x: ... ) and (?-x: ... )/code>.
constant$ ( caddr u "spaces<name>" -- ) creates a
constant called <name> that, when executed, places
(caddr u) on the stack. This word is included for use with
stringify.
These are preceded by the \ character and append the
following character or value to the concatenation buffer. The list of
these is:
\\ appends character \ \( appends character (
\) character ) \$ character $
\. character . \{ character {
\} character } \# character #
\| character | \% character %
\a value 7 \e value 27
\f value 12 \n value 10
\r value 13 \s value 32
\t value 9 \v value 11
\0 value 0
\cX value (following character mod 32)
e.g. \cH appends CTRL H value 8 as does \ch and \c(
The pattern string can include various references to the results from the last regular expression match; these append the associated value to the concatenation buffer. The types of reference are:
\1 for the contents of the first capturing parentheses,
\2 for the second, and so on. These are equivalent to $1,
$2 etc in Perl.
\H refers to the part of the string before the match\M is the matching string itself\T refers to the part of the string after the match
\g{name} refers to a named capture where name
has been defined by refname (changed in version 0.4
onwards).
The commands that can be included in the pattern string are:
(?c) and (?-c), where (?c)
clears the concatenation buffer and (?-c) means do not
clear the buffer. (?-c) is only useful as the first
command in a pattern string to retain the results from a previous use
of the string builder. (?c) is the default.
(?g) and (?-g), where (?g) means
return the contents of the concatenation buffer as (caddr
u) at the end of the pattern string, and (?-g)
means do not return the string. (?g) is the default.
(?x) and (?-x) to indicate free-form layout
in exactly the same way as for regular expressions, (?e)
or (?end) is used to terminate the pattern string.
(?{forth-code}) to execute the enclosed forth-code. As
shown in the test program this may be used to match regular
expressions. It can also be used to insert text into the concatenation
buffer e.g.s" Name: (?{s" Mary" concat}) Smith" stringer$Name: Mary Smith
\U \L \u \l \E for case folding. These behave in exactly
the same way as in regular expressions.
(?(<test>)<then-part>|<else-part>)
conditional expression where <test> may be one
of:TRUE if the corresponding capturing
parentheses participated in the match.
TRUE if the
corresponding capturing parentheses participated in the match.
(?{forth code
sequence}) where the forth code must return FALSE
or a non-zero value taken as TRUE.
TRUE the <then-part> is
interpreted otherwise the <else-part> is interpreted. The
<then-part> and <else-part>'s can be any valid String
Builder sequence, hence conditionals may be nested.
%<flags><minwidth><precision><long><conversion>(?%<flags><minwidth><precision><long><conversion>)- for left
justification, 0 to use 0 as a padding character,
+ to include a + sign for positive numbers,
space to leave a space for a positive number,
# for a variant of the main operation.* to indicate the width is on the stack.. followed by a
number or * to indicate the width is on the stack.l to indicate a double length number.d u c s x X o % to indicate single integer, unsigned
integer, character, string, hexadecimal using abcdef for
x or ABCDEF for X, octal or a
% character respectively.BASE. If
* is used for either of these, the corresponding number
must be above the other arguments on the stackf e E g G characters
remain to be done. A useful addition might be conversion using the
current value of the Forth word BASE.There is plenty of scope to add to the set of commands in the String Builder and this will probably happen as experience is gained in its use. Examples are the addition of other control structures e.g. loops but it is unclear where the most cost-effective boundary lies between doing this in the pattern string or in Forth.
The floating point options of the format command are outstanding.
For commands the String Builder has kept mostly to the regular expression syntax of Perl. This may change in future versions.
A simple concatenation buffer is provided to build up replacement strings
for text matched with regular expressions. Words to use this are:
clear-concat ( -- ) Empties the buffer.
concat ( caddr u -- ) Appends the string (caddr
u) to the contents of the buffer
concat-char ( char -- )Appends a character to the
buffer.
get-concat ( -- caddr u )Returns the contents of the
buffer
For an example of use, see the test program.
The size of the concatenation buffer is set by constant
cb-size in file regex.fth and therefore may be
changed by the user.
The concatenation buffer may be deleted with: cb delete
For more complex string processing a package such as David Williams' Dynamic Strings Package can be used, where embedded code calling words in the package can be inserted into the pattern string.
These features may be added in future versions, but not necessarily in this order:
\g{-1}
(?i-msx)
(?'name'...)
There are no plans to include the following features:
\num
(?#...)
The regular expression engine is based on the ideas given in the paper Regular Expression Matching Can Be Simple And Fast by Russ Cox.
The regular expression and string builder parsers were generated by Grace from a BNF specification. In all there are three separate parsers in the package.
The state transition tables were generated by LexGen. There are four tables compacted into a one.
Regex requires a case-insensitive ANS Forth system. The following words from the various ANS Forth word sets are used.
! #> #s ' ( * + +! +loop , - . ." /
/mod 0< 0= 1+ 1- 2! 2@ 2drop 2dup 2over 2swap : ; < <# = >
>body >in >number >r ?dup @ abort" abs align aligned allot
and base begin bl c! c, c@ cell+ cells char char+ chars constant count cr
create decimal depth do does> drop dup else emit environment? evaluate
execute exit find here hold i if immediate invert j literal loop lshift
max min mod move negate or over postpone r> r@ recurse repeat rot
rshift s" source space spaces swap then type u< unloop until variable
while word xor [ ['] [char] ]
.( .r 0<> 0> 2>r
2r> 2r@ :noname <> ?do again case compile, endcase endof false
hex nip of pad parse pick refill roll to true tuck u> value within
\
2constant 2literal 2variable d=
dabs
included
allocate free
.s
[if] [else]
[then]
/string cmove compare
Any comments, feedback (good or bad), error reports etc will be gratefully received, email Gerry Jackson.
v0.7 11 August 2010
stringify and constant$ added.
(1) not (\1).
v0.6 31 July 2010
{n,m}? {n,}? {n}? added
v0.5 27 July 2010
?? *? +? added
MATCH changed to return the rest of the
subject string
v0.4 10 June 2010
2variable.
a(b*)*c no longer crash.
v0.3 24 May 2010
(?!) always fails to match
v0.2 7 May 2010
v0.1 23 April 2010
match changed.
regexutils.fth added to provide words such as
substitute and a simple string concatenation buffer.
v0.0 2 April 2010