[Top] [Prev] [Next] [Bottom]

11 Regular Expressions

This chapter describes regular expression pattern matching and string processing based on regular expression substitutions. Tcl commands described: regexp and regsub.

Regular expressions are a formal way to describe string patterns. They provide a powerful and compact way to specify patterns in your data. Even better, there is a very efficient implementation of the regular expression mechanism due to Henry Spenser. If your script does much string processing, it is worth the effort to learn about regular expressions. Your Tcl scripts will be compact and efficient. This chapter includes a large set of canned regular expressions you can pick up and use.

Regular expression substitution is a mechanism that lets you rewrite a string based on regular expression matching. This is another powerful tool, and this chapter includes several examples that do a lot of work in just a few Tcl commands. Stephen Uhler has shown me several ways to transform input data into a Tcl script and then use subst or eval to process the data. The idea takes a moment to get used to, but it provides a very efficient way to process strings.

One of the stumbling blocks with regular expressions is that they use some of the same special characters as Tcl. When you construct regular expressions in Tcl programs, you must be aware of this. In many cases you can group the regular expression with curly braces so Tcl pays no attention to it. In other cases you need Tcl to do substitutions on part of the pattern, and then you need to worry about quoting the special characters in the regular expression. The first section of this chapter ignores this issue. The working examples take into account any necessary quoting.

Regular Expression Syntax

A regular expression is a sequence of the following items:

A literal character.
A matching character.
A repetition clause.
An alternation clause.

A subpattern grouped with parentheses. Table 11-1 summarizes the syntax of regular expressions:

Regular expression syntax.
.	Matches any character.
*	Matches zero or more instances of the previous pattern item.
+	Matches one or more instances of the previous pattern item.
?	Matches zero or one instances of the previous pattern item.
( )	Groups a subpattern. The repetition and alternation operators apply to the whole proceeding subpattern.
\|	Alternation.
[ ]	Delimit a set of characters. Ranges are specified as [x-y]. If the first character in the set is ^, then there is a match if the remaining characters in the set are not present.
^	Anchor the pattern to the beginning of the string. Only when first.
$	Anchor the pattern to the end of the string. Only when last.

A number of examples of regular expressions follow. Any pattern that contains brackets, dollar signs, or spaces must be handled specially when used in a Tcl command. It is unfortunate, but there is a clash between the Tcl special characters and the regular expression special characters. The patterns in this section ignore this problem. The sample expressions in Table 11-2 are quoted for use within Tcl scripts.

The general wild-card character is the period, ".". It matches any single character. The following pattern matches all two-character strings:

..

The matching character can be restricted to a set of characters with the [xyz] syntax. Any of the characters between the two brackets is allowed to match. For example, the following matches either Hello or hello:

[Hh]ello

The matching set can be specified as a range over the ASCII character set with the [x-y] syntax. There is also the ability to specify the complement of a set. That is, the matching character can be anything except what is in the set. This is achieved with the [^xyz] syntax. Ranges and complements can be combined. The following matches anything except the uppercase and lowercase letters:

[^a-zA-Z]

Repetition is specified with *, for zero or more, +, for one or more, and ?, for zero or one. These operators apply to the previous item, which is either a matching character, which could involve the set syntax, or a subpattern grouped with parentheses. The following matches a string that contains b followed by zero or more a's:

ba*

The following matches a string that has one or more sequences of ab:

(ab)+

The pattern that matches anything is:

.*

Alternation is specified with |, a pipe symbol. Another way to match either Hello or hello is:

hello|Hello

By default a pattern does not have to match the whole string. There can be unmatched characters before and after the match. You can anchor the match to the beginning of the string by starting the pattern with ^, or to the end of the string by ending the pattern with $. You can force the pattern to match the whole string by using both. All strings that begin with spaces or tabs are matched with:

^[ \t]+

If a pattern can match several parts of a string, the matcher takes the match that occurs earliest in the input string. Then, if there is more than one match from that same point, the matcher takes the longest possible match. The rule of thumb is: first, then longest.

A Pattern to Match URLs

Let's try one complex pattern to match a URL. Our sample input string will be:

http://www.sun.com:80/index.html

The regular expression is:

[^:]+://[^:/]+(:[0-9]+)?/.*

Let's look at the pattern one piece at a time. The first part looks for the protocol, which is separated by a colon from the rest of the URL. The first part of the pattern is one or more characters that are not a colon, followed by a colon. This matches the http: part of the URL:

[^:]+:

The next part of the pattern looks for the server name, which comes after two slashes. The server name is either followed by a colon and a port number, or by a slash. The pattern uses a complementary set that specifies one or more characters that are not a colon or a slash. This matches the //www.sun.com part of the URL:

//[^:/]+

The port number is optional, so a subpattern is delimited with parentheses and followed by a question mark. This matches the :80 part of the URL:

(:[0-9]+)?

The last part of the pattern is everything else, starting with a slash. This matches the /index.html part of the URL:

/.*

Use subpatterns to parse strings.

To make this pattern really useful we delimit several subpatterns with parentheses:

([^:]+)://([^:/]+)(:([0-9]+))?(/.*)

These parentheses do not change the way the pattern matches. Only the optional port number really needs the parentheses in this example. However, the regexp command described next gives us access to the strings that match these subpatterns. In one step regexp can test for a valid URL and divide it into the protocol part, the server, the port, and the trailing path. This is shown later in Example 11-2 on page 126.

The regexp Command

The regexp command provides direct access to the regular expression matcher. Not only does it tell you if a string matches a pattern, it can also extract one or more matching substrings. The return value is 1 if some part of the string matches the pattern; it is 0 otherwise. Its syntax is:

regexp ?flags? pattern string ?match sub1 sub2...?

The flags are optional and constrain the match as follows:

If -nocase is specified, then lowercase characters in pattern can match either lowercase or uppercase letters in string.
If -indices is specified, then the match variables each contain a pair of numbers that are the indices delimiting the match within string. Otherwise, the matching string itself is copied into the match variables.
If your pattern begins with -, then you can use -- to separate the flags from the pattern. The pattern argument is a regular expression as described in the previous section. If this contains $ or [, you must be careful. The easiest thing to do is group your patterns with curly braces. However, if your pattern contains backslash sequences like \n or \t, you must group with double quotes so the Tcl interpreter can do those substitutions. In that case, use \[ and \$ in your patterns.
If string matches pattern, then the results of the match are stored in the variables named in the command. These match variable arguments are optional. If present, match is set to be the part of the string that matched the pattern. The remaining variables are set to be the substrings of string that matched the corresponding subpatterns in pattern. The correspondence is based on the order of left parentheses in the pattern to avoid ambiguities that can arise from nested subpatterns.
Example 11-1 uses regexp to pick the hostname out of the DISPLAY environment variable, which has the form:

hostname:display.screen

Using regular expressions to parse a string.

set env(DISPLAY) sage:0.1

regexp {([^:]*):} $env(DISPLAY) match host

=> 1

set match

=> sage:

set host

=> sage

The pattern involves a complementary set, [^:], to match anything except a colon. It uses repetition, *, to repeat that zero or more times. It groups that part into a subexpression with parentheses. The literal colon ensures that the DISPLAY value matches the format we expect. The part of the string that matches the complete pattern is stored into the match variable. The part that matches the subpattern is stored into host. The whole pattern has been grouped with braces to quote the square brackets. Without braces it would be:

regexp (\[^:\]*): $env(DISPLAY) match host

This is quite a powerful statement, and it is efficient. If we had only had the string command to work with, we would have needed to resort to the following, which takes roughly twice as long to interpret:

set i [string first : $env(DISPLAY)]

if {$i >= 0} {

	set host [string range $env(DISPLAY) 0 [expr $i-1]]

Example 11-2 demonstrates a pattern with several subpatterns that extract the different parts of a URL. There are lots of subpatterns, and you can determine which match variable is associated with which subpattern by counting the left parenthesis. Note that the port number uses a nested subpattern. The subpattern (:([0-9]+))? uses the zero-or-more pattern character ? because the port specification is optional. However, this puts a colon into the match variable x, so another subpattern is used to just get the digits of the port number into the port match variable:

A pattern to match URLs.

set url http://www.sun.com:80/index.html

regexp {([^:]+)://([^:/]+)(:([0-9]+))?(/.*)} $url \

	match protocol server x port path

=> 1

set match

=> http://www.sun.com:80/index.html

set protocol

=> http

set server

=> www.sun.com

set x

=> :80

set port

=> 80

set path

=> /index.html

Useful Regular Expressions

Table 11-1 lists regular expressions as you would use them in Tcl commands. Most are quoted with curly braces to turn off the special meaning of square brackets and dollar sign. Others patterns are grouped with double quotes and use backslash quoting because the patterns include backslash sequences like \n and \t that must be substituted by Tcl before the regexp command is called.

Simple regular expressions
{^[yY]}	Begins with y or Y, as in a Yes answer.
{^(yes\|YES\|Yes)$}	Exactly "yes", "Yes", or "YES".
"^\[^ \t:\]+:"	Begins with colon-delimited field that has no spaces or tabs.
"^\[ \t]*$"	A blank line.
{^[A-Za-z]+$}	Only letters.
{^[A-Za-z0-9_]+$}	Letters, digits, and the underscore.
{[][${}\\]}	The set of Tcl special characters: ] [ $ { } \
"\[^\n\]*\n"	Everything up to a newline.
{\.}	A period.
{[][$^\?\+\*\\|\\}	The set of regular expression special characters: ] [ $ ^ ? + * ( ) \| \
"(^\|\n)to:\[^\n\]+\n"	A line that begins with "to:". The (^\|\n) matches the beginning of string or a newline before the line.

The regsub Command

The regsub command does string substitution based on pattern matching. Its syntax is:

regsub ?switches? pattern string subspec varname

The regsub command returns the number of matches and replacements, or 0 if there was no match. regsub copies string to varname, replacing occurrences of pattern with the substitution specified by subspec. If the pattern does not match, then string is copied to varname without modification. The optional switches include:

-all, which means to replace all occurrences of the pattern. Otherwise only the first occurrence is replaced.
The -nocase switch means that lower-case characters in the pattern can match either lower or upper case letters in string.
The -- switch separates the pattern from the switches, which is necessary if your pattern begins with a -. The replacement pattern, subspec, can contain literal characters as well as the following special sequences:
& is replaced with the string that matched the pattern.
\1 through \9 are replaced with the strings that match the corresponding subpatterns in pattern. Nine subpatterns are supported. The correspondence is based on the order of left parentheses in the pattern specification. The following replaces a user's home directory with a ~:

regsub ^$env(HOME)/ $pathname ~/ newpath

The following constructs a C compile command line given a filename:

regsub {([^\.]*)\.c} file.c {cc -c & -o \1.o} ccCmd

The \. is used to specify a match against period. The & is replaced with file.c, and \1 is replaced with file, which matches the pattern between the parentheses. The value assigned to ccCmd is:

cc -c file.c -o file.o

The regsub command can count things for us. The following command counts the newlines in some text. In this case the substitution is not important:

set numLines [regsub -all \n $text {} ignore]

Transforming Data to Tcl with regsub

One of the most powerful combinations of Tcl commands is regsub and subst. This section describes a few examples that use regsub to transform data into Tcl commands, and then use subst to replace those commands with a new version of the data. This technique is very efficient because it relies on two subsystems that are written in highly optimized C code: the regular expression engine and the Tcl parser. These examples are primarily written by Stephen Uhler.

URL Decoding

When a URL is transmitted over the network, it is encoded by replacing special characters with a %xx sequence, where xx is the hexadecimal code for the character. In addition, spaces are replaced with a plus (+). It would be tedious and very inefficient to scan a URL one character at a time with Tcl statements to undo this encoding. It would be more efficient to do this with a custom C program, but still very tedious. Instead, a combination of regsub and subst can do it efficiently in just a few Tcl commands.

Replacing the + with spaces requires quoting the + because it is the one-or-more special character in regular expressions:

regsub -all {\+} $url { } url

The %xx are replaced with a format command that will generate the right character:

regsub -all {%([0-9a-hA-H][0-9a-hA-H])} $url \

	{[format %c 0x\1]} url

The %c directive to format tells it to generate the character from a character code number. We force a hexadecimal interpretation with a leading 0x. The resulting string is passed to subst to get the format commands substituted:

set url [subst $url]

For example, if the input is %7ewelch, the result of the regsub is:

[format %c 0x7e]welch

And then subst generates:

~welch

Example 11-3 encapsulates this trick in the Url_Decode procedure.

The Url_Decode procedure.

proc Url_Decode {url} {

	regsub -all {\+} $url { } url

	regsub -all {%([0-9a-hA-H][0-9a-hA-H])} $url \

		{[format %c 0x\1]} url

	return [subst $url]

CGI Argument Parsing

Example 11-4 builds upon Url_Decode to decode the inputs to a CGI program that processes data from an HTML form. Each form element is identified by a name, and the value is URL encoded. All the names and encoded values are passed to the CGI program in the following format:

name1=value1&name2=value2&name3=value3

Example 11-4 uses split to get back a list of names and values, and then it decodes them with Url_Decode. The values are stored in the cgi array with the name as the key:

The Cgi_Parse and Cgi_Value procedures.

proc Cgi_Parse {} {

	global cgi env

	if [info exists env(QUERY_STRING)] {

		# This comes from GET-style forms

		set query $env(QUERY_STRING)

	} elseif {[info exists env(CONTENT_LENGTH)]} {

		# This comes from POST-style forms

		set query [read stdin $env(CONTENT_LENGTH)]

	} else {

		# No content-length so it is only safe to read one line

		gets stdin query

	foreach {name value} [split $query &=] {

		lappend cgi([Url_Decode $name]) [Url_Decode $value]

proc Cgi_Value {name} {

	global cgi

	if ![info exists cgi($name)] {

		return {}

	} elseif {[llength $cgi($name)] == 1} {

		# Strip a level of listing for convenience

		return [lindex $cgi($name) 0]

	} else {

		# Return the whole list of values

		return $cgi($name)

An HTML form can have several form elements with the same name, and this can result in more than one value for each name. This is why Cgi_Parse uses lappend to assign to the array. The list structure makes it a bit awkward to get the values back, so the Cgi_Value procedure is added to make this more convenient. In the normal case of a single value for a name, Cgi_value eliminates the list structure.

Decoding HTML Entities

The next example is a decoder for HTML entities. In HTML, special characters are encoded as entities. If you want a literal < or > in your document, you encode them to avoid conflict with the <tag> syntax used in HTML. HTML syntax is briefly described in Chapter 3 on page 30. The < and > characters are encoded as the entities < and >, respectively. Characters with codes above 127 like copyright © and egrave è are also encoded. There are named entities, like < for < and è for è. You can also use decimal-valued entities like © for ©. Finally, the trailing semicolon is optional, so &lt or < can both be used to encode <.

The entity decoder is similar to Url_Decode. In this case, however, we need to be more careful with subst. The text passed to the decoder could contain special characters like a square bracket or dollar sign. With Url_Decode we can rely on those special characters being encoded as, for example %24. Entity encoding is different (do not ask me why URLs and HTML have different encoding standards), and dollar signs and square brackets are not necessarily encoded. This requires an additional pass to quote these characters. This regsub puts a backslash in front of all the brackets, dollar signs, and backslashes.

regsub -all {([][$\\])} $text {\\\1} new

The decimal encoding (e.g., ©) is also more awkward than the hexadecimal encoding used in URLs. We cannot force a decimal interpretation of a number in Tcl. In particular, if the entity has a leading zero (e.g., 
) then Tcl interprets the value (e.g., 010) as octal. The scan command is used to do a decimal interpretation:

regsub -all {&#([0-9][0-9]?[0-9]?);?}  $new \

	{[format %c [scan \1 %d tmp;set tmp]]} new

The named entities are converted with an array that maps from the entity names to the special character. The only detail is that unknown entity names (e.g., &foobar;) are not converted. This mapping is done inside HtmlMapEntity, which guards against invalid entities.

regsub -all {&([a-zA-Z]+)(;?)} $new \

		{[HtmlMapEntity \1 \\\2 ]} new

If the input text contained:

[x &lt; y]

then the regsub would transform this into:

\[x [HtmlMapEntity lt \; ] y\]

Finally, subst will result in:

[x < y]

Html_DecodeEntity.

proc Html_DecodeEntity {text} {

	if {![regexp & $text]} {return $text}

	regsub -all {([][$\\])} $text {\\\1} new

	regsub -all {&#([0-9][0-9]?[0-9]?);?}  $new {\

		[format %c [scan \1 %d tmp;set tmp]]} new

	regsub -all {&([a-zA-Z]+)(;?)} $new \

		{[HtmlMapEntity \1 \\\2 ]} new

	return [subst $new]

proc HtmlMapEntity {text {semi {}}} {

	global htmlEntityMap

	set result $text$semi

	catch {set result $htmlEntityMap($text)}

	return $result

# Some of the htmlEntityMap

array set htmlEntityMap {

	lt	<	gt	>	amp	&

	aring		\xe5		atilde		\xe3

	copy		\xa9		ecirc		\xea		egrave		\xe8

A Simple HTML Parser

The following example is the brainchild of Stephen Uhler. It uses regsub to transform HTML into a Tcl script. When it is evaluated the script calls a procedure to handle each tag in an HTML document. This provides a general framework for processing HTML. Different callback procedures can be applied to the tags to achieve different effects. For example, the html_library package on the CD-ROM uses Html_Parse to display HTML in a Tk text widget.

Html_Parse.

proc Html_Parse {html cmd {start {}}} {

	# Map braces and backslashes into HTML entities

	regsub -all \{ $html {\&ob;} html

	regsub -all \} $html {\&cb;} html

	regsub -all {\\} $html {\&bsl;} html

	# This pattern matches the parts of an HTML tag

	set w " \t\r\n"					;# white space

	set exp <(/?)(\[^$w>]+)\[$w]*(\[^>]*)>

	# This generates a call to cmd with HTML tag parts

	# \1 is the leading /, if any

	# \2 is the HTML tag name

	# \3 is the parameters to the tag, if any

	# The curly braces at either end group of all the text

	# after the HTML tag, which becomes the last arg to $cmd.

	set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"

	regsub -all $exp $html $sub html

	# This balances the curly braces,

	# and calls $cmd with $start as a pseudo-tag

	# at the beginning and end of the script.

	eval "$cmd {$start} {} {} {$html}"

	eval "$cmd {$start} / {} {}"

An example will help visualize the transformation. Given this HTML:

<Body bgcolor=white text=black>

<H1>My Home</H1>

This is my <b>home</b> page.

and a call to Html_Parse that looks like this:

Html_Parse $html {Render .text} hmstart

then the generated program is this:

Render .text {hmstart} {} {} {}

Render .text {Title} {} {} {My Home Page}

Render .text {Title} {/} {} {

Render .text {Body} {} {bgcolor=white text=black} {

Render .text {H1} {} {} {My Home}

Render .text {H1} {/} {} {

This is my }

Render .text {b} {} {} {home}

Render .text {b} {/} {} { page.

Render .text {hmstart} / {} {}

One overall point to make about this example is the difference between using eval and subst with the generated script. The decoders shown in Examples 11-3 and 11-5 use subst to selectively replace encoded characters while ignoring the rest of the text. In Html_Parse we must process all the text. The main trick is to replace the matching text (e.g., the HTML tag) with some Tcl code that ends in an open curly brace and starts with a close curly brace. This effectively groups all the unmatched text.

When eval is used this way you must do something with any braces and backslashes in the unmatched text. Otherwise the resulting script does not parse correctly. In this case these special characters are encoded as HTML entities. We can afford to do this because the cmd that is called must deal with encoded entities already. It is not possible to quote these special characters with backslashes because all this text is inside curly braces so no backslash substitution is performed. If you try that the backslashes will be seen by the cmd callback.

Finally, I must admit that I am always surprised that this works:

eval "$cmd {$start} {} {} {$html}"

I always forget that $start and $html are substituted in spite of the braces. This is because double quotes are being used to group the argument, so the quoting effect of braces is turned off. Try this:

	set x hmstart

	set y "foo {$x} bar"

=> foo {hmstart} bar

Stripping HTML Comments

The Html_Parse procedure does not correctly handle HTML comments. The problem is that the syntax for HTML commands allows tags inside comments, so there can be > characters inside the comment. HTML comments are also used to hide Javascript inside pages, which can also contain >. We can fix this with a pass that eliminates the comments.

The comment syntax is this:

<!-- HTML comment -->

It is awkward to match the closing --> without getting stuck on embedded > characters, or without matching too much and going all the way to the end of the last comment. This is because the regular expression matcher uses a greedy algorithm. Time for another trick:

regsub -all --> $html \x81 html

This replaces all the end comment sequences with a single character that is not allowed in HTML. Now you can delete the comments like this:

regsub -all "<--\[^\x81\]*\x81" $html {} html

Other Commands That Use Regular Expressions

Several Tcl commands use regular expressions.

lsearch takes a -regexp flag so you can search for list items that match a regular expression. The lsearch command is described on page 58.
switch takes a -regexp flag so you can branch based on a regular expression match instead of an exact match or a string match style match. The switch command is described on page 65.
The Tk text widget can search its contents based on a regular expression match. Searching in the text widget is described on page 379.
The expect command that is part of the expect Tcl extension can match the output of a program with regular expressions. Expect is the subject of its own book, Exploring Expect (Don Libes, O'Reilly & Associates, Inc., 1995).

[Top] [Prev] [Next] [Bottom]

Brent.Welch@acm.org