Searching with regular expressions

Layers that support the use of regular expressions are denoted by their pattern input fields having Regular expression text fields.

A regular expression (or ‘regex’) is a compact way of finding a pattern in a string of text. In APLS, for example, searching the orthography layer for do.* will return all words that start with the characters do (including the word do itself).

Regular expressions are a feature of many programming languages, so you may have used them before. Even seasoned regex users can use a refresher once in a while. For new users, regex can be a little intimidating because they are so compact and can use a lot of special symbols.

If you are…

hoping to learn regex, read the introduction to regex
hoping to use some simple patterns (but you don’t need to learn regex), read about common regex “idioms”
just looking for a refresher on what the symbols mean, read the cheat sheet
experienced with regex in other contexts, read about how regex is different in APLS

On this page

Regex cheat sheet
Common regex idioms
Introduction to regex
1. Practicing regex in APLS
Regex in APLS vs. regex elsewhere
Regex tips and tricks
1. Checking your search matches
2. Other resources for learning about regex

Regex cheat sheet

Character	Meaning
Letters & numbers (`A`-`Z`, `a`-`z`, `0`-`9`)	Themselves
Most special characters used in layer notation systems	Themselves
`.`	Any single character
`?`	The character before `?` is optional; i.e. it can occur 0 or 1 times
`+`	The character before `+` can repeat; i.e. it can occur 1 or more times
`*`	The character before `*` can be optional or repeat; i.e. it can occur any number of times, including 0
`()`	Characters inside the parentheses are treated as a unit
`[abc]`	Match `a` or `b` or `c`
`[^abc]`	Match anything other than `a`, `b`, and `c`
`ab\|cd`	Match `ab` or `cd`
`\`	The character after `\` is not treated as a metacharacter; i.e., the character is treated literally

Because of the notation systems used by certain layers, the characters in APLS that you may need to use \ to match literally are:

Layer(s)	Character	Meaning
morphemes	`+`	Morpheme boundary
word	`.`	Short pause
word	`-`	Long pause
word	`?`	Question mark
foll_segment	`.`	Following pause
pronounce	`-`	Syllable boundary
segment, foll_segment, syllables, phonemes, dictionary_phonemes, pronounce	`{`	DISC symbol for the IPA /æ/ vowel
segment, foll_segment, syllables, phonemes, dictionary_phonemes, pronounce	`$`	DISC symbol for the IPA /ɔ/ vowel
part_of_speech	`$`	Used in the part-of-speech tags `$` (currency symbols), `PRP$` (possessive pronoun), `WP$` (possessive wh- pronoun)

Entering an invalid regex will make the text turn red. If you hover over the red text, a tooltip will pop up with a brief explanation of why the regex is invalid. In addition, if you click the Search button with an invalid regex, the search will return an error message.

Common regex idioms

These are the patterns you’re most likely to need when searching APLS:

Dot-asterisk (.*): Match anything or nothing
Dot-plus (.+): Match one or more of any character (doesn’t have to be the same character)
Square brackets ([]): Match one of the characters in the brackets
Square brackets with a caret ([^]): Match anything other than the characters in the brackets
Vertical pipe (|): Match one of the strings on either side of the pipe

Dot-asterisk (`.*`)

Dot-asterisk .* can be used at the beginning, end, and/or middle of a pattern to denote “anything (or nothing at all)”. APLS searches look for whole word matches, so using .* is useful at the edge (beginning or end) of patterns when you are looking for strings that can appear in multiple words.

Pattern	Explanation	Examples
`do.*`	Any word that starts with `do`	do dot dotted dig ado boot
`.*king`	Any word that ends with `king`	king seeking asking kings kin doing
`.tend.`	Any word that contains `tend`	tend tender extendNintendo ten tinder
`a.*n`	Any word that starts with `a` and ends with `n`	an again attention in apple and

Dot-plus (`.+`)

Dot-plus .+ can be used at the beginning, end, and/or middle of a pattern to denote “at least one of any character”. The important distinction between .+ and .* is that .+ requires there to be at least one character in that position, as shown with the examples below.

Pattern	Explanation	Examples
`do.+`	Any word that starts with `do` except `do`	dot dotted do ado
`.+king`	Any word that ends with `king` except `king`	seeking asking king kings kin
`.+tion.+`	Any word that contains `tion` and has characters on both sides of `tion`	questions recreational action question
`a.+n`	Any word that starts with `a` and ends with `n` except `an`	again attention an in and

Square brackets (`[]`)

Square brackets [] can be used to match any one of the characters within the brackets. The characters within the brackets are a character set.

Pattern	Explanation	Examples
`do[tg]`	Any 3-letter word that starts with `do` and ends with `t` or `g`	dot dog do doe dotted
`[dw]on't`	Any 5-character word that starts with `d` or `w` and ends with `on't`	don’t won’t donut want
`.*[oau][td]`	Any word whose last two characters are `o` or `a` or `u` followed by `t` or `d`	cut glad float cost cots cut

Importantly, square brackets always match a single character:

Pattern	Explanation	Examples
`r[oau]t`	Any 3-letter word that starts with `r`, then `o` or `a` or `u`, and ends with `t`	rot rat rut rout root

Square brackets with a caret (`[^]`)

Square brackets can be inverted with ^ to match anything other than the characters in the brackets. Inverted square brackets will still only match a single character, similar to normal square brackets.

Pattern	Explanation	Examples
`do[^t]`	Any 3-letter word that starts with `do` and does not end with `t`	doe don do dot dotted
`b[^eu]t`	Any 3-letter word that starts with `b`, then any character other than `e` or `u`, and ends with `t`	bot bat bit bet but boat bots

Vertical pipe (`|`)

Vertical pipes | can be used to specify multiple possible patterns at once. These can be useful when you’d like to find multiple words with the same search.

Pattern	Explanation	Examples
`steeler\|penguin`	The word `steeler` or the word `penguin`	steeler penguin steel penguins pirate
`steel.\|pen.`	Any word that starts with `steel` or `pen`	steelers pencil steer bullpen pirate
`b(oo\|ea\|u)t`	Any word that starts with `b`, then `oo` or `ea` or `u`, and ends with `t`	boot beat but boat bet bit
`(wh\|th\|sh).*`	Any word that starts with `wh` or `th` or `sh`	what though short other slush

Introduction to regex

In regular expressions, letters and numbers are literal characters that match themselves – the regex apples will match the literal text “apples”.

What makes regular expressions more powerful than normal searches are metacharacters that have special functions. These special functions allow regexes to match patterns that are more flexible than searches for literal words.

Metacharacters are denoted using punctuation symbols, which means that regexes can look confusing and hard to read at first. However, regex patterns get easier to decode once you understand the meanings of the different metacharacters.

Metacharacters can be grouped into a few categories based on their functions:

A wildcard can match any character
- . is a wildcard that will match any single character
A quantifier controls the number of times a pattern can match
- The most useful quantifiers are ? (match 0 or 1 times), + (match 1 or more times), and * (match 0 or more times)
A character set lets you specify a set of possible characters for a match
- Characters sets are created with []
A group treats multiple characters as a unit, which lets you control the scope and order of operations in a regex
- Groups are created with ()
An alternator lets you match either the pattern before or the pattern after the alternator
- The vertical pipe | is the alternator symbol in regex

The most useful metacharacters for searching APLS are provided in the Regex cheat sheet.

The table below gives some examples of how characters and metacharacters can be combined to form regex search patterns for the orthography layer:

Pattern	Explanation	Examples
`do.`	Any three-letter word that begins with `do`	dot dog dig do dogs
`pittsburgh(ese)?`	The word `pittsburgh` must be matched but it can optionally end with `ese`	pittsburgh pittsburghese pittsburgher pitt pittsburgh’s
`to+`	The letter `t` followed by one or more `o`	to too tot two do
`pittsburgh.*`	Any word that begins with `pittsburgh`	pittsburgh pittsburghesepittsburgh’spittsburgher pitt philadelphia

To search for every instance of “town”, “downtown”, and “hometown” in APLS:

Go to the Search page.

Enter (down|home)?town into the orthography input field.

Click the Search button.

For more examples of building regex search patterns, go to Common regex idioms.

Practicing regex in APLS

Now that you’re familiar with the basics of regex, it’s time for you to try searching APLS with regex patterns on your own! You can find some practice search questions below, with solutions included as “Try it!”s.

How would you find all words that (1) start with vowels and (2) are at least 2 letters long?

Go to the Search page.

Enter [aeiou].+ into the orthography input field.

Click the Search button.

How would you find all words that end with either -ing or -ize?

Go to the Search page.

Enter .+i(ng|ze) into the orthography input field.

Click the Search button.

How would you find all words that have three consecutive orthographic vowels?

Go to the Search page.

Enter .*[aeiou][aeiou][aeiou].* into the orthography input field.

Click the Search button.

How would you find all words that either start with q or end with x, excluding instances of q and x on their own?

Go to the Search page.

Enter q.+|.+x into the orthography input field.

Click the Search button.

Regex in APLS vs. regex elsewhere

APLS’s search regex syntax is very similar to standard “flavors” of regex in high-level programming languages (e.g., Python, R, JavaScript), with some important differences.

Search: Regex match whole annotations, not parts

Unlike regex elsewhere, APLS’s search matches patterns to whole annotations, whether those are words, part-of-speech tags, segments, etc.

If you’ve used regex in other contexts, you are probably used to patterns matching any part of a string. For example, in Python, the regex the matches the strings the, these, and other. However, the will only ever match the in APLS. To search for patterns that occur within annotations, use .* at the beginning and/or end of the regex. For example, searching APLS with .*the.* will find matches for the, these, and other. (This also means that the anchors ^ and $ aren’t necessary because patterns always match entire annotations.) The table below shows the APLS equivalent of standard regex anchor expressions, along with example matches.

Standard regex	How it matches in APLS	Translated for APLS	How it matches in APLS
`the`	the these other	`.the.`	the these other
`^the`	the these theater	`the.*`	the these theater
`the$`	the soothe bathe	`.*the`	the soothe bathe
`^the$`	the	`the`	the

In the context of regular expressions beyond APLS, “anchors” are the metacharacters ^ and $. In APLS searches, “anchoring” refers to anchoring a search match to a larger annotation, such as finding words at the start of a turn. This is covered on the Anchoring searches page.

Participants and Transcripts pages: Regex match parts

When browsing participants or transcripts, you can specify filters to narrow down the list of matches. Some of these filters accept regex, like the participant or transcript name. Unlike search regex, these regex do match parts of the label (no need for .* at the beginning and/or end of the regex). For example, on the Participants page, entering 1 in the Participant box will match participants with 1 anywhere in their speaker code. This also means that ^ and $, which anchor patterns to the start and end (respectively) of the match, do work in these filters For example, on the Participants page, entering 1$ in the Participant box will match participants whose speaker codes end with 1.

Regex use POSIX Extended Regular Expressions

APLS implements regex in MySQL (i.e., with the MySQL REGEXP operator), which uses POSIX Extended Regular Expressions. This means that, compared to PCRE regular expressions that are currently standard in high-level programming languages, features like backreferences and lookarounds are not supported.

On the Search page, entering an invalid regex pattern will make the text turn red. Hovering over the red text will make a tooltip appear that gives a brief explanation of why the regex is invalid. This error-checking is powered not by MySQL’s regex engine but by JavaScript’s new Regexp() with the "u" (Unicode) flag. The error messages are mostly self-explanatory, but you can look them up here if needed.

Regex tips and tricks

Here we’ve collected some tips and tricks for using regex with APLS that are broadly helpful for users at any level of regex proficiency!

Checking your search matches

When you perform a search and the Search results page contains a handful of matches, it can be easy to manually check how many unique items your search matched. APLS contains over 400,000 word tokens, so often a search will result in a lot more than a handful of matches! Even if you export your search results as a .csv, it can be difficult to identify all the unique item matches if one or two items are much more frequent than the rest.

The Dictionary export option on the Search results page lets you download a .txt file of every unique individual item that your search matched. This can be helpful for quickly checking that the matches you got line up with what you were expecting!

Go to the Search page.

Enter don.* into the orthography input field.

Click the Search button.

On the Search results page generated by your search, click the Dictionary button.

Save and view the file to see what words in APLS match the don.* orthography search pattern.

Other resources for learning about regex

This page is only meant to be an introduction to regular expressions (and the ways that regex is different in APLS). There are lots of resources out there for learning and practicing regex if you would like more than this page can provide!

A few places that we would recommend are:

regexone.com or regexlearn.com for learning regex by practicing with simple, interactive exercises.
regular-expressions.info tutorial for more details about basic regex concepts.
- The regular-expressions website also has resources for learning about more advanced regex concepts!
regex crosswords for practicing regex skills with crossword-like puzzles.
regex101.com for testing out different regex patterns.