Searching with regular expressions
Layers that support the use of regular expressions are denoted by their pattern input fields having Regular expression
text fields.
A regular expression (or âregexâ) is a compact way of finding a pattern in a string of text. In APLS, for example, searching the orthography layer for do.*
will return all words that start with the characters do
(including the word do itself).
Regular expressions are a feature of many programming languages, so you may have used them before. Even seasoned regex users can use a refresher once in a while. For new users, regex can be a little intimidating because they are so compact and can use a lot of special symbols.
If you areâŚ
- hoping to learn regex, read the introduction to regex
- hoping to use some simple patterns (but you donât need to learn regex), read about common regex âidiomsâ
- just looking for a refresher on what the symbols mean, read the cheat sheet
- experienced with regex in other contexts, read about how regex is different in APLS
On this page
Regex cheat sheet
Character | Meaning |
---|---|
Letters & numbers (A -Z , a -z , 0 -9 ) | Themselves |
Most special characters used in layer notation systems | Themselves |
. | Any single character |
? | The character before ? is optional; i.e. it can occur 0 or 1 times |
+ | The character before + can repeat; i.e. it can occur 1 or more times |
* | The character before * can be optional or repeat; i.e. it can occur any number of times, including 0 |
() | Characters inside the parentheses are treated as a unit |
[abc] | Match a or b or c |
[^abc] | Match anything other than a , b , and c |
ab|cd | Match ab or cd |
\ | The character after \ is not treated as a metacharacter; i.e., the character is treated literally |
Because of the notation systems used by certain layers, the characters in APLS that you may need to use \
to match literally are:
Layer(s) | Character | Meaning |
---|---|---|
morphemes | + | Morpheme boundary |
word | . | Short pause |
word | - | Long pause |
word | ? | Question mark |
foll_segment | . | Following pause |
pronounce | - | Syllable boundary |
segment, foll_segment, syllables, phonemes, dictionary_phonemes, pronounce | { | DISC symbol for the IPA /ĂŚ/ vowel |
segment, foll_segment, syllables, phonemes, dictionary_phonemes, pronounce | $ | DISC symbol for the IPA /É/ vowel |
part_of_speech | $ | Used in the part-of-speech tags $ (currency symbols), PRP$ (possessive pronoun), WP$ (possessive wh- pronoun) |
Entering an invalid regex will make the text turn red. If you hover over the red text, a tooltip will pop up with a brief explanation of why the regex is invalid. In addition, if you click the Search button with an invalid regex, the search will return an error message.
Common regex idioms
These are the patterns youâre most likely to need when searching APLS:
- Dot-asterisk (
.*
): Match anything or nothing - Dot-plus (
.+
): Match one or more of any character (doesnât have to be the same character) - Square brackets (
[]
): Match one of the characters in the brackets - Square brackets with a caret (
[^]
): Match anything other than the characters in the brackets - Vertical pipe (
|
): Match one of the strings on either side of the pipe
Dot-asterisk (.*
)
Dot-asterisk .*
can be used at the beginning, end, and/or middle of a pattern to denote âanything (or nothing at all)â. APLS searches look for whole word matches, so using .*
is useful at the edge (beginning or end) of patterns when you are looking for strings that can appear in multiple words.
Pattern | Explanation | Examples |
---|---|---|
do.* | Any word that starts with do | do dot dotted dig ado boot |
.*king | Any word that ends with king | king seeking asking kings kin doing |
.*tend.* | Any word that contains tend | tend tender extendNintendo ten tinder |
a.*n | Any word that starts with a and ends with n | an again attention in apple and |
Dot-plus (.+
)
Dot-plus .+
can be used at the beginning, end, and/or middle of a pattern to denote âat least one of any characterâ. The important distinction between .+
and .*
is that .+
requires there to be at least one character in that position, as shown with the examples below.
Pattern | Explanation | Examples |
---|---|---|
do.+ | Any word that starts with do except do | dot dotted do ado |
.+king | Any word that ends with king except king | seeking asking king kings kin |
.+tion.+ | Any word that contains tion and has characters on both sides of tion | questions recreational action question |
a.+n | Any word that starts with a and ends with n except an | again attention an in and |
Square brackets ([]
)
Square brackets []
can be used to match any one of the characters within the brackets. The characters within the brackets are a character set.
Pattern | Explanation | Examples |
---|---|---|
do[tg] | Any 3-letter word that starts with do and ends with t or g | dot dog do doe dotted |
[dw]on't | Any 5-character word that starts with d or w and ends with on't | donât wonât donut want |
.*[oau][td] | Any word whose last two characters are o or a or u followed by t or d | cut glad float cost cots cut |
Importantly, square brackets always match a single character:
Pattern | Explanation | Examples |
---|---|---|
r[oau]t | Any 3-letter word that starts with r , then o or a or u , and ends with t | rot rat rut rout root |
Square brackets with a caret ([^]
)
Square brackets can be inverted with ^
to match anything other than the characters in the brackets. Inverted square brackets will still only match a single character, similar to normal square brackets.
Pattern | Explanation | Examples |
---|---|---|
do[^t] | Any 3-letter word that starts with do and does not end with t | doe don do dot dotted |
b[^eu]t | Any 3-letter word that starts with b , then any character other than e or u , and ends with t | bot bat bit bet but boat bots |
Vertical pipe (|
)
Vertical pipes |
can be used to specify multiple possible patterns at once. These can be useful when youâd like to find multiple words with the same search.
Pattern | Explanation | Examples |
---|---|---|
steeler|penguin | The word steeler or the word penguin | steeler penguin steel penguins pirate |
steel.*|pen.* | Any word that starts with steel or pen | steelers pencil steer bullpen pirate |
b(oo|ea|u)t | Any word that starts with b , then oo or ea or u , and ends with t | boot beat but boat bet bit |
(wh|th|sh).* | Any word that starts with wh or th or sh | what though short other slush |
Introduction to regex
In regular expressions, letters and numbers are literal characters that match themselves â the regex apples
will match the literal text âapplesâ.
What makes regular expressions more powerful than normal searches are metacharacters that have special functions. These special functions allow regexes to match patterns that are more flexible than searches for literal words.
Metacharacters are denoted using punctuation symbols, which means that regexes can look confusing and hard to read at first. However, regex patterns get easier to decode once you understand the meanings of the different metacharacters.
Metacharacters can be grouped into a few categories based on their functions:
- A wildcard can match any character
.
is a wildcard that will match any single character
- A quantifier controls the number of times a pattern can match
- The most useful quantifiers are
?
(match 0 or 1 times),+
(match 1 or more times), and*
(match 0 or more times)
- The most useful quantifiers are
- A character set lets you specify a set of possible characters for a match
- Characters sets are created with
[]
- Characters sets are created with
- A group treats multiple characters as a unit, which lets you control the scope and order of operations in a regex
- Groups are created with
()
- Groups are created with
- An alternator lets you match either the pattern before or the pattern after the alternator
- The vertical pipe
|
is the alternator symbol in regex
- The vertical pipe
The most useful metacharacters for searching APLS are provided in the Regex cheat sheet.
The table below gives some examples of how characters and metacharacters can be combined to form regex search patterns for the orthography layer:
Pattern | Explanation | Examples |
---|---|---|
do. | Any three-letter word that begins with do | dot dog dig do dogs |
pittsburgh(ese)? | The word pittsburgh must be matched but it can optionally end with ese | pittsburgh pittsburghese pittsburgher pitt pittsburghâs |
to+ | The letter t followed by one or more o | to too tot two do |
pittsburgh.* | Any word that begins with pittsburgh | pittsburgh pittsburghesepittsburghâspittsburgher pitt philadelphia |
To search for every instance of âtownâ, âdowntownâ, and âhometownâ in APLS:
- Go to the Search page.
- Enter
(down|home)?town
into the orthography input field.- Click the Search button.
For more examples of building regex search patterns, go to Common regex idioms.
Practicing regex in APLS
Now that youâre familiar with the basics of regex, itâs time for you to try searching APLS with regex patterns on your own! You can find some practice search questions below, with solutions included as âTry it!âs.
- How would you find all words that (1) start with vowels and (2) are at least 2 letters long?
- Go to the Search page.
- Enter
[aeiou].+
into the orthography input field.- Click the Search button.
- How would you find all words that end with either -ing or -ize?
- Go to the Search page.
- Enter
.+i(ng|ze)
into the orthography input field.- Click the Search button.
- How would you find all words that have three consecutive orthographic vowels?
- Go to the Search page.
- Enter
.*[aeiou][aeiou][aeiou].*
into the orthography input field.- Click the Search button.
- How would you find all words that either start with q or end with x, excluding instances of q and x on their own?
- Go to the Search page.
- Enter
q.+|.+x
into the orthography input field.- Click the Search button.
Regex in APLS vs. regex elsewhere
APLSâs search regex syntax is very similar to standard âflavorsâ of regex in high-level programming languages (e.g., Python, R, JavaScript), with some important differences.
Search: Regex match whole annotations, not parts
Unlike regex elsewhere, APLSâs search matches patterns to whole annotations, whether those are words, part-of-speech tags, segments, etc.
If youâve used regex in other contexts, you are probably used to patterns matching any part of a string. For example, in Python, the regex the
matches the strings the
, these
, and other
. However, the
will only ever match the
in APLS. To search for patterns that occur within annotations, use .*
at the beginning and/or end of the regex. For example, searching APLS with .*the.*
will find matches for the
, these
, and other
. (This also means that the anchors ^
and $
arenât necessary because patterns always match entire annotations.) The table below shows the APLS equivalent of standard regex anchor expressions, along with example matches.
Standard regex | How it matches in APLS | Translated for APLS | How it matches in APLS |
---|---|---|---|
the | the these other | .*the.* | the these other |
^the | the these theater | the.* | the these theater |
the$ | the soothe bathe | .*the | the soothe bathe |
^the$ | the | the | the |
In the context of regular expressions beyond APLS, âanchorsâ are the metacharacters
^
and$
. In APLS searches, âanchoringâ refers to anchoring a search match to a larger annotation, such as finding words at the start of a turn. This is covered on the Anchoring searches page.
Participants and Transcripts pages: Regex match parts
When browsing participants or transcripts, you can specify filters to narrow down the list of matches. Some of these filters accept regex, like the participant or transcript name. Unlike search regex, these regex do match parts of the label (no need for .*
at the beginning and/or end of the regex). For example, on the Participants page, entering 1
in the Participant box will match participants with 1
anywhere in their speaker code. This also means that ^
and $
, which anchor patterns to the start and end (respectively) of the match, do work in these filters For example, on the Participants page, entering 1$
in the Participant box will match participants whose speaker codes end with 1
.
Regex use POSIX Extended Regular Expressions
APLS implements regex in MySQL (i.e., with the MySQL REGEXP
operator), which uses POSIX Extended Regular Expressions. This means that, compared to PCRE regular expressions that are currently standard in high-level programming languages, features like backreferences and lookarounds are not supported.
On the Search page, entering an invalid regex pattern will make the text turn red. Hovering over the red text will make a tooltip appear that gives a brief explanation of why the regex is invalid. This error-checking is powered not by MySQLâs regex engine but by JavaScriptâs
new Regexp()
with the"u"
(Unicode) flag. The error messages are mostly self-explanatory, but you can look them up here if needed.
Regex tips and tricks
Here weâve collected some tips and tricks for using regex with APLS that are broadly helpful for users at any level of regex proficiency!
Checking your search matches
When you perform a search and the Search results page contains a handful of matches, it can be easy to manually check how many unique items your search matched. APLS contains over 400,000 word tokens, so often a search will result in a lot more than a handful of matches! Even if you export your search results as a .csv
, it can be difficult to identify all the unique item matches if one or two items are much more frequent than the rest.
The Dictionary export option on the Search results page lets you download a
.txt
file of every unique individual item that your search matched. This can be helpful for quickly checking that the matches you got line up with what you were expecting!
- Go to the Search page.
- Enter
don.*
into the orthography input field.- Click the Search button.
- On the Search results page generated by your search, click the
Dictionary button.
- Save and view the file to see what words in APLS match the
don.*
orthography search pattern.
Other resources for learning about regex
This page is only meant to be an introduction to regular expressions (and the ways that regex is different in APLS). There are lots of resources out there for learning and practicing regex if you would like more than this page can provide!
A few places that we would recommend are:
- regexone.com or regexlearn.com for learning regex by practicing with simple, interactive exercises.
- regular-expressions.info tutorial for more details about basic regex concepts.
- The regular-expressions website also has resources for learning about more advanced regex concepts!
- regex crosswords for practicing regex skills with crossword-like puzzles.
- regex101.com for testing out different regex patterns.