Custom dictionary

On the APLS GitHub repository, the folder /files/custom-dictionary/ contains files for APLS’s custom dictionary.

On this page

Intro
Dictionary file format
(For corpus admins) How to add dictionary entries

Intro

APLS uses pronouncing dictionaries to match individual words in a transcript to their phonemic representations. Its default dictionary is the Unisyn lexicon for American English, which covers the vast majority of the words in any given transcript. The custom dictionary supplements Unisyn, with words in a few categories:

Names from Pittsburgh/western PA physical geography, human geography, and/or culture (e.g., neighborhoods like Shadyside, municipalities like Sewickley, streets like Baum, schools like Milliones, notable local figures like Stargell)
Brand/business names, whether local (e.g., Pelusi) or not (e.g., Highmark, Panera)
Pittsburgh lexical features, whether stereotypical (e.g., redd, gumband) or not (e.g., Trib, WAMO)
Non-Pittsburgh-specific words that are absent from Unisyn, sometimes unexpectedly (artsy, bachelorette, homie, Kwanzaa, microbrew, stepdad, tarp, y’all)
Restricted mini-lexicons specified in the APLS transcription convention: colloquial spellings (e.g., gonna), interjections (e.g., yup), and single-phoneme hesitation codes (e.g., f~ for [f])

Dictionary file format

The file APLS-dict.csv must conform to the following format:

Lines must consist of one of the following:
- An entry of the form <word-form>,<phonemes>
- A comment of the form // <comment>
- A blank line
Word-forms must match the regular expression /^[A-Za-z'-]+~?$/
- That is, they may consist only of letters of the English alphabet and/or the characters ' (apostrophe) or - (hyphen), and they may end with an optional ~ (tilde)
  - Only literal apostrophes or hyphens are acceptable, not lookalikes like ‘ (Unicode U+2018 “curly opening single quote”) or – (Unicode U+2013 “en dash”)
  - Tildes (hesitation markers) may only come at the end of the word-form
Phonemic representations may only use the APLS subset of the DISC phonemic alphabet, including syllabification/stress markers: pbtdkgNmnlrfvTDszSZjhwJ_FHPiIE{Q$VUu@78#3912645'"-
After the header (first 7 rows), comments are interpreted as separating dictionary sections

(For corpus admins) How to add dictionary entries

This section is meant for users with admin access to APLS. However, other users may find this section useful to understand the processes that go into making APLS.

Decide whether you even need to add a new entry
Add the word, plus any inflectional forms like plurals or -ing, to APLS-dict.csv
Update the Elan file checker’s dictionary (updateDict.sh + commit/push changes)
Update APLS’s internal dictionary

To add or not to add?

Transcribers will suggest new dictionary entries as they work on transcriptions, typically because a word got flagged by the Elan file checker’s step 2. It’s up to the corpus maintainer whether to add these entries, or whether transcribers should specify the phonemic transcription inline (see transcription convention sec. 3.4).

Add the word if:

It’s likely to be used by more than one speaker
- Examples: Monongahela Panera (word made up on the spot)
- Rationale: If a word gets added to the dictionary, it’ll save future transcribers time and ensure quality control
- Transcribers should: Use an inline pronounce code
- Note: This criterion is by far the most subjective! If the word satisfies all the other criteria, err on the side of adding it
It’s not a new colloquial spelling or interjection
- Examples: dunno er ahem
- Rationale: We want to avoid “coding while transcribing”. If transcribers have the choice between (e.g.) don’t know and dunno, they’ll have to spend time deciding which one the speaker uttered. That’s really a question for future sociolinguistic investigation, not something to be decided at the transcription stage
- Transcribers should: Depending on the situation, use a standard spelling, a hesitation code, or a noise code
- Note: We can consider waiving this criterion only if the evidence is really, really strong. See below for an example of this principle in action.
It doesn’t violate existing rules in the transcription convention
- Examples: 412 IBM colllect jumpin’ résumé Picksburgh
- Rationale: We have those rules for good reasons! Plus, allowing “eye dialect” forms like Picksburgh leads to “coding while transcribing”
- Transcribers should: Correct their spelling
It’s not already in the Unisyn dictionary
- Rationale: LaBB-CAT won’t update a dictionary entry if it’s in Unisyn (it will update custom dictionary entries if needed)
- Transcribers should: Do nothing
- Notes:
  - Don’t assume that pluralized forms are in the Unisyn dictionary!
  - If the Elan file checker flags a form, it’s not in either dictionary.
  - If you accidentally add a word that turns out to have been in Unisyn, you’ll get the message word in the read-only part of the dictionary when you try to upload it to APLS. If that happens, simply delete the word from the custom dictionary
It’s not a vulgar and/or objectionable word
- Rationale: Since the APLS dictionary gets published on the open web, we don’t want it to show up in search engine results associated with racial slurs, etc.
- Transcribers should: Use an inline pronounce code

Non-criteria:

Words do not have to be Pittsburgh-specific to be added to the dictionary
Words do not have to conform to prescriptive notions of what is or isn’t a word—as long as they’re not colloquial or “eye dialect” spellings

Example: lemme

In May 2023, an RA asked:

I’ve had a speaker use lemme and was wondering if it’s a potential candidate to be added to the conventions alongside gonna and the like?

In other words, is it worth adding lemme as an alternative to let me?

lemme clearly satisfies most criteria; it is not vulgar or objectionable, it’s not already in Unisyn (we know because it was flagged by the Elan file checker), it doesn’t violate any existing rules, and it seems likely to be used by more than one speaker. However, it would mean expanding a restricted mini-lexicon, colloquial spellings, so it deserves extra scrutiny.

If we added lemme to the dictionary, we would have to think about whether existing tokens of let me should remain let me or be re-transcribed as lemme. I performed a search in APLS for tokens of let me (searching the orthography layer for let followed by me); the results are in results_(let)_(me).csv. I sent the RAs the wav and TextGrid files for these tokens, along with this message:

There are currently 19 tokens of let me in APLS. Give a listen to some of them, and we can consider adding “lemme” only if (1) it’s obvious that there’s variation between lemme vs. let me and (2) it’s obvious which is which. In general, there’s lots of expressions that get phonologically reduced in spontaneous speech beyond their orthographic representation, so I want to be conservative with which ones get added to the list

After listening to the files, the RAs found that it wasn’t always obvious whether a token was lemme or let me. In other words, having to decide between lemme or let me would have required transcribers to “code while transcribing”. We decided not to add lemme to the dictionary.

Dialect pronunciations

In rare cases, a word that exists across dialects will have a unique pronunciation in just one speech community that is unrelated to more general phonological features that characterize the dialect. For example, Carnegie is /ˈkɑɹ-nə-gi/ in “general American”, /kɑɹ-ˈneɪ-gi/ in Pittsburgh, which doesn’t seem to be the result of a phonological process affecting other words in Pittsburgh English.

We have to handle these words in a special way. If a word is already in Unisyn, LaBB-CAT won’t update the word’s dictionary entry; you’ll get an error like Line 1 word water ignored: word in the read-only part of the dictionary. So rather than adding the word to the custom dictionary, we have to change Unisyn’s representation of the word; if you have access to the APLS-Admin repo, see here for directions: https://github.com/djvill/APLS-Admin/blob/main/doc/dictionary-phonemes/README.md.

Note: These “dialect pronunciations” are not the same as idiosyncratic pronunciations. The former are shared by a speech community (or a decent subset thereof); the latter are isolated to a single speaker and should be transcribed with an inline pronounce code. Of course, what might appear at first blush to be an idiosyncratic pronunciation may, upon further examination, turn out to be more common in the speech community; if that happens, we can always revisit the issue and add a new dialect pronunciation.

Add the word

Add the word to APLS-dict.csv. Edit the file in a text editor, not in Excel, as Excel will mess up the formatting by adding extra characters. Simply add a new line under the appropriate category heading: <word>,<phonemes>. The category headings are semi-arbitrary and don’t affect anything meaningful, so just pick whichever one seems right.

Use the DISC phonemic alphabet, as described here. Please pay attention to the extra considerations for suggesting new dictionary entries: multiple phonemic representations per word, the speech community’s pronunciation(s), and syllabification/stress.

You’ll also need to add inflectional forms, like plurals, verbal forms like -s, -ing, etc. You do not need to add cliticized forms (e.g., Pittsburgh’s); APLS automatically adjusts pronunciations to account for clitics, and the Elan file checker is programmed to ignore X’s if X is in the dictionary. (The transcription convention says certain clitics can be added to any noun, but the implementation of clitics in APLS and the file checker aren’t actually restricted to nouns.) If you supplied multiple entries for the base-form of the word, you should also add multiple entries for each inflected form.

Update the Elan file checker’s dictionary

Pull updates from the GitHub repository
Run updateDict.sh
- If necessary, fix any formatting errors to ensure APLS-dict.csv conforms to APLS specifications, and re-run updateDict.sh
Commit and push your changes to GitHub.

That’s it!

The Elan file checker reads aplsDict.txt from this repository; in turn, aplsDict.txt gets created from APLS-dict.csv when you run updateDict.sh. That means if you update APLS-dict.csv but don’t do this step, the Elan file checker will throw errors in step 2.

Requirements:

Software
- Git
- Bash (if not included as Git Bash in Git install)
- R
  - Packages readr, stringr, purrr, and dependencies
  - R must be in your PATH (and it probably is, if you’ve installed R). You can tell that R is in your PATH if running Rscript -e R.version.string at the command line prints your R version. If not, follow directions for Windows, macOS, or Unix
GitHub account
Push access to https://github.com/djvill/APLS

FYI, updateDict.sh also updates custom-entries.md. The sole purpose of this file is creating a nice user-facing custom dictionary page on the APLS documentation site.

Update APLS’s internal dictionary

Currently, you have to update the custom dictionary for two separate layers in APLS: dictionary_phonemes and phonemes_no_clitic. Go to the word layers page (note: you must have admin access, or else you’ll get “ERROR 403”). For each layer:

Click on the dictionary icon:
Drag and drop APLS-dict.csv to the “Choose File” button, and click “Import From CSV”
You’ll see a ton of lines indicating entries that weren’t added, such as
- Line 1 word // APLS custom dictionary not deleted: isn’t in the dictionary
- Line 9 word Wilkinsburg definition ‘wIl-kInz-“b3rg ignored: definition already in the dictionary
These are normal. At the very bottom, you should see Entries added: 17 (or some other number), with the new entries listed below.
If you see any lines in red like Line 1 word water ignored: word in the read-only part of the dictionary, then either:
- Delete the word (if you didn’t mean to provide a dialect pronunciation), or
- Follow the steps for dialect pronunciations