Custom dictionary
On the APLS GitHub repository, the folder /files/custom-dictionary/
contains files for APLSâs custom dictionary.
On this page
Intro
APLS uses pronouncing dictionaries to match individual words in a transcript to their phonemic representations. Its default dictionary is the Unisyn lexicon for American English, which covers the vast majority of the words in any given transcript. The custom dictionary supplements Unisyn, with words in a few categories:
- Names from Pittsburgh/western PA physical geography, human geography, and/or culture (e.g., neighborhoods like Shadyside, municipalities like Sewickley, streets like Baum, schools like Milliones, notable local figures like Stargell)
- Brand/business names, whether local (e.g., Pelusi) or not (e.g., Highmark, Panera)
- Pittsburgh lexical features, whether stereotypical (e.g., redd, gumband) or not (e.g., Trib, WAMO)
- Non-Pittsburgh-specific words that are absent from Unisyn, sometimes unexpectedly (artsy, bachelorette, homie, Kwanzaa, microbrew, stepdad, tarp, yâall)
- Restricted mini-lexicons specified in the APLS transcription convention: colloquial spellings (e.g., gonna), interjections (e.g., yup), and single-phoneme hesitation codes (e.g., f~ for [f])
Dictionary file format
The file APLS-dict.csv
must conform to the following format:
- Lines must consist of one of the following:
- An entry of the form
<word-form>,<phonemes>
- A comment of the form
// <comment>
- A blank line
- An entry of the form
- Word-forms must match the regular expression
/^[A-Za-z'-]+~?$/
- That is, they may consist only of letters of the English alphabet and/or the characters
'
(apostrophe) or-
(hyphen), and they may end with an optional~
(tilde)- Only literal apostrophes or hyphens are acceptable, not lookalikes like
â
(Unicode U+2018 âcurly opening single quoteâ) orâ
(Unicode U+2013 âen dashâ) - Tildes (hesitation markers) may only come at the end of the word-form
- Only literal apostrophes or hyphens are acceptable, not lookalikes like
- That is, they may consist only of letters of the English alphabet and/or the characters
- Phonemic representations may only use the APLS subset of the DISC phonemic alphabet, including syllabification/stress markers:
pbtdkgNmnlrfvTDszSZjhwJ_FHPiIE{Q$VUu@78#3912645'"-
- After the header (first 7 rows), comments are interpreted as separating dictionary sections
(For corpus admins) How to add dictionary entries
This section is meant for users with admin access to APLS. However, other users may find this section useful to understand the processes that go into making APLS.
- Decide whether you even need to add a new entry
- Add the word, plus any inflectional forms like plurals or -ing, to
APLS-dict.csv
- Update the Elan file checkerâs dictionary (
updateDict.sh
+ commit/push changes) - Update APLSâs internal dictionary
To add or not to add?
Transcribers will suggest new dictionary entries as they work on transcriptions, typically because a word got flagged by the Elan file checkerâs step 2. Itâs up to the corpus maintainer whether to add these entries, or whether transcribers should specify the phonemic transcription inline (see transcription convention sec. 3.4).
Add the word if:
- Itâs likely to be used by more than one speaker
- Examples: Monongahela Panera (word made up on the spot)
- Rationale: If a word gets added to the dictionary, itâll save future transcribers time and ensure quality control
- Transcribers should: Use an inline pronounce code
- Note: This criterion is by far the most subjective! If the word satisfies all the other criteria, err on the side of adding it
- Itâs not a new colloquial spelling or interjection
- Examples: dunno er ahem
- Rationale: We want to avoid âcoding while transcribingâ. If transcribers have the choice between (e.g.) donât know and dunno, theyâll have to spend time deciding which one the speaker uttered. Thatâs really a question for future sociolinguistic investigation, not something to be decided at the transcription stage
- Transcribers should: Depending on the situation, use a standard spelling, a hesitation code, or a noise code
- Note: We can consider waiving this criterion only if the evidence is really, really strong. See below for an example of this principle in action.
- It doesnât violate existing rules in the transcription convention
- Examples: 412 IBM colllect jumpinâ rĂ©sumĂ© Picksburgh
- Rationale: We have those rules for good reasons! Plus, allowing âeye dialectâ forms like Picksburgh leads to âcoding while transcribingâ
- Transcribers should: Correct their spelling
- Itâs not already in the Unisyn dictionary
- Rationale: LaBB-CAT wonât update a dictionary entry if itâs in Unisyn (it will update custom dictionary entries if needed)
- Transcribers should: Do nothing
- Notes:
- Donât assume that pluralized forms are in the Unisyn dictionary!
- If the Elan file checker flags a form, itâs not in either dictionary.
- If you accidentally add a word that turns out to have been in Unisyn, youâll get the message word in the read-only part of the dictionary when you try to upload it to APLS. If that happens, simply delete the word from the custom dictionary
- Itâs not a vulgar and/or objectionable word
- Rationale: Since the APLS dictionary gets published on the open web, we donât want it to show up in search engine results associated with racial slurs, etc.
- Transcribers should: Use an inline pronounce code
Non-criteria:
- Words do not have to be Pittsburgh-specific to be added to the dictionary
- Words do not have to conform to prescriptive notions of what is or isnât a wordâas long as theyâre not colloquial or âeye dialectâ spellings
Example: lemme
In May 2023, an RA asked:
Iâve had a speaker use lemme and was wondering if itâs a potential candidate to be added to the conventions alongside gonna and the like?
In other words, is it worth adding lemme as an alternative to let me?
lemme clearly satisfies most criteria; it is not vulgar or objectionable, itâs not already in Unisyn (we know because it was flagged by the Elan file checker), it doesnât violate any existing rules, and it seems likely to be used by more than one speaker. However, it would mean expanding a restricted mini-lexicon, colloquial spellings, so it deserves extra scrutiny.
If we added lemme to the dictionary, we would have to think about whether existing tokens of let me should remain let me or be re-transcribed as lemme. I performed a search in APLS for tokens of let me (searching the orthography layer for let
followed by me
); the results are in results_(let)_(me).csv
. I sent the RAs the wav and TextGrid files for these tokens, along with this message:
There are currently 19 tokens of let me in APLS. Give a listen to some of them, and we can consider adding âlemmeâ only if (1) itâs obvious that thereâs variation between lemme vs. let me and (2) itâs obvious which is which. In general, thereâs lots of expressions that get phonologically reduced in spontaneous speech beyond their orthographic representation, so I want to be conservative with which ones get added to the list
After listening to the files, the RAs found that it wasnât always obvious whether a token was lemme or let me. In other words, having to decide between lemme or let me would have required transcribers to âcode while transcribingâ. We decided not to add lemme to the dictionary.
Dialect pronunciations
In rare cases, a word that exists across dialects will have a unique pronunciation in just one speech community that is unrelated to more general phonological features that characterize the dialect. For example, Carnegie is /ËkÉÉč-nÉ-gi/ in âgeneral Americanâ, /kÉÉč-ËneÉȘ-gi/ in Pittsburgh, which doesnât seem to be the result of a phonological process affecting other words in Pittsburgh English.
We have to handle these words in a special way. If a word is already in Unisyn, LaBB-CAT wonât update the wordâs dictionary entry; youâll get an error like Line 1 word water ignored: word in the read-only part of the dictionary. So rather than adding the word to the custom dictionary, we have to change Unisynâs representation of the word; if you have access to the APLS-Admin
repo, see here for directions: https://github.com/djvill/APLS-Admin/blob/main/doc/dictionary-phonemes/README.md.
Note: These âdialect pronunciationsâ are not the same as idiosyncratic pronunciations. The former are shared by a speech community (or a decent subset thereof); the latter are isolated to a single speaker and should be transcribed with an inline pronounce code. Of course, what might appear at first blush to be an idiosyncratic pronunciation may, upon further examination, turn out to be more common in the speech community; if that happens, we can always revisit the issue and add a new dialect pronunciation.
Add the word
Add the word to APLS-dict.csv
. Edit the file in a text editor, not in Excel, as Excel will mess up the formatting by adding extra characters. Simply add a new line under the appropriate category heading: <word>,<phonemes>
. The category headings are semi-arbitrary and donât affect anything meaningful, so just pick whichever one seems right.
Use the DISC phonemic alphabet, as described here. Please pay attention to the extra considerations for suggesting new dictionary entries: multiple phonemic representations per word, the speech communityâs pronunciation(s), and syllabification/stress.
Youâll also need to add inflectional forms, like plurals, verbal forms like -s, -ing, etc. You do not need to add cliticized forms (e.g., Pittsburghâs); APLS automatically adjusts pronunciations to account for clitics, and the Elan file checker is programmed to ignore Xâs if X is in the dictionary. (The transcription convention says certain clitics can be added to any noun, but the implementation of clitics in APLS and the file checker arenât actually restricted to nouns.) If you supplied multiple entries for the base-form of the word, you should also add multiple entries for each inflected form.
Update the Elan file checkerâs dictionary
- Pull updates from the GitHub repository
- Run
updateDict.sh
- If necessary, fix any formatting errors to ensure
APLS-dict.csv
conforms to APLS specifications, and re-runupdateDict.sh
- If necessary, fix any formatting errors to ensure
- Commit and push your changes to GitHub.
Thatâs it!
The Elan file checker reads aplsDict.txt
from this repository; in turn, aplsDict.txt
gets created from APLS-dict.csv
when you run updateDict.sh
. That means if you update APLS-dict.csv
but donât do this step, the Elan file checker will throw errors in step 2.
Requirements:
- Software
- Git
- Bash (if not included as Git Bash in Git install)
- R
- GitHub account
- Push access to https://github.com/djvill/APLS
FYI, updateDict.sh
also updates custom-entries.md
. The sole purpose of this file is creating a nice user-facing custom dictionary page on the APLS documentation site.
Update APLSâs internal dictionary
Currently, you have to update the custom dictionary for two separate layers in APLS: dictionary_phonemes and phonemes_no_clitic. Go to the word layers page (note: you must have admin access, or else youâll get âERROR 403â). For each layer:
- Click on the dictionary icon:
- Drag and drop
APLS-dict.csv
to the âChoose Fileâ button, and click âImport From CSVâ - Youâll see a ton of lines indicating entries that werenât added, such as
- Line 1 word // APLS custom dictionary not deleted: isnât in the dictionary
- Line 9 word Wilkinsburg definition âwIl-kInz-âb3rg ignored: definition already in the dictionary
- These are normal. At the very bottom, you should see Entries added: 17 (or some other number), with the new entries listed below.
- If you see any lines in red like Line 1 word water ignored: word in the read-only part of the dictionary, then either:
- Delete the word (if you didnât mean to provide a dialect pronunciation), or
- Follow the steps for dialect pronunciations