Data organization in LaBB-CAT
APLS data is organized using the data structures provided by the open-source linguistic corpus software LaBB-CAT. The most important organizational units in LaBB-CAT corpora are participants, transcripts, attributes, layers, and annotations.
- Participants are speakers in the audio files (both interviewers and interviewees), along with metadata like demographic info.
- Transcripts are data objects for individual audio files and all of their annotations, plus metadata like when the audio file was recorded.
- Attributes are metadata about participants and transcripts.
- Layers are series of time-aligned annotations in transcripts corresponding to a single type of linguistic data (e.g., pronunciations, part-of-speech tags).
- Annotations are individual bits of data aligned to specific timestamps in audio files.
Participants
The participants in APLS are the interviewees, the interviewers, and occasionally a bystander whose speech is captured in the recording. Interviewees in APLS are identified by an anonymized speaker code that includes their neighborhood abbrevation (e.g., CB01
, HD17
).1
Transcripts
Each interview is divided into multiple transcripts (corresponding to the original recording files), named after the interviewee and interview section. For example, the file FH10pairs.eaf
contains the minimal pairs task from the interviewee FH10.2 Some sections are split into multiple transcripts (e.g., interview1
, reading2
).
Attributes
Each participant and transcript has a set of attributes that describe metadata about the participant or transcript. For example, the participant HD17
has gender Female
and year_of_birth 1964
. The transcript FH10pairs.eaf
has duration 188.61
and recording_date September 26, 2023.
Layers and annotations
To illustrate layers and annotations in APLS, let’s look at a screen-grab of a single line of speech (aka an utterance) from the transcript HD07interview3.eaf
:
On the left-hand side of the image is HD07
, the speaker code for the participant who uttered this speech.
Layers
To the right of the speaker code are three layers. From top to bottom, these are speech_rate, part_of_speech, and word.
- speech_rate (top)
- This layer contains a measurement of how quickly HD07 uttered this line:
6.5068
syllables per second. - APLS measures speech rate by lines in the transcript, so there is just one speech_rate annotation for this line (as indicated by the curved bracket).
- This layer contains a measurement of how quickly HD07 uttered this line:
- part_of_speech (middle)
- This layer encodes each word’s part of speech using Penn Treebank part-of-speech tags (e.g.,
UH
for interjections,CC
for coordinating conjunctions). - Most words have a single part_of_speech annotation. The word don’t has two annotations (
VBP RB
), since it consists of both a present-tense verb (do) and an adverb (not).
- This layer encodes each word’s part of speech using Penn Treebank part-of-speech tags (e.g.,
- word (bottom)
- This layer contains the words that HD07 spoke, spelled in normal English.
- Each word has a single annotation on the word layer.
Layers are covered in more detail in the Layers and attributes section of the APLS documentation.
Annotations
The cursor in the screen-grab is hovering over the NN
annotation, which brings up a tooltip with several pieces of information:
- The selected annotation is on the part_of_speech layer.
- This annotation is part of an utterance that begins at 7.92 seconds into the transcript and lasts around 3.29 seconds.
- Clicking on the annotation will bring up a menu with additional options.
Next page: Navigating documentation
-
Here’s the demographic breakdown of APLS interviewees:
Neighborhood Count Cranberry Township 6 Forest Hills 12 Hill District 10 Lawrenceville 12 Gender Count Female 24 Male 16 Education Count High school 16 Undergraduate 13 Graduate 11 Year of birth Minimum 1st quartile Median 3rd quartile Maximum 1920 1941 1955 1967 1986 -
The
.eaf
part of the transcript name reflects the original transcript file, which was created in the transcription program Elan. ↩