Data organization in LaBB-CAT

APLS data is organized using the data structures provided by the open-source linguistic corpus software LaBB-CAT. The most important organizational units in LaBB-CAT corpora are participants, transcripts, attributes, layers, and annotations.

Participants are speakers in the audio files (both interviewers and interviewees), along with metadata like demographic info.
Transcripts are data objects for individual audio files and all of their annotations, plus metadata like when the audio file was recorded.
Attributes are metadata about participants and transcripts.
Layers are series of time-aligned annotations in transcripts corresponding to a single type of linguistic data (e.g., pronunciations, part-of-speech tags).
Annotations are individual bits of data aligned to specific timestamps in audio files.

Participants

The participants in APLS are the interviewees, the interviewers, and occasionally a bystander whose speech is captured in the recording. Interviewees in APLS are identified by an anonymized speaker code that includes their neighborhood abbrevation (e.g., CB01, HD17).¹

Transcripts

Each interview is divided into multiple transcripts (corresponding to the original recording files), named after the interviewee and interview section. For example, the file FH10pairs.eaf contains the minimal pairs task from the interviewee FH10.² Some sections are split into multiple transcripts (e.g., interview1, reading2).

Attributes

Each participant and transcript has a set of attributes that describe metadata about the participant or transcript. For example, the participant HD17 has gender Female and year_of_birth 1964. The transcript FH10pairs.eaf has duration 188.61 and recording_date September 26, 2023.

Layers and annotations

To illustrate layers and annotations in APLS, let’s look at a screen-grab of a single line of speech (aka an utterance) from the transcript HD07interview3.eaf:

On the left-hand side of the image is HD07, the speaker code for the participant who uttered this speech.

Layers

To the right of the speaker code are three layers. From top to bottom, these are speech_rate, part_of_speech, and word.

speech_rate (top)
- This layer contains a measurement of how quickly HD07 uttered this line: 6.5068 syllables per second.
- APLS measures speech rate by lines in the transcript, so there is just one speech_rate annotation for this line (as indicated by the curved bracket).
part_of_speech (middle)
- This layer encodes each word’s part of speech using Penn Treebank part-of-speech tags (e.g., UH for interjections, CC for coordinating conjunctions).
- Most words have a single part_of_speech annotation. The word don’t has two annotations (VBP RB), since it consists of both a present-tense verb (do) and an adverb (not).
word (bottom)
- This layer contains the words that HD07 spoke, spelled in normal English.
- Each word has a single annotation on the word layer.

Layers are covered in more detail in the Layers and attributes section of the APLS documentation.

Annotations

The cursor in the screen-grab is hovering over the NN annotation, which brings up a tooltip with several pieces of information:

The selected annotation is on the part_of_speech layer.
This annotation is part of an utterance that begins at 7.92 seconds into the transcript and lasts around 3.29 seconds.
Clicking on the annotation will bring up a menu with additional options.

Next page: Navigating documentation

Here’s the demographic breakdown of APLS interviewees:

Neighborhood Count

Cranberry Township 6

Forest Hills 12

Hill District 10

Lawrenceville 12

Gender Count

Female 24

Male 16

Education Count

High school 16

Undergraduate 13

Graduate 11

Year of birth

Minimum 1st quartile Median 3rd quartile Maximum

1920 1941 1955 1967 1986

↩
The .eaf part of the transcript name reflects the original transcript file, which was created in the transcription program Elan. ↩

Neighborhood	Count
Cranberry Township	6
Forest Hills	12
Hill District	10
Lawrenceville	12

Year of birth
Minimum	1st quartile	Median	3rd quartile	Maximum
1920	1941	1955	1967	1986

Gender	Count
Female	24
Male	16

Education	Count
High school	16
Undergraduate	13
Graduate	11