1 Introduction

This tutorial is a companion to the paper “Sociolinguistic auto-coding has fairness problems too: Measuring and mitigating overlearning bias”, published open-access in Linguistics Vanguard in 2024: https://doi.org/10.1515/lingvan-2022-0114. It walks readers through the process of running all the code to replicate the paper’s analysis. Check out the README for a higher-level overview of this repository and sociolinguistic auto-coding (SLAC) more generally.

You can use this tutorial to…

  1. Assess fairness for sociolinguistic auto-coders
  2. Mitigate unfairness for sociolinguistic auto-coders

1.1 Running this code on your own machine

This tutorial was written in R Markdown. R Markdown is an extension of the R statistical programming language that allows users to interweave formatted text with R commands and outputs (in other words, a literate programming approach to R). The webpage version of this tutorial is the output generated by running the code with R Markdown. The tutorial code, in turn, relies on a larger set of data and scripts that do the dirty work so this tutorial can be a nice clean summary of the analysis.

To run the tutorial code on your own machine, you’ll need a suitable computing environment and software, as described here. See below for more information about machine specs, how long it took this script to run, and how much disk space auto-coder files take up.

In addition, you can customize some of this tutorial’s behaviors via the parameters in its YAML header. This is especially useful if you want to adapt this code to your own projects.

  • extract_metrics (default TRUE): Extract fairness/performance metrics from auto-coder files and save to Outputs/Performance/? If you use a two-computer setup, use TRUE for the computer where the auto-coder files are stored, and FALSE for the other computer. Or if you’ve already extracted metrics,
    • TRUE: Run code chunks that extract & save metrics.
    • FALSE: Skip these code chunks.
  • extract_only_metrics (default FALSE): Useful if you’re using this code to just update your performance files (e.g., you are testing out new UMSs and want to analyze them separately). Note: if extract_metrics is FALSE, extract_only_metrics will be overridden to FALSE.
    • TRUE: Run only code chunks that extract & save metrics.
    • FALSE: Run other code chunks as well.

2 RQ2: Assessing fairness for SLAC

This section assesses gender fairness in the auto-coder reported on in Villarreal et al.’s 2020 Laboratory Phonology article and “How to train your classifier” auto-coding tutorial.

Read the auto-coder:

##N.B. Copy of https://github.com/nzilbb/How-to-Train-Your-Classifier/blob/main/LabPhonClassifier.Rds, 
##  but with Gender added to internal representation of training data
##  (trainingData element) to facilitate fairness measurement
LabPhonClassifier <- readRDS("Input-Data/LabPhonClassifier.Rds")

If you originally ran your auto-coder using the scripts in this repository, then it’s ready for fairness measurement. If not, you may need to manually modify the auto-coder or re-run your auto-coder using these scripts (see here).

2.1 Measuring fairness: UMS-Utils.R functions

Next, we measure fairness using functions in the UMS-Utils.R script: cls_fairness() and cls_summary().

source("R-Scripts/UMS-Utils.R", keep.source=TRUE)

cls_fairness() is useful in exploratory data analysis— that stage where you’re poking around the data, trying to wrap your head around it. You can customize its output using different arguments.

##Overall accuracy
cls_fairness(LabPhonClassifier)
##Class accuracies
cls_fairness(LabPhonClassifier, byClass=TRUE)
##Confusion matrix
cls_fairness(LabPhonClassifier, output="cm")
$Female
         Actual
Predicted     Absent    Present
  Absent  1228.33333  118.00000
  Present   46.66667  125.00000

$Male
         Actual
Predicted    Absent   Present
  Absent  1921.0000  375.6667
  Present  188.0000  686.3333
##Chi-sq test: Overall accuracy
cls_fairness(LabPhonClassifier, output="chisq")
##Chi-sq tests: Class accuracies
cls_fairness(LabPhonClassifier, output="chisq", byClass=TRUE)
##Raw predictions
head(cls_fairness(LabPhonClassifier, output="pred"))

    2-sample test for equality of proportions with continuity correction

data:  hitMiss
X-squared = 37.029, df = 1, p-value = 1.164e-09
alternative hypothesis: two.sided
95 percent confidence interval:
 0.04825599 0.09030539
sample estimates:
   prop 1    prop 2 
0.8915239 0.8222432 

$Absent

    2-sample test for equality of proportions with continuity correction

data:  .x[[i]]
X-squared = 33.179, df = 1, p-value = 8.403e-09
alternative hypothesis: two.sided
95 percent confidence interval:
 0.03596963 0.06911130
sample estimates:
   prop 1    prop 2 
0.9633987 0.9108582 


$Present

    2-sample test for equality of proportions with continuity correction

data:  .x[[i]]
X-squared = 14.065, df = 1, p-value = 0.0001766
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.20349686 -0.06022637
sample estimates:
   prop 1    prop 2 
0.5144033 0.6462649 

cls_summary() returns a one-row dataframe of fairness and (optionally) performance info:

cls_summary(LabPhonClassifier)

You can shape the outputs of cls_fairness() and cls_summary() to create nicer-formatted results tables. For example, this is (roughly) how I generated Table 4 in the Linguistics Vanguard paper, using cls_summary():

##Summary
smry <- cls_summary(LabPhonClassifier)

##Get fairness dataframe: Metric in rows, Female/Male/Diff in columns
fmDiff <- 
  smry %>% 
  select(matches("_(Female|Male)")) %>%
  pivot_longer(everything(), names_to=c("Metric","name"), 
               names_pattern="(.+)_(.+)") %>% 
  pivot_wider() %>% 
  mutate(Difference = Female-Male)

##Get chisq dataframe: Metric in rows, Chisq stat/df/p in columns
chiCols <- 
  smry %>% 
  select(contains("Chisq")) %>% 
  pivot_longer(everything(), names_to=c("Metric","name"), 
               names_pattern="(.+)_(Chisq.+)") %>% 
  pivot_wider()

##Join together & recode Metric w/ nicer labels
recodeMetric <- c("Overall accuracy" = "Acc", 
                  "Absent class accuracy" = "ClassAcc_Absent",
                  "Present class accuracy" = "ClassAcc_Present")
left_join(fmDiff, chiCols, by="Metric") %>% 
  mutate(across(Metric, ~ fct_recode(.x, !!!recodeMetric)))

This is (roughly) how I generated Table 5 in the Linguistics Vanguard paper, using cls_fairness() and the auto-coder’s trainingData element:

##Get each token's gender & majority-vote prediction
LabPhonPred <-
  ##Prediction dataframe (one row per token * resample)
  cls_fairness(LabPhonClassifier, "pred") %>% 
  ##Get counts of each token's Rpresent votes (plus tokens' unique Gender values)
  count(rowIndex, Gender, Rpresent = Predicted) %>% 
  ##Only take most frequent prediction per token
  slice_max(n, by=rowIndex)
##Put together table
list(Actual = LabPhonClassifier$trainingData %>% 
       rename(Rpresent = .outcome), 
     Predicted = LabPhonPred) %>% 
  ##Get Gender & Rpresent counts for each data source
  imap(~ count(.x, Gender, Rpresent, name=.y)) %>% 
  ##Put into a single dataframe
  reduce(left_join, by=c("Gender","Rpresent")) %>% 
  ##Calculate Under/overprediction
  mutate("Under/overprediction" = Predicted / Actual - 1)

3 RQ3: Mitigating SLAC unfairness

In this section, we attempt to produce a fair auto-coder that does not suffer from the fairness issues in the preceding auto-coder (aka the LabPhon auto-coder). To do this, we run and analyze additional auto-coders under different unfairness mitigation strategies (UMSs).

3.1 How to generate auto-coders

To generate auto-coders, use the Bash command-line client to run shell scripts. The shell scripts are written to be compatible with Slurm, the job queue used by Pitt’s CRC clusters, using the command sbatch to submit jobs to Slurm. For example, to run UMS-Round1.sh with Slurm:

##Assuming you are in Shell-Scripts working directory
sbatch UMS-Round1.sh

If you don’t need to use Slurm, you can submit jobs directly using the command bash. In this case, you should explicitly specify where script outputs & errors should go:

##Assuming you are in Shell-Scripts working directory
bash UMS-Round1.sh &> ../Outputs/Shell-Scripts/UMS-Round1.out

The following R code returns TRUE if the command sbatch will work on your system:

unname(nchar(Sys.which("sbatch")) > 0)
[1] TRUE

For Slurm users, note that the shell scripts pre-define several additional arguments to sbatch (e.g., --partition=smp). Unfortunately, it’s not possible to override these hard-coded defaults by passing arguments to sbatch in the command line (e.g., sbatch UMS-Round1.sh --partition=htc will still use the hard-coded --partition=smp). If you need different sbatch arguments, either hard-code new arguments (if you don’t need them to change each time you execute the script) or write a wrapper script.

See the README for more info about how the shell scripts work.

3.2 Baseline

The auto-coders that we run for RQ3 will not undergo the time-consuming process of optimization for performance: hyperparameter tuning and outlier dropping. Applying these steps to each UMS would dramatically increase the amount of time it would take to run this analysis. Instead of comparing UMSs to the LabPhon auto-coder in RQ2, which was optimized for performance, we’ll run an un-optimized baseline auto-coder so we get an apples-to-apples comparison.

3.2.1 Run auto-coder

Load Bash, navigate to the Shell-Scripts/ directory, and run one of the following:

##Run with Slurm
sbatch Baseline.sh

##OR

##Run directly
bash Baseline.sh &> ../Outputs/Shell-Scripts/Run-Baseline.out

Once that script is done running, you should have a new auto-coder file: Outputs/Diagnostic-Files/Temp-Autocoders/Run-UMS_UMS0.0.Rds.

3.2.2 Extract fairness and performance metrics

Before proceeding, we’ll extract fairness/performance metrics from this auto-coder (using cls_summary()) and save metrics to Outputs/Performance/. This allows us to bridge the two-computer split.

To extract and save fairness/performance metrics from the baseline auto-coder, switch back to R (on the same computer the auto-coder was run on), and run the following code:

##Get list of UMS descriptions
umsList <- read.csv("Input-Data/UMS-List.txt", sep="\t")

##Read auto-coder file
file_baseline <- "Run-UMS_UMS0.0.Rds"
cls_baseline <- readRDS(paste0("Outputs/Diagnostic-Files/Temp-Autocoders/", 
                               file_baseline))

##Extract performance
cls_baseline %>% 
  cls_summary() %>%
  ##Add name and long description
  mutate(Classifier = str_remove_all(file_baseline, ".+_|\\.Rds"),
         .before=1) %>% 
  left_join(umsList %>% 
              mutate(Classifier = paste0("UMS", UMS)) %>% 
              select(-UMS),
            by="Classifier") %>% 
  ##Save data
  write_csv("Outputs/Performance/Perf_Baseline.csv")

We won’t analyze baseline fairness here because its whole purpose is to compare UMSs against it (with neither the baseline nor UMS auto-coders optimized for performance). However, it’s worth noting that there are small differences in fairness/performance between this un-optimized baseline and the LabPhon auto-coder (which was optimized for performance) analyzed for fairness above:

##Read baseline performance
perf_baseline <- read_csv("Outputs/Performance/Perf_Baseline.csv")

##Combine LabPhon & Baseline metrics into a single dataframe
list(LabPhon = cls_summary(LabPhonClassifier),
     Baseline = perf_baseline %>% 
       select(-c(Classifier, Description))) %>% 
  ##One dataframe with just the necessary columns
  map_dfr(select, Acc, Acc_Diff, matches("ClassAcc_(Present|Absent)$"),
          matches("ClassAcc_(Present|Absent)_Diff"),
          .id="Version") %>%
  ##LabPhon/Baseline in separate columns, one row per metric * type
  pivot_longer(contains("Acc"), names_to="Metric") %>%
  mutate(Type = if_else(str_detect(Metric, "Diff"), "Fairness", "Performance"),
         Metric = fct_inorder(if_else(str_detect(Metric, "Absent|Present"),
                                      paste(str_extract(Metric, "Absent|Present"), "class accuracy"),
                                      "Overall accuracy"))) %>%
  pivot_wider(names_from=Version) %>%
  ##Put rows in nicer order
  arrange(Metric, desc(Type))

3.3 UMS round 1

The auto-coders in UMS round 1 include downsampling, valid predictor selection, and normalization UMSs (see here for more info).

3.3.1 Run auto-coders and extract metrics

Load Bash, navigate to Shell-Scripts/, and run one of the following:

##Run with Slurm
sbatch UMS-Round1.sh

##OR

##Run directly
bash UMS-Round1.sh &> ../Outputs/Shell-Scripts/UMS-Round1.out

Once that script is done running, you should have a bunch more files in Outputs/Diagnostic-Files/Temp-Autocoders/.

To extract and save fairness/performance metrics from the round 1 auto-coders, switch back to R (staying on the same computer), and run the following code:

##Get auto-coder filenames (exclude UMS 0.x precursor auto-coders)
files_round1 <- list.files("Outputs/Diagnostic-Files/Temp-Autocoders/", 
                           "Run-UMS_UMS[1-3]", full.names=TRUE)

##Read auto-coder files
cls_round1 <-
  files_round1 %>% 
  ##Better names
  set_names(str_remove_all(files_round1, ".+_|\\.Rds")) %>% 
  map(readRDS)

##Extract performance
cls_round1 %>% 
  map_dfr(cls_summary, .id="Classifier") %>% 
  ##Add long description
  left_join(umsList %>% mutate(Classifier = paste0("UMS", UMS)) %>% select(-UMS),
            by="Classifier") %>% 
  ##Save data
  write_csv("Outputs/Performance/Perf_UMS-Round1.csv")

We’ll also extract and save variable importance data from particular auto-coders. Several UMSs are “valid predictor selection” strategies: they remove acoustic measures that could inadvertently signal gender. To determine which measures could inadvertently signal gender, we run a auto-coder predicting speaker gender rather than rhoticity and discard the measures that were “too helpful” in predicting gender. The following code pulls variable importance from these “precursor” auto-coders.

readRDS("Outputs/Diagnostic-Files/Temp-Autocoders/Run-UMS_UMS0.1.1.Rds") %>% 
  pluck("finalModel", "variable.importance") %>% 
  {tibble(Measure=names(.), Importance=.)} %>% 
  write.csv("Outputs/Other/Var-Imp_UMS0.1.1.csv", row.names=FALSE)
readRDS("Outputs/Diagnostic-Files/Temp-Autocoders/Run-UMS_UMS0.2.Rds") %>% 
  map_dfr(~ .x %>% 
            pluck("finalModel", "variable.importance") %>% 
            {tibble(Measure=names(.), value=.)},
          .id="name") %>% 
  pivot_wider(names_prefix="Importance_") %>% 
  write.csv("Outputs/Other/Var-Imp_UMS0.2.csv", row.names=FALSE)

This step isn’t strictly necessary for the R code in this document, but it allows us to run umsData() for all UMSs without needing to be on the computer where the auto-coders are saved.

3.3.2 Analyze fairness and performance

Now we can analyze metrics on a user-friendlier system. Read fairness/performance data for the round 1 auto-coders, and add baseline data:

##Read
perf_baseline <- read_csv("Outputs/Performance/Perf_Baseline.csv")
perf_round1 <- rbind(perf_baseline,
                     read_csv("Outputs/Performance/Perf_UMS-Round1.csv"))

##Decode first digit of UMS code
categories <- c("Baseline", "Downsampling", "Valid pred selection", 
                "Normalization", "Combination") %>% 
  set_names(0:4)

##Shape performance dataframe for plotting: Fairness/Performance in separate
##  columns, one row per UMS * metric, Category factor, shorter Classifier label
perfPlot_round1 <- perf_round1 %>%
  select(Classifier,
         Acc, Acc_Diff, matches("ClassAcc_(Present|Absent)$"),
         matches("ClassAcc_(Present|Absent)_Diff")) %>%
  ##Fairness/Performance in separate columns, one row per UMS * metric
  pivot_longer(contains("Acc")) %>%
  mutate(Metric = fct_inorder(if_else(str_detect(name, "Absent|Present"),
                                      paste(str_extract(name, "Absent|Present"), "class accuracy"),
                                      "Overall accuracy")),
         name = if_else(str_detect(name, "Diff"), "Fairness", "Performance")) %>%
  pivot_wider() %>%
  ##Add Category column, shorter Classifier label
  mutate(Category = recode_factor(str_extract(Classifier, "\\d"), !!!categories),
         across(Classifier, ~ str_remove(.x, "UMS")))
##Plot
perfPlot_round1 %>% 
  mutate(across(Fairness, abs)) %>% 
  ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier)) +
  ##Points w/ ggrepel'd labels
  geom_point(size=3) +
  geom_text_repel(show.legend=FALSE, max.overlaps=20) +
  ##Each metric in its own facet
  ##  (N.B. use arg scales="free" to have each facet zoom to fit data)
  facet_wrap(~ Metric, nrow=1) +
  ##Lower fairness on the right (so top-right is optimal)
  scale_x_reverse() +
  ##Theme
  theme_bw()

To make the baseline stand out, we can plot it with separate aesthetics:

perfPlot_round1 %>%
  mutate(across(Fairness, abs)) %>% 
  ##Exclude Baseline from points & labels
  filter(Classifier != "0.0") %>% 
  ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier)) +
  ##Points w/ ggrepel'd labels
  geom_point(size=3) +
  geom_text_repel(show.legend=FALSE, max.overlaps=20) +
  ##Dotted line for baseline
  geom_vline(data=perfPlot_round1 %>% filter(Classifier=="0.0"), 
             aes(xintercept=abs(Fairness)), linetype="dashed") +
  geom_hline(data=perfPlot_round1 %>% filter(Classifier=="0.0"), 
             aes(yintercept=Performance), linetype="dashed") +
  ##Each metric in its own facet
  facet_wrap(~ Metric, nrow=1) +
  ##Lower fairness on the right (so top-right is optimal)
  scale_x_reverse() +
  ##Theme
  theme_bw()

While numerous UMSs improve on fairness relative to the baseline, there is no one obvious winner. This is often because a UMS will perform well on some metrics but poorly on others. For example, while UMS 1.3.2 improves fairness and performance for overall accuracy and Absent class accuracy, it has dismal performance for Present class accuracy (under 40%). In other instances the disparity is more dramatic; for example, UMS 1.5 is clearly superior for the fairness/performance tradeoff when it comes to Present class accuracy, but its Absent class accuracy fairness is worse than the baseline.

3.4 UMS round 2

Since the round 1 results weren’t completely satisfactory, I decided to attempt combination strategies: combining downsampling with either valid predictor selection or normalization. Combining these strategies is feasible because different categories of UMS affect the data in different ways; downsampling UMSs remove tokens (rows), valid predictor selection UMSs remove acoustic measures (columns), and normalization transforms acoustic measures. This could theoretically produce better results if the strengths and weakness of the combined UMSs hedge against one another (e.g., the improvements in Present class accuracy performance for UMS 2.2 could balance out the decline in Present class accuracy performance for UMS 1.3.2).

I chose 8 combination UMSs based on round 1 results: 2 downsampling UMSs (1.3.1, 1.3.2) \(\times\) 4 other UMSs (2.1.1, 2.2, 2.3, 3.1). I chose these UMSs because they were relatively balanced across all 3 metrics for fairness and performance (i.e., excluding UMSs like 1.5 that performed very poorly on at least one metric). In your own projects, it may be appropriate to choose different UMSs to combine depending on how the round 1 results shake out.

3.4.1 Run auto-coders and extract metrics

Load Bash, navigate to Shell-Scripts/, and run one of the following:

##Run with Slurm
sbatch UMS-Round2.sh

##OR

##Run directly
bash UMS-Round2.sh &> ../Outputs/Shell-Scripts/UMS-Round2.out

Once that script is done running, you should have additional files in Outputs/Diagnostic-Files/Temp-Autocoders/.

To extract and save fairness/performance metrics from the round 2 auto-coders, switch back to R (staying on the same computer), and run the following code:

##Get auto-coder filenames (all combo UMSs start with the digit 4)
files_round2 <- list.files("Outputs/Diagnostic-Files/Temp-Autocoders/", 
                           "Run-UMS_UMS4", full.names=TRUE)

##Read auto-coder files
cls_round2 <-
  files_round2 %>% 
  ##Better names
  set_names(str_remove_all(files_round2, ".+_|\\.Rds")) %>% 
  map(readRDS)

##Extract performance
cls_round2 %>% 
  map_dfr(cls_summary, .id="Classifier") %>% 
  ##Add long description
  left_join(umsList %>% mutate(Classifier = paste0("UMS", UMS)) %>% select(-UMS),
            by="Classifier") %>% 
  ##Save data
  write_csv("Outputs/Performance/Perf_UMS-Round2.csv")

3.4.2 Analyze fairness and performance

Now we can analyze these metrics on a user-friendlier system. Read fairness/performance data for the round 1 auto-coders, and add baseline data:

##Read
perf_baseline <- read_csv("Outputs/Performance/Perf_Baseline.csv")
perf_round2 <- rbind(perf_baseline,
                     read_csv("Outputs/Performance/Perf_UMS-Round2.csv"))

##Shape performance dataframe for plotting: Fairness/Performance in separate
##  columns, one row per UMS * metric, Category factor, shorter Classifier label
perfPlot_round2 <- perf_round2 %>%
  select(Classifier,
         Acc, Acc_Diff, matches("ClassAcc_(Present|Absent)$"),
         matches("ClassAcc_(Present|Absent)_Diff")) %>%
  ##Fairness/Performance in separate columns, one row per UMS * metric
  pivot_longer(contains("Acc")) %>%
  mutate(Metric = fct_inorder(if_else(str_detect(name, "Absent|Present"),
                                      paste(str_extract(name, "Absent|Present"), "class accuracy"),
                                      "Overall accuracy")),
         name = if_else(str_detect(name, "Diff"), "Fairness", "Performance")) %>%
  pivot_wider() %>%
  ##Add Category column, shorter Classifier label
  mutate(Category = recode_factor(str_extract(Classifier, "\\d"), !!!categories),
         across(Classifier, ~ str_remove(.x, "UMS")))

Plot (using dotted line for baseline)

perfPlot_round2 %>%
  mutate(across(Fairness, abs)) %>% 
  ##Exclude Baseline from points & labels
  filter(Classifier != "0.0") %>% 
  ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier)) +
  ##Points w/ ggrepel'd labels
  geom_point(size=3) +
  geom_text_repel(show.legend=FALSE, max.overlaps=20, color="black") +
  ##Dotted line for baseline
  geom_vline(data=perfPlot_round2 %>% filter(Classifier=="0.0"), 
             aes(xintercept=abs(Fairness)), linetype="dashed") +
  geom_hline(data=perfPlot_round2 %>% filter(Classifier=="0.0"), 
             aes(yintercept=Performance), linetype="dashed") +
  ##Each metric in its own facet
  facet_wrap(~ Metric, nrow=1) +
  ##Lower fairness on the right (so top-right is optimal)
  scale_x_reverse() +
  ##Theme
  theme_bw() +
  ##Rotate x-axis labels to avoid clash
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

We can also plot round 1 & 2 together:

##Put perfPlot dfs together
perfPlot <- rbind(perfPlot_round1,
                  perfPlot_round2) %>% 
  ##Remove duplicate Baseline rows
  distinct()

perfPlot %>%
  mutate(across(Fairness, abs)) %>% 
  ##Exclude Baseline from points
  filter(Classifier != "0.0") %>% 
  ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier)) +
  ##Points w/ ggrepel'd labels
  geom_point(size=3) +
  geom_text_repel(show.legend=FALSE, max.overlaps=40) +
  ##Dotted line for baseline
  geom_vline(data=perfPlot %>% filter(Classifier=="0.0"), 
             aes(xintercept=abs(Fairness)), linetype="dashed") +
  geom_hline(data=perfPlot %>% filter(Classifier=="0.0"), 
             aes(yintercept=Performance), linetype="dashed") +
  ##Each metric in its own facet
  facet_wrap(~ Metric, nrow=1) +
  ##Lower fairness on the right (so top-right is optimal)
  scale_x_reverse() +
  ##Theme
  theme_bw() +
  ##Rotate x-axis labels to avoid clash
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

3.5 Identify optimal auto-coder

Now that we’ve got all our performance data, we need to choose which auto-coder to actually use for auto-coding data that hasn’t previously been coded (i.e., to scale up our dataset of coded tokens without more manual coding). The previous plot tells us that some UMSs are better than others (e.g., we obviously won’t be using UMS 1.2), but there isn’t any UMS that clearly stands out from the rest. Furthermore, even if we eliminate the obviously bad options, there seems to be a tradeoff between performance and fairness. How do we winnow down the space of options? One technique is to find the UMSs that are Pareto-optimal: a given UMS is Pareto-optimal if every other UMS that is better in fairness is worse in performance, or vice versa. In this sense, the best UMS for our purposes might be neither the fairest nor the best-performing, but the UMS for which there’s a good fairness–performance tradeoff.

In R, we can use psel() from the rPref package to find Pareto-optimal auto-coders. For example, the following auto-coders are Pareto-optimal for Overall accuracy:

perfPlot %>% 
  mutate(across(Fairness, abs)) %>% 
  filter(Metric=="Overall accuracy") %>% 
  psel(high(Performance) * low(Fairness))

Here’s that same info represented in a plot:

perfPlot %>% 
  mutate(across(Fairness, abs)) %>% 
  filter(Metric=="Overall accuracy",
         Classifier != "0.0") %>%
  mutate(`Pareto-optimal` = Classifier %in% psel(., high(Performance) * low(Fairness))$Classifier) %>% 
  ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier, alpha=`Pareto-optimal`)) +
  ##Points w/ ggrepel'd labels
  geom_point(size=3) +
  geom_text_repel(show.legend=FALSE, max.overlaps=40) +
  ##Dotted line for baseline
  geom_vline(data=perfPlot %>% 
               filter(Metric=="Overall accuracy", Classifier=="0.0"), 
             aes(xintercept=abs(Fairness)), linetype="dashed") +
  geom_hline(data=perfPlot %>% 
               filter(Metric=="Overall accuracy", Classifier=="0.0"), 
             aes(yintercept=Performance), linetype="dashed") +
  ##Lower fairness on the right (so top-right is optimal)
  scale_x_reverse() +
  ##Theme
  theme_bw() +
  ##Rotate x-axis labels to avoid clash
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

In this case, we have 3 metrics, so for each metric we find the UMSs that are Pareto-optimal for the fairness–performance tradeoff

##Get list of dataframes, one per Metric, w/ only Pareto-optimal UMSs
paretoOpt <- 
  perfPlot %>% 
  mutate(across(Fairness, abs)) %>% 
  group_by(Metric) %>% 
  ##Identify Pareto-optimal UMSs
  group_map(~ psel(.x, high(Performance) * low(Fairness)),
            .keep=TRUE)
##Display as a single dataframe
bind_rows(paretoOpt)

In this particular dataset, we get really lucky: There is one UMS, 4.2.1, that is shared among these 3 sets of Pareto-optimal UMSs. This is certainly not a guaranteed outcome!

##Get Classifier value that is in all 3 dataframes (if any)
paretoOpt %>% 
  map("Classifier") %>% 
  reduce(intersect)
[1] "4.2.1"

Incidentally, UMS 4.2.1 also happens to be the fairest UMS for all 3 metrics— this is definitely not a guaranteed outcome!

##Get fairest UMS for each Metric
perfPlot %>% 
  mutate(across(Fairness, abs)) %>% 
  group_by(Metric) %>%
  filter(Fairness==min(Fairness))
##Could also do
# perfPlot %>%
#   mutate(across(Fairness, abs)) %>%
#   group_by(Metric) %>%
#   group_modify(~ psel(.x, low(Fairness))) 

Thus, we choose 4.2.1 as the optimal UMS. In fact, this is the UMS that was used to grow an /r/ dataset fivefold for the 2021 Language Variation and Change article “Gender separation and the speech community: Rhoticity in early 20th century Southland New Zealand English” by me, Lynn Clark, Jen Hay, and Kevin Watson. (The auto-coder used for that analysis was optimized for performance, so the fairness we report in that paper is slightly worse than UMS 4.2.1.)

4 Script meta-info

4.1 R session info

sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS:   /usr/lib64/libblas.so.3.4.2 
LAPACK: /usr/lib64/liblapack.so.3.4.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] ROCR_1.0-11       rlang_1.1.1       benchmarkme_1.0.8 rPref_1.4.0      
 [5] ggrepel_0.9.3     knitr_1.43        magrittr_2.0.3    lubridate_1.9.2  
 [9] forcats_1.0.0     stringr_1.5.0     dplyr_1.1.2       purrr_1.0.2      
[13] readr_2.1.4       tidyr_1.3.0       tibble_3.2.1      ggplot2_3.4.2    
[17] tidyverse_2.0.0  

loaded via a namespace (and not attached):
 [1] benchmarkmeData_1.0.4 gtable_0.3.3          xfun_0.40            
 [4] bslib_0.5.0           lattice_0.21-8        tzdb_0.4.0           
 [7] vctrs_0.6.3           tools_4.3.0           generics_0.1.3       
[10] parallel_4.3.0        fansi_1.0.4           highr_0.10           
[13] pkgconfig_2.0.3       Matrix_1.6-0          RcppParallel_5.1.7   
[16] lifecycle_1.0.3       compiler_4.3.0        farver_2.1.1         
[19] munsell_0.5.0         codetools_0.2-19      htmltools_0.5.5      
[22] sass_0.4.7            yaml_2.3.7            lazyeval_0.2.2       
[25] pillar_1.9.0          crayon_1.5.2          jquerylib_0.1.4      
[28] cachem_1.0.8          iterators_1.0.14      foreach_1.5.2        
[31] tidyselect_1.2.0      digest_0.6.33         stringi_1.7.12       
[34] labeling_0.4.2        fastmap_1.1.1         grid_4.3.0           
[37] colorspace_2.1-0      cli_3.6.1             utf8_1.2.3           
[40] withr_2.5.0           scales_1.2.1          bit64_4.0.5          
[43] timechange_0.2.0      rmarkdown_2.23        httr_1.4.6           
[46] igraph_1.5.1          bit_4.0.5             hms_1.1.3            
[49] evaluate_0.21         doParallel_1.0.17     Rcpp_1.0.11          
[52] glue_1.6.2            renv_1.0.1            vroom_1.6.3          
[55] jsonlite_1.8.7        R6_2.5.1             

4.2 Disk space used

These are only shown if params$extract_metrics is TRUE (because otherwise it’s assumed that you’re not working on the same system the auto-coders were run on).

Temporary auto-coders:

tmpAuto <- 
  list.files("Outputs/Diagnostic-Files/Temp-Autocoders/", 
             # "^(Run-UMS|(Hyperparam-Tuning|Outlier-Dropping)_UMS0\\.0).*Rds$") %>%
             "^Run-UMS.*Rds$", full.names=TRUE) %>%
  file.info()
# cat("Disk space: ", round(sum(tmpAuto$size)/2^30, 1), " Gb (",
cat("Disk space: ", round(sum(tmpAuto$size)/2^20, 1), " Mb (",
    nrow(tmpAuto), " files)", sep="")
Disk space: 188.6 Mb (28 files)

Complete repository:

if (.Platform$OS.type=="windows") {
  shell("dir /s", intern=TRUE) %>% 
    tail(2) %>% 
    head(1) %>% 
    str_trim() %>% 
    str_squish()
}
if (.Platform$OS.type=="unix") {
  system2("du", "-sh", stdout=TRUE) %>% 
    str_remove("\\s.+") %>% 
    paste0("b")
}
[1] "366Mb"

4.3 Machine specs

System:

Sys.info()
                              sysname                               release 
                              "Linux"         "3.10.0-1160.99.1.el7.x86_64" 
                              version                              nodename 
"#1 SMP Thu Aug 10 10:46:21 EDT 2023"                 "login0.crc.pitt.edu" 
                              machine                                 login 
                             "x86_64"                               "dav49" 
                                 user                        effective_user 
                              "dav49"                               "dav49" 

Processor:

get_cpu()
$vendor_id
[1] "GenuineIntel"

$model_name
[1] "Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz"

$no_of_cores
[1] 12

RAM:

get_ram()
135 GB

4.4 Running time

Total running time for shell scripts:

##Parse HH:MM:SS and print as nicer time
printDur <- function(x) {
  library(lubridate)
  library(magrittr)
  x %>% 
    lubridate::hms() %>% 
    as.duration() %>% 
    sum() %>% 
    seconds_to_period()
}
scripts <- c("Baseline", "UMS-Round1", "UMS-Round2")
cat("Total running time for", paste0(scripts, ".sh", collapse=", "), fill=TRUE)
paste0("Outputs/Shell-Scripts/", paste0(scripts, ".out")) %>% 
  map_chr(~ .x %>% 
            readLines() %>% 
            str_subset("RunTime")) %>% 
  str_extract("[\\d:]{2,}") %>%
  printDur()
Total running time for Baseline.sh, UMS-Round1.sh, UMS-Round2.sh
[1] "21M 16S"

Total running time for R code in this document (with params$extract_metrics set to TRUE), in seconds:

timing$stop <- proc.time()

timing$stop - timing$start
   user  system elapsed 
 40.212   3.112  45.113 

5 Acknowledgements

I would like to thank Chris Bartlett, the Southland Oral History Project (Invercargill City Libraries and Archives), and the speakers for sharing their data and their voices. Thanks are also due to Lynn Clark, Jen Hay, Kevin Watson, and the New Zealand Institute of Language, Brain and Behaviour for supporting this research. Valuable feedback was provided by audiences at NWAV 49, the Penn Linguistics Conference, Pitt Computer Science, and the Michigan State SocioLab. Other resources were provided by a Royal Society of New Zealand Marsden Research Grant (16-UOC-058) and the University of Pittsburgh Center for Research Computing (specifically, the H2P cluster supported by NSF award number OAC-2117681). Any errors are mine entirely.