PCA in Sociolinguistics: A Tutorial
Joshua Wilson Black
27 July 2022
1 / 27

Overview

What is PCA?
1. A simple example in three dimensions
2. A more general description

2 / 27

Overview

What is PCA?
1. A simple example in three dimensions
2. A more general description
A Worked Example in R
1. Preparing data
2. Applying PCA with prcomp
3. Interpreting PCA output (via factoextra)

2 / 27

Overview

What is PCA?
1. A simple example in three dimensions
2. A more general description
A Worked Example in R
1. Preparing data
2. Applying PCA with prcomp
3. Interpreting PCA output (via factoextra)
When is PCA Appropriate?
- Rules of thumb
- Examples from linguistics literature

2 / 27

PCA with Three Variables

Data: Mean first formant values in Hz for dress, kit and trap from 100 random ONZE speakers.

3 / 27

find the centre of the cloud,
draw the line which stays in the cloud for the longest (PC1),
draw the line at right angles to PC1 which stays in the cloud for the longest (PC2)

From 3D to 2D

(Intuitive) PCA:
1. Find the centre,
2. draw the line which captures maximum variance (PC1),
3. draw the line orthogonal to PC1 which captures maximum variance, and
4. repeat until you run out of options!
We can plot our original three-variable data against PC1 and PC2 (see right).

4 / 27

Note that this is not that impressive for three dimensions, but it is great for 20 or 30!

PCA can allow us to see patterns in the data which would not otherwise be visible to creatures like us!

What do the PCs Mean?

PC1: decrease in all variables.
PC2: decrease in kit and increase in dress and trap.
Supplementary variables can help interpretation (see right).
Interpretation:
- PC1: vocal tract length
- PC2: short front vowel shift
We've lost information, but our new description captures deeper phenomena.

5 / 27

Take stock:

We've seen:
- a method for PCA in 3D space,
- Not mathematical: haven't told you how to maximise variance
- PCA as dimensionality reduction
- PCs as interpretable
- both in terms of the relationship between variables PCs represent and
- in terms of supplementary variables.

PCA in General

From 'the textbook':

The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables. (Jolliffe, 2002)

6 / 27

PCA in General

From 'the textbook':

The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables. (Jolliffe, 2002)

'Reduce dimensionality' = we end up with fewer variables.
'Interrelated variables' = there is structure in the correlations between variables.
Another intuition: PCA finds structure in the correlation matrix.
'First few': few is not mathematically defined.
Often, we want interpretable PCs.
- PCs can capture meaningful information present, but not immediately obvious, in the original data.
- PCA is thus an important tool for exploratory data analysis.
PCA also useful for hypothesis testing: PCs can be used as independent variables.

6 / 27

Correlation matrix point in more detail:

PCs represent eigenvalues of correlation matrices.

Worked Example: Patterns in ONZE

NZBC mobile recording unit.

Target: explore patterns in the behaviour of monophthongs in the ONZE corpus.
A simplification of the analysis in Brand et al. (2021).
- 100 random speakers, 50 born in 1920 or before, 50 born after 1920.
- first and second formant data for 10 monophthongs filtered to exclude stopwords, outliers, and formant tracking errors (details).
We move from readings of individual vowels to systems of vowels.
This is PCA in its exploratory use.

Steps:

Preprocessing,
application of PCA,
interpretation of PCA

7 / 27

Libraries

Time for some R:

# We're working in the tidyverse 'dialect'.
library(tidyverse)
# Factoextra used for PCA visualisation
library(factoextra)
# nzilbb.vowels used to share the data and some useful functions.
# Install via github using following commented line.
# remotes::install_github('https://github.com/JoshuaWilsonBlack/nzilbb_vowels')
library(nzilbb.vowels)
# We will fit GAMM models as part of our preprocessing example
library(mgcv)
library(itsadug)

8 / 27

Libraries

Time for some R:

# We're working in the tidyverse 'dialect'.
library(tidyverse)
# Factoextra used for PCA visualisation
library(factoextra)
# nzilbb.vowels used to share the data and some useful functions.
# Install via github using following commented line.
# remotes::install_github('https://github.com/JoshuaWilsonBlack/nzilbb_vowels')
library(nzilbb.vowels)
# We will fit GAMM models as part of our preprocessing example
library(mgcv)
library(itsadug)

NB: No special package required for running PCA.

8 / 27

Preprocessing

Doing data analysis, in good mathematics, is simply searching for eigenvectors; all the science of it (the art) is to find the right matrix to diagonalize (Benzécri, cited by Husson)

9 / 27

Preprocessing

Doing data analysis, in good mathematics, is simply searching for eigenvectors; all the science of it (the art) is to find the right matrix to diagonalize (Benzécri, cited by Husson)

PCA does this for the correlation matrix. But careful data selection is also required before PCA.

What do we need from our data?

9 / 27

Preprocessing

Doing data analysis, in good mathematics, is simply searching for eigenvectors; all the science of it (the art) is to find the right matrix to diagonalize (Benzécri, cited by Husson)

PCA does this for the correlation matrix. But careful data selection is also required before PCA.

What do we need from our data?

Technical requirements:
- a row for each speaker, and
- a column for each variable
  - i.e. two for each vowel (F1 and F2 for each).
Research requirements:
- patterns which are present across the development of NZE,
- patterns not explained by other known sources of covariation.

9 / 27

Preprocessing

Doing data analysis, in good mathematics, is simply searching for eigenvectors; all the science of it (the art) is to find the right matrix to diagonalize (Benzécri, cited by Husson)

PCA does this for the correlation matrix. But careful data selection is also required before PCA.

What do we need from our data?

Technical requirements:
- a row for each speaker, and
- a column for each variable
  - i.e. two for each vowel (F1 and F2 for each).
Research requirements:
- patterns which are present across the development of NZE,
- patterns not explained by other known sources of covariation.

Thoughtful preprocessing is required for meaningful results.

9 / 27

My use of the quote is a bit loose - we will be processing matrices derived from our data.

Raw Data

# onze_vowels comes from the nzilbb.vowels package.
onze_vowels %>%
  head(5)

##    speaker   vowel F1_50 F2_50 speech_rate gender  yob       word
## 1 IA_f_065 THOUGHT   514   868      4.3131      F 1891 word_09539
## 2 IA_f_065  FLEECE   395  2716      4.3131      F 1891 word_22664
## 3 IA_f_065     KIT   653  2413      4.3131      F 1891 word_02705
## 4 IA_f_065   DRESS   612  2372      4.3131      F 1891 word_23651
## 5 IA_f_065   GOOSE   445  2037      4.3131      F 1891 word_06222

10 / 27

Raw Data

# onze_vowels comes from the nzilbb.vowels package.
onze_vowels %>%
  head(5)

##    speaker   vowel F1_50 F2_50 speech_rate gender  yob       word
## 1 IA_f_065 THOUGHT   514   868      4.3131      F 1891 word_09539
## 2 IA_f_065  FLEECE   395  2716      4.3131      F 1891 word_22664
## 3 IA_f_065     KIT   653  2413      4.3131      F 1891 word_02705
## 4 IA_f_065   DRESS   612  2372      4.3131      F 1891 word_23651
## 5 IA_f_065   GOOSE   445  2037      4.3131      F 1891 word_06222

Technical requirements:
- Make the data 'wider' so that:
  - each vowel-formant pair has a column,
  - each speaker has a row.
Research requirements:
- Control for yob (year of birth), speech_rate, and gender.
- Normalise vowel space to enable across-speaker comparisons.

10 / 27

Preprocessing

After normalisation, we can kill two birds with one stone:

11 / 27

Preprocessing

After normalisation, we can kill two birds with one stone:

Fit a mixed-effect regression models for each vowel-formant pair, where:
- sources of unwanted variation are the fixed effects, and
- there are by-speaker random intercepts.
Extract random intercepts for each speaker from each model.
- for exploitation of by-speaker random effects see Drager and Hay (2012).
Create matrix with row for each speaker, and column for each random intercept value.

11 / 27

Preprocessing

After normalisation, we can kill two birds with one stone:

Fit a mixed-effect regression models for each vowel-formant pair, where:
- sources of unwanted variation are the fixed effects, and
- there are by-speaker random intercepts.
Extract random intercepts for each speaker from each model.
- for exploitation of by-speaker random effects see Drager and Hay (2012).
Create matrix with row for each speaker, and column for each random intercept value.

The result satisfies both the technical and research requirements.

11 / 27

Preprocessing: Normalisation

We apply Lobanov 2.0 normalisation using lobanov_2 from nzilbb.vowel.

onze_ints <- onze_vowels %>%
  # Add F1_lob2 and F2_lob2 columns
  lobanov_2() %>%
  # Remove F1_50 and F2_50 columns
  select(-(F1_50:F2_50))
onze_ints %>%
  head(5)

## # A tibble: 5 × 8
##   speaker  vowel   speech_rate gender   yob word       F1_lob2 F2_lob2
##   <fct>    <fct>         <dbl> <fct>  <int> <fct>        <dbl>   <dbl>
## 1 IA_f_065 THOUGHT        4.31 F       1891 word_09539  -0.708 -2.45  
## 2 IA_f_065 FLEECE         4.31 F       1891 word_22664  -1.76   1.41  
## 3 IA_f_065 KIT            4.31 F       1891 word_02705   0.518  0.773 
## 4 IA_f_065 DRESS          4.31 F       1891 word_23651   0.157  0.688 
## 5 IA_f_065 GOOSE          4.31 F       1891 word_06222  -1.32  -0.0111

12 / 27

Preprocessing: Modelling

Step 1: reshape data so we have a row for each model we want to fit.

13 / 27

Preprocessing: Modelling

Step 1: reshape data so we have a row for each model we want to fit.

onze_ints <- onze_ints %>%
  pivot_longer(
    cols = F1_lob2:F2_lob2, # Select columns with formant data.
    names_to = "formant_type", # Name column to distinguish F1 & F2
    values_to = "formant_value" # Name column for formant values.
  ) %>%
  group_by(vowel, formant_type) %>%
  nest() %>%
  ungroup()
onze_ints %>% head(3)

## # A tibble: 3 × 3
##   vowel   formant_type data                 
##   <fct>   <chr>        <list>               
## 1 THOUGHT F1_lob2      <tibble [6,942 × 6]> 
## 2 THOUGHT F2_lob2      <tibble [6,942 × 6]> 
## 3 FLEECE  F1_lob2      <tibble [11,896 × 6]>

13 / 27

Preprocessing: Modelling

Step 1: reshape data so we have a row for each model we want to fit.

onze_ints <- onze_ints %>%
  pivot_longer(
    cols = F1_lob2:F2_lob2, # Select columns with formant data.
    names_to = "formant_type", # Name column to distinguish F1 & F2
    values_to = "formant_value" # Name column for formant values.
  ) %>%
  group_by(vowel, formant_type) %>%
  nest() %>%
  ungroup()
onze_ints %>% head(3)

## # A tibble: 3 × 3
##   vowel   formant_type data                 
##   <fct>   <chr>        <list>               
## 1 THOUGHT F1_lob2      <tibble [6,942 × 6]> 
## 2 THOUGHT F2_lob2      <tibble [6,942 × 6]> 
## 3 FLEECE  F1_lob2      <tibble [11,896 × 6]>

Column data contains data frames ('tibbles') with the data for each model.

13 / 27

Preprocessing: Modelling

Step 2: Fit models for each vowel-formant pair and extract random effects.

14 / 27

Preprocessing: Modelling

Step 2: Fit models for each vowel-formant pair and extract random effects.

# gam version
onze_ints <- onze_ints %>%
  # For each vowel-formant pair, fit an lmer model.
  mutate(
    model = map( # map takes...
      data, # ... each entry in a column, and ...
      ~ bam( # applies a function to it.
          formant_value ~ gender + s(yob, by=gender) 
            + s(speech_rate) + s(speaker, bs="re"),
          discrete = TRUE, nthreads = 2,
          data = .x # where .x is a pronoun referring to the column entry.
      )
    ),
    # For each model, extract random effects and turn them into a dataframe
    random_intercepts = map(
      model,
      ~ get_random(.x)$`s(speaker)` %>% as_tibble(rownames = "speaker") %>%
        rename(intercept = value)
    )
  )

14 / 27

Preprocessing: Modelling

Step 3: Reshape random intercept data for use in PCA.

15 / 27

Preprocessing: Modelling

Step 3: Reshape random intercept data for use in PCA.

# Select the vowel, formant type, and random intercepts columns and 'unnest'
onze_ints <- onze_ints %>%
  select(vowel, formant_type, random_intercepts) %>%
  unnest(random_intercepts)
onze_ints %>%
  head(5)

## # A tibble: 5 × 4
##   vowel   formant_type speaker  intercept
##   <fct>   <chr>        <chr>        <dbl>
## 1 THOUGHT F1_lob2      CC_f_020  -0.114  
## 2 THOUGHT F1_lob2      CC_f_084   0.00552
## 3 THOUGHT F1_lob2      CC_f_170  -0.0661 
## 4 THOUGHT F1_lob2      CC_f_186   0.284  
## 5 THOUGHT F1_lob2      CC_f_210   0.242

15 / 27

Preprocessing: Modelling

Step 3: Reshape random intercept data for use in PCA.

# Select the vowel, formant type, and random intercepts columns and 'unnest'
onze_ints <- onze_ints %>%
  select(vowel, formant_type, random_intercepts) %>%
  unnest(random_intercepts)
onze_ints %>%
  head(5)

## # A tibble: 5 × 4
##   vowel   formant_type speaker  intercept
##   <fct>   <chr>        <chr>        <dbl>
## 1 THOUGHT F1_lob2      CC_f_020  -0.114  
## 2 THOUGHT F1_lob2      CC_f_084   0.00552
## 3 THOUGHT F1_lob2      CC_f_170  -0.0661 
## 4 THOUGHT F1_lob2      CC_f_186   0.284  
## 5 THOUGHT F1_lob2      CC_f_210   0.242

Almost there! We just need to reshape this so we have a single row for each speaker, and a column for each vowel-formant pair.

15 / 27

Preprocessing: Modelling