+ - 0:00:00
Notes for current slide
Notes for next slide

PCA in Sociolinguistics: A Tutorial

Joshua Wilson Black

27 July 2022

1 / 27

Overview

  1. What is PCA?

    1. A simple example in three dimensions
    2. A more general description
2 / 27

Overview

  1. What is PCA?

    1. A simple example in three dimensions
    2. A more general description
  2. A Worked Example in R

    1. Preparing data
    2. Applying PCA with prcomp
    3. Interpreting PCA output (via factoextra)
2 / 27

Overview

  1. What is PCA?

    1. A simple example in three dimensions
    2. A more general description
  2. A Worked Example in R

    1. Preparing data
    2. Applying PCA with prcomp
    3. Interpreting PCA output (via factoextra)
  3. When is PCA Appropriate?

    • Rules of thumb
    • Examples from linguistics literature
2 / 27

PCA with Three Variables

Data: Mean first formant values in Hz for dress, kit and trap from 100 random ONZE speakers.

3 / 27
  1. find the centre of the cloud,

  2. draw the line which stays in the cloud for the longest (PC1),

  3. draw the line at right angles to PC1 which stays in the cloud for the longest (PC2)

From 3D to 2D

  • (Intuitive) PCA:

    1. Find the centre,
    2. draw the line which captures maximum variance (PC1),
    3. draw the line orthogonal to PC1 which captures maximum variance, and
    4. repeat until you run out of options!
  • We can plot our original three-variable data against PC1 and PC2 (see right).

4 / 27

Note that this is not that impressive for three dimensions, but it is great for 20 or 30!

PCA can allow us to see patterns in the data which would not otherwise be visible to creatures like us!

What do the PCs Mean?

  • PC1: decrease in all variables.
  • PC2: decrease in kit and increase in dress and trap.
  • Supplementary variables can help interpretation (see right).
  • Interpretation:
    • PC1: vocal tract length
    • PC2: short front vowel shift
  • We've lost information, but our new description captures deeper phenomena.

5 / 27

Take stock:

  • We've seen:
    • a method for PCA in 3D space,
    • Not mathematical: haven't told you how to maximise variance
    • PCA as dimensionality reduction
    • PCs as interpretable
    • both in terms of the relationship between variables PCs represent and
    • in terms of supplementary variables.

PCA in General

From 'the textbook':

The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables. (Jolliffe, 2002)

6 / 27

PCA in General

From 'the textbook':

The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables. (Jolliffe, 2002)

  • 'Reduce dimensionality' = we end up with fewer variables.
  • 'Interrelated variables' = there is structure in the correlations between variables.
  • Another intuition: PCA finds structure in the correlation matrix.
  • 'First few': few is not mathematically defined.
  • Often, we want interpretable PCs.
    • PCs can capture meaningful information present, but not immediately obvious, in the original data.
    • PCA is thus an important tool for exploratory data analysis.
  • PCA also useful for hypothesis testing: PCs can be used as independent variables.
6 / 27

Correlation matrix point in more detail:

  • PCs represent eigenvalues of correlation matrices.

Worked Example: Patterns in ONZE

NZBC mobile recording unit.

  • Target: explore patterns in the behaviour of monophthongs in the ONZE corpus.
  • A simplification of the analysis in Brand et al. (2021).
    • 100 random speakers, 50 born in 1920 or before, 50 born after 1920.
    • first and second formant data for 10 monophthongs filtered to exclude stopwords, outliers, and formant tracking errors (details).
  • We move from readings of individual vowels to systems of vowels.
  • This is PCA in its exploratory use.

Steps:

  1. Preprocessing,
  2. application of PCA,
  3. interpretation of PCA
7 / 27

Libraries

Time for some R:

# We're working in the tidyverse 'dialect'.
library(tidyverse)
# Factoextra used for PCA visualisation
library(factoextra)
# nzilbb.vowels used to share the data and some useful functions.
# Install via github using following commented line.
# remotes::install_github('https://github.com/JoshuaWilsonBlack/nzilbb_vowels')
library(nzilbb.vowels)
# We will fit GAMM models as part of our preprocessing example
library(mgcv)
library(itsadug)
8 / 27

Libraries

Time for some R:

# We're working in the tidyverse 'dialect'.
library(tidyverse)
# Factoextra used for PCA visualisation
library(factoextra)
# nzilbb.vowels used to share the data and some useful functions.
# Install via github using following commented line.
# remotes::install_github('https://github.com/JoshuaWilsonBlack/nzilbb_vowels')
library(nzilbb.vowels)
# We will fit GAMM models as part of our preprocessing example
library(mgcv)
library(itsadug)

NB: No special package required for running PCA.

8 / 27

Preprocessing

Doing data analysis, in good mathematics, is simply searching for eigenvectors; all the science of it (the art) is to find the right matrix to diagonalize (Benzécri, cited by Husson)

9 / 27

Preprocessing

Doing data analysis, in good mathematics, is simply searching for eigenvectors; all the science of it (the art) is to find the right matrix to diagonalize (Benzécri, cited by Husson)

PCA does this for the correlation matrix. But careful data selection is also required before PCA.

What do we need from our data?

9 / 27

Preprocessing

Doing data analysis, in good mathematics, is simply searching for eigenvectors; all the science of it (the art) is to find the right matrix to diagonalize (Benzécri, cited by Husson)

PCA does this for the correlation matrix. But careful data selection is also required before PCA.

What do we need from our data?

  1. Technical requirements:
    • a row for each speaker, and
    • a column for each variable
      • i.e. two for each vowel (F1 and F2 for each).
  2. Research requirements:
    • patterns which are present across the development of NZE,
    • patterns not explained by other known sources of covariation.
9 / 27

Preprocessing

Doing data analysis, in good mathematics, is simply searching for eigenvectors; all the science of it (the art) is to find the right matrix to diagonalize (Benzécri, cited by Husson)

PCA does this for the correlation matrix. But careful data selection is also required before PCA.

What do we need from our data?

  1. Technical requirements:
    • a row for each speaker, and
    • a column for each variable
      • i.e. two for each vowel (F1 and F2 for each).
  2. Research requirements:
    • patterns which are present across the development of NZE,
    • patterns not explained by other known sources of covariation.

Thoughtful preprocessing is required for meaningful results.

9 / 27

My use of the quote is a bit loose - we will be processing matrices derived from our data.

Raw Data

# onze_vowels comes from the nzilbb.vowels package.
onze_vowels %>%
head(5)
## speaker vowel F1_50 F2_50 speech_rate gender yob word
## 1 IA_f_065 THOUGHT 514 868 4.3131 F 1891 word_09539
## 2 IA_f_065 FLEECE 395 2716 4.3131 F 1891 word_22664
## 3 IA_f_065 KIT 653 2413 4.3131 F 1891 word_02705
## 4 IA_f_065 DRESS 612 2372 4.3131 F 1891 word_23651
## 5 IA_f_065 GOOSE 445 2037 4.3131 F 1891 word_06222
10 / 27

Raw Data

# onze_vowels comes from the nzilbb.vowels package.
onze_vowels %>%
head(5)
## speaker vowel F1_50 F2_50 speech_rate gender yob word
## 1 IA_f_065 THOUGHT 514 868 4.3131 F 1891 word_09539
## 2 IA_f_065 FLEECE 395 2716 4.3131 F 1891 word_22664
## 3 IA_f_065 KIT 653 2413 4.3131 F 1891 word_02705
## 4 IA_f_065 DRESS 612 2372 4.3131 F 1891 word_23651
## 5 IA_f_065 GOOSE 445 2037 4.3131 F 1891 word_06222
  1. Technical requirements:
    • Make the data 'wider' so that:
      • each vowel-formant pair has a column,
      • each speaker has a row.
  2. Research requirements:
    • Control for yob (year of birth), speech_rate, and gender.
    • Normalise vowel space to enable across-speaker comparisons.
10 / 27

Preprocessing

After normalisation, we can kill two birds with one stone:

11 / 27

Preprocessing

After normalisation, we can kill two birds with one stone:

  1. Fit a mixed-effect regression models for each vowel-formant pair, where:
    • sources of unwanted variation are the fixed effects, and
    • there are by-speaker random intercepts.
  2. Extract random intercepts for each speaker from each model.
  3. Create matrix with row for each speaker, and column for each random intercept value.
11 / 27

Preprocessing

After normalisation, we can kill two birds with one stone:

  1. Fit a mixed-effect regression models for each vowel-formant pair, where:
    • sources of unwanted variation are the fixed effects, and
    • there are by-speaker random intercepts.
  2. Extract random intercepts for each speaker from each model.
  3. Create matrix with row for each speaker, and column for each random intercept value.

The result satisfies both the technical and research requirements.

11 / 27

Preprocessing: Normalisation

We apply Lobanov 2.0 normalisation using lobanov_2 from nzilbb.vowel.

onze_ints <- onze_vowels %>%
# Add F1_lob2 and F2_lob2 columns
lobanov_2() %>%
# Remove F1_50 and F2_50 columns
select(-(F1_50:F2_50))
onze_ints %>%
head(5)
## # A tibble: 5 × 8
## speaker vowel speech_rate gender yob word F1_lob2 F2_lob2
## <fct> <fct> <dbl> <fct> <int> <fct> <dbl> <dbl>
## 1 IA_f_065 THOUGHT 4.31 F 1891 word_09539 -0.708 -2.45
## 2 IA_f_065 FLEECE 4.31 F 1891 word_22664 -1.76 1.41
## 3 IA_f_065 KIT 4.31 F 1891 word_02705 0.518 0.773
## 4 IA_f_065 DRESS 4.31 F 1891 word_23651 0.157 0.688
## 5 IA_f_065 GOOSE 4.31 F 1891 word_06222 -1.32 -0.0111
12 / 27

Preprocessing: Modelling

Step 1: reshape data so we have a row for each model we want to fit.

13 / 27

Preprocessing: Modelling

Step 1: reshape data so we have a row for each model we want to fit.

onze_ints <- onze_ints %>%
pivot_longer(
cols = F1_lob2:F2_lob2, # Select columns with formant data.
names_to = "formant_type", # Name column to distinguish F1 & F2
values_to = "formant_value" # Name column for formant values.
) %>%
group_by(vowel, formant_type) %>%
nest() %>%
ungroup()
onze_ints %>% head(3)
## # A tibble: 3 × 3
## vowel formant_type data
## <fct> <chr> <list>
## 1 THOUGHT F1_lob2 <tibble [6,942 × 6]>
## 2 THOUGHT F2_lob2 <tibble [6,942 × 6]>
## 3 FLEECE F1_lob2 <tibble [11,896 × 6]>
13 / 27

Preprocessing: Modelling

Step 1: reshape data so we have a row for each model we want to fit.

onze_ints <- onze_ints %>%
pivot_longer(
cols = F1_lob2:F2_lob2, # Select columns with formant data.
names_to = "formant_type", # Name column to distinguish F1 & F2
values_to = "formant_value" # Name column for formant values.
) %>%
group_by(vowel, formant_type) %>%
nest() %>%
ungroup()
onze_ints %>% head(3)
## # A tibble: 3 × 3
## vowel formant_type data
## <fct> <chr> <list>
## 1 THOUGHT F1_lob2 <tibble [6,942 × 6]>
## 2 THOUGHT F2_lob2 <tibble [6,942 × 6]>
## 3 FLEECE F1_lob2 <tibble [11,896 × 6]>

Column data contains data frames ('tibbles') with the data for each model.

13 / 27

Preprocessing: Modelling

Step 2: Fit models for each vowel-formant pair and extract random effects.

14 / 27

Preprocessing: Modelling

Step 2: Fit models for each vowel-formant pair and extract random effects.

# gam version
onze_ints <- onze_ints %>%
# For each vowel-formant pair, fit an lmer model.
mutate(
model = map( # map takes...
data, # ... each entry in a column, and ...
~ bam( # applies a function to it.
formant_value ~ gender + s(yob, by=gender)
+ s(speech_rate) + s(speaker, bs="re"),
discrete = TRUE, nthreads = 2,
data = .x # where .x is a pronoun referring to the column entry.
)
),
# For each model, extract random effects and turn them into a dataframe
random_intercepts = map(
model,
~ get_random(.x)$`s(speaker)` %>% as_tibble(rownames = "speaker") %>%
rename(intercept = value)
)
)
14 / 27

Preprocessing: Modelling

Step 3: Reshape random intercept data for use in PCA.

15 / 27

Preprocessing: Modelling

Step 3: Reshape random intercept data for use in PCA.

# Select the vowel, formant type, and random intercepts columns and 'unnest'
onze_ints <- onze_ints %>%
select(vowel, formant_type, random_intercepts) %>%
unnest(random_intercepts)
onze_ints %>%
head(5)
## # A tibble: 5 × 4
## vowel formant_type speaker intercept
## <fct> <chr> <chr> <dbl>
## 1 THOUGHT F1_lob2 CC_f_020 -0.114
## 2 THOUGHT F1_lob2 CC_f_084 0.00552
## 3 THOUGHT F1_lob2 CC_f_170 -0.0661
## 4 THOUGHT F1_lob2 CC_f_186 0.284
## 5 THOUGHT F1_lob2 CC_f_210 0.242
15 / 27

Preprocessing: Modelling

Step 3: Reshape random intercept data for use in PCA.

# Select the vowel, formant type, and random intercepts columns and 'unnest'
onze_ints <- onze_ints %>%
select(vowel, formant_type, random_intercepts) %>%
unnest(random_intercepts)
onze_ints %>%
head(5)
## # A tibble: 5 × 4
## vowel formant_type speaker intercept
## <fct> <chr> <chr> <dbl>
## 1 THOUGHT F1_lob2 CC_f_020 -0.114
## 2 THOUGHT F1_lob2 CC_f_084 0.00552
## 3 THOUGHT F1_lob2 CC_f_170 -0.0661
## 4 THOUGHT F1_lob2 CC_f_186 0.284
## 5 THOUGHT F1_lob2 CC_f_210 0.242

Almost there! We just need to reshape this so we have a single row for each speaker, and a column for each vowel-formant pair.

15 / 27

Preprocessing: Modelling

Step 3 (cont): Reshape random intercept data for use in PCA.

onze_ints <- onze_ints %>%
# Create a column to store our eventual column names
mutate(
# Combine the 'vowel' and 'formant_type' columns as a string.
vowel_formant = str_c(vowel, '_', formant_type),
# Remove '_lob2' for cleaner column names
vowel_formant = str_replace(vowel_formant, '_lob2', '')
) %>%
# Remove old 'vowel' and 'formant_type' columns
select(-c(vowel, formant_type)) %>%
# Make data 'wider', by...
pivot_wider(
names_from = vowel_formant, # naming columns using 'vowel_formant'...
values_from = intercept # and values from intercept column
)
16 / 27

Preprocessing: Modelling

Step 3 (cont): Reshape random intercept data for use in PCA.

17 / 27

Preprocessing: Modelling

Step 3 (cont): Reshape random intercept data for use in PCA.

What does the data look like now?

17 / 27

Preprocessing: Modelling

Step 3 (cont): Reshape random intercept data for use in PCA.

What does the data look like now?

onze_ints %>%
head(5)
## # A tibble: 5 × 21
## speaker THOUGHT_F1 THOUGHT…¹ FLEECE…² FLEEC…³ KIT_F1 KIT_F2 DRESS…⁴ DRESS…⁵
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CC_f_020 -0.114 -0.331 -0.106 -0.196 0.178 0.145 0.232 -0.0675
## 2 CC_f_084 0.00552 0.202 -0.112 0.167 -0.0660 -0.129 -0.0905 0.0647
## 3 CC_f_170 -0.0661 0.0271 0.270 -0.0465 -0.233 -0.0840 -0.146 -0.101
## 4 CC_f_186 0.284 -0.0704 -0.0491 -0.0812 0.0424 -0.239 0.0361 0.0651
## 5 CC_f_210 0.242 -0.0819 0.00588 -0.0770 -0.162 0.168 0.0635 -0.0108
## # … with 12 more variables: GOOSE_F1 <dbl>, GOOSE_F2 <dbl>, TRAP_F1 <dbl>,
## # TRAP_F2 <dbl>, START_F1 <dbl>, START_F2 <dbl>, STRUT_F1 <dbl>,
## # STRUT_F2 <dbl>, NURSE_F1 <dbl>, NURSE_F2 <dbl>, LOT_F1 <dbl>, LOT_F2 <dbl>,
## # and abbreviated variable names ¹​THOUGHT_F2, ²​FLEECE_F1, ³​FLEECE_F2,
## # ⁴​DRESS_F1, ⁵​DRESS_F2
## # ℹ Use `colnames()` to see all variable names
17 / 27

Preprocessing: Modelling

Step 3 (cont): Reshape random intercept data for use in PCA.

What does the data look like now?

onze_ints %>%
head(5)
## # A tibble: 5 × 21
## speaker THOUGHT_F1 THOUGHT…¹ FLEECE…² FLEEC…³ KIT_F1 KIT_F2 DRESS…⁴ DRESS…⁵
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CC_f_020 -0.114 -0.331 -0.106 -0.196 0.178 0.145 0.232 -0.0675
## 2 CC_f_084 0.00552 0.202 -0.112 0.167 -0.0660 -0.129 -0.0905 0.0647
## 3 CC_f_170 -0.0661 0.0271 0.270 -0.0465 -0.233 -0.0840 -0.146 -0.101
## 4 CC_f_186 0.284 -0.0704 -0.0491 -0.0812 0.0424 -0.239 0.0361 0.0651
## 5 CC_f_210 0.242 -0.0819 0.00588 -0.0770 -0.162 0.168 0.0635 -0.0108
## # … with 12 more variables: GOOSE_F1 <dbl>, GOOSE_F2 <dbl>, TRAP_F1 <dbl>,
## # TRAP_F2 <dbl>, START_F1 <dbl>, START_F2 <dbl>, STRUT_F1 <dbl>,
## # STRUT_F2 <dbl>, NURSE_F1 <dbl>, NURSE_F2 <dbl>, LOT_F1 <dbl>, LOT_F2 <dbl>,
## # and abbreviated variable names ¹​THOUGHT_F2, ²​FLEECE_F1, ³​FLEECE_F2,
## # ⁴​DRESS_F1, ⁵​DRESS_F2
## # ℹ Use `colnames()` to see all variable names

Yay! a row for each speaker, and a column for each variable.

17 / 27

Preprocessing Tips

18 / 27

Preprocessing Tips

  • PCA requires complete observations.
    • i.e. each row has a value for each variable.
  • If we are missing data, we have a few options:
    1. delete the column with missing data,
    2. delete the row with missing data, or
    3. impute values.
  • Which option to choose depends on context.
    • e.g. all three used in Wilson Black et al. (under review):
      1. both foot columns removed,
      2. rows missing more than 3 values removed (very), and
      3. mean values imputed for remaining rows.
18 / 27

Taking stock:

  • Specific steps carried out above were problem specific.
  • The details are not vital for understanding PCA.
  • At this point all that matters is that we have a row for each speaker, and a column for each speaker. i.e. we now have a multivariate space to reduce!

Apply PCA

19 / 27

Apply PCA

We have a multivariate dataset containing information about 10 New Zealand English monophthongs.

We can now apply PCA to find meaningful patterns in this data:

19 / 27

Apply PCA

We have a multivariate dataset containing information about 10 New Zealand English monophthongs.

We can now apply PCA to find meaningful patterns in this data:

onze_pca <- onze_ints %>%
# Remove the speaker column as this is not wanted by PCA.
select(-speaker) %>%
# We scale the variables (more on this in a moment)
prcomp(scale = TRUE)
19 / 27

Apply PCA

We have a multivariate dataset containing information about 10 New Zealand English monophthongs.

We can now apply PCA to find meaningful patterns in this data:

onze_pca <- onze_ints %>%
# Remove the speaker column as this is not wanted by PCA.
select(-speaker) %>%
# We scale the variables (more on this in a moment)
prcomp(scale = TRUE)
  • prcomp is the recommended function for PCA (rather than princomp).
  • For historical reasons, prcomp does not scale variables by default.
  • But variable scaling is pretty much always recommended.
    • Why?: because variables with higher variance will dominate variables with lower variance.
19 / 27

Apply PCA

We have a multivariate dataset containing information about 10 New Zealand English monophthongs.

We can now apply PCA to find meaningful patterns in this data:

onze_pca <- onze_ints %>%
# Remove the speaker column as this is not wanted by PCA.
select(-speaker) %>%
# We scale the variables (more on this in a moment)
prcomp(scale = TRUE)
  • prcomp is the recommended function for PCA (rather than princomp).
  • For historical reasons, prcomp does not scale variables by default.
  • But variable scaling is pretty much always recommended.
    • Why?: because variables with higher variance will dominate variables with lower variance.

What now?

  1. How many PCs should we look at?
  2. What do the PCs mean?
19 / 27

How Many PCs?: Scree Plots

20 / 27

How Many PCs?: Scree Plots

fviz_eig(onze_pca)

  • The fviz_eig function from factoextra plots the variance explained by each PC.
  • Each PC explains less (or equal) variance, than the previous PC.
  • The scree plot is used to select which PCs we will use.
  • How to use: two options:
    1. use a cut off value (e.g. 10%), or
    2. 'the elbow rule': ignore PCs after the 'elbow' in the scree plot.
  • In this case, both methods can motivate selecting the first two PCs.
  • The second method might allow you to pick the first six PCs.
20 / 27

One always picks the first n PCs in standard use of PCA.

Function name: fviz_eig: under the hood - these are eigenvalues of matrices.

The function produces ggplot objects - which can be modified in the usual ggplot way. e.g. (+ labs(...)).

Interpretation: Variable Plots

21 / 27

Interpretation: Variable Plots

fviz_pca_var(
onze_pca, repel = TRUE,
select.var = list(cos2 = 10)
)

  • Variable plots plot the relationship between the original variables and the selected PCs (in this case: PC1 and PC2.)
  • Positives: get a sense of the PC space as a whole and relations between variables which might be invisible from a single PC.
  • Negatives: they can be very messy!
  • Some technical points:
    • This plot restricts shown variables to the 10 most well represented by PC1 and PC2 (list(cos2 = 10)).
    • repel = TRUE prevents labels from overlapping.
    • If you want to look at, e.g., PC3 and PC4, use axes = c(3, 4).
21 / 27

PC1 seems to match Brand 2021 PC2, while PC2 gets the DRESS F2, GOOSE F2 relationship in Brand's PC3 and PC1. Let's wait for the contribution plots.

Interpretation: Contribution Plots

pca_contrib_plot(
onze_pca, pc_no=1, cutoff=50
)

  • This kind of plot was used to interpret PCs in Brand et al. (2021)
    • We interpret the PC with reference to the variables which explain the 50% of the variance.
    • Here we are given the percentage contribution of the original variables, and their relationship to one another.
    • NB: the signs can be flipped without affecting the meaning of the PCs!
    • Don't let yourself be bullied by strict cutoffs, but they can help to focus interpretation.
22 / 27

Interpretation: Contribution Plots

pca_contrib_plot(
onze_pca, pc_no=2, cutoff=50
)

  • PC1 has loadings which slowly trail off.
  • Often it is easier to interpret PCs which involve only a handful of original variables.
  • We avoid getting into the weeds with these particular plots!
  • The point: PCs can be interpreted one-by-one as in these plots, or together as in the variable plots.
23 / 27

Interpretation: Theory

24 / 27

Interpretation: Theory

  • Variance explained is calculated from the square roots of the eigenvalues of the correlation matrix (in, e.g., onze_pca$sdev).
    • To turn into percentages: e.g., onze_pca$sdev^2/sum(onze_pca$sdev^2) * 100
  • Interpretation of PCs usually focuses on two quantities:
    1. the loadings: the contribution of each original variable to a given PC.
    2. the scores: the position of individuals in PC space.
  • It can be useful to see these directly:
    • loadings can be access by adding $rotation (e.g. onze_pca$rotation).
    • scores can be accessed by adding $x to the name of your pca object (e.g. onze_pca$x).
  • Scores can be useful as either response or predictor in regression models.
24 / 27

Interpretation: Theory

  • Variance explained is calculated from the square roots of the eigenvalues of the correlation matrix (in, e.g., onze_pca$sdev).
    • To turn into percentages: e.g., onze_pca$sdev^2/sum(onze_pca$sdev^2) * 100
  • Interpretation of PCs usually focuses on two quantities:
    1. the loadings: the contribution of each original variable to a given PC.
    2. the scores: the position of individuals in PC space.
  • It can be useful to see these directly:
    • loadings can be access by adding $rotation (e.g. onze_pca$rotation).
    • scores can be accessed by adding $x to the name of your pca object (e.g. onze_pca$x).
  • Scores can be useful as either response or predictor in regression models.
onze_pca$rotation[1:5, 1:5]
## PC1 PC2 PC3 PC4 PC5
## THOUGHT_F1 0.3572963 -0.04267274 0.08444889 -0.14809462 0.34717131
## THOUGHT_F2 0.1412604 -0.31325372 0.03986780 0.16732278 0.31588925
## FLEECE_F1 -0.3153993 -0.16235352 -0.21469142 -0.32770244 -0.04562523
## FLEECE_F2 0.3106982 -0.05124337 -0.37692977 0.18536913 -0.05316580
## KIT_F1 -0.1130919 0.05917746 0.45148138 0.04650166 -0.17518187
24 / 27

Testing: Permutation

permutation_results <- permutation_test(
onze_ints %>% select(-speaker), pc_n = 5, n = 100
)
plot_permutation_test(permutation_results)

25 / 27

Testing: Permutation

  • Permutation offers: a way to discern whether there are real patterns in your data.
  • The method: each variable is shuffled independently.
  • Left panel: the number of significant pairwise correlations in the data.
    • Remember: PCA as investigation of the correlation matrix.
  • Right panel: the percentage of variance explained by each PC in our permuted and actual data.
  • If n, a permutation test can take a long time.
26 / 27

Summary

27 / 27

Summary

  • PCA is an important tool for exploratory data analysis in sociolinguistics.
  • It is well suited to 'quantitative abstraction': we move from lots of related variables to structural variation between them.
27 / 27

Summary

  • PCA is an important tool for exploratory data analysis in sociolinguistics.
  • It is well suited to 'quantitative abstraction': we move from lots of related variables to structural variation between them.

  • We've covered:

    1. the geometric intuition behind PCA,
    2. pre-PCA data processing,
    3. the application of PCA with prcomp,
    4. visual interpretation of PCA with factoextra and nzilbb.vowels,
    5. how to access loadings, scores, and variances explained, and
    6. testing for meaningful structure with permutation tests.
27 / 27

Summary

27 / 27

Overview

  1. What is PCA?

    1. A simple example in three dimensions
    2. A more general description
2 / 27
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow