What is PCA?
What is PCA?
A Worked Example in R
prcomp
factoextra
)What is PCA?
A Worked Example in R
prcomp
factoextra
)When is PCA Appropriate?
Data: Mean first formant values in Hz for dress, kit and trap from 100 random ONZE speakers.
find the centre of the cloud,
draw the line which stays in the cloud for the longest (PC1),
draw the line at right angles to PC1 which stays in the cloud for the longest (PC2)
(Intuitive) PCA:
We can plot our original three-variable data against PC1 and PC2 (see right).
Note that this is not that impressive for three dimensions, but it is great for 20 or 30!
PCA can allow us to see patterns in the data which would not otherwise be visible to creatures like us!
Take stock:
From 'the textbook':
The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables. (Jolliffe, 2002)
From 'the textbook':
The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables. (Jolliffe, 2002)
Correlation matrix point in more detail:
NZBC mobile recording unit.
Steps:
Time for some R:
# We're working in the tidyverse 'dialect'.library(tidyverse)# Factoextra used for PCA visualisationlibrary(factoextra)# nzilbb.vowels used to share the data and some useful functions.# Install via github using following commented line.# remotes::install_github('https://github.com/JoshuaWilsonBlack/nzilbb_vowels')library(nzilbb.vowels)# We will fit GAMM models as part of our preprocessing examplelibrary(mgcv)library(itsadug)
Time for some R:
# We're working in the tidyverse 'dialect'.library(tidyverse)# Factoextra used for PCA visualisationlibrary(factoextra)# nzilbb.vowels used to share the data and some useful functions.# Install via github using following commented line.# remotes::install_github('https://github.com/JoshuaWilsonBlack/nzilbb_vowels')library(nzilbb.vowels)# We will fit GAMM models as part of our preprocessing examplelibrary(mgcv)library(itsadug)
NB: No special package required for running PCA.
Doing data analysis, in good mathematics, is simply searching for eigenvectors; all the science of it (the art) is to find the right matrix to diagonalize (Benzécri, cited by Husson)
Doing data analysis, in good mathematics, is simply searching for eigenvectors; all the science of it (the art) is to find the right matrix to diagonalize (Benzécri, cited by Husson)
PCA does this for the correlation matrix. But careful data selection is also required before PCA.
What do we need from our data?
Doing data analysis, in good mathematics, is simply searching for eigenvectors; all the science of it (the art) is to find the right matrix to diagonalize (Benzécri, cited by Husson)
PCA does this for the correlation matrix. But careful data selection is also required before PCA.
What do we need from our data?
Doing data analysis, in good mathematics, is simply searching for eigenvectors; all the science of it (the art) is to find the right matrix to diagonalize (Benzécri, cited by Husson)
PCA does this for the correlation matrix. But careful data selection is also required before PCA.
What do we need from our data?
Thoughtful preprocessing is required for meaningful results.
My use of the quote is a bit loose - we will be processing matrices derived from our data.
# onze_vowels comes from the nzilbb.vowels package.onze_vowels %>% head(5)
## speaker vowel F1_50 F2_50 speech_rate gender yob word## 1 IA_f_065 THOUGHT 514 868 4.3131 F 1891 word_09539## 2 IA_f_065 FLEECE 395 2716 4.3131 F 1891 word_22664## 3 IA_f_065 KIT 653 2413 4.3131 F 1891 word_02705## 4 IA_f_065 DRESS 612 2372 4.3131 F 1891 word_23651## 5 IA_f_065 GOOSE 445 2037 4.3131 F 1891 word_06222
# onze_vowels comes from the nzilbb.vowels package.onze_vowels %>% head(5)
## speaker vowel F1_50 F2_50 speech_rate gender yob word## 1 IA_f_065 THOUGHT 514 868 4.3131 F 1891 word_09539## 2 IA_f_065 FLEECE 395 2716 4.3131 F 1891 word_22664## 3 IA_f_065 KIT 653 2413 4.3131 F 1891 word_02705## 4 IA_f_065 DRESS 612 2372 4.3131 F 1891 word_23651## 5 IA_f_065 GOOSE 445 2037 4.3131 F 1891 word_06222
yob
(year of birth), speech_rate
, and gender
.After normalisation, we can kill two birds with one stone:
After normalisation, we can kill two birds with one stone:
After normalisation, we can kill two birds with one stone:
The result satisfies both the technical and research requirements.
We apply Lobanov 2.0 normalisation using lobanov_2
from nzilbb.vowel
.
onze_ints <- onze_vowels %>% # Add F1_lob2 and F2_lob2 columns lobanov_2() %>% # Remove F1_50 and F2_50 columns select(-(F1_50:F2_50))onze_ints %>% head(5)
## # A tibble: 5 × 8## speaker vowel speech_rate gender yob word F1_lob2 F2_lob2## <fct> <fct> <dbl> <fct> <int> <fct> <dbl> <dbl>## 1 IA_f_065 THOUGHT 4.31 F 1891 word_09539 -0.708 -2.45 ## 2 IA_f_065 FLEECE 4.31 F 1891 word_22664 -1.76 1.41 ## 3 IA_f_065 KIT 4.31 F 1891 word_02705 0.518 0.773 ## 4 IA_f_065 DRESS 4.31 F 1891 word_23651 0.157 0.688 ## 5 IA_f_065 GOOSE 4.31 F 1891 word_06222 -1.32 -0.0111
Step 1: reshape data so we have a row for each model we want to fit.
Step 1: reshape data so we have a row for each model we want to fit.
onze_ints <- onze_ints %>% pivot_longer( cols = F1_lob2:F2_lob2, # Select columns with formant data. names_to = "formant_type", # Name column to distinguish F1 & F2 values_to = "formant_value" # Name column for formant values. ) %>% group_by(vowel, formant_type) %>% nest() %>% ungroup()onze_ints %>% head(3)
## # A tibble: 3 × 3## vowel formant_type data ## <fct> <chr> <list> ## 1 THOUGHT F1_lob2 <tibble [6,942 × 6]> ## 2 THOUGHT F2_lob2 <tibble [6,942 × 6]> ## 3 FLEECE F1_lob2 <tibble [11,896 × 6]>
Step 1: reshape data so we have a row for each model we want to fit.
onze_ints <- onze_ints %>% pivot_longer( cols = F1_lob2:F2_lob2, # Select columns with formant data. names_to = "formant_type", # Name column to distinguish F1 & F2 values_to = "formant_value" # Name column for formant values. ) %>% group_by(vowel, formant_type) %>% nest() %>% ungroup()onze_ints %>% head(3)
## # A tibble: 3 × 3## vowel formant_type data ## <fct> <chr> <list> ## 1 THOUGHT F1_lob2 <tibble [6,942 × 6]> ## 2 THOUGHT F2_lob2 <tibble [6,942 × 6]> ## 3 FLEECE F1_lob2 <tibble [11,896 × 6]>
Column data
contains data frames ('tibbles') with the data for each model.
Step 2: Fit models for each vowel-formant pair and extract random effects.
Step 2: Fit models for each vowel-formant pair and extract random effects.
# gam versiononze_ints <- onze_ints %>% # For each vowel-formant pair, fit an lmer model. mutate( model = map( # map takes... data, # ... each entry in a column, and ... ~ bam( # applies a function to it. formant_value ~ gender + s(yob, by=gender) + s(speech_rate) + s(speaker, bs="re"), discrete = TRUE, nthreads = 2, data = .x # where .x is a pronoun referring to the column entry. ) ), # For each model, extract random effects and turn them into a dataframe random_intercepts = map( model, ~ get_random(.x)$`s(speaker)` %>% as_tibble(rownames = "speaker") %>% rename(intercept = value) ) )
Step 3: Reshape random intercept data for use in PCA.
Step 3: Reshape random intercept data for use in PCA.
# Select the vowel, formant type, and random intercepts columns and 'unnest'onze_ints <- onze_ints %>% select(vowel, formant_type, random_intercepts) %>% unnest(random_intercepts)onze_ints %>% head(5)
## # A tibble: 5 × 4## vowel formant_type speaker intercept## <fct> <chr> <chr> <dbl>## 1 THOUGHT F1_lob2 CC_f_020 -0.114 ## 2 THOUGHT F1_lob2 CC_f_084 0.00552## 3 THOUGHT F1_lob2 CC_f_170 -0.0661 ## 4 THOUGHT F1_lob2 CC_f_186 0.284 ## 5 THOUGHT F1_lob2 CC_f_210 0.242
Step 3: Reshape random intercept data for use in PCA.
# Select the vowel, formant type, and random intercepts columns and 'unnest'onze_ints <- onze_ints %>% select(vowel, formant_type, random_intercepts) %>% unnest(random_intercepts)onze_ints %>% head(5)
## # A tibble: 5 × 4## vowel formant_type speaker intercept## <fct> <chr> <chr> <dbl>## 1 THOUGHT F1_lob2 CC_f_020 -0.114 ## 2 THOUGHT F1_lob2 CC_f_084 0.00552## 3 THOUGHT F1_lob2 CC_f_170 -0.0661 ## 4 THOUGHT F1_lob2 CC_f_186 0.284 ## 5 THOUGHT F1_lob2 CC_f_210 0.242
Almost there! We just need to reshape this so we have a single row for each speaker, and a column for each vowel-formant pair.
Step 3 (cont): Reshape random intercept data for use in PCA.
onze_ints <- onze_ints %>% # Create a column to store our eventual column names mutate( # Combine the 'vowel' and 'formant_type' columns as a string. vowel_formant = str_c(vowel, '_', formant_type), # Remove '_lob2' for cleaner column names vowel_formant = str_replace(vowel_formant, '_lob2', '') ) %>% # Remove old 'vowel' and 'formant_type' columns select(-c(vowel, formant_type)) %>% # Make data 'wider', by... pivot_wider( names_from = vowel_formant, # naming columns using 'vowel_formant'... values_from = intercept # and values from intercept column )
Step 3 (cont): Reshape random intercept data for use in PCA.
Step 3 (cont): Reshape random intercept data for use in PCA.
What does the data look like now?
Step 3 (cont): Reshape random intercept data for use in PCA.
What does the data look like now?
onze_ints %>% head(5)
## # A tibble: 5 × 21## speaker THOUGHT_F1 THOUGHT…¹ FLEECE…² FLEEC…³ KIT_F1 KIT_F2 DRESS…⁴ DRESS…⁵## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 CC_f_020 -0.114 -0.331 -0.106 -0.196 0.178 0.145 0.232 -0.0675## 2 CC_f_084 0.00552 0.202 -0.112 0.167 -0.0660 -0.129 -0.0905 0.0647## 3 CC_f_170 -0.0661 0.0271 0.270 -0.0465 -0.233 -0.0840 -0.146 -0.101 ## 4 CC_f_186 0.284 -0.0704 -0.0491 -0.0812 0.0424 -0.239 0.0361 0.0651## 5 CC_f_210 0.242 -0.0819 0.00588 -0.0770 -0.162 0.168 0.0635 -0.0108## # … with 12 more variables: GOOSE_F1 <dbl>, GOOSE_F2 <dbl>, TRAP_F1 <dbl>,## # TRAP_F2 <dbl>, START_F1 <dbl>, START_F2 <dbl>, STRUT_F1 <dbl>,## # STRUT_F2 <dbl>, NURSE_F1 <dbl>, NURSE_F2 <dbl>, LOT_F1 <dbl>, LOT_F2 <dbl>,## # and abbreviated variable names ¹THOUGHT_F2, ²FLEECE_F1, ³FLEECE_F2,## # ⁴DRESS_F1, ⁵DRESS_F2## # ℹ Use `colnames()` to see all variable names
Step 3 (cont): Reshape random intercept data for use in PCA.
What does the data look like now?
onze_ints %>% head(5)
## # A tibble: 5 × 21## speaker THOUGHT_F1 THOUGHT…¹ FLEECE…² FLEEC…³ KIT_F1 KIT_F2 DRESS…⁴ DRESS…⁵## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 CC_f_020 -0.114 -0.331 -0.106 -0.196 0.178 0.145 0.232 -0.0675## 2 CC_f_084 0.00552 0.202 -0.112 0.167 -0.0660 -0.129 -0.0905 0.0647## 3 CC_f_170 -0.0661 0.0271 0.270 -0.0465 -0.233 -0.0840 -0.146 -0.101 ## 4 CC_f_186 0.284 -0.0704 -0.0491 -0.0812 0.0424 -0.239 0.0361 0.0651## 5 CC_f_210 0.242 -0.0819 0.00588 -0.0770 -0.162 0.168 0.0635 -0.0108## # … with 12 more variables: GOOSE_F1 <dbl>, GOOSE_F2 <dbl>, TRAP_F1 <dbl>,## # TRAP_F2 <dbl>, START_F1 <dbl>, START_F2 <dbl>, STRUT_F1 <dbl>,## # STRUT_F2 <dbl>, NURSE_F1 <dbl>, NURSE_F2 <dbl>, LOT_F1 <dbl>, LOT_F2 <dbl>,## # and abbreviated variable names ¹THOUGHT_F2, ²FLEECE_F1, ³FLEECE_F2,## # ⁴DRESS_F1, ⁵DRESS_F2## # ℹ Use `colnames()` to see all variable names
Yay! a row for each speaker, and a column for each variable.
Taking stock:
We have a multivariate dataset containing information about 10 New Zealand English monophthongs.
We can now apply PCA to find meaningful patterns in this data:
We have a multivariate dataset containing information about 10 New Zealand English monophthongs.
We can now apply PCA to find meaningful patterns in this data:
onze_pca <- onze_ints %>% # Remove the speaker column as this is not wanted by PCA. select(-speaker) %>% # We scale the variables (more on this in a moment) prcomp(scale = TRUE)
We have a multivariate dataset containing information about 10 New Zealand English monophthongs.
We can now apply PCA to find meaningful patterns in this data:
onze_pca <- onze_ints %>% # Remove the speaker column as this is not wanted by PCA. select(-speaker) %>% # We scale the variables (more on this in a moment) prcomp(scale = TRUE)
prcomp
is the recommended function for PCA (rather than princomp
).prcomp
does not scale variables by default.We have a multivariate dataset containing information about 10 New Zealand English monophthongs.
We can now apply PCA to find meaningful patterns in this data:
onze_pca <- onze_ints %>% # Remove the speaker column as this is not wanted by PCA. select(-speaker) %>% # We scale the variables (more on this in a moment) prcomp(scale = TRUE)
prcomp
is the recommended function for PCA (rather than princomp
).prcomp
does not scale variables by default.What now?
fviz_eig(onze_pca)
fviz_eig
function from factoextra
plots the variance explained by each
PC.One always picks the first n PCs in standard use of PCA.
Function name: fviz_eig
: under the hood - these are eigenvalues of matrices.
The function produces ggplot objects - which can be modified in the usual ggplot
way. e.g. (+ labs(...)
).
fviz_pca_var( onze_pca, repel = TRUE, select.var = list(cos2 = 10) )
list(cos2 = 10)
). repel = TRUE
prevents labels from overlapping.axes = c(3, 4)
.PC1 seems to match Brand 2021 PC2, while PC2 gets the DRESS F2, GOOSE F2 relationship in Brand's PC3 and PC1. Let's wait for the contribution plots.
pca_contrib_plot( onze_pca, pc_no=1, cutoff=50)
pca_contrib_plot( onze_pca, pc_no=2, cutoff=50)
onze_pca$sdev
).onze_pca$sdev^2/sum(onze_pca$sdev^2) * 100
$rotation
(e.g. onze_pca$rotation
).$x
to the name of your pca object (e.g. onze_pca$x
).onze_pca$sdev
).onze_pca$sdev^2/sum(onze_pca$sdev^2) * 100
$rotation
(e.g. onze_pca$rotation
).$x
to the name of your pca object (e.g. onze_pca$x
).onze_pca$rotation[1:5, 1:5]
## PC1 PC2 PC3 PC4 PC5## THOUGHT_F1 0.3572963 -0.04267274 0.08444889 -0.14809462 0.34717131## THOUGHT_F2 0.1412604 -0.31325372 0.03986780 0.16732278 0.31588925## FLEECE_F1 -0.3153993 -0.16235352 -0.21469142 -0.32770244 -0.04562523## FLEECE_F2 0.3106982 -0.05124337 -0.37692977 0.18536913 -0.05316580## KIT_F1 -0.1130919 0.05917746 0.45148138 0.04650166 -0.17518187
permutation_results <- permutation_test( onze_ints %>% select(-speaker), pc_n = 5, n = 100)plot_permutation_test(permutation_results)
n
, a permutation test can take a long time.It is well suited to 'quantitative abstraction': we move from lots of related variables to structural variation between them.
We've covered:
prcomp
,factoextra
and nzilbb.vowels
,It is well suited to 'quantitative abstraction': we move from lots of related variables to structural variation between them.
We've covered:
prcomp
,factoextra
and nzilbb.vowels
,What is PCA?
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |