class: top, title-slide .title[ # PCA in Sociolinguistics: A Tutorial ] .author[ ### Joshua Wilson Black ] .date[ ### 27 July 2022 ] --- # Overview 1. What is PCA? 1. A simple example in three dimensions 2. A more general description -- 2. A Worked Example in R 1. Preparing data 2. Applying PCA with `prcomp` 3. Interpreting PCA output (via `factoextra`) -- 3. When is PCA Appropriate? - Rules of thumb - Examples from linguistics literature --- # PCA with Three Variables
Data: Mean first formant values in Hz for <span style="font-variant: small-caps;">dress, kit</span> and <span style="font-variant: small-caps;">trap</span> from 100 random [ONZE](https://www.canterbury.ac.nz/nzilbb/research/onze/) speakers. ??? 1. find the centre of the cloud, 2. draw the line which stays in the cloud for the longest (PC1), 3. draw the line _at right angles to PC1_ which stays in the cloud for the longest (PC2) --- # From 3D to 2D .pull-left[ - (Intuitive) PCA: 1. Find the centre, 2. draw the line which captures maximum variance (PC1), 3. draw the line orthogonal to PC1 which captures maximum variance, and 4. repeat until you run out of options! - We can plot our original three-variable data against PC1 and PC2 (see right). ] .pull-right[ <img src="pca_presentation_files/figure-html/unnamed-chunk-3-1.png" width="100%" /> ] ??? Note that this is not that impressive for three dimensions, but it is _great_ for 20 or 30! PCA can allow us to see patterns in the data which would not otherwise be visible to creatures like us! --- # What do the PCs Mean? .pull-left[ - PC1: decrease in all variables. - PC2: decrease in .sc[kit] and increase in .sc[dress] and .sc[trap]. - _Supplementary variables_ can help interpretation (see right). - Interpretation: - PC1: vocal tract length - PC2: short front vowel shift - We've lost information, but our new description captures deeper phenomena. ] .pull-right[ <img src="pca_presentation_files/figure-html/unnamed-chunk-4-1.png" width="100%" /> ] ??? Take stock: - We've seen: - a method for PCA in 3D space, - Not _mathematical_: haven't told you _how_ to maximise variance - PCA as _dimensionality reduction_ - PCs as _interpretable_ - both in terms of the relationship between variables PCs represent _and_ - in terms of supplementary variables. --- # PCA in General **From 'the textbook':** > The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first _few_ retain most of the variation present in _all_ of the original variables. [(Jolliffe, 2002)](https://doi.org/10.1007/b98835) -- - 'Reduce dimensionality' = we end up with fewer variables. - 'Interrelated variables' = there is structure in the correlations between variables. - Another intuition: PCA finds structure in the correlation matrix. - 'First few': _few_ is not mathematically defined. - Often, we want _interpretable_ PCs. - PCs can capture _meaningful information_ present, but not immediately obvious, in the original data. - PCA is thus an important tool for _exploratory data analysis_. - PCA also useful for hypothesis testing: PCs can be used as independent variables. ??? Correlation matrix point in more detail: - PCs represent eigenvalues of correlation matrices. --- # Worked Example: Patterns in ONZE .left-column[ ![](images/mobileunit.png) .tiny[NZBC mobile recording unit.] ] .right-column[ - **Target:** explore patterns in the behaviour of monophthongs in the ONZE corpus. - A simplification of the analysis in [Brand et al. (2021)](https://doi.org/10.1016/j.wocn.2021.101096). - 100 random speakers, 50 born in 1920 or before, 50 born after 1920. - first and second formant data for 10 monophthongs filtered to exclude stopwords, outliers, and formant tracking errors ([details](https://osf.io/zeb3g)). - We move from readings of _individual vowels_ to _systems of vowels_. - This is PCA in its _exploratory_ use. **Steps:** 1. Preprocessing, 2. application of PCA, 3. interpretation of PCA ] --- # Libraries **Time for some R:** ```r # We're working in the tidyverse 'dialect'. library(tidyverse) # Factoextra used for PCA visualisation library(factoextra) # nzilbb.vowels used to share the data and some useful functions. # Install via github using following commented line. # remotes::install_github('https://github.com/JoshuaWilsonBlack/nzilbb_vowels') library(nzilbb.vowels) # We will fit GAMM models as part of our preprocessing example library(mgcv) library(itsadug) ``` -- **NB:** No special package required for running PCA. --- # Preprocessing > Doing data analysis, in good mathematics, is simply searching for eigenvectors; all the science of it (the art) is to find the right matrix to diagonalize (Benzécri, cited by [Husson](https://www.youtube.com/playlist?list=PLnZgp6epRBbRX8TEp1HlFGqfMf_AxYEj7)) -- PCA does this for the correlation matrix. But careful data selection is also required _before_ PCA. What do we need from our data? -- 1. Technical requirements: - a row for each speaker, and - a column for each variable - i.e. two for each vowel (F1 and F2 for each). 2. Research requirements: - patterns which are present _across_ the development of NZE, - patterns not explained by other known sources of covariation. -- **Thoughtful preprocessing is required for meaningful results.** ??? My use of the quote is a bit loose - we will be processing matrices _derived from_ our data. --- # Raw Data ```r # onze_vowels comes from the nzilbb.vowels package. onze_vowels %>% head(5) ``` ``` ## speaker vowel F1_50 F2_50 speech_rate gender yob word ## 1 IA_f_065 THOUGHT 514 868 4.3131 F 1891 word_09539 ## 2 IA_f_065 FLEECE 395 2716 4.3131 F 1891 word_22664 ## 3 IA_f_065 KIT 653 2413 4.3131 F 1891 word_02705 ## 4 IA_f_065 DRESS 612 2372 4.3131 F 1891 word_23651 ## 5 IA_f_065 GOOSE 445 2037 4.3131 F 1891 word_06222 ``` -- 1. Technical requirements: - Make the data 'wider' so that: - each vowel-formant pair has a column, - each speaker has a row. 2. Research requirements: - Control for `yob` (year of birth), `speech_rate`, and `gender`. - Normalise vowel space to enable across-speaker comparisons. --- # Preprocessing After normalisation, we can kill two birds with one stone: -- 1. Fit a mixed-effect regression models for each vowel-formant pair, where: - sources of unwanted variation are the fixed effects, and - there are by-speaker random intercepts. 2. Extract random intercepts for each speaker from each model. - for exploitation of by-speaker random effects see [Drager and Hay (2012)](https://doi.org/10.1017/S0954394512000014). 3. Create matrix with row for each speaker, and column for each random intercept value. -- The result satisfies both the technical and research requirements. --- # Preprocessing: Normalisation We apply Lobanov 2.0 normalisation using `lobanov_2` from `nzilbb.vowel`. ```r onze_ints <- onze_vowels %>% # Add F1_lob2 and F2_lob2 columns lobanov_2() %>% # Remove F1_50 and F2_50 columns select(-(F1_50:F2_50)) onze_ints %>% head(5) ``` ``` ## # A tibble: 5 × 8 ## speaker vowel speech_rate gender yob word F1_lob2 F2_lob2 ## <fct> <fct> <dbl> <fct> <int> <fct> <dbl> <dbl> ## 1 IA_f_065 THOUGHT 4.31 F 1891 word_09539 -0.708 -2.45 ## 2 IA_f_065 FLEECE 4.31 F 1891 word_22664 -1.76 1.41 ## 3 IA_f_065 KIT 4.31 F 1891 word_02705 0.518 0.773 ## 4 IA_f_065 DRESS 4.31 F 1891 word_23651 0.157 0.688 ## 5 IA_f_065 GOOSE 4.31 F 1891 word_06222 -1.32 -0.0111 ``` --- # Preprocessing: Modelling **Step 1:** reshape data so we have a row for each model we want to fit. -- ```r onze_ints <- onze_ints %>% pivot_longer( cols = F1_lob2:F2_lob2, # Select columns with formant data. names_to = "formant_type", # Name column to distinguish F1 & F2 values_to = "formant_value" # Name column for formant values. ) %>% group_by(vowel, formant_type) %>% nest() %>% ungroup() onze_ints %>% head(3) ``` ``` ## # A tibble: 3 × 3 ## vowel formant_type data ## <fct> <chr> <list> ## 1 THOUGHT F1_lob2 <tibble [6,942 × 6]> ## 2 THOUGHT F2_lob2 <tibble [6,942 × 6]> ## 3 FLEECE F1_lob2 <tibble [11,896 × 6]> ``` -- Column `data` contains data frames ('tibbles') with the data for each model. --- # Preprocessing: Modelling **Step 2:** Fit models for each vowel-formant pair and extract random effects. -- ```r # gam version onze_ints <- onze_ints %>% # For each vowel-formant pair, fit an lmer model. mutate( model = map( # map takes... data, # ... each entry in a column, and ... ~ bam( # applies a function to it. formant_value ~ gender + s(yob, by=gender) + s(speech_rate) + s(speaker, bs="re"), discrete = TRUE, nthreads = 2, data = .x # where .x is a pronoun referring to the column entry. ) ), # For each model, extract random effects and turn them into a dataframe random_intercepts = map( model, ~ get_random(.x)$`s(speaker)` %>% as_tibble(rownames = "speaker") %>% rename(intercept = value) ) ) ``` --- # Preprocessing: Modelling **Step 3:** Reshape random intercept data for use in PCA. -- ```r # Select the vowel, formant type, and random intercepts columns and 'unnest' onze_ints <- onze_ints %>% select(vowel, formant_type, random_intercepts) %>% unnest(random_intercepts) onze_ints %>% head(5) ``` ``` ## # A tibble: 5 × 4 ## vowel formant_type speaker intercept ## <fct> <chr> <chr> <dbl> ## 1 THOUGHT F1_lob2 CC_f_020 -0.114 ## 2 THOUGHT F1_lob2 CC_f_084 0.00552 ## 3 THOUGHT F1_lob2 CC_f_170 -0.0661 ## 4 THOUGHT F1_lob2 CC_f_186 0.284 ## 5 THOUGHT F1_lob2 CC_f_210 0.242 ``` -- Almost there! We just need to reshape this so we have a single row for each speaker, and a column for each vowel-formant pair. --- # Preprocessing: Modelling **Step 3 (cont):** Reshape random intercept data for use in PCA. ```r onze_ints <- onze_ints %>% # Create a column to store our eventual column names mutate( # Combine the 'vowel' and 'formant_type' columns as a string. vowel_formant = str_c(vowel, '_', formant_type), # Remove '_lob2' for cleaner column names vowel_formant = str_replace(vowel_formant, '_lob2', '') ) %>% # Remove old 'vowel' and 'formant_type' columns select(-c(vowel, formant_type)) %>% # Make data 'wider', by... pivot_wider( names_from = vowel_formant, # naming columns using 'vowel_formant'... values_from = intercept # and values from intercept column ) ``` --- # Preprocessing: Modelling **Step 3 (cont):** Reshape random intercept data for use in PCA. -- What does the data look like now? -- ```r onze_ints %>% head(5) ``` ``` ## # A tibble: 5 × 21 ## speaker THOUGHT_F1 THOUGHT…¹ FLEECE…² FLEEC…³ KIT_F1 KIT_F2 DRESS…⁴ DRESS…⁵ ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 CC_f_020 -0.114 -0.331 -0.106 -0.196 0.178 0.145 0.232 -0.0675 ## 2 CC_f_084 0.00552 0.202 -0.112 0.167 -0.0660 -0.129 -0.0905 0.0647 ## 3 CC_f_170 -0.0661 0.0271 0.270 -0.0465 -0.233 -0.0840 -0.146 -0.101 ## 4 CC_f_186 0.284 -0.0704 -0.0491 -0.0812 0.0424 -0.239 0.0361 0.0651 ## 5 CC_f_210 0.242 -0.0819 0.00588 -0.0770 -0.162 0.168 0.0635 -0.0108 ## # … with 12 more variables: GOOSE_F1 <dbl>, GOOSE_F2 <dbl>, TRAP_F1 <dbl>, ## # TRAP_F2 <dbl>, START_F1 <dbl>, START_F2 <dbl>, STRUT_F1 <dbl>, ## # STRUT_F2 <dbl>, NURSE_F1 <dbl>, NURSE_F2 <dbl>, LOT_F1 <dbl>, LOT_F2 <dbl>, ## # and abbreviated variable names ¹THOUGHT_F2, ²FLEECE_F1, ³FLEECE_F2, ## # ⁴DRESS_F1, ⁵DRESS_F2 ## # ℹ Use `colnames()` to see all variable names ``` -- **Yay!** a row for each speaker, and a column for each variable. --- # Preprocessing Tips -- - PCA requires _complete observations_. - i.e. each row has a value for each variable. - If we are missing data, we have a few options: 1. delete the column with missing data, 2. delete the row with missing data, or 3. impute values. - Which option to choose depends on context. - e.g. all three used in Wilson Black et al. (under review): 1. both <span style="font-variant: small-caps;">foot</span> columns removed, 2. rows missing more than 3 values removed (very), and 3. mean values imputed for remaining rows. ??? Taking stock: - Specific steps carried out above were _problem specific_. - The details are not vital for understanding PCA. - At this point all that matters is that we have a row for each speaker, and a column for each speaker. i.e. we now have a multivariate space to reduce! --- # Apply PCA -- We have a multivariate dataset containing information about 10 New Zealand English monophthongs. **We can now apply PCA** to find meaningful patterns in this data: -- ```r onze_pca <- onze_ints %>% # Remove the speaker column as this is not wanted by PCA. select(-speaker) %>% # We scale the variables (more on this in a moment) prcomp(scale = TRUE) ``` -- - `prcomp` is the recommended function for PCA (rather than `princomp`). - For historical reasons, `prcomp` does not scale variables by default. - But variable scaling is pretty much always recommended. - Why?: because variables with higher variance will dominate variables with lower variance. -- **What now?** 1. How many PCs should we look at? 2. What do the PCs mean? --- # How Many PCs?: Scree Plots -- .pull-left[ ```r fviz_eig(onze_pca) ``` <img src="pca_presentation_files/figure-html/unnamed-chunk-15-1.png" height="450px" /> ] .pull-right[ - The `fviz_eig` function from `factoextra` plots the variance explained by each PC. - Each PC explains less (or equal) variance, than the previous PC. - The scree plot is used to select which PCs we will use. - **How to use:** two options: 1. use a cut off value (e.g. 10%), or 2. 'the elbow rule': ignore PCs after the 'elbow' in the scree plot. - In this case, _both_ methods can motivate selecting the first two PCs. - The second method _might_ allow you to pick the first six PCs. ] ??? One always picks the _first_ _n_ PCs in standard use of PCA. Function name: `fviz_eig`: under the hood - these are eigenvalues of matrices. The function produces ggplot objects - which can be modified in the usual ggplot way. e.g. (`+ labs(...)`). --- # Interpretation: Variable Plots -- .pull-left[ ```r fviz_pca_var( onze_pca, repel = TRUE, select.var = list(cos2 = 10) ) ``` <img src="pca_presentation_files/figure-html/unnamed-chunk-16-1.png" height="400px" /> ] .pull-right[ - Variable plots plot the relationship between the original variables and the selected PCs (in this case: PC1 and PC2.) - **Positives:** get a sense of the PC space as a whole and relations between variables which might be invisible from a single PC. - **Negatives:** they can be very messy! - Some technical points: - This plot restricts shown variables to the 10 most well represented by PC1 and PC2 (`list(cos2 = 10)`). - `repel = TRUE` prevents labels from overlapping. - If you want to look at, e.g., PC3 and PC4, use `axes = c(3, 4)`. ] ??? PC1 seems to match Brand 2021 PC2, while PC2 gets the DRESS F2, GOOSE F2 relationship in Brand's PC3 and PC1. Let's wait for the contribution plots. --- # Interpretation: Contribution Plots .pull-left[ ```r pca_contrib_plot( onze_pca, pc_no=1, cutoff=50 ) ``` ![](pca_presentation_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] .pull-right[ - This kind of plot was used to interpret PCs in Brand et al. (2021) - We interpret the PC with reference to the variables which explain the 50% of the variance. - Here we are given the percentage contribution of the original variables, and their relationship to one another. - **NB:** the signs can be flipped without affecting the meaning of the PCs! - Don't let yourself be bullied by strict cutoffs, but they can help to focus interpretation. - ] --- # Interpretation: Contribution Plots .pull-left[ ```r pca_contrib_plot( onze_pca, pc_no=2, cutoff=50 ) ``` ![](pca_presentation_files/figure-html/unnamed-chunk-18-1.png)<!-- --> ] .pull-right[ - PC1 has loadings which slowly trail off. - Often it is easier to interpret PCs which involve only a handful of original variables. - We avoid getting into the weeds with these particular plots! - **The point:** PCs can be interpreted _one-by-one_ as in these plots, or _together_ as in the variable plots. ] --- # Interpretation: Theory -- - Variance explained is calculated from the square roots of the eigenvalues of the correlation matrix (in, e.g., `onze_pca$sdev`). - To turn into percentages: e.g., `onze_pca$sdev^2/sum(onze_pca$sdev^2) * 100` - Interpretation of PCs usually focuses on two quantities: 1. the _loadings_: the contribution of each original variable to a given PC. 2. the _scores_: the position of individuals in PC space. - It can be useful to see these directly: - loadings can be access by adding `$rotation` (e.g. `onze_pca$rotation`). - scores can be accessed by adding `$x` to the name of your pca object (e.g. `onze_pca$x`). - **Scores** can be useful as either response or predictor in regression models. -- ```r onze_pca$rotation[1:5, 1:5] ``` ``` ## PC1 PC2 PC3 PC4 PC5 ## THOUGHT_F1 0.3572963 -0.04267274 0.08444889 -0.14809462 0.34717131 ## THOUGHT_F2 0.1412604 -0.31325372 0.03986780 0.16732278 0.31588925 ## FLEECE_F1 -0.3153993 -0.16235352 -0.21469142 -0.32770244 -0.04562523 ## FLEECE_F2 0.3106982 -0.05124337 -0.37692977 0.18536913 -0.05316580 ## KIT_F1 -0.1130919 0.05917746 0.45148138 0.04650166 -0.17518187 ``` --- # Testing: Permutation ```r permutation_results <- permutation_test( onze_ints %>% select(-speaker), pc_n = 5, n = 100 ) plot_permutation_test(permutation_results) ``` ![](pca_presentation_files/figure-html/unnamed-chunk-20-1.png)<!-- --> --- # Testing: Permutation - Permutation offers: a way to discern whether there are real patterns in your data. - The method: each variable is shuffled independently. - Left panel: the number of significant pairwise correlations in the data. - Remember: PCA as investigation of the correlation matrix. - Right panel: the percentage of variance explained by each PC in our permuted and actual data. - If `n`, a permutation test can take a long time. --- # Summary -- - PCA is an important tool for _exploratory data analysis_ in sociolinguistics. - It is well suited to 'quantitative abstraction': we move from lots of related variables to structural variation between them. -- - We've covered: 1. the geometric intuition behind PCA, 2. pre-PCA data processing, 3. the application of PCA with `prcomp`, 4. visual interpretation of PCA with `factoextra` and `nzilbb.vowels`, 5. how to access loadings, scores, and variances explained, and 6. testing for meaningful structure with permutation tests. -- - For more, see:<http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/>