nzilbb.vowels

A new tool for quantitative research on vocalic covariation

Joshua Wilson Black
joshua.black@canterbury.ac.nz

Te Kāhui Roro Reo | New Zealand Institute of Language, Brain and Behaviour

Te Whare Wānanga o Waitaha | University of Canterbury

Overview

  1. What is it?
  2. Design strategy
  3. From research script to package

What is nzilbb.vowels?

What is it?

  • A package for R (R Core Team 2024).
  • A normalisation function lobanov_2()
  • A series testing functions for testing PCA (and MDS)
  • Functions to plot test results
  • A shortcut for smallcaps formatting in RStudio
  • Website: https://nzilbb.github.io/nzilbb_vowels

lobanov_2()

  • Brand et al. (2021) adopts a mean-of-means approach to Lobanov normalisation.
  • Why?
    • It’s better for unbalanced data across vowel categories.
    • i.e. any corpus work!
  • Matches the normalisation functions in the vowels package.

lobanov_2() usage

onze_vowels <- onze_vowels |>  
  lobanov_2()

head(onze_vowels)
speaker vowel F1_50 F2_50 speech_rate gender yob word F1_lob2 F2_lob2
IA_f_065 THOUGHT 514 868 4.3131 F 1891 word_09539 -0.7212895 -1.9212428
IA_f_065 FLEECE 395 2716 4.3131 F 1891 word_22664 -1.6603467 1.4915434
IA_f_065 KIT 653 2413 4.3131 F 1891 word_02705 0.3755925 0.9319794
IA_f_065 DRESS 612 2372 4.3131 F 1891 word_23651 0.0520517 0.8562628
IA_f_065 GOOSE 445 2037 4.3131 F 1891 word_06222 -1.2657848 0.2376030
IA_f_065 GOOSE 443 2258 4.3131 F 1891 word_06222 -1.2815673 0.6457338

‘model-to-PCA’

  • Functions to aid a common NZILBB workflow (Brand et al. 2021; Hurring et al. Under review; Wilson Black et al. 2022)
  • Basic story:
    • Models control variance to be explained by PCA.
    • Apply PCA to model outputs (esp. random intercepts)
  • Some similarity with Redundancy Analysis in ecology (Legendre and Legendre 2012).

correlation_test()

  • PCA finds structure in correlations between variables.
  • Q: Is there robust structure in my data?
  • If ‘yes’, then we expect more and stronger pairwise correlations than for random data.
  • Determine this with a permutation test.

correlation_test() usage

pca_data <- onze_intercepts |>
  dplyr::select(-speaker)

cor_test <- correlation_test(
  pca_data, 
  n = 100, 
  cor.method = 'spearman'
)
summary(cor_test)

correlation_test() usage

Correlation test results.
Count of significant pairwise correlations in original data at alpha = 0.05: 60
Mean significant pairwise correlations in permuted data (n = 100) at alpha = 0.05: 9.4
Min = 3, Max = 17.

Top 5 pairwise correlations in original data:
F2_FLEECE, F2_NURSE: -0.57
F1_FLEECE, F1_START: 0.57
F2_STRUT, F2_THOUGHT: -0.51
F1_LOT, F1_START: -0.49
F2_START, F2_THOUGHT: -0.48

plot_correlation_magnitudes()

plot_correlation_magnitudes(cor_test)

plot_correlation_counts()

plot_correlation_counts(cor_test)

pca_test()

  • PCA is sensitive to noise and outliers.
  • Q: How stable is my PCA?
  • Estimate confidence intervals using bootstrapping.
  • Estimate our ‘null’ using permutation.
  • Again borrowing ideas from biology (Björklund 2019; Camargo 2022; Vieira 2012).

pca_test() usage

onze_test <- pca_test(
  pca_data,
  n = 500,
  scale = TRUE,
  variance_confint = 0.95,
  loadings_confint = 0.9
)
summary(onze_test)

pca_test() usage

PCA Permutation and Bootstrapping Test

Iterations: 500

Significant PCs at 0.05 level: PC1, PC2, PC3, PC4, PC5.

Significant loadings at 0.1 level: 
    PC1: F1_FLEECE
    PC1: F1_GOOSE
    PC1: F1_START
    PC1: F1_THOUGHT
    PC1: F1_TRAP
    PC1: F2_FLEECE
    PC1: F2_NURSE
    PC1: F2_THOUGHT
    PC2: F1_NURSE
    PC2: F2_DRESS
    PC2: F2_KIT
    PC2: F2_LOT
    PC2: F2_STRUT
    PC2: F2_THOUGHT
    PC2: F2_TRAP
    PC3: F2_FLEECE
    PC3: F2_GOOSE
    PC3: F2_LOT
    PC4: F1_KIT
    PC4: F1_LOT
    PC5: F1_STRUT
    PC6: F1_NURSE
    PC6: F2_START

plot_variance_explained()

  • How many PCs should I use?
plot_variance_explained(onze_test, pc_max = 10)

plot_loadings()

plot_loadings(onze_test, pc_no = 1)

plot_pc_vs()

plot_pc_vs(onze_vowels, onze_test, pc_no = 1, is_sig = TRUE) +
  coord_fixed()

pc_flip()

  • The direction of a PC is arbitrary.
  • i.e., we can swap ‘+’ and ‘-’ in:
plot_loadings(onze_test, pc_no = 1)

pc_flip() (cont.)

  • It’s the relationship between the variables that matters.
  • We might find it easiest to think of, say, a positive PC score as ‘innovative’.
  • We can flip PC loadings with pc_flip() to make this happen.

pc_flip() usage

onze_test <- onze_test |> 
  pc_flip(pc_no = 1)

# OR I want F1_TRAP to be positive
onze_test <- onze_test |> 
  pc_flip(pc_no = 1, flip_var = "F1_TRAP")

pc_flip() usage (cont.)

plot_pc_vs(onze_vowels, onze_test, pc_no = 1, is_sig = TRUE) + 
  coord_fixed()

mds_test()

  • Moving from production to perception.
  • Multidimensional Scaling (MDS) is similar to PCA.
  • Q: How many dimensions do I need?
  • A: Does adding a dimension reduce ‘stress’ more than we would expect from random data.
  • Used in (Sheard et al. 2024).

mds_test() usage

mds_res <- mds_test(
  sim_matrix,
  n_boots = 50,
  n_perms = 50,
  test_dimensions = 5,
  principal = TRUE,
  mds_type = "ordinal",
  spline_degree = 2,
  spline_int_knots = 2
)
plot_mds_test(mds_res)

mds_test() usage

Packaged data

  • onze_vowels(_complete)
  • onze_intercepts(_complete)
  • qb_vowels(_complete)
  • qb_intervals
  • sim_matrix

RStudio add-in

  • Formatting text as small caps can be very annoying in RMarkdown/Quarto documents.
  • Create a shortcut to make it easy (in ‘Addins’ menu of RStudio).
  • No more typing ‘<span style="font-variant: small-caps;"></span>’ etc.

Design strategy

Challenges

  • PCA has many advantages but has ‘noise’ problems.
  • Especially relevant for phonetic data.
  • Need a way to visually get a handle on health of our PCAs.
  • ‘Flipping’ PCs is mentally hard, especially with the vowel space involved!

What’s out there?

  • Look at current packages and try to match what people use.
  • lobanov_2() matches normalisation functions in vowels package.
  • PCAtest package (Camargo 2022), good but merges data generation and plotting (c.f. Dunnington n.d.).

Data and plots

  • Want people to be able to do their own plots:
    • So create a function to generate the data for a plot.
  • Want to provide a good first option for plots:
    • So create plotting functions which use the data.
  • Want plots to be customisable:
    • So output ggplot objects.

Customising plots

  • Example: annotating an existing plot using annotate()
plot_pc_vs(onze_vowels, onze_test, pc_no = 1, is_sig = TRUE) +
  coord_fixed() +
  annotate(
    "text",
    x = 1, y = 0.2,
    label = "Low vowel movement"
  ) +
  annotate(
    "rect",
    xmin = 1.2, xmax = -1, ymin = 1.9, ymax =  0.4, 
    alpha = 0.2, colour = "black"
  )

Customising plots

Interactive vs. publication

  • Some plots are provided for fast interactive work.
  • In most cases, you should produce your own plot for publication!

From scripts to packages

Code sharing

  • Our usual approach to code sharing: copy-and-paste from another person’s script.
  • This is how most of us start to learn R.
  • But research code is typically only run once
    • … and not designed with future users in mind.
  • Alternative approach: produce a well-documented package.
  • Requires explicit thought about the range of situations in which the code should work and provision of documentation.
  • Frequently copied code is a good candidate for making a package.
  • Even if you don’t make a package, it’s worth keeping potential future users in mind!

How to write a package

  • Best resource is Wickham and Bryan (2023) (free at https://r-pkgs.org/).
  • Write functions and documentation at the same time.
  • Write tests to ensure the functions are performing as expected in all relevant cases.
  • Write code to generate data for the package.

Getting on to CRAN

  • Packages are significantly easier to use if they are on CRAN.
  • CRAN: “The Comprehensive R Archive Network”.
  • Code is checked by volunteers to ensure (minimal) quality standards.
  • Code must be well documented.
  • There’s good advice in Wickham and Bryan (2023).

Summary

Summary

  1. What’s in the package?
  2. Design strategy
  3. Scripts vs. packages

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.
Björklund, Mats. 2019. “Be Careful with Your Principal Components.” Evolution 73 (10): 2151–58. https://doi.org/10.1111/evo.13835.
Brand, James, Jen Hay, Lynn Clark, Kevin Watson, and Márton Sóskuthy. 2021. “Systematic Co-Variation of Monophthongs Across Speakers of New Zealand English.” Journal of Phonetics 88: 101096. https://doi.org/10.1016/j.wocn.2021.101096.
Camargo, Arley. 2022. “PCAtest: Testing the Statistical Significance of Principal Component Analysis in R.” PeerJ 10: e12967. https://doi.org/10.7717/peerj.12967.
Dunnington, Dewey. n.d. “Best Practices for Programming with Ggplot2.” Posit. Accessed December 2, 2024. https://www.posit.co/.
Hurring, Gia, Joshua Wilson Black, Jen Hay, and Lynn Clark. Under review. “How Stable Are Patterns of Covariation Across Time?” Under review.
Legendre, P., and Louis Legendre. 2012. Numerical Ecology. Chantilly, NETHERLANDS, THE: Elsevier. http://ebookcentral.proquest.com/lib/canterbury/detail.action?docID=982554.
Müller, Kirill. 2020. here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Sheard, Elena, Jen Hay, Robert Fromont, Joshua Wilson Black, and Lynn Clark. 2024. “Covarying New Zealand Vowels Interact with Speech Rate to Create Social Meaning for NZ Listeners.” Presented at the 19th Conference on Laboratory Phonology, Hanyang University, June 29.
Vieira, Vasco. 2012. “Permutation Tests to Estimate Significances on Principal Components Analysis.” Computational Ecology and Software 2 (June): 103–23.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, and Jennifer Bryan. 2023. R Packages: Organize, Test, Document, and Share Your Code. 2nd edition. Sebastopol: O’Reilly Media. https://r-pkgs.org/.
Wilson Black, Joshua, James Brand, Jen Hay, and Lynn Clark. 2022. “Using Principal Component Analysis to Explore Co‐variation of Vowels.” Language and Linguistics Compass 17 (1). https://doi.org/10.1111/lnc3.12479.
———. 2023. “Using Principal Component Analysis to Explore Co‐variation of Vowels.” Language and Linguistics Compass 17 (1): e12479. https://doi.org/10.1111/lnc3.12479.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.
Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.
nzilbb.vowels A new tool for quantitative research on vocalic covariation Joshua Wilson Black joshua.black@canterbury.ac.nz Te Kāhui Roro Reo | New Zealand Institute of Language, Brain and Behaviour Te Whare Wānanga o Waitaha | University of Canterbury

  1. Slides

  2. Tools

  3. Close
  • nzilbb.vowels
  • Overview
  • What is nzilbb.vowels?
  • What is it?
  • lobanov_2()
  • lobanov_2() usage
  • ‘model-to-PCA’
  • correlation_test()
  • correlation_test() usage
  • correlation_test() usage
  • plot_correlation_magnitudes()
  • plot_correlation_counts()
  • pca_test()
  • pca_test() usage
  • pca_test() usage
  • plot_variance_explained()
  • plot_loadings()
  • plot_pc_vs()
  • pc_flip()
  • pc_flip() (cont.)
  • pc_flip() usage
  • pc_flip() usage (cont.)
  • mds_test()
  • mds_test() usage
  • mds_test() usage
  • Packaged data
  • RStudio add-in
  • Design strategy
  • Challenges
  • What’s out there?
  • Data and plots
  • Customising plots
  • Customising plots
  • Interactive vs. publication
  • From scripts to packages
  • Code sharing
  • How to write a package
  • Getting on to CRAN
  • Summary
  • Summary
  • References
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • ? Keyboard Help