class: top, title-slide .title[ # The overlooked effect of amplitude on within-speaker vowel variation ] .author[ ### Joshua Wilson Black, Jen Hay, Lynn Clark, James Brand ] .date[ ### 11 July 2022 ] --- # The Team This research was carried out by **Joshua Wilson Black**, **Jen Hay**, **Lynn Clark**, and **James Brand**. This work is part of the project _Towards an Improved Theory of Language Change_, based at the New Zealand Institute of Language, Brain and Behaviour | Te Kāhui Roro Reo at the University of Canterbury | Te Whare Wānanga o Waitaha. We gratefully acknowledge the support of the Marsden Fund | Te Pūtea Rangahau a Marsden. .left[![](images/Marsden-logo-RGB-150.png)] _Application number: (17-UOC-049)_ ??? Contributions: Joshua Wilson Black: primary analysis, write up. Jen Hay: research direction, write up. Lynn Clark: research direction. James Brand: pilot work on interval representation of data. --- # Overview Our **problem**: Research on style _predicts_ systematic covariation of linguistic variables as a result of shifts in speaker style (e.g. Podesva 2008), and _invites_ the development of quantitative methods to explore any such covariation (e.g. Tamminga 2021). Our **method**: To extend Brand et al.'s [(2021)](https://www.sciencedirect.com/science/article/pii/S0095447021000711#ak005) use of PCA from _across_-speaker variation to _within_-speaker variation. Our **results**: 1. Relative amplitude has a significant and under-appreciated impact on speaker vowel spaces. - Revealed by covariation in first formant values. - Known in experimental environments with introduction of loud background noise, not known for naturalistic recordings in a quiet environment. 2. Amplitude is used, potentially agentively, to mark position within topical subsections of monologues. ??? The literature on style both _predicts_ covariation of linguistic variables and _invites_ methodological developments so that we can get a quantitative handle on such covariation. Evaluating the predictions depends on the development of appropriate methods. Previous work on the project (by all authors but Wilson Black), showed that Principal Component Analysis is an appropriate methods for finding vowel covariation across speakers. We attempt to push a similar method in order to explore within-speaker covariation. We did not find stylistic covariation. However, we did find a surprising and, we argue, overlooked, source of covariation in amplitude. We find that first formant values pattern together with amplitude. While this is known in certain laboratory environments, and is particularly associated with loud speech and/or distant interlocutors, it has not been shown in the kind of naturalistic recordings which sociolinguists are often interested in. Second, we found that variation in amplitude is being used, possibly on purpose, by speakers to mark their position in sub-topic segments of our recordings. This is a potential link with literature on, say, turn-taking, where amplitude is a variable which can be used to signal that one is finished or not in a dialogue. We didn't solve our problem, but our results are of interest nonetheless! We will discuss the first of these results here, but are happy to talk about the second if that is of interest. --- # Our Data: Quakebox .pull-left[ ![](images/QuakeBox.jpg) <figcaption>Source: Waynne Williams (Port Hills Productions)</figcaption> ] .pull-right[ - 431 single-speaker monologues - Prompt: "tell us your earthquake story". - 277 speakers with topic tagging. - Stored as LaBB-CAT instance, with forced alignment via HTK. - Extract using LaBB-CAT interface to Praat: - F1 and F2 for all monophthongs at midpoint of _token_, - Maximum amplitude of _word_ in which monophthong token appears, - Articulation rate of _utterance_ in which monophthong token appears. ] ??? Our data comes from the QuakeBox corpus (Clark et al., 2016). The corpus has 431 speakers who freely respond to the prompt 'tell us your earthquake story'. The recordings are made with high quality audio and video equipment. The corpus contains topic tagging for a subset of speakers. These tags indicate whether the speaker is talking about, say, the September 2010 Earthquake, or the larger February 2011 Earthquake, or their life before the earthquakes etc. The corpus is stored in LaBB-CAT and run through HTK forced alignment (Fromont and Hay, 2008). LaBB-CAT also interfaces with Praat. If it comes up **filtering details** are given as an additional slide at the end of the deck. --- # PCA Analysis .pull-left[ ![](images/var_plot.png) <figcaption>Data from Brand et al. (2021)</figcaption> ] .pull-right[ - Principal Component Analysis (PCA) finds structural relationships (PCs) between variables. - PCs are _new variables_ which more compactly represent the original data. - e.g. PC1 and PC2 in plot above capture relationships between the named variables. - PCs can reveal systemic patterns in sound change - Plot on left is a _variable plot_. - Axes are PCs, arrows are original variables. - e.g. PC1: <span style="font-variant: small-caps;">strut</span> and <span style="font-variant: small-caps;">fleece</span> go one way, <span style="font-variant: small-caps;">thought</span> goes the other. ] ??? The example is a simplified version of two PCs in Brand et al. 2021. These are from the ONZE corpus and represent structure in large scale across-speaker change in NZE. Can the same methods work _within_ speakers? --- # Within-Speaker PCA: Intervals .pull-left[ ![](images/interval_detail.png) <figcaption>Data from two monophthongs against 240 second intervals.</figcaption> ] .pull-right[ - **Problem:** to move _within_ monologues, we need multiple observations per speaker. - **Simple solution:** divide monologue into intervals and take means. - How long?: - need enough tokens of each monophthong, but - long intervals will miss covariation at shorter time scales. - Two lengths: 60s and 240s. ] ??? The observations need to be _complete_. That is, we need an entry for each of our variables (formants, amplitude, and articulation rate for each vowel). The interval solution is not the only possible solution. We tried more sophisticated methods (representing each vowel with a GAMM, for instance), but these proved very difficult to get a handle on (for a sense of why see Meredith Tamminga's recent paper in _Social Meaning and Linguistic Variation_ (2021).). The figure on the left is a zoomed in section of a larger plot of all of the data for a single speaker. The two rows are the first formant values for two different vowels. The points are the frequencies of tokens of the vowel. The filled in squares are the intervals, with values represented by the shade, with more red intervals having higher mean values and more blue intervals having lower mean values. Note that different vowels have quite different frequencies. These are 240 second intervals, and one of the vowels only has one token! More detail on **imputation process** given in additional slide at end of deck. Worth noting here that 240 second intervals require almost no imputation, whereas the 60 second intervals require quite a bit. --- # Speaker Example ![](images/QB_NZ_F_369_combined.png) ??? This depicts all of the data (with exception of articulation rate) for a single speaker. We have both interval lengths, both F1 and F2, all vowels, and amplitude. Note that amplitude seems to be gradually reducing. An intuitive way to think of our PCA analysis is that seeks to find associations between the patterns of colour in these variables. --- # PCA Results: 60 Seconds .center[.large-image[ ![](images/var_plot_60.png) ]] - **PC1:** F1s and amplitude pattern together. - **PC2:** A strong F1-F2 relationship in GOOSE. structure. - F1-F2 relationships are not covariation between monophthongs. ??? We jump straight to the PCA. This analysis is the result of putting all of the 60 second intervals into PCA. It is thus an attempt to see _within speaker_ patterns which are about to be characterised _across_ speakers. Any patterns we find are patterns in the corpus at large, but they are patterns which apply _within_ monologues. We will here present only the results for the 60 second intervals. The 240 second intervals give the same kind of **PC1**. PC2 the easiest to use to indicate how to read variable plots. PC2 indicates one pattern in the data: when GOOSE F2 goes up, GOOSE F1 goes down, and vice versa. We are not interested in PC2 otherwise, though: it is not a relationship _between_ vowels. PC1 is our main phenomenon. Note all F1 on the right, along with amplitude. PC1 says that one ingredient in the data is F1s moving together with amplitude. Incidentally, our initial PCA _did not include_ amplitude. This PCA analysis was an initial attempt to _explain_ F1 variation. **240s Intervals and results of permutation test given as additional slide at end.** --- # Amplitude and F1 .center[.medium-image[ ![](images/f1_amp.png) ]] - GAMM model of F1 with predictors: - amplitude, articulation rate, pitch, time through monologue. - Random effects: Speaker, and vowel by speaker. - Amplitude by far largest predictor of variation. - In general: increase with amplitude. - Some vowels discernible at one amplitude but not another. ??? To further confirm our claim that amplitude drives systematic covariation of F1, we fit a GAMM model with: - Response: F1 (non-scaled) - Predictors: - Amplitude (scaled by speaker), - Articulation rate (same), - Pitch (same), - Time through monologue (on scale [0, 1]) - Random effects: - Speaker: fit the mean F1 for the speaker. - Vowel by Speaker: fit the mean F1 for each vowel for a given speaker. - **i.e.**: random effects capture each speakers mean vowel space. - Each predictor had a smooth fit for each vowel. All variables come up as significant by model comparison significance test, but amplitude **far and away** the strongest effect. **Addition at end: sig test scores.** We already see with the last point, some consequences for the nature of the nature of the vowel space. We see this more in the next slide. --- # Amplitude and Vowel Space .medium-image[.center[ ![](images/amp_art.gif) ]] - Same GAMM fit to F2 and used to plot model predictions. - Articulation rate is known to contract vowel space. - ...and is often controlled for. - But relative amplitude has a larger effect! - Note that <span style="font-variant:small-caps;">goose</span> and <span style="font-variant:small-caps;">thought</span> F2 also involved. ??? These plots assume that the non-varied variables maintain average values and that we are at the mid point of a monologue. **Addition at end:** embedded vowel space Shiny for exploration of different intervals. --- # Consequences 1. Natural amplitude variation in quiet environments correlates with F1. - Previously shown in lab environments and with introduction of background noise. - Amplitude variation in natural speech _larger_ than in controlled environments. 2. Challenges for investigation of within speaker stylistic covariation. - Amplitude variation needs to be controlled to find other sources of covariation. - Reported claims of F1 covariation which don't include amplitude are challenged. 3. Challenges for investigation of across speaker covariation. - Handling differences in amplitude when comparing speakers is non-trivial! - A minimal step: include a measure of amplitude in regression models. 4. Amplitude variation related to topic structure. - Amplitude variation part of phonetic account of discourse structure. - Apparent increase in amplitude at beginning of topic and decline at end. ??? Regarding 1: 24dB variation in our data vs. 10db variation in quiet vs. loud environment study (Liénard & Di Benedetto 1999). Regarding 2: Priming work by Villarreal & Clark mentioned as study which might have benefited from including amplitude. Regarding 3: Differences in recording equipment make this very difficult. No objective measure forthcoming for most of the data we work with. The proposal to include _relative_ amplitude also insufficient if speakers have been recorded entirely within a high or low amplitude span of speech. This creates serious problems for comparing speakers with one another. Regarding 4: - Link to turn-taking literature ('manage the flow of talk' Local 2007). - Very hard to model given that the effect we're claiming is a reduction in amplitude over time _in addition to_ the reduction which occurs in the monologue as a whole. We apply a few different methods to this question, all of which point in the same direction. But this needs to be explored further. --- class: title-slide # Thank you! --- # Additional: Filtering .center[.full-image[![](images/filtering_flow.png)]] --- # Additional: Imputation .medium-image[ ![](images/missing_interval_rate_plot.png) ] Exclude <span style = "font-variant: small-caps;">foot</span>, then 1. accept 60 second intervals with 7 monophthong types (& 240s intervals with nine). 2. if no data for monophthong, we impute the mean value. 3. if one or two tokens, we add two mean tokens to the interval, then take the interval mean. ??? **NB:** No imputation required for amplitude or articulation rate. --- # Additional: 240s Intervals .full-image[ ![](images/var_plot_240.png) ] --- # Additional: PCA Permutation Test .medium-image[ ![](images/permutation_test.png) ] - Permutation method: scramble time variable in original dataset. - Maintains F1-F2 link _within_ vowel. - Removes temporal covariation _between_ vowels. - y-axis: amount of variance explained by PC. - Blue: 1000 permuted data -> PCA runs. - Red: Real PCA results. --- # Additional: Model Significance - We generate p-values by a `\(\chi^2\)` model comparison. <div> <table> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:left;"> p-value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;">Time</td> <td style="text-align:left;">6.382e-08</td> </tr> <tr> <td style="text-align:left;">Articulation Rate</td> <td style="text-align:left;">1.051e-13</td> </tr> <tr> <td style="text-align:left;">Amplitude</td> <td style="text-align:left;">< 2e-16</td> </tr> <tr> <td style="text-align:left;">Pitch</td> <td style="text-align:left;">< 2e-16</td> </tr> </tbody> </table> </div> --- # Additional: Shiny Interactive <iframe src="https://joshuawilsonblack.shinyapps.io/space_similarity_app/" height="550" width="100%" style="border: 1px solid #464646;" data-external="1"></iframe> --- # Additional: Topics .medium-image[.center[ ![](images/topic_examples.png) ]] - Is amplitude variation within a monologue just fatigue-driven gradual decline or is it *used* by speakers for other purposes? - We use QuakeBox topic tags to investigate whether amplitude marks changes in topic. - We explore this by dividing topics into beginning, middle, and end and fitting a linear mixed model. ??? We do not find any strong evidence that our F1 co-variation is agentive (as say, stylistic co-variation would be). But is amplitude variation being used agentively. We don't have much in our corpus to get a handle on this question, but we _do_ have the topic tags. These give us structure within out monologues which speakers ought to be somewhat aware of, and which might have some connection with amplitude variation. The plot here shows amplitude data for six distinct topics within the same speaker. Note that we ignore the specific content of these topics. We do a bit of additional filtering not discussed here, but note that topic 5 does not have enough data in the middle to be included. (We require five points per section). --- # Additional: Topics - We control for overall decrease in amplitude over monologue by adding a term for time through monologue. - We fit a random slope for each speaker and for the parts of each topic in the dataset. <div> <table> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> Std. Error </th> <th style="text-align:left;"> t-value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;">Beginning</td> <td style="text-align:left;">0.05</td> <td style="text-align:left;">0.01</td> <td style="text-align:left;">3.46</td> </tr> <tr> <td style="text-align:left;">Middle</td> <td style="text-align:left;"> -0.024 </td> <td style="text-align:left;">0.02</td> <td style="text-align:left;"> -1.10 </td> </tr> <tr> <td style="text-align:left;">End</td> <td style="text-align:left;"> -0.10 </td> <td style="text-align:left;"> 0.02 </td> <td style="text-align:left;"> -5.28 </td> </tr> <tr> <td style="text-align:left;">Time</td> <td style="text-align:left;"> -0.24 </td> <td style="text-align:left;"> 0.06 </td> <td style="text-align:left;"> -3.99 </td> </tr> </tbody> </table> </div> - Coefficient for 'Time' indicates a gradual decline in amplitude. - Other coefficients indicate a shift from the mean value. - We have an increase in amplitude at the beginning of topics, and a larger decrease at the end. - t-values higher than 2 are taken to indicate 'surprise'. - Rise in amplitude at beginning and decline at end of topics is not explained by gradual decrease in amplitude. ??? We don't fit random intercepts because speaker data is z-scored. Our random effect structure treats each topic as a random effect, allowing each part to vary independently. It also treats each speaker as having a more or less extreme change in amplitude over the course of the monologue. I've not discussed the t-value test we carry out in the paper here. This is given as an **additional slide**. --- # Additional: Topics .full-image[.center[ ![](images/lmm-topics.png) ]] ??? This plot displays predictions from the model for each topic in the data set. The effect does look quite subtle, but it is there. --- # Additional: Fake Topic Test .large-image[.center[ ![](images/t_values.png) ]] - t-values obtained for real topics of extreme magnitude compared to distribution of 1000 'fake topics'. ??? We generate fake topics by, for each speaker, collecting the lengths of their topics, and then selecting random chunks of their monologue of the same length. This is done, and the modelling step repeated, 1000 times. The distributions depicted here are the distribution of t-values from models fit on the fake topics. <!-- ## Amplitude and the Vowel Space --> <!-- - Some further consequences: --> <!-- - Normalisation methods which assume a consistent vowel space for each speaker might be affected by within-speaker amplitude variation. --> <!-- - Sociolinguistic effects might look different depending on amplitude of sample speech. --> <!-- - Control is difficult: --> <!-- - In almost all cases, true amplitude can't be retrieved from signal. --> <!-- - Even if true amplitude known, difference in effect on each vowel is also a challenge. --> <!-- ## Amplitude and Topical Structure --> <!-- <div class="columns-2"> --> <!-- <p class='forceBreak'></p> --> <!-- - Do amplitude changes correspond to structural features of monologues? --> <!-- - QuakeBox allows us to explore this for topical spans within monologues. --> <!-- - We take topical sections on monologues ignoring their specific content. --> <!-- - We divide each topic into three: beginning, middle, and end. --> <!-- - Topics with less than five tokens in any part are removed. --> <!-- </div> --> <!-- ## Amplitude and Topic Model {.build} --> <!-- - We fit a linear mixed model. --> <!-- - Amplitude as dependent variable. --> <!-- - Part of topic, and time through monologue as fixed effects, --> <!-- - Random slope for each part of each topic and on time through monologue for each speaker. --> <!-- - Results: --> <!-- <div> --> <!-- <table> --> <!-- <thead> --> <!-- <tr> --> <!-- <th style="text-align:left;"> --> <!-- Variable --> <!-- </th> --> <!-- <th style="text-align:left;"> --> <!-- Estimate --> <!-- </th> --> <!-- <th style="text-align:left;"> --> <!-- t-value --> <!-- </th> --> <!-- </tr> --> <!-- </thead> --> <!-- <tbody> --> <!-- <tr> --> <!-- <td style="text-align:left;">Beginning</td> --> <!-- <td style="text-align:left;">0.034</td> --> <!-- <td style="text-align:left;">2.142</td> --> <!-- </tr> --> <!-- <tr> --> <!-- <td style="text-align:left;">Middle</td> --> <!-- <td style="text-align:left;">-0.022</td> --> <!-- <td style="text-align:left;">-1.154</td> --> <!-- </tr> --> <!-- <tr> --> <!-- <td style="text-align:left;">End</td> --> <!-- <td style="text-align:left;">-0.116</td> --> <!-- <td style="text-align:left;">-5.283</td> --> <!-- </tr> --> <!-- <tr> --> <!-- <td style="text-align:left;">Time through monologue</td> --> <!-- <td style="text-align:left;">-0.273</td> --> <!-- <td style="text-align:left;">-4.150</td> --> <!-- </tr> --> <!-- </tbody> --> <!-- </table> --> <!-- </div> --> <!-- - We treat t-values of magnitude greater than two as surprising. --> <!-- - A small increase in amplitude at the start and a decrease at the end of topics. --> <!-- ## Amplitude and Topic Model --> <!-- - Problem: it's hard to disentangle this from amplitude reduction over the course of a monologue. --> <!-- - This is true even when we factor in time through monologue as a variable. --> <!-- - A (partial) solution: another permutation test! --> <!-- - For each speaker, collect topic lengths. --> <!-- - Collect random span of speaker monologue of each topic length. --> <!-- - Repeat model collecting t-values and coefficients. --> <!-- --- --> <!-- <div class = "fullslide-nohead"> --> <!-- ![](images/t_vals.png) --> <!-- </div> --> <!-- --- --> <!-- <div class = "fullslide-nohead"> --> <!-- ![](images/coefs.png) --> <!-- </div> --> <!-- ## Amplitude and Topic Model --> <!-- - Results are consistent with --> <!-- - Decrease in amplitude at end of topic over and above decrease over monologue. --> <!-- - Increase in amplitude at start of topic over and above decrease over monologue. --> <!-- # Summary --> <!-- ## Summary --> <!-- - **Analysis** --> <!-- - An attempt to extend Brand et al. (2021) to within speaker covariation of monophthongs. --> <!-- - Data representation difficult: --> <!-- - variation in monophthong frequency. --> <!-- - variation in monologue length. --> <!-- - **Finding** --> <!-- - F1s pattern together. --> <!-- - **Explanation** --> <!-- - In line with Lombard literature: F1 increase with vocal effort. --> <!-- - We find a strong effect of amplitude on F1. --> <!-- - A consistent increase for all monophthongs but KIT. --> <!-- - This effect is of greater magnitude than effect of articulation rate. --> <!-- - **Consequences** --> <!-- - Vowel space effects of amplitude uncontrolled for and affect speaker clustering. --> <!-- - Amplitude changes seem to mark topical structures within monologues. --> <!-- ## Bibliography -->