class: top, title-slide .title[ # Topical structure, amplitude variation, and F1 in single-speaker narrative recordings ] .author[ ### Joshua Wilson Black, Jen Hay, Lynn Clark, James Brand ] .date[ ### 8 December 2022 ] --- # The Team This research was carried out by **Joshua Wilson Black**, **Jen Hay**, **Lynn Clark**, and **James Brand**. This work is part of the project _Towards an Improved Theory of Language Change_, based at the New Zealand Institute of Language, Brain and Behaviour | Te Kāhui Roro Reo at the University of Canterbury | Te Whare Wānanga o Waitaha. We gratefully acknowledge the support of the Marsden Fund | Te Pūtea Rangahau a Marsden. .left[![](data:image/png;base64,#images/Marsden-logo-RGB-150.png)] _Application number: (17-UOC-049)_ ??? Contributions: Joshua Wilson Black: primary analysis, write up. Jen Hay: research direction, write up. Lynn Clark: research direction. James Brand: pilot work on interval representation of data. --- # The Big Picture Our **problem**: Research on style _predicts_ systematic covariation of linguistic variables as a result of shifts in speaker style (e.g. Podesva 2008), and _invites_ the development of quantitative methods to explore any such covariation (e.g. Tamminga 2021). -- Our **method**: To extend Brand et al.'s [(2021)](https://www.sciencedirect.com/science/article/pii/S0095447021000711#ak005) use of PCA from _across_-speaker variation to _within_-speaker variation. -- Our **results**: 1. Relative amplitude has a significant and under-appreciated impact on speaker vowel spaces. - Revealed by covariation in first formant values. - Known in experimental environments with introduction of loud background noise, not known for naturalistic recordings in a quiet environment. 2. Amplitude is used, potentially agentively, to mark position within topical subsections of single-speaker narrative recordings. ??? The literature on style both _predicts_ covariation of linguistic variables and _invites_ methodological developments so that we can get a quantitative handle on such covariation. Evaluating the predictions depends on the development of appropriate methods. Previous work on the project (by all authors but Wilson Black), showed that Principal Component Analysis is an appropriate methods for finding vowel covariation across speakers. We attempt to push a similar method in order to explore within-speaker covariation. We did not find stylistic covariation. However, we did find a surprising and, we argue, overlooked, source of covariation in amplitude. We find that first formant values pattern together with amplitude. While this is known in certain laboratory environments, and is particularly associated with loud speech and/or distant interlocutors, it has not been shown in the kind of naturalistic recordings which sociolinguists are often interested in. Second, we found that variation in amplitude is being used, possibly on purpose, by speakers to mark their position in sub-topic segments of our recordings. This is a potential link with literature on, say, turn-taking, where amplitude is a variable which can be used to signal that one is finished or not in a dialogue. **Mismatch:** We didn't solve our problem, but our results are of interest nonetheless! Today we focus on the second point, and we don't talk about the specific **PCA** aspects. Instead, we'll go straight to a series of GAMM models to show the influence of amplitude on the F1 of NZE monophthongs. --- # Our Data: the QuakeBox Corpus .pull-left[ ![](data:image/png;base64,#images/QuakeBox.jpg) <figcaption>Source: Waynne Williams (Port Hills Productions)</figcaption> ] -- .pull-right[ - 431 single-speaker monologues - Prompt: "tell us your earthquake story". - 277 speakers with topic tagging. - Stored as LaBB-CAT instance, with forced alignment via HTK. {{content}} ] -- - We extract (via LaBB-CAT interface to Praat): {{content}} -- - F1 and F2 for all monophthongs at midpoint of _token_, {{content}} -- - Maximum amplitude of _word_ in which monophthong token appears, {{content}} -- - Articulation rate of _utterance_ in which monophthong token appears. {{content}} ??? Our data comes from the QuakeBox corpus (Clark et al., 2016). The corpus has 431 speakers who freely respond to the prompt 'tell us your earthquake story'. The recordings are made with high quality audio and video equipment. The corpus contains topic tagging for a subset of speakers. These tags indicate whether the speaker is talking about, say, the September 2010 Earthquake, or the larger February 2011 Earthquake, or their life before the earthquakes etc. The corpus is stored in LaBB-CAT and run through HTK forced alignment (Fromont and Hay, 2008). LaBB-CAT also interfaces with Praat. **Filtering details** are given as an additional slide at the end of the deck if anyone is interested. --- # GAMM: Amplitude and F1 -- .center[.full-image[ ![](data:image/png;base64,#images/figure_3.png) ]] ??? To further confirm our claim that amplitude drives systematic covariation of F1, we fit a GAMM model with: - Response: F1 (non-scaled) - Predictors: - Amplitude (scaled by speaker), - Articulation rate (same), - Pitch (same), - Time through monologue (on scale [0, 1]) - Random effects: - Speaker: fit the mean F1 for the speaker. - Vowel by Speaker: fit the mean F1 for each vowel for a given speaker. - **i.e.**: random effects capture each speakers mean vowel space. - Each predictor had a smooth fit for each vowel. All variables come up as significant by model comparison significance test, but amplitude the strongest effect. If desired, sig test scores are given as an additional slide. the fact that some vowels are discernible at one amplitude but not at another indicates already that this effect will have an interesting impact on vowel spaces. --- # Amplitude and the Vowel Space -- .full-image[.center[ ![](data:image/png;base64,#images/amp-art.png) ]] ??? Here we're looking at the effect in the vowel space as a whole. - We fit an equivalent model of F2 and plot model predictions from both models. - Articulation rate is known to contract vowel space and is often controlled for. - But relative amplitude has a larger effect! - Note that <span style="font-variant:small-caps;">goose</span> and <span style="font-variant:small-caps;">thought</span> F2 also involved. In an earlier phase of this research, we thought we were picking up expansion and contraction of speaker vowel spaces over the course of monologues. **Addition at end:** embedded vowel space Shiny for exploration of different intervals. --- # Consequences -- 1. Natural amplitude variation in quiet environments correlates with F1. - Previously shown in lab environments and with introduction of background noise. - Amplitude variation in natural speech _larger_ than in controlled environments. -- 2. Challenges for investigation of within speaker stylistic covariation. - Amplitude variation needs to be controlled to find other sources of covariation. - Reported claims of F1 covariation which don't include amplitude are challenged. -- 3. Challenges for investigation of across speaker covariation. - Handling differences in amplitude when comparing speakers is non-trivial! - A minimal step: include a measure of amplitude in regression models. ??? Regarding 1: 24dB variation in our data vs. 10db variation in quiet vs. loud environment study (Liénard & Di Benedetto 1999). Regarding 2: Priming work by Villarreal & Clark mentioned as study which might have benefited from including amplitude. Regarding 3: Differences in recording equipment make this very difficult. No objective measure forthcoming for most of the data we work with. The proposal to include _relative_ amplitude also insufficient if speakers have been recorded entirely within a high or low amplitude span of speech. This creates serious problems for comparing speakers with one another. --- # Amplitude and Topical Structure .pull-left[ ![](data:image/png;base64,#images/topic-ex.png) ] -- .pull-right[ **Further question**: Do speakers use amplitude to do structural work within narrative recordings? {{content}} ] -- **We find**: Amplitude is increased at the start of topical segments within monologues and reduced at the end. {{content}} -- **How?**: Using QuakeBox topic tags, we: - Divide topics into 'start', 'middle', and 'end'. - Model amplitude by part of topic and time through monologue. {{content}} -- **A problem**: Amplitude drops over course of monologue. {{content}} ??? So amplitude variation is important. It's still an open question whether it is being used by speakers, intentionally or unintentionally, to do structural work in narrative monologues. Our data gives us some purchase on this question in the form of a series of topic tags. We find some evidence that the answer to our question here is 'yes'. We find that there is a statistically significant increase in amplitude at the start of topical segments and a statistically significant reduction at the end. In order to do this, we ignore the specific content of the topics in our corpus (these include, e.g., which earthquake is being discussed, or which of a series of other common topics in the corpus are being covered). The figure shows four example topics from an example speaker. We can see clear differences in the length or topics and their density. But, I would also say, some evidence (esp in the top panels) of the phenomenon we're claiming. The model I'll report to you today is very simple: we divide topics into 'start', 'middle', and 'end', and use this, plus time-though-monologue and by-speaker and by-topic random effects, to model amplitude. One problem we run into: amplitude reliably drops over the course of a monologue and, by definition, the start of a topic comes before the end in the monologue. So we need to be sure that our apparent increase at the start and decrease at the end of a topic is not an artefact of this global reduction in amplitude. We'll look at a method for avoiding this which compares our results with those generated from random, non-topical, divisions in the monologues. But, before that, a specific example: --- # Example .center[ ![](data:image/png;base64,#images/example_speaker_topic_amp_2.png) ] <audio controls><source src="data:image/png;base64,#audio/example_1.wav" type="audio/wav"/>></audio> --- # Results -- - Linear mixed model structure: `speaker_scaled_amp_max ~ topic_part + speaker_scaled_time + (0 +speaker_scaled_time | Speaker) + (0 + topic_part | speaker_topic)` - t-values > 2 indicate 'statistical significance' (in a looser exploratory sense). -- | Variable | Estimate | Error | t Value | |----------|----------|-------|---------| |topic_start| 0.06293 | 0.01437| 4.378| |topic_middle| -0.02741 | 0.01644 | -1.668| |topic_end| -0.11829 | 0.01889 | -6.263| |scaled_time| -0.27295| 0.05683| -4.803| -- Significant *rise* in amplitude at start, and *reduction* at end. Also significant reduction over course of monologue. -- Let's visualise: ??? This table shows the results of a linear mixed model, with a simple structure. The predictors are where in the topic a token comes from (start, middle, or end), and where in the monologue the token is (from -0.5-0.5). The model fits a random slope for each speakers monologue --- i.e. the extent to which amplitude drops over time is allowed to vary for each speaker. This is sensible, given that the monologues can be of quite different lengths. Each topic in the corpus is also given a random slope in the sense that the 'start', 'middle', and 'end' for each topic is taken to allowed to vary. The results are exactly as said above. A slight increase at the start and at the end. We take t-values greater than 2 to indicate 'statistical significance' here. This is definitely an exploratory project here, so we're in the 'expressions of surprise' interpretation of t-values and p-values. We can look at the predictions for each topic in the corpus: --- # Results .full-image[![](data:image/png;base64,#images/mod_preds.png)] ??? Here we see model predictions for each topic, assuming we are in the middle of a monologue. We see a lot of variation between topics, with a stable shift in the central tendency of each part. The start and middle are definitely sitting above 0, with the end sitting below. --- # Fake Topic Test -- The problem (again): If amplitude decreases over monologue, we might not actually have anything to do with topic. -- Solution: 1. Create fake topical segments from monologues - Copy distribution of lengths of topics from real speakers. - Choose start time of fake topic randomly (with restriction that there is enough room left in the monologue) 2. Repeat analysis and compare results with those obtained from actual topics. -- We repeat this process 1000 times to estimate the results we would expect in the absence of an effect of topic part (start, middle, or end) on amplitude. --- # Fake Topics: Coefficient Estimates .full-image[![](data:image/png;base64,#images/coefs.png)] ??? Nothing _too_ surprising here. But the next slide is good: --- # Fake Topics: t Values .full-image[![](data:image/png;base64,#images/t_values.png)] ??? - This is important, because it shows that 'significant' t values are often obtained by random segments of monologues (where they sit outside [-2, 2]). But our model is much more confident that there is something happening in the real data than in the fake topics. --- # Key points - Amplitude variation has a strong, uncontrolled-for, influence on formant readings. - Variation in amplitude is not random. - In our data, the beginning of topics is marked by increased amplitude and the end by decreased amplitude. - In, e.g., Local 2007, amplitude variation in 'so' (standaalone vs. trailing) linked to turn-taking role. - Amplitude variation should be taken into account in sociophonetics. - This is not always easy. - In within-speaker contexts, adding a measure of relative amplitude to models is valuable. ??? As noted at the start, this is somewhat of a 'B-side' to the overall project. But, we think, it opens out to some important questions about the sociophonetics of amplitude variation. --- class: title-slide # Thank you! --- # Additional: GAMM structure ``` gamm_fit <- bam( F1_50 ~ participant_gender + Vowel + s(speaker_scaled_time, by=Vowel) + s(speaker_scaled_art_rate, by=Vowel) + s(speaker_scaled_amp_max, by=Vowel) + s(speaker_scaled_pitch, by=Vowel) + s(Speaker, bs="re") + s(Speaker, Vowel, bs="re"), data = qb_vowels, method="fREML", discrete = TRUE, nthreads = 8 ) ``` --- # Additional: Filtering .center[.full-image[![](data:image/png;base64,#images/filtering_flow.png)]] --- # Additional: Imputation .medium-image[ ![](data:image/png;base64,#images/missing_interval_rate_plot.png) ] Exclude <span style = "font-variant: small-caps;">foot</span>, then 1. accept 60 second intervals with 7 monophthong types (& 240s intervals with nine). 2. if no data for monophthong, we impute the mean value. 3. if one or two tokens, we add two mean tokens to the interval, then take the interval mean. ??? **NB:** No imputation required for amplitude or articulation rate. --- # Additional: 240s Intervals .full-image[ ![](data:image/png;base64,#images/var_plot_240.png) ] --- # Additional: PCA Permutation Test .medium-image[ ![](data:image/png;base64,#images/permutation_test.png) ] - Permutation method: scramble time variable in original dataset. - Maintains F1-F2 link _within_ vowel. - Removes temporal covariation _between_ vowels. - y-axis: amount of variance explained by PC. - Blue: 1000 permuted data -> PCA runs. - Red: Real PCA results. --- # Additional: Model Significance - We generate p-values for GAMM models of F1 by amplitude using a `\(\chi^2\)` model comparison. <div> <table> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:left;"> p-value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;">Time</td> <td style="text-align:left;">2.215e-07</td> </tr> <tr> <td style="text-align:left;">Articulation Rate</td> <td style="text-align:left;">< 2e-16</td> </tr> <tr> <td style="text-align:left;">Amplitude</td> <td style="text-align:left;">< 2e-16</td> </tr> <tr> <td style="text-align:left;">Pitch</td> <td style="text-align:left;">< 2e-16</td> </tr> </tbody> </table> </div> --- # Additional: Shiny Interactive <iframe src="https://joshuawilsonblack.shinyapps.io/space_similarity_app/" height="550" width="100%" style="border: 1px solid #464646;" data-external="1"></iframe> --- <!-- ## Amplitude and the Vowel Space --> <!-- - Some further consequences: --> <!-- - Normalisation methods which assume a consistent vowel space for each speaker might be affected by within-speaker amplitude variation. --> <!-- - Sociolinguistic effects might look different depending on amplitude of sample speech. --> <!-- - Control is difficult: --> <!-- - In almost all cases, true amplitude can't be retrieved from signal. --> <!-- - Even if true amplitude known, difference in effect on each vowel is also a challenge. --> <!-- ## Amplitude and Topical Structure --> <!-- <div class="columns-2"> --> <!-- <p class='forceBreak'></p> --> <!-- - Do amplitude changes correspond to structural features of monologues? --> <!-- - QuakeBox allows us to explore this for topical spans within monologues. --> <!-- - We take topical sections on monologues ignoring their specific content. --> <!-- - We divide each topic into three: beginning, middle, and end. --> <!-- - Topics with less than five tokens in any part are removed. --> <!-- </div> --> <!-- ## Amplitude and Topic Model {.build} --> <!-- - We fit a linear mixed model. --> <!-- - Amplitude as dependent variable. --> <!-- - Part of topic, and time through monologue as fixed effects, --> <!-- - Random slope for each part of each topic and on time through monologue for each speaker. --> <!-- - Results: --> <!-- <div> --> <!-- <table> --> <!-- <thead> --> <!-- <tr> --> <!-- <th style="text-align:left;"> --> <!-- Variable --> <!-- </th> --> <!-- <th style="text-align:left;"> --> <!-- Estimate --> <!-- </th> --> <!-- <th style="text-align:left;"> --> <!-- t-value --> <!-- </th> --> <!-- </tr> --> <!-- </thead> --> <!-- <tbody> --> <!-- <tr> --> <!-- <td style="text-align:left;">Beginning</td> --> <!-- <td style="text-align:left;">0.034</td> --> <!-- <td style="text-align:left;">2.142</td> --> <!-- </tr> --> <!-- <tr> --> <!-- <td style="text-align:left;">Middle</td> --> <!-- <td style="text-align:left;">-0.022</td> --> <!-- <td style="text-align:left;">-1.154</td> --> <!-- </tr> --> <!-- <tr> --> <!-- <td style="text-align:left;">End</td> --> <!-- <td style="text-align:left;">-0.116</td> --> <!-- <td style="text-align:left;">-5.283</td> --> <!-- </tr> --> <!-- <tr> --> <!-- <td style="text-align:left;">Time through monologue</td> --> <!-- <td style="text-align:left;">-0.273</td> --> <!-- <td style="text-align:left;">-4.150</td> --> <!-- </tr> --> <!-- </tbody> --> <!-- </table> --> <!-- </div> --> <!-- - We treat t-values of magnitude greater than two as surprising. --> <!-- - A small increase in amplitude at the start and a decrease at the end of topics. --> <!-- ## Amplitude and Topic Model --> <!-- - Problem: it's hard to disentangle this from amplitude reduction over the course of a monologue. --> <!-- - This is true even when we factor in time through monologue as a variable. --> <!-- - A (partial) solution: another permutation test! --> <!-- - For each speaker, collect topic lengths. --> <!-- - Collect random span of speaker monologue of each topic length. --> <!-- - Repeat model collecting t-values and coefficients. --> <!-- --- --> <!-- <div class = "fullslide-nohead"> --> <!-- ![](images/t_vals.png) --> <!-- </div> --> <!-- --- --> <!-- <div class = "fullslide-nohead"> --> <!-- ![](data:image/png;base64,#images/coefs.png) --> <!-- </div> --> <!-- ## Amplitude and Topic Model --> <!-- - Results are consistent with --> <!-- - Decrease in amplitude at end of topic over and above decrease over monologue. --> <!-- - Increase in amplitude at start of topic over and above decrease over monologue. --> <!-- # Summary --> <!-- ## Summary --> <!-- - **Analysis** --> <!-- - An attempt to extend Brand et al. (2021) to within speaker covariation of monophthongs. --> <!-- - Data representation difficult: --> <!-- - variation in monophthong frequency. --> <!-- - variation in monologue length. --> <!-- - **Finding** --> <!-- - F1s pattern together. --> <!-- - **Explanation** --> <!-- - In line with Lombard literature: F1 increase with vocal effort. --> <!-- - We find a strong effect of amplitude on F1. --> <!-- - A consistent increase for all monophthongs but KIT. --> <!-- - This effect is of greater magnitude than effect of articulation rate. --> <!-- - **Consequences** --> <!-- - Vowel space effects of amplitude uncontrolled for and affect speaker clustering. --> <!-- - Amplitude changes seem to mark topical structures within monologues. --> <!-- ## Bibliography -->