Topical structure, amplitude variation, and F1 in single-speaker narrative recordings

class: top, title-slide

.title[
# Topical structure, amplitude variation, and F1 in single-speaker narrative recordings
]
.author[
### Joshua Wilson Black, Jen Hay, Lynn Clark, James Brand
]
.date[
### 8 December 2022
]

---

# The Team

This research was carried out by **Joshua Wilson Black**, **Jen Hay**, 
**Lynn Clark**, and **James Brand**.

This work is part of the project _Towards an Improved Theory of Language Change_,
based at the New Zealand Institute of Language, Brain and Behaviour | Te Kāhui
Roro Reo at the University of Canterbury | Te Whare Wānanga o Waitaha.

We gratefully acknowledge the support of the Marsden Fund | 
Te Pūtea Rangahau a Marsden.

.left[![](data:image/png;base64,#images/Marsden-logo-RGB-150.png)]

_Application number: (17-UOC-049)_

???

Contributions:
Joshua Wilson Black: primary analysis, write up.
Jen Hay: research direction, write up.
Lynn Clark: research direction.
James Brand: pilot work on interval representation of data.

---

# The Big Picture

Our **problem**:

Research on style _predicts_ systematic covariation of linguistic variables 
as a result of shifts in speaker style (e.g. Podesva 2008), and _invites_ 
the development of quantitative methods to explore any such covariation (e.g. Tamminga 2021).

Our **method**:

To extend Brand et al.'s [(2021)](https://www.sciencedirect.com/science/article/pii/S0095447021000711#ak005) use of PCA from _across_-speaker variation to _within_-speaker variation.

Our **results**:

1. Relative amplitude has a significant and under-appreciated impact on speaker vowel spaces.
  - Revealed by covariation in first formant values.
  - Known in experimental environments with introduction of loud background noise, not known for naturalistic recordings in a quiet environment.
2. Amplitude is used, potentially agentively, to mark position within topical subsections of single-speaker narrative recordings.

???

The literature on style both _predicts_ covariation of linguistic variables and
_invites_ methodological developments so that we can get a quantitative handle
on such covariation. Evaluating the predictions depends on the development
of appropriate methods.

Previous work on the project (by all authors but Wilson Black), showed that
Principal Component Analysis is an appropriate methods for finding vowel
covariation across speakers. We attempt to push
a similar method in order to explore within-speaker covariation.

We did not find stylistic covariation. However, we did find a surprising and, 
we argue, overlooked, source of covariation in amplitude. We find that first
formant values pattern together with amplitude.

While this is known in certain laboratory environments, and is particularly
associated with loud speech and/or distant interlocutors, it has not been shown
in the kind of naturalistic recordings which sociolinguists are often interested
in.

Second, we found that variation in amplitude is being used, possibly on purpose,
by speakers to mark their position in sub-topic segments of our recordings. This
is a potential link with literature on, say, turn-taking, where amplitude is a
variable which can be used to signal that one is finished or not in a dialogue.

**Mismatch:** We didn't solve our problem, but our results are of interest nonetheless!

Today we focus on the second point, and we don't talk about the specific **PCA** aspects.

Instead, we'll go straight to a series of GAMM models to show the influence of amplitude
on the F1 of NZE monophthongs.

---

# Our Data: the QuakeBox Corpus

.pull-left[
![](data:image/png;base64,#images/QuakeBox.jpg)
<figcaption>Source: Waynne Williams (Port Hills Productions)</figcaption>
]

.pull-right[
- 431 single-speaker monologues
- Prompt: "tell us your earthquake story".
- 277 speakers with topic tagging.
- Stored as LaBB-CAT instance, with forced alignment via HTK.
{{content}}
]

--
- We extract (via LaBB-CAT interface to Praat):
{{content}}

--
  - F1 and F2 for all monophthongs at midpoint of _token_,
  {{content}}
  
--
  - Maximum amplitude of _word_ in which monophthong token appears,
  {{content}}
  
--
  - Articulation rate of _utterance_ in which monophthong token appears.
{{content}}

???

Our data comes from the QuakeBox corpus (Clark et al., 2016). 
The corpus has 431 speakers who freely respond to the prompt 'tell us your
earthquake story'. The recordings are made with high quality audio and video
equipment.
 
The corpus contains topic tagging for a subset of speakers. These tags indicate
whether the speaker is talking about, say, the September 2010 Earthquake, or
the larger February 2011 Earthquake, or their life before the earthquakes etc.

The corpus is stored in LaBB-CAT and run through HTK forced alignment (Fromont
and Hay, 2008). LaBB-CAT also interfaces with Praat.

**Filtering details** are given as an additional slide at the
end of the deck if anyone is interested.

---

# GAMM: Amplitude and F1

.center[.full-image[
![](data:image/png;base64,#images/figure_3.png)
]]

???

To further confirm our claim that amplitude drives systematic covariation of F1,
we fit a GAMM model with:

- Response: F1 (non-scaled)
- Predictors:
  - Amplitude (scaled by speaker),
  - Articulation rate (same),
  - Pitch (same),
  - Time through monologue (on scale [0, 1])
- Random effects:
  - Speaker: fit the mean F1 for the speaker.
  - Vowel by Speaker: fit the mean F1 for each vowel for a given speaker.
  - **i.e.**: random effects capture each speakers mean vowel space.
- Each predictor had a smooth fit for each vowel.

All variables come up as significant by model comparison significance test, but
amplitude the strongest effect.

If desired, sig test scores are given as an additional slide.

the fact that some vowels are discernible at one amplitude but not at another
indicates already that this effect will have an interesting impact on vowel
spaces.

---

# Amplitude and the Vowel Space

.full-image[.center[
![](data:image/png;base64,#images/amp-art.png)
]]

???

Here we're looking at the effect in the vowel space as a whole.

- We fit an equivalent model of F2 and plot model predictions from both models.
- Articulation rate is known to contract vowel space and is often controlled for.
- But relative amplitude has a larger effect!
- Note that <span style="font-variant:small-caps;">goose</span> and <span style="font-variant:small-caps;">thought</span> F2 also involved.

In an earlier phase of this research, we thought we were picking up expansion
and contraction of speaker vowel spaces over the course of monologues.

**Addition at end:** embedded vowel space Shiny for exploration of different
intervals.

---

# Consequences

1. Natural amplitude variation in quiet environments correlates with F1.

- Previously shown in lab environments and with introduction of background noise.
  - Amplitude variation in natural speech _larger_ than in controlled environments.

2. Challenges for investigation of within speaker stylistic covariation.
  - Amplitude variation needs to be controlled to find other sources of covariation.
  - Reported claims of F1 covariation which don't include amplitude are challenged.

3. Challenges for investigation of across speaker covariation.
  - Handling differences in amplitude when comparing speakers is non-trivial!
  - A minimal step: include a measure of amplitude in regression models.
  
???

Regarding 1: 24dB variation in our data vs. 10db variation in quiet vs. loud
environment study (Liénard & Di Benedetto 1999).

Regarding 2:

Priming work by Villarreal & Clark mentioned as study which might have
benefited from including amplitude.

Regarding 3:

Differences in recording equipment make this very difficult. No objective measure
forthcoming for most of the data we work with.

The proposal to include _relative_ amplitude also insufficient if speakers have
been recorded entirely within a high or low amplitude span of speech. This 
creates serious problems for comparing speakers with one another.

---

# Amplitude and Topical Structure

.pull-left[
  ![](data:image/png;base64,#images/topic-ex.png)
]

.pull-right[
**Further question**: Do speakers use amplitude to do structural work 
within narrative recordings?
{{content}}
]

**We find**: Amplitude is increased at the start of topical segments within
monologues and reduced at the end.
{{content}}
  
--
  
**How?**: Using QuakeBox topic tags, we:
  - Divide topics into 'start', 'middle', and 'end'.
  - Model amplitude by part of topic and time through monologue.
{{content}}

**A problem**: Amplitude drops over course of monologue.
{{content}}

???

So amplitude variation is important. It's still an open question whether it is
being used by speakers, intentionally or unintentionally, to do structural work 
in narrative monologues. Our data gives us some purchase on this question in the
form of a series of topic tags.

We find some evidence that the answer to our question here is 'yes'. We find
that there is a statistically significant increase in amplitude at the start
of topical segments and a statistically significant reduction at the end.

In order to do this, we ignore the specific content of the topics in our 
corpus (these include, e.g., which earthquake is being discussed, or which of
a series of other common topics in the corpus are being covered). The figure
shows four example topics from an example speaker. We can see clear differences
in the length or topics and their density. But, I would also say, some evidence 
(esp in the top panels) of the phenomenon we're claiming.

The model I'll report to you today is very simple: we divide topics into 
'start', 'middle', and 'end', and use this, plus time-though-monologue and
by-speaker and by-topic random effects, to model amplitude.

One problem we run into: amplitude reliably drops over the course of a monologue
and, by definition, the start of a topic comes before the end in the monologue.
So we need to be sure that our apparent increase at the start and decrease at
the end of a topic is not an artefact of this global reduction in amplitude.
We'll look at a method for avoiding this which compares our results with those
generated from random, non-topical, divisions in the monologues.

But, before that, a specific example:

---

# Example

.center[
  ![](data:image/png;base64,#images/example_speaker_topic_amp_2.png)
]

---

# Results

- Linear mixed model structure: `speaker_scaled_amp_max ~ topic_part + speaker_scaled_time + (0 +speaker_scaled_time | Speaker) + (0 + topic_part | speaker_topic)`
- t-values > 2 indicate 'statistical significance' (in a looser exploratory sense).

| Variable | Estimate | Error | t Value |
|----------|----------|-------|---------|
|topic_start| 0.06293 | 0.01437|   4.378|
|topic_middle| -0.02741 | 0.01644 | -1.668|
|topic_end| -0.11829  |  0.01889  | -6.263|
|scaled_time| -0.27295|    0.05683|  -4.803|

Significant *rise* in amplitude at start, and *reduction* at end. Also 
significant reduction over course of monologue.

Let's visualise:

???

This table shows the results of a linear mixed model, with a simple structure.
The predictors are where in the topic a token comes from (start, middle, or
end), and where in the monologue the token is (from -0.5-0.5). The model fits a 
random slope for each speakers monologue --- i.e. the extent to which amplitude
drops over time is allowed to vary for each speaker. This is sensible, given
that the monologues can be of quite different lengths. Each topic in the 
corpus is also given a random slope in the sense that the 'start', 'middle',
and 'end' for each topic is taken to allowed to vary.

The results are exactly as said above. A slight increase at the start and at
the end. We take t-values greater than 2 to indicate 'statistical significance'
here. This is definitely an exploratory project here, so we're in the 'expressions
of surprise' interpretation of t-values and p-values.

We can look at the predictions for each topic in the corpus:

---

# Results

.full-image[![](data:image/png;base64,#images/mod_preds.png)]

???

Here we see model predictions for each topic, assuming we are in the middle of
a monologue. We see a lot of variation between topics, with a stable shift in
the central tendency of each part. The start and middle are definitely sitting
above 0, with the end sitting below.

---

# Fake Topic Test

The problem (again): If amplitude decreases over monologue, we might not actually
have anything to do with topic.

Solution:

1. Create fake topical segments from monologues
  - Copy distribution of lengths of topics from real speakers.
  - Choose start time of fake topic randomly (with restriction that there is 
  enough room left in the monologue)
2. Repeat analysis and compare results with those obtained
from actual topics.

We repeat this process 1000 times to estimate the results we would expect in
the absence of an effect of topic part (start, middle, or end) on amplitude.

---

# Fake Topics: Coefficient Estimates

.full-image[![](data:image/png;base64,#images/coefs.png)]

???

Nothing _too_ surprising here. But the next slide is good:

---

# Fake Topics: t Values

.full-image[![](data:image/png;base64,#images/t_values.png)]

???

- This is important, because it shows that 'significant' t values are often 
obtained by random segments of monologues (where they sit outside [-2, 2]). 
But our model is much more confident that there is something happening in the
real data than in the fake topics.

---

# Key points

- Amplitude variation has a strong, uncontrolled-for, influence on formant readings.

- Variation in amplitude is not random.
	- In our data, the beginning of topics is marked by increased amplitude and the end by decreased amplitude.
	- In, e.g., Local 2007, amplitude variation in 'so' (standaalone vs. trailing) linked to turn-taking role.
	
- Amplitude variation should be taken into account in sociophonetics.
	- This is not always easy.
	- In within-speaker contexts, adding a measure of relative amplitude to models is valuable.
	
???

As noted at the start, this is somewhat of a 'B-side' to the overall project.
But, we think, it opens out to some important questions about the sociophonetics
of amplitude variation.

---
class: title-slide

# Thank you!

---

# Additional: GAMM structure

```
gamm_fit <- bam(
  F1_50 ~ 
    participant_gender +
    Vowel +
    s(speaker_scaled_time, by=Vowel) + 
    s(speaker_scaled_art_rate, by=Vowel) +
    s(speaker_scaled_amp_max, by=Vowel) +
    s(speaker_scaled_pitch, by=Vowel) + 
    s(Speaker, bs="re") +
    s(Speaker, Vowel, bs="re"),
  data = qb_vowels, 
  method="fREML",
  discrete = TRUE,
  nthreads = 8 
)
```

---

# Additional: Filtering

.center[.full-image[![](data:image/png;base64,#images/filtering_flow.png)]]

---

# Additional: Imputation

.medium-image[
![](data:image/png;base64,#images/missing_interval_rate_plot.png)
]

Exclude <span style = "font-variant: small-caps;">foot</span>, then
1. accept 60 second intervals with 7 monophthong types (& 240s intervals with nine).
2. if no data for monophthong, we impute the mean value.
3. if one or two tokens, we add two mean tokens to the interval, then take the interval mean.

???

**NB:** No imputation required for amplitude or articulation rate.

---

# Additional: 240s Intervals

.full-image[
![](data:image/png;base64,#images/var_plot_240.png)
]

---

# Additional: PCA Permutation Test

.medium-image[
![](data:image/png;base64,#images/permutation_test.png)
]

- Permutation method: scramble time variable in original dataset.
  - Maintains F1-F2 link _within_ vowel.
  - Removes temporal covariation _between_ vowels.
- y-axis: amount of variance explained by PC.
- Blue: 1000 permuted data -> PCA runs.
- Red: Real PCA results.

---

# Additional: Model Significance

- We generate p-values for GAMM models of F1 by amplitude using a `\(\chi^2\)` model comparison.

<div>
<table>
  <thead>
    <tr>
    <th style="text-align:left;">
    Variable
    </th>
    <th style="text-align:left;">
    p-value
    </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align:left;">Time</td>
      <td style="text-align:left;">2.215e-07</td>
    </tr>
    <tr>
      <td style="text-align:left;">Articulation Rate</td>
      <td style="text-align:left;">&lt; 2e-16</td>
    </tr>
    <tr>
      <td style="text-align:left;">Amplitude</td>
      <td style="text-align:left;">&lt; 2e-16</td>
    </tr>
    <tr>
      <td style="text-align:left;">Pitch</td>
      <td style="text-align:left;">&lt; 2e-16</td>
    </tr>
  </tbody>
</table>
</div>

---

# Additional: Shiny Interactive

---