Pitch cues to hierarchical metric structure in children’s poetry

We investigated whether speakers use pitch to signal hierarchical metric structure in productions of Dr. Seuss’s The Cat in the Hat , by modeling fundamental frequency (F0) of monosyllabic words as a function of metric strength and a set of control parameters. We modeled maximum F0 of ~25000 words in a corpus of book productions from 17 speakers, comparing a 3-level musical metric model and a 5-level linguistic metric model. Results demonstrate that speakers consistently realized two levels of musical metric strength, as words corresponding with downbeats were produced with higher maximum F0 than all other beats. In addition, speakers simultaneously realized three levels of linguistic metric strength, as maximum F0 decreased linearly across the three highest linguistic metric levels. These results are consistent with previous work in both prose speech production and Western music composition, demonstrating that poetic speech uses pitch variation in ways that are consistent with both music and speech, and they complement prior demonstrations that duration and intensity variation signal musical and linguistic metric structure in the same corpus.


Introduction
Children's poetry shares many structural features with music (Lerdahl, 2001), but there has been little empirical investigation of how these features are realized in production. This project assesses how speakers use pitch variation to signal both musical and linguistic metric structure in children's poetry by analyzing productions of The Cat in the Hat (Dr. Seuss, 1957), a quintessential example of children's poetry with regular hierarchical metric structure.
Prior work demonstrates that music performers signal hierarchical metric structure with duration (Palmer, 1996;Todd, 1985) and intensity variation (Drake & Palmer, 1993). Our previous work demonstrates that, in productions of The Cat in the Hat, speakers systematically signal hierarchical musical metric structure with intensity variation (Fitzroy & Breen, 2020) and hierarchical linguistic metric structure with duration variation (Breen, 2018). The goal of the current study is to investigate how a third acoustic variable -pitch -is manipulated by speakers to signal hierarchical musical and linguistic metric structure in productions of The Cat in the Hat.
There is considerable evidence from psycholinguistics that speakers signal metric structure with pitch variation. Pitch accents, which are aligned with metrically strong syllables, are generally signaled by a local increase in pitch (Breen et al., 2010). Music cognition makes similar claims about how metric structure is realized in Western music; the Generative Theory of Tonal Music predicts that metrically strong positions are more likely to correspond with pitch excursions (Lerdahl & Jackendoff, 1983). Moreover, analyses of Western music reveal that pitch accents frequently occur at metrically strong positions (Huron & Royal, 1996). In addition, musical phrase boundaries, which typically coincide with metric unit boundaries (Temperley, 2003), are often signaled with falling pitch (Huron, 2006), and, in Western music, "late phrase compression" in which the pitch interval size tends to decline toward the end of a phrase (Shanahan & Huron, 2011).
The current paper investigates how speakers use pitch variation to signal both musical and linguistic hierarchical metric structure in productions of childdirected poetic speech. To do this, we modeled the pitch of each word using two hierarchical models of metric structure ( Figure 1): a 3-level musical metric model based on a 6/8 measure structure, and a 5-level linguistic metric model based on cross-linguistic metrical poetry (e.g., Fabb & Halle, 2008). Based on prior work in music and speech production, we predict that speakers will signal metrically strong syllables with higher pitch than metrically weak syllables, and that speakers will signal ends of metric units with decreases in pitch.

Participants
In the current study, we analyzed productions from the The Cat in the Hat corpus (Breen, 2018) from 17 female native speakers of American English.

Stimuli
Participants read aloud from a hardcover copy of The Cat in the Hat (Dr. Seuss, 1957) -a 61-page, illustrated children's book written in rhyming anapestic tetrameter, which is widely read by English-speaking caregivers to 0-3 year old children (Hudson Kam & Matthewson, 2017). The book consists of 1625 words (1576 monosyllabic words, 236 unique lexemes) organized primarily into 70 stanzas; each stanza contains two lines of four anapests each (as in (1)). The first line in each stanza ends with a rhyme prime and the second line ends with a phonologically predictable rhyme target.
(1) "Put me down!" said the fish. "This is no fun at all! Put me down!" said the fish.
"I do NOT wish to fall!"

Acoustic Measures
Word and silence boundaries ( Figure 2) were identified by automatic force-alignment of the audio productions with the text in Praat (Boersma & Weenink, 2018) using the Prosodylab-Aligner (Gorman et al., 2011), then manually adjusted as needed. F0 values were identified using Praat's auto-correlation algorithm. The maximum F0 value of each word was defined as the parabolically interpolated maximum pitch ( Figure 2). Multisyllabic words were excluded from analysis, because the unstressed syllables have reduced pitch for reasons unrelated to metric structure (Fry, 1958). Limiting investigation to one-syllable words resulted in exclusion of 16 of 236 unique lexemes (49/1625 words). Disfluent and incorrect word productions were also excluded, resulting in 473 out of 26,792 possible monosyllabic word productions (1.77%). The remaining maximum F0 values of monosyllabic words were centered and scaled to standard deviation units (i.e., converted to z-scores) separately within each participant.

Text Annotation
All words were first annotated for a set of control factors: a) number of phonemes (M = 2.98, SD = 0.8), b) word class (542 open-class, 1034 closed-class), c) log lexical frequency (M = 5.82, SD = 2.15), d) syntactic dependency structure, e) text emphasis (26 words in SMALL CAPS), and f) intra-stanza repetition. Next, metric structure was annotated in two ways ( Figure 1): with a 3-level musical metric structure based on 6/8 musical meter, where performers signal greatest prominence on beat 1, intermediate prominence on beat 4, and lowest prominence on beats 2, 3, 5, and 6 (Drake & Palmer, 1993); and a linguistic metric structure where metric feet are iteratively grouped in pairs, creating a 5level hierarchical structure (Fabb & Halle, 2008).

Analysis
Using linear mixed-effects regression, we fit a model of maximum F0 using the control factors. The data were fit on a word-by-word basis. The fully-saturated model included all control fixed effects, a random effect of speaker, and random slopes over speaker for each fixed effect. This model did not converge, so we iteratively removed random slopes accounting for the least variance and refit the model until it converged. Fixed effects were then individually removed and the simpler model was compared to the more complex model using a likelihood ratio test (Baayen et al., 2008). Factors accounting for significantly more variance in the more complex model remained in the final control model. We then fit an experimental model by adding fixed effects corresponding to the predictions of musical meter and linguistic meter. We added musical metric strength as both a fixed effect and random slope over participant, coded using simple contrast coding with metric strength level 2 (intermediate) as the reference level. We added linguistic metric strength as both a fixed effect and random slope over participant, coded using backward difference coding where each higher level is contrasted with the level below.

Results
Normalized maximum F0 is shown for words at each level of the musical metric hierarchy in Figure 3, and at each level of the linguistic metric hierarchy in Figure 4. The final model parameters appear in Table 1. Results demonstrate the speakers consistently signal two levels of musical metric strength with pitch: words aligned with metric strength level 3 (beat 1 in a 6/8 measure structure) are produced with higher pitch than words aligned with the other levels. Although words aligned with metric strength level 2 were produced with numerically higher F0 than level 1, this contrast did not reach significance in the model when including control parameters. Results also demonstrate that speakers use pitch to signal the end of large linguistic metric constituents, as evidenced by significant pitch decreases from metric strength level 3 to level 4, and from level 4 to level 5. Control parameters indicate that speakers use higher pitch to cue open class words (as opposed to closed class words), lower frequency words, words written in SMALL CAPS, and the first mention of words (as opposed to repeated mentions).

Discussion
The current study was designed to investigate the signaling of metric structure through pitch variation in child-directed productions of highly metric poetry. Results demonstrate that speakers use pitch variation to signal metric structure in poetic production in multiple ways, consistent with work in both music performance and prose speech production. Specifically, readers cue metrically strong syllables (i.e., beat 1 in a 6/8 musical meter) with higher pitch than metrically weak syllables (all other beats). In addition, speakers cue three hierarchical levels of metric units by decreasing pitch. Although it has long been observed that poetry shares features with both music and language (Lerdahl, 2001), there have been few empirical investigations of this claim. The current study provides such empirical support by demonstrating that models of both musical and linguistic metric structure simultaneously account for pitch variation in child-directed productions of poetry. Interestingly, this hybrid realization of musical and linguistic metric structure in pitch differs from the patterns we have previously demonstrated for word intensity and duration in this corpus, which respectively predominantly signal either musical or linguistic metric structure alone.
For musical metric structure, pitch signaled a twolevel metric hierarchy with sole accent on the downbeat (strength level 3). This contrasts with our previous investigations of both intensity, which signaled a threelevel hierarchy with primary accent on the downbeat and secondary accent on strength level 2 (Fitzroy & Breen, 2020), and word duration, which signaled a three-level hierarchy with primary accent on strength level 2 and secondary accent on the downbeat (Fitzroy & Breen, 2018). The realization of downbeat accent in pitch suggests that, like intensity, musical metric structure is signaled via pitch variation. However, the realization of only two levels of musical metric strength in pitch indicates that this aspect of metric structure is realized with lower fidelity in pitch than in intensity.
For linguistic metric structure, pitch marked differences between the three highest strength levels. This clearly contrasts with our previous investigation of duration, which unambiguously signaled five levels of linguistic metric strength (Breen, 2018), and differs somewhat from our previous findings for intensity, which marked differences between strength levels 2 and 3, and between strength levels 4 and 5 (Fitzroy & Breen, 2018). The similarity of linguistic metric structure realization in pitch and intensity is consistent with prior findings that physical aspects of these prosodic channels lead them to correlate somewhat in speech production (e.g., Gramming et al., 1988). However, the clearer distinction between linguistic metric levels in pitch than in intensity suggests that linguistic metric structure is realized more clearly in pitch. Taken together, our present pitch results and prior investigations of metric realization in word intensity and duration in this same corpus demonstrate that metric structure is realized in poetic production in a manner that reflects both musical and linguistic features of poetry, but that the balance of these features differs across prosodic channels.
The realization of metric structure through pitch and other acoustic cues provides important temporal information to child listeners of poetic texts like The Cat in the Hat. Specifically, they can use the metric structure to generate temporal expectations about upcoming information. For example, event-related potential studies demonstrate that listeners use metric structure, cued by in part by pitch variation, to direct their attention to metrically strong events in both music (Fitzroy & Sanders, 2015) and speech (Breen et al., 2014). Child listeners who hear metrically-regular poetry like The Cat in the Hat with clear prosodic cues to metric structure can direct their attention to metrically strong moments, facilitating phonological learning.

Conclusion
The current study demonstrates that speakers of childdirected poetic text use pitch variation to consistently signal two levels of musical metric structure and three levels of linguistic metric structure. These results are consistent with previous work in both prose speech production and Western music composition. Moreover, these results complement prior demonstrations that duration and intensity variation signal musical and linguistic metric structure in the same corpus, while providing further evidence that metric structure is realized differently across these prosodic channels.