Audio samples for "Emotion-Coherent Speech Data Augmentation and Self-Supervised Contrastive Style Training for Enhancing Kids's Story Speech Synthesis"

Audio samples

Author: Raymond Chung


Kid-storybook Page-level Storytelling utterances

Ground truth utterancesM1:
1-sentence utterances
M2:
1-sentence utterances
+
2-sentence utterances
M4:
1-sentence utterances
+
2-sentence utterances of augmented data
+
contrastive loss
Sample 1
"Build strong, safe houses!" So they packed their bags and waved goodbye. "Watch out for the Big Bad Wolf," called Mother Pig. "We will!"
Sample 2
...they met a man selling straw. "Can I buy some straw?" asked Pinky Pig. "I'm going to build a house."
Sample 3
The next day, the Big Bad Wolf went to the straw house. "Little pig, little pig, let me come in," he called. "No!" cried Pinky Pig. "Not by the hair on my chinny-chin-chin."

Longer example

This one-minute kid storytelling speech was generated by the prosposed TTS model (M4) in one step and with each sentence conditioned by its predicted speaking style.

The story is from this page.