SpeechCraft is a large-scale expressive bilingual speech dataset with natural language descriptions resulting from an automatic speech annotation system. It encompasses over 2,000,000 audio clips annotated with two versions of text prompts, called speech-Descriptions (exclude transcript) and speech-Instructions (include transcript) .

We are planning to open source SpeechCraft, making it the largest natural language stylistic dataset that encompass the most fine-grained attributes and most diverse natural language descriptions available.

new updateThe SpeechCraft Dataset is now available at Here!

Contents in the Demo Webpage

  1. Introducing the SpeechCraft Dataset
    1. Overview: SpeechCraft Dataset
    2. Automatic Speech Annotation System
    3. Constructing Emphasis Speech Data (ref Sec. 4.2)
  2. Experimental Results: Enhancing Speech-Related Tasks with the SpeechCraft Dataset
    1. Expressive Speech Synthesis (ref Sec. 5.1)
    2. Fine-Grained Speech Emphasis Control (ref Sec. 5.2)
    3. Automated Speech Style Captioning (ref Sec. 5.3)

1. Introducing the SpeechCraft Dataset

1.1 Overview: SpeechCraft Dataset

Audio Transcript speech-Descriptions speech-Instructions
‘Come into the water, Marcus’, said Jean peremptorily, as she put her foot against the edge of the raft. Entertaining us with her storytelling skills, a natural youth female with high pitch and normal volume speaks rapidly, enthralling us. Entertaining us with her storytelling skills, a natural youth female with high pitch and normal volume speaks rapidly, enthralling us:"Come into the water, Marcus’, said Jean peremptorily, as she put her foot against the edge of the raft."
Is it not that it is their fashion of investing themselves with importance? This audiobook features a calm, steady-paced speaking male adult with a low pitch and high volume, reflecting on the style of investing. "Is it not that it is their fashion of investing themselves with importance?" This audiobook features a calm, steady-paced speaking male adult with a low pitch and high volume, reflecting on the style of investing.
Well, you know, life is holistic, Dave. Reflecting on a topic in the fields of Health and Fitness, a sad youth with low pitch and normal volume states. She speaks at a fast pace, signifying her sadness. Reflecting on a topic in the fields of Health and Fitness, a sad youth with low pitch and normal volume states, "Well, you know, life is holistic, Dave." She speaks at a fast pace, signifying her sadness.
And it’s very, very important to me that our family doesn’t operate like that. Expressing happiness, a high-pitched and high-volume female teenager speaker enthusiastically states, in a fast-paced manner. Speaking in the context of News and Politics, she reflects upon a particular topic, expressing excitement about her words. Expressing happiness, a high-pitched and high-volume female teenager speaker enthusiastically states, "And it’s very, very important to me that our family doesn’t operate like that." in a fast-paced manner. Speaking in the context of News and Politics, she reflects upon a particular topic, expressing excitement about her words.
I say, i enjoyed your film. That’s why. Expressing joy in the context of Entertainment, a happy adult male with normal pitch and volume speaks rapidly and says. His words reflect a positive attitude and amiable mood, evoking delight in the listener. Expressing joy in the context of Entertainment, a happy adult male with normal pitch and volume speaks rapidly and says, "I say, i enjoyed your film. That’s why." His words reflect a positive attitude and amiable mood.
这个铜牌可以当作生日礼物。 这位年轻女士的音调中等,音量低沉,语速很快。她的语气中透露着内心的自信,还有些得意。 “这个铜牌可以当作生日礼物。”这位年轻女士的音调中等,音量低沉,语速很快。她的语气中透露着内心的自信,还有些得意。
很多著名的流行音乐歌星都因使用毒品而毁了自己。 这位少女的音调中等,音量适中,语速很快,语气坚定,语气中带着怀疑和不相信的态度。 “很多著名的流行音乐歌星都因使用毒品而毁了自己。”这位少女的音调中等,音量适中,语速很快,语气坚定,语气中带着怀疑和不相信的态度。
自被列入十二五规划后。 男孩的声音很低沉,语气很认真,语气比较平静,有点内敛的感觉,用较高的音量,以较快的语速说着。 男孩的声音很低沉,语气很认真,语气比较平静,有点内敛的感觉,用较高的音量,以较快的语速说:“自被列入十二五规划后。”
全年将有望突破三千亿。 一位中年女性,她的音调低沉,音量高,语速适中,语气沉稳,镇定得让人感觉安心。她信心满满地说着。 一位中年女性,她的音调低沉,音量高,语速适中,语气沉稳,镇定得让人感觉安心。她信心满满地说:“全年将有望突破三千亿。”
中证房天下大数据指数的推出。 中年男子高分贝,快速地高声说道。他充满兴奋的语气,反映出他对这个话题话题热衷的态度。 中年男子高分贝,快速地高声说道:“中证房天下大数据指数的推出。”他充满兴奋的语气,反映出他对这个话题话题热衷的态度。

1.2 Automatic Speech Annotation System

SpeechCraft is obtained by employing an automatic speech annotation system to four open-source speech datasets. The annotation system adopted various kinds of speech style recognition with LLMs rewriting to form detailed and customized descriptions for expressiveness interpretation. The system framework is illustrated as the video.

1.2.1 Compared with the Previous Works

In this section, we compared the description generated by our annotation system with TextrolSpeech, which is the existing largest speech description dataset. Each type of highlight represents a unique speech attribute. Speech samples all from TextrolSpeech Dataset.

Given Audio Transcript TextrolSpeech Dataset By Our Automatic Speech Annotation System
“Hurry up, hurry up!” Speaking slowly with a high tone, she articulates her amazed words with normal energy. Urging something with urgency, a surprised teenage female with a high pitch and normal volume impatiently asks.
A few years later the dome fell in. Speaking rapidly and in a normal pitch, the mad man’s energy during communication is low. In a tense and furious tone, a high-pitched teenager with a normal volume and fast speech says. This conversation revolves around a topic related to time, as the speaker expresses his anger.
Our King George is labourers. Her low-energy voice carried her sad words gradually, maintaining a normal pitch. Speaking slowly and plaintively, a woman remarks. With a normal pitch and low volume, she emphasizes the significance of this statement.
No. The man was not drunk, he wondered how he got tied up with this stranger. A terrified male speaker engages the crowd with dynamic speech, regular pitch, and a quick tempo. Speaking in a normal pitch with high volume, an elderly male exclaims. Jumping into the conversation quickly, he seems to express fear.
We can die too, we can die like real people. People never live forever. A disconsolate man's voice is low-pitched, speaking at an average rate, yet exuding an overall sense of diminished vitality. Murmur a sad, older male, his voice low-pitched, low-volume, and medium-speed. Reflecting on the potential for mortality, he speaks with melancholy, evidence of his sadness.

1.3 Constructing Emphasis Speech Data

Here we display samples of the emphasis speech data regenerated from AISHELL-3 and Libritts-R, paired with the instructions generated by the Annotation System. (ref Sec. 4.2)

Transcript Word Emphasis Regenerated Audio speech-Instructions
‘It is a story,’ Sara would answer. story Speaking with a natural tone and at a normal speed, a young girl with normal pitch and low volume says, “‘It is a story,’ Sara would answer.”, adding a touch of charm to the conversation, highlighting “story” with pronounced emphasis.
That was something over thirteen years ago. years In an environment where naturalness rules, a calm adult male with normal pitch and low volume speaks rapidly, expressing: “That was something over thirteen years ago.”, projecting “years” with significant stress.
Here I can cheaply purchase a delicious self-approval. self A youthful male with normal pitch and low volume explosively states, “Here I can cheaply purchase a delicious self-approval.” He speaks rapidly in a natural manner, drawing attention to “self” by stressing it significantly.
Were you born in Spain, Pablo? Spain A fast-paced conversation with a youth female with low pitch and low volume: “Were you born in Spain, Pablo?”, uttering “Spain” with particular stress.
不可以叫住院医师 少女声音略带高昂,音量适中,以缓慢的语速,表达了自己内心的不相信和怀疑,说:“不可以叫住院医师!”,在说“叫”时加大了语气。
进入前一集 进入 中年女性,声音低沉带有些许忧伤,以低沉的音调,低声说道:“进入前一集。”,确保“进入”被突出地读出。
男人哭吧不是罪 男人 一位青年男性,声音中等音量,音调中等,语气充满愤怒的发怒,毫不留情地说:“男人哭吧不是罪。”,在“男人”这个词上特别强调。
如果当时没被抱错 年轻女孩的音调很高,音量也非常高,更快速的说出:“如果当时没被抱错。”她的声音中透露着一种不耐烦的情感,在“被”字上进行了强调发音。

2. Experimental Results: Enhancing Speech-Related Tasks with the SpeechCraft Dataset

2.1 Expressive Speech Synthesis (ref Sec. 5.1)

In this section, we compare the SpeechCraft Dataset with TextrolSpeech Dataset on the performance of Expressive Speech Synthesis. We trained the Salle model on each dataset with same steps. Notably, the first six prompts and audio clips of TextrolSpeech are copied from its official demopage, which was also trained with Salle model.

Style Prompt Transcript Synthezied Speech
(Trained on TextrolSpeech Dataset)
Synthezied Speech
(Trained on speech-Descriptions)
The man employs a deep tone and average speaking speed, projecting an overall low vitality. A doctor believes this boy to be mad.
The male speaker’s energetic discourse is accompanied by a normal pitch and speed. A doctor believes this boy to be mad.
The man employs a low-pitched voice, keeping a regular rhythm and usual energy in conversation. A doctor believes this boy to be mad.
Rapidly speaking, the despair man’s deep voice resonates with a sense of normal energy. A doctor believes this boy to be mad.
The despair woman’s high-pitched voice carried a slow speech. A doctor believes this boy to be mad.
The woman’s voice is vibrant, high-pitched, and delivered rapidly. A doctor believes this boy to be mad.
Style Prompt Transcript Salle Synthezied Speech
(Trained on TextrolSpeech Dataset)
Salle Synthezied Speech
(Trained on speech-Descriptions)
Parler-TTS Synthezied Speech
(Trained on speech-Descriptions)
In the context of News and Politics, a calm youth female with normal pitch and high energy describes the details of Felix Sater’s forty million dollars pump-and-dump scheme and his cooperation with the government, highlighting their confidential nature. Like, everything you just heard about felix sater’s forty million dollars pump-and-dump scheme and his cooperation with the government goes into a vault.
Surprised by the information, an adult male with normal pitch and energy speaks rapidly, exclaiming. His fast speech reflects his astonishment. In the context of Crime, he expresses his surprise. Oh, wow! What, what age did that start?
In the midst of a calm and composed atmosphere of Sports, an old male with high pitch and high energy speaks slowly, highlighting the profound emphasis placed on family before the commencement of a race. You see just how much he was thinking about family before the start of this race.
In a somber tone, an adult male with normal pitch and energy speaks slowly about the snow piling up on the streets. The snow was piling waist high upon the streets.
With a low pitch and high energy, a happy adult male enjoying an educational moment exclaimed. His words were spoken at a slow pace, expressing his joy and excitement. This falls under the category of Education. He was blowing excitedly and running his fingers through his hair.

2.2 Fine-Grained Speech Emphasis Control (ref Sec. 5.2)

In this section, we demonstrate the effectiveness of SpeechCraft on the task of Fine-Grained Speech Emphasis Control.

(ref Fig. 5) The first table shows the case study using a series of same speech instructions varied only in the words to be emphasized.
Instruction: A youthful male with normal pitch and low volume explosively states, “Winsome Waitress Wins Wealthy Wisconsin Woodsman.” He speaks rapidly in a natural manner, drawing attention to “*” by stressing it significantly.

Transcript Word Emphasis Synthezied Speech (Trained on speech-Instructions)
Winsome Waitress Wins Wealthy Wisconsin Woodsman. Waitress
Winsome Waitress Wins Wealthy Wisconsin Woodsman. Wealthy
Winsome Waitress Wins Wealthy Wisconsin Woodsman. Woodsman

In the second table, we compared the speech-Descriptions and speech-Instructions in the effectiveness of Fine-Grained Speech Emphasis Control.

Transcript Word Emphasis Synthezied Speech
(Trained on speech-Descriptions )
Synthezied Speech
(Trained on speech-Instructions)
‘It is a story,’ Sara would answer. story
That was something over thirteen years ago. years
Here I can cheaply purchase a delicious self-approval. self
Were you born in Spain, Pablo? Spain
进入前一集。 进入
男人哭吧不是罪。 男人

2.3 Experimental Results for Automated Speech Style Captioning (ref Sec. 5.3)

In this section, we compared the SpeechCraft Dataset with the original SECap, which was trained on an internal Chinese emotion annotated dataset called EMOSpeech, demonstrating the performance of SpeechCraft on Automated Speech Style Captioning. All the audio samples are from the EMOSpeech.

Audio Human Annotation from EMOSpeech Caption (Trained on original EMOSpeech) Caption (Trained on speech-Descriptions)
(translation: Overjoyed and in high spirits.)
(translation: Felt happiness and joy.)
(translation: A young woman , voice high , pace swift , revealed joy and delight in her emotion.)
(translation: The tone is full of surprise, the mood extremely pleasant, barely containing the joy within.)
(translation: The voice was full of curiosity, and the tone carried a careful anticipation.)
(translation: A young female, with a high-pitched voice and a moderate pace, spoke with an air of confusion and misunderstanding.)
(translation: A sense of constant complaining.)
(translation: Appears to be very skillful.)
(translation: A young gentleman, with an elevated pitch and rapid speed , articulated in anger.)
(translation: The tone is lively and animated, with varying intonations, filled with surprise and curiosity.)
(translation: The tone is calm, inquiring in nature, expressing confusion and puzzlement.)
(translation: A young woman speaks with a moderate pitch and relatively fast pace, her speech marked by rising and falling intonations.)
(translation: Filled with self-blame inside, her words carry displeasure, very sad.)
(translation: The speech is earnest, the tone sorrowful, the mood grief-stricken.)
(translation: A young woman speaks in a high pitch and moderate pace, carrying an unhappy mood.)
(translation: Driven by curiosity, remaining puzzled, wanting to know the answer.)
(translation: Suspicious and puzzled about something.)
(translation: A young female’s tone was high-pitched and the pace was moderate, speaking with a sense of doubt.)

As to English Speech Style Captioning, we showcase the description results from the Automatic Annotation System and the caption results of SECap trained on speech-Descriptions.

Audio Annotation from the Automatic System Caption
(Trained on speech-Descriptions)
Delving into the world of Education, a cheerful young woman with low pitch and high energy enthusiastically explains. A happy teenage girl with normal pitch and high volume speaks slowly, expressing her thoughts in an educational setting.
Embarking on a narration in an audiobook, a sad teenage female with a normal pitch and normal energy sets the stage with a poignant line. In the context of an audiobook, a teenage girl with normal pitch and volume speaks at a moderate speed, conveying her thoughts.
With a sense of angry, an adult male with normal pitch and energy speaks slowly in an audiobook setting, describing a scene. In the context of an audiobook, a natural adult male with normal pitch and volume speaks at a slow pace.
Reflecting on the alleviation or disappearance of symptoms after a fast, a calm elderly male with a high pitch and slow speaking speed shares the observation. expresses a natural old male with normal pitch and high volume, speaking at a slow pace.
Expressing sadness in the context of News and Politics, a calm adult female with normal pitch and energy speaks slowly about racial anguish, saying. says a sad adult female with normal pitch and volume, speaking at a slow pace in the context of News and Politics.
With high energy and a slow pace, a happy female youth with normal pitch conveys her thoughts. Her words reflect a positive and optimistic outlook. (Category. News and Politics) expresses a sad adult female with normal pitch and high volume, speaking at a slow pace in the context of News and Politics.
Expressing angry in the domain of news and politics, an old male with a normal pitch and energy speaks rapidly. says an angry adult male with normal pitch and volume, speaking at a fast pace. This conversation takes place in the context of News and Politics.