Getting Started with Speech Synthesis Markup Language (SSML)
SSML was designed by W3C to provide an XML-based markup language to assist in generating natural sounding synthesized speech. Plivo <Speak> XML now supports the generation of SSML based speech. SSML speech generation on Plivo is powered by Amazon Polly, the leader in SSML speech synthesis.
With normal text-to-speech, developers can only choose from a basic male or female voice in a subset of languages. Plivo SSML supports 27 languages and over 40 voices, and allows developers to also control pronunciation, pitch, volume, etc. Plivo’s root XML element for SSML tags is <Speak>, same as that for basic TTS. For example:
1 2 3 4 5 6 <Response> <Speak voice="MAN">Go Green, Go Plivo</Speak> //Basic Text-to-Speech <Speak voice="Polly.Joey"> <emphasis level="moderate">Go Green, Go Plivo</emphasis> //Text-to-Speech using SSML </Speak> </Response>
Amazon Polly voices can process text-to-speech for a maximum of 3000 characters using the <Speak> tag. For more information about SSML, see the W3C specifications.
Amazon Polly is a service that provides life-like text-to-speech across several languages and locales. SSML support on Plivo is powered by Amazon Polly.
To synthesize SSML speech on Plivo, simply specify one of the many Amazon Polly voices in the ‘voice’ attribute of Plivo’s <Speak> XML. Note that Polly voices must be namespaced with
1 2 3 4 5 <Response> <Speak voice="Polly.Joey"> <emphasis level="moderate">Go Green, Go Plivo</emphasis> </Speak> </Response>
A complete list of supported Polly voices is available here.
The following SSML tags are supported for use in Plivo’s XML:
|Adding a Pause||<break>||Use this tag to include a pause in the speech.|
|Emphasizing words||<emphasis>||Use this tag to change the rate and voice of the speech.|
|Specifying Another language for Specific Words||<lang>||Use this tag to set the natural language of the text.|
|Adding a Pause between Paragraphs||<p>||Use this tag to represent a paragraph.|
|Controlling Volume, Speaking Rate and Pitch||<prosody>||Use this tag to modify the volume, pitch, and rate of the tagged text.|
|Adding a Pause between sentences||<s>||Use this tag to represent a sentence. This will add a strong break before and after the tag.|
|Controlling How special types of words are spoken||<say-as>||Use this tag to describe how to interpret the text.|
|Pronouncing Acronyms and Abbreviations||<sub>||Use this tag to pronounce the specified words or phrases as different words or phrases.|
|Improving Pronunciation by specifying parts of speech||<w>||Use this tag to customize the pronunciation of words by specifying the part of speech.|
Note: The following AWS Polly specific tags are not supported for use with Plivo XML:
- <amazon:effect name=”drc”>
- <amazon:effect phonation=”soft”>
- <amazon:effect vocal-tract-length>
- <amazon: effect name=”whispered”>
The SSML Voices are supported for use with Plivo XML:
|Australian English (en-AU)||Nicole||Russell|
|Brazilian Portuguese (pt-BR)||Vitória||Ricardo|
|Canadian French (fr-CA)||Chantal||-|
|Indian English (en-IN)||Raveena||-|
|Mandarin Chinese (cmn-CN)||Zhiyu||-|
|Portuguese - Iberic (pt-PT)||Ines||Cristiano|
|Spanish - Castilian (es-ES)||Conchita||Enrique|
|UK English (en-GB)||Amy||Brian|
|US English (en-US)||Joanna||Matthew|
|US Spanish (es-US)||Penelope||Miguel|
|Welsh English (en-GB-WLS)||-||Geraint|
To ensure quick synthesis, an upper cap of 3000 characters is enforced on the text that can be synthesized in one <Speak> XML.
Support for SSML based speech synthesis is currently in Beta. While in Beta, SSML based speech synthesis is absolutely free.
SSML based speech synthesis will eventually be charged on the basis of the number of characters synthesized.
SSML Support In Plivo Server SDKs
SSML tags are supported in all our Server SDKs, you can get started with our SDKs by checking the setup guides available here for your preferred language.
The below examples use the Joey voice for US English (en-US). Use the <Speak voice> tag to specify the voice for your text.
The say-as tag describes how to interpret the text.
1 2 3 4 5 6 <Response> <Speak voice="Polly.Joey"> The date is <say-as interpret-as="date">20180626</say-as> </Speak> </Response>
The w tag is used to customize the pronunciation of words by specifying the part of speech.
1 2 3 4 5 6 7 8 9 10 11 12 <Response> <Speak voice="Polly.Joey"> The word <say-as interpret-as="characters">read</say-as> <s> may be interpreted as either the present simple form </s> <w role="amazon:VB">read</w> <s>or the past participle form</s> <w role="amazon:VBD">read</w> </Speak> </Response>