Getting Started with Speech Synthesis Markup Language (SSML)

    SSML was designed by W3C to provide an XML-based markup language to assist in generating natural sounding synthesized speech. Plivo <Speak> XML now supports the generation of SSML based speech. SSML speech generation on Plivo is powered by Amazon Polly, the leader in SSML speech synthesis.

    With normal text-to-speech, developers can only choose from a basic male or female voice in a subset of languages. Plivo SSML supports 27 languages and over 40 voices, and allows developers to also control pronunciation, pitch, volume, etc. Plivo’s root XML element for SSML tags is <Speak>, same as that for basic TTS. For example:

    1
    2
    3
    4
    5
    6
    
    <Response>
        <Speak voice="MAN">Go Green, Go Plivo</Speak> //Basic Text-to-Speech
        <Speak voice="Polly.Joey">
            <emphasis level="moderate">Go Green, Go Plivo</emphasis> //Text-to-Speech using SSML
        </Speak>
    </Response>
    

    Amazon Polly voices can process text-to-speech for a maximum of 3000 characters using the <Speak> tag. For more information about SSML, see the W3C specifications.

    Amazon Polly

    Amazon Polly is a service that provides life-like text-to-speech across several languages and locales. SSML support on Plivo is powered by Amazon Polly.

    To synthesize SSML speech on Plivo, simply specify one of the many Amazon Polly voices in the ‘voice’ attribute of Plivo’s <Speak> XML. Note that Polly voices must be namespaced with Polly..

    For example:

    1
    2
    3
    4
    5
    
    <Response>
        <Speak voice="Polly.Joey">
            <emphasis level="moderate">Go Green, Go Plivo</emphasis>
        </Speak>
    </Response>
    

    A complete list of supported Polly voices is available here.

    SSML Tags

    The following SSML tags are supported for use in Plivo’s XML:

    ActionSSML TagDescription
    Adding a Pause<break>Use this tag to include a pause in the speech.
    Emphasizing words<emphasis>Use this tag to change the rate and voice of the speech.
    Specifying Another language for Specific Words<lang>Use this tag to set the natural language of the text.
    Adding a Pause between Paragraphs<p>Use this tag to represent a paragraph.
    Controlling Volume, Speaking Rate and Pitch<prosody>Use this tag to modify the volume, pitch, and rate of the tagged text.
    Adding a Pause between sentences<s>Use this tag to represent a sentence. This will add a strong break before and after the tag.
    Controlling How special types of words are spoken<say-as>Use this tag to describe how to interpret the text.
    Pronouncing Acronyms and Abbreviations<sub>Use this tag to pronounce the specified words or phrases as different words or phrases.
    Improving Pronunciation by specifying parts of speech<w>Use this tag to customize the pronunciation of words by specifying the part of speech.

    Note: The following AWS Polly specific tags are not supported for use with Plivo XML:

    • <amazon:auto-breaths>
    • <amazon:effect name=”drc”>
    • <amazon:effect phonation=”soft”>
    • <amazon:effect vocal-tract-length>
    • <amazon: effect name=”whispered”>

    SSML Voices

    The SSML Voices are supported for use with Plivo XML:

    LanguageFemaleMale
    Australian English (en-AU)NicoleRussell
    Brazilian Portuguese (pt-BR)VitóriaRicardo
    Canadian French (fr-CA)Chantal-
    Danish (da-DK)NajaMads
    Dutch (nl-NL)LotteRuben
    French (fr-FR)Lea Celine
    Mathieu-
    German (de-DE)VickiHans
    Marlene-
    Hindi (hi-IN)Aditi-
    Icelandic (is-IS)DoraKarl
    Indian English (en-IN)Raveena -
    Aditi-
    Italian (it-IT)CarlaGiorgio
    Japanese (ja-JP)MizukiTakumi
    Korean (ko-KR)Seoyeon-
    Mandarin Chinese (cmn-CN)Zhiyu-
    Norwegian (nb-NO)Liv-
    Polish (pl-PL)EwaJacek
    MajaJan
    Portuguese - Iberic (pt-PT)InesCristiano
    Romanian (ro-RO)Carmen-
    Russian (ru-RU)TatyanaMaxim
    Spanish - Castilian (es-ES)ConchitaEnrique
    Swedish (sv-SE)Astrid-
    Turkish (tr-TR)Filiz-
    UK English (en-GB)AmyBrian
    Emma-
    US English (en-US)JoannaMatthew
    SalliJustin
    KendraJoey
    Kimberly-
    Ivy-
    US Spanish (es-US)PenelopeMiguel
    Welsh (cy-GB)Gwyneth-
    Welsh English (en-GB-WLS)-Geraint

    Character Limit

    To ensure quick synthesis, an upper cap of 3000 characters is enforced on the text that can be synthesized in one <Speak> XML.

    Pricing

    Support for SSML based speech synthesis is currently in Beta. While in Beta, SSML based speech synthesis is absolutely free.

    SSML based speech synthesis will eventually be charged on the basis of the number of characters synthesized.

    SSML Support In Plivo Server SDKs

    SSML tags are supported in all our Server SDKs, you can get started with our SDKs by checking the setup guides available here for your preferred language.

    Examples

    The below examples use the Joey voice for US English (en-US). Use the <Speak voice> tag to specify the voice for your text.

    • Say-as

    The say-as tag describes how to interpret the text.

    1
    2
    3
    4
    5
    6
    
    <Response>
        <Speak voice="Polly.Joey">
            The date is
            <say-as interpret-as="date">20180626</say-as>
        </Speak>
    </Response>
    
    • W

    The w tag is used to customize the pronunciation of words by specifying the part of speech.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    <Response>
        <Speak voice="Polly.Joey">
        The word
        <say-as interpret-as="characters">read</say-as>
        <s>
            may be interpreted as either the present simple form
        </s>
        <w role="amazon:VB">read</w>
        <s>or the past participle form</s>
        <w role="amazon:VBD">read</w>
        </Speak>
    </Response>