Receive DTMF/Speech Input using Java

    Overview

    Capturing user inputs is a critical capability in any phone system. User inputs, captured in the form of Dual-tone multi-frequency (DTMF), or digit press, inputs and speech inputs, are useful in many use-cases such as IVR phone systems, Conversational IVRs, Virtual Assistants, Voice-based forms and surveys, etc. Plivo offers powerful features on the Voice platform that you can use to implement your business use-cases that involve secure capture of DTMF inputs & speech inputs.

    Set Up Your Java Dev Environment

    You must set up and install Java(Java 1.8 or higher) and Plivo’s Java SDK to handle incoming calls and callbacks. Here’s how.

    Install Java

    Operating SystemInstructions
    macOS & LinuxTo see if you already have Java installed, run the command java -version in the terminal. If you do not have it installed, you can install it from here.
    WindowsTo install Java on Windows follow the instructions listed here.

    Install Spring and Plivo Java Package using IntelliJ Idea

    • Use Spring Initializr to create a boilerplate project with Spring Boot framework.

      Create Boilerplate code

    • Choose the “Spring Web” dependency, Give the project a friendly name and click “Generate” to download the boilerplate code and open it in IntelliJ Idea.

      Boilerplate project in IntelliJ

      Note: Please set the Java target as 8.
    • Install the Plivo Java package by adding the dependency in pom.xml

        <dependency>
            <groupId>com.plivo</groupId>
            <artifactId>plivo-java</artifactId>
            <version>4.14.0</version>
        </dependency>
      

      Install Plivo package

    Detect DTMF inputs

    Outline

    In this section, we will show you how to implement a multi-level IVR phone system and capture digit press inputs (DTMF) on the Plivo voice platform.

    Receive DTMF

    The example IVR phone tree below has been implemented using the GetInput XML feature:

    1. Caller dials a phone number, and a virtual assistant answers the call.
    2. The first branch of the IVR phone tree will include three choices, such as “Press 1 for your account balance. Press 2 for your account status. Press 3 to speak to a representative.”
    3. Options 1 and 2 will automatically retrieve the information and play the caller a text-to-speech message, and option 3 will redirect the caller to the second branch of the IVR.
    4. The second branch of the IVR will have two options, such as “Press 1 for Sales. Press 2 for Support.”
    5. If the caller press “1”, then the call will be connected to the sales representative, or if the caller press “2”, then the call will be connected to the support representative.

    Create a Spring App to Detect DTMF Inputs

    Now, locate the PlivoVoiceApplication.java file in the src/main/java/com.example.demo/ folder and paste the following code.

    Note: Here, the demo application name is PlivoVoiceApplication.java because the friendly name provided in the Spring Initializr was Plivo Voice.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    
    package com.example.Plivo;
    
    import com.plivo.api.exceptions.PlivoXmlException;
    import com.plivo.api.xml.*;
    import com.plivo.api.xml.Number;
    import org.springframework.boot.SpringApplication;
    import org.springframework.boot.autoconfigure.SpringBootApplication;
    import org.springframework.web.bind.annotation.*;
    
    @SpringBootApplication
    @RestController
    public class PlivoVoiceApplication {
        public static void main(String[] args) {
            SpringApplication.run(PlivoVoiceApplication.class, args);
        }
    
        // Welcome message - firstbranch
        String WelcomeMessage = "Welcome to the demo app, Press 1 for your account balance. Press 2 for your account status. Press 3 to talk to our representative";
        // Message for Second branch
        String RepresentativeBranch = "Press 1 to talk to our Sales representative. Press 2 to talk to our Support representative";
        // This is the message that Plivo reads when the caller does nothing at all
        String NoInput = "Sorry, I didn't catch that. Please hangup and try again later.";
        // This is the message that Plivo reads when the caller inputs a wrong digit.
        String WrongInput = "Sorry, it's a wrong input.";
    
        @GetMapping(value = "/response/ivr/", produces = { "application/xml" })
    
        public Response getInput() throws PlivoXmlException {
            Response response = new Response().children(
                    new GetInput().action("http://1e69c9096712.ngrok.io/multilevelivr/firstbranch/").method("POST")
                            .inputType("dtmf").digitEndTimeout(5).redirect(true).children(new Speak(WelcomeMessage)))
                    .children(new Speak(NoInput));
            System.out.println(response.toXmlString());
            return response;
        }
    
        @RequestMapping(value = "/multilevelivr/firstbranch/", method = RequestMethod.POST, produces = {
                "application/xml" })
        public Response speak(@RequestParam("Digits") String digit) throws PlivoXmlException {
            System.out.println("Digit pressed:" + digit);
            Response response = new Response();
            if (digit.equals("1")) {
                response.children(new Speak("Your account balance is $20."));
            } else if (digit.equals("2")) {
                response.children(new Speak("Your account status is active."));
            } else if (digit.equals("3")) {
                response.children(new GetInput().action("http://1e69c9096712.ngrok.io/multilevelivr/secondbranch/")
                        .method("POST").inputType("dtmf").digitEndTimeout(5).redirect(true)
                        .children(new Speak(RepresentativeBranch))).children(new Speak(NoInput));
            } else {
                response.children(new Speak(WrongInput));
            }
            System.out.println(response.toXmlString());
            return response;
        }
    
        @RequestMapping(value = "/multilevelivr/second/", produces = { "application/xml" }, method = RequestMethod.POST)
        public Response callforward(@RequestParam("Digits") String digit, @RequestParam("From") String from_number)
                throws PlivoXmlException {
            System.out.println("Digit pressed:" + digit);
            Response response = new Response();
            if (digit.equals("1")) {
                response.children(new Dial().action("http://1e69c9096712.ngrok.io/multilevelivr/action/").method("POST")
                        .redirect(false).children(new Number("<number_1>")));
            } else if (digit.equals("2")) {
                response.children(new Dial().action("http://1e69c9096712.ngrok.io/multilevelivr/action/").method("POST")
                        .redirect(false).children(new Number("<number_2>")));
            } else {
                response.children(new Speak(WrongInput));
            }
            System.out.println(response.toXmlString());
            return response;
        }
    }
    

    Control the gathering of DTMF inputs

    You can improve the functionality of DTMF collection by using the various attributes available for GetInput XML, such as digitEndTimeout, numDigit, finishOnKey, executionTimeout.

    digitEndTimeout: You can use this attribute to set the time interval between successive digit inputs. The default value is auto and the allowed values are 2 to 10 seconds or auto. If the end-user has not provided any new digit input within the digitEndTimeout period, the digits entered to that point will be processed.

    numDigits: You can use this attribute to set the maximum number of digits the end-user has to provide on the call in the current operation. The default value is 32 and the allowed values are 1 to 32.

    If the end-user provides more digit inputs than the numDigits allows, Plivo will only send the maximum number of digits specified as numDigits to the action URL and the rest of the digit inputs will be ignored. For example, if numDigits is specified as ‘4’ and if the user provides 5 digits, then the last digit input will be ignored.

    finishOnKey: You can use this attribute to define the key that end-users need to press to submit their digit input. The default value is # and the allowed values are 0-9, *, # OR <empty string>,‘none’. When you set the value as <empty string> or ‘none,’ the DTMF input collection will end depending on the timeout or the numDigits attribute.

    Note: The above three attributes apply to input types dtmf and dtmf speech and do not apply to the speech input type. Also, if all these three attributes are specified, the priority is for finishOnKey.

    executionTimeout: You may use this attribute to configure the maximum execution time during which the input detection will be performed. You can use this to process the next element in the XML response when the end-user does not provide any input on the call. The default value is 15seconds, and the allowed values are 5 to 60 seconds.

    Detect speech inputs

    In this segment, you can learn how to use the GetInput XML feature to capture speech inputs and implement a simple IVR phone system.

    Outline

    Receive DTMF

    Let’s consider the simple IVR phone tree below:

    1. Caller dials a phone number, and a virtual assistant answers the call.
    2. The first branch of the IVR phone tree will include two choices, such as “Say Sales to talk to our Sales representative. Say Support to talk to our Support representative”.
    3. If the caller says “sales” then the call will be connected to the sales representative or if the caller says “support” then the call will be connected to the support representative.

    Create a Spring App to Detect Speech Inputs

    Now, locate the PlivoVoiceApplication.java file in the src/main/java/com.example.demo/ folder and paste the following code.

    Note: Here, the demo application name is PlivoVoiceApplication.java because the friendly name provided in the Spring Initializr was Plivo Voice.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    
    package com.example.Plivo;
    
    import com.plivo.api.exceptions.PlivoXmlException;
    import com.plivo.api.xml.*;
    import com.plivo.api.xml.Number;
    import org.springframework.boot.SpringApplication;
    import org.springframework.boot.autoconfigure.SpringBootApplication;
    import org.springframework.web.bind.annotation.*;
    
    @SpringBootApplication
    @RestController
    public class PlivoVoiceApplication {
        public static void main(final String[] args) {
            SpringApplication.run(PlivoVoiceApplication.class, args);
        }
    
        // Welcome message - firstbranch
        String welcomeMessage = "Welcome to the demo app, Say Sales to talk to our Sales representative. Say Support to talk to our Support representative";
        // This is the message that Plivo reads when the caller does nothing at all
        String noInput = "Sorry, I didn't catch that. Please hangup and try again later.";
        // This is the message that Plivo reads when the caller inputs a wrong digit.
        String wrongInput = "Sorry, it's a wrong input.";
    
        @GetMapping(value = "/response/ivr/", produces = { "application/xml" })
    
        public Response getInput() throws PlivoXmlException {
            final Response response = new Response().children(
                    new GetInput().action("http://1e69c9096712.ngrok.io/multilevelivr/firstbranch/").method("POST")
                            .interimSpeechResultsCallback("https://3273948bbc57.ngrok.io/ivrspeech/firstbranch/")
                            .interimSpeechResultsCallbackMethod("POST").inputType("speech").redirect(true)
                            .children(new Speak(welcomeMessage)))
                    .children(new Speak(noInput));
            System.out.println(response.toXmlString());
            return response;
        }
    
        @RequestMapping(value = "/multilevelivr/firstbranch/", produces = {
                "application/xml" }, method = RequestMethod.POST)
        public Response callforward(@RequestParam("Speech") final String speech,
                @RequestParam("From") final String fromNumber) throws PlivoXmlException {
            System.out.println("Speech Input is:" + speech);
            final Response response = new Response();
            if (speech.equals("sales")) {
                response.children(
                        new Dial().callerId(fromNumber).action("http://1e69c9096712.ngrok.io/multilevelivr/action/")
                                .method("POST").redirect(false).children(new Number("<number_1>")));
            } else if (speech.equals("support")) {
                response.children(
                        new Dial().callerId(fromNumber).action("http://1e69c9096712.ngrok.io/multilevelivr/action/")
                                .method("POST").redirect(false).children(new Number("<number_2>")));
            } else {
                response.children(new Speak(wrongInput));
            }
            System.out.println(response.toXmlString());
            return response;
        }
    }
    

    Speech recognition model & hints

    Speech Model

    You can select the type of Automatic Speech Recognition (ASR) Model using the speechModel attribute. Note that it is useful to select a speech recognition model based on your use-case.

    • You can set the speechModel as “command_and_search” for shorter audio clips. For example, if you expect callers to use voice commands or voice search, then you can use this model.
    • If you want to transcribe the audio from a phone call, you can set the model as “phone_call”.
    • You can explore both these models and see which one is best suited to your use-case.
    • You can set the model as “default” if your use-case does not suit the above models.

    Example XML:

    <Response>
    <GetInput action="https://example.com/action/" method="POST" inputType="speech" speechModel="command_and_search" redirect="true">
    <Speak>Welcome to the demo app, Say Sales to talk to our Sales representative. Say Support to talk to our Support representative</Speak>
    </GetInput>
    <Speak>Sorry, I didn't catch that. Please hangup and try again later.</Speak>
    </Response>
    

    Hints

    You can use the Hints attribute to improve speech transcription results. Using this attribute, you can define the words and phrases that would be common in your use-case. For example, if your use-case is a call-center, and callers would mostly use voice commands to connect to support & sales, you can use these keywords “support” & “sales” as hints.

    • Allowed values: a non-empty string of comma-separated phrases.
    • Limitations are:
      • Phrases per request: 500.
      • Characters per request: 10000.
      • Characters per phrase: 100.

    Example XML:

    <Response>
    <GetInput action="https://example.com/action/" method="POST" inputType="speech" hints="sales,support" redirect="true">
    <Speak>Welcome to the demo app, Say Sales to talk to our Sales representative. Say Support to talk to our Support representative</Speak>
    </GetInput>
    <Speak>Sorry, I didn't catch that. Please hangup and try again later.</Speak>
    </Response>
    

    Control the gathering of speech inputs

    You can improve the functionality of speech input collection by using the various attributes available for GetInput XML, such as speechEndTimeout, executionTimeout.

    speechEndTimeout: You can use this attribute to set the time that Plivo has to wait for more speech inputs once silence is detected. The default value is auto and the allowed values are 2 to 10 seconds or auto. If the end-user has not provided any new speech input within the speechEndTimeout period, the speech collected to that point will be processed.

    language: You can use this attribute to specify the language(along with the national/regional dialect) of the audio to be recognized on calls. The default language for speech detection is en-US. You can choose your preferred language from the language list available here.<hyperlink to the languages section in the same doc>.

    profanityFilter: If any profane words are used by end-users while providing speech inputs, Plivo will filter them out during transcription if you define this attribute as “true”. The profanity filter is used for single words and does not work for a combination of words. If you set this attribute to “false” or do not define this attribute, Plivo will not filter profane words by default, as the default value is “false.”

    Note: The above three attributes apply to input types speech and dtmf speech and do not apply to the dtmf input type.

    executionTimeout: You may use this attribute to configure the maximum execution time during which the speech detection will be performed. You can use this to process the next element in the XML response when the end-user does not provide any input on the call. The default value is 15seconds, and the allowed values are 5 to 60 seconds.

    Example XML

    <Response>
    <GetInput action="https://example.com/action/" method="POST" inputType="speech" speechEndTimeout="5" language="en-US" profanityFilter="true" executionTimeout="25" redirect="true">
    <Speak>Welcome to the demo app, Say Sales to talk to our Sales representative. Say Support to talk to our Support representative</Speak>
    </GetInput>
    <Speak>Sorry, I didn't catch that. Please hangup and try again later.</Speak>
    </Response>
    

    Real-time Speech Recognition

    You can use the interimSpeechResultsCallback attribute to perform real-time speech recognition. You can define the URL of your application server to this attribute and receive real-time callbacks of the user’s recognized speech while the user is still speaking on the call. Plivo sends the transcribed result to your server URL with attributes such as StableSpeech, UnstableSpeech, Stability, & SequenceNumber.

    • UnstableSpeech: This will hold the interim transcribed result of the user’s speech, which may be refined when more speech is collected from the user.
    • StableSpeech: This will hold the stable transcribed result of the user’s speech.
    • Stability: This field holds the UnstableSpeech stability score. Values range from 0.0 to 1.0, with 0.0 being completely unstable and 1.0 being completely stable. This value depicts the estimation of the probability that the recognizer will not change its guess about the interim speech result.
    • SequenceNumber: This argument will hold the sequence number of the interim speech callback that will help you to order the incoming callback requests.

    Example XML

    <Response>
    <GetInput action="https://example.com/action/" method="POST" interimSpeechResultsCallback="https://example.com/interimcallback/" interimSpeechResultsCallbackMethod="POST" inputType="speech" redirect="true">
    <Speak>Welcome to the demo app, Say Sales to talk to our Sales representative. Say Support to talk to our Support representative</Speak>
    </GetInput>
    <Speak>Sorry, I didn't catch that. Please hangup and try again later.</Speak>
    </Response>
    

    Data logging preferences

    You can use the log attribute of the GetInput XML to manage input logging preferences. If you define this attribute as “false” then logging will be disabled and Plivo will not log the digit and speech inputs. The default value for this is “true”.