Receive DTMF/Speech Input using Node.js

    Overview

    Capturing user inputs is a critical capability in any phone system. User inputs, captured in the form of Dual-tone multi-frequency (DTMF), or digit press, inputs and speech inputs, are useful in many use-cases such as IVR phone systems, Conversational IVRs, Virtual Assistants, Voice-based forms and surveys, etc. Plivo offers powerful features on the Voice platform that you can use to implement your business use-cases that involve secure capture of DTMF inputs & speech inputs.

    Set Up Your Node.js Dev Environment

    In this section, we’ll walk you through how to set up a Express server in under five minutes and start handling incoming calls & callbacks.

    Install Node.js

    Operating SystemInstructions
    OS X & LinuxTo see if you already have Node.js installed, run the command node --version in the terminal. If you don't have it installed, you can install it from here .
    WindowsTo install Node.js on Windows you can download it from here and install.

    Install Plivo Node.js Package

    • Create a project directory, run the following command:

      $ mkdir mynodeapp
      
    • Change the directory to our project directory in the command line:

      $ cd mynodeapp
      
    • Install the SDK using npm

      $ npm install plivo
      

    Detect DTMF inputs

    Outline

    In this section, we will show you how to implement a multi-level IVR phone system and capture digit press inputs (DTMF) on the Plivo voice platform.

    Receive DTMF

    The example IVR phone tree below has been implemented using the GetInput XML feature:

    1. Caller dials a phone number, and a virtual assistant answers the call.
    2. The first branch of the IVR phone tree will include three choices, such as “Press 1 for your account balance. Press 2 for your account status. Press 3 to speak to a representative.”
    3. Options 1 and 2 will automatically retrieve the information and play the caller a text-to-speech message, and option 3 will redirect the caller to the second branch of the IVR.
    4. The second branch of the IVR will have two options, such as “Press 1 for Sales. Press 2 for Support.”
    5. If the caller press “1”, then the call will be connected to the sales representative, or if the caller press “2”, then the call will be connected to the support representative.

    Create a Express App to Detect DTMF Inputs

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    
    var plivo = require('plivo');
    var express = require('express');
    var app = express();
    app.set('port', (process.env.PORT || 5000));
    app.use(express.static(__dirname + '/public'));
    //  Welcome message - firstbranch
    var WelcomeMessage = "Welcome to the demo app, Press 1 for your account balance. Press 2 for your account status. Press 3 to talk to our representative"
    // Message for Second branch
    var RepresentativeBranch = "Press 1 to talk to our Sales representative. Press 2 to talk to our Support representative"
    // This is the message that Plivo reads when the caller does nothing at all
    var NoInput = "Sorry, I didn't catch that. Please hangup and try again later."
    // This is the message that Plivo reads when the caller inputs a wrong digit.
    var WrongInput = "Sorry, it's a wrong input."
    
    app.all('/response/ivr/', function (request, response) {
    	if (request.method == "GET") {
    		var r = new plivo.Response();
    		const get_input = r.addGetInput(
    			{
    				'action': 'http://809927bea504.ngrok.io/multilevelivr/firstbranch/',
    				"method": 'POST',
    				'inputType': 'dtmf',
    				'digitEndTimeout': '5',
    				'language': 'en-US',
    				'redirect': 'true',
    			});
    		get_input.addSpeak(WelcomeMessage);
    		r.addSpeak(NoInput);
    		console.log(r.toXML());
    		response.set({ 'Content-Type': 'text/xml' });
    		response.end(r.toXML());
    	}
    });
    
    app.all('/multilevelivr/firstbranch/', function (request, response) {
    	var digits = request.query.Digits;
    	console.log("Digit pressed", digits)
    	var r = new plivo.Response();
    	if (digits == "1") {
    		var BalMessage = "Your account balance is $20.";
    		r.addSpeak(BalMessage);
    	}
    	else if (digits == "2") {
    		var StatMessage = "Your account status is active"
    		r.addSpeak(StatMessage);
    	}
    	else if (digits == "3") {
    		const get_input = r.addGetInput(
    			{
    				'action': 'http://809927bea504.ngrok.io/multilevelivr/secondbranch/',
    				"method": 'POST',
    				'inputType': 'dtmf',
    				'digitEndTimeout': '5',
    				'language': 'en-US',
    				'redirect': 'false',
    				'profanityFilter': 'true'
    			});
    		get_input.addSpeak(RepresentativeBranch, voice = "Polly.Salli", language = "en-US");
    		r.addSpeak(NoInput);
    		console.log(r.toXML());
    	}
    	else {
    		r.addSpeak(WrongInput);
    	}
    	response.set({ 'Content-Type': 'text/xml' });
    	response.end(r.toXML());
    });
    
    app.all('/multilevelivr/secondbranch/', function (request, response) {
    	var from_number = request.query.From;
    	var digits = request.query.Digits;
    	console.log("Digit pressed", digits)
    	var r = new plivo.Response();
    	var params = {
    		'action': "http://809927bea504.ngrok.io/multilevelivr/action/",
    		'method': "POST",
    		'redirect': "false",
    		'callerId': from_number
    	};
    	var dial = r.addDial(params);
    	if (digits == "1") {
    		dial.addNumber("<number_1>");
    		console.log(r.toXML());
    	}
    	else if (digits == "2") {
    		dial.addNumber("<number_2>");
    		console.log(r.toXML());
    	}
    	else {
    		r.addSpeak(WrongInput);
    	}
    	response.set({ 'Content-Type': 'text/xml' });
    	response.end(r.toXML());
    });
    
    app.listen(app.get('port'), function () {
    	console.log('Node app is running on port', app.get('port'));
    });
    

    Save this code in any file (name the file something like detect_dtmf.js). To run this file on the server, go to the folder where this file resides and use the following command:

    $ node detect_dtmf.js
    

    And you should see your basic server app in action on http://localhost:3000/response/ivr/

    Control the gathering of DTMF inputs

    You can improve the functionality of DTMF collection by using the various attributes available for GetInput XML, such as digitEndTimeout, numDigit, finishOnKey, executionTimeout.

    digitEndTimeout: You can use this attribute to set the time interval between successive digit inputs. The default value is auto and the allowed values are 2 to 10 seconds or auto. If the end-user has not provided any new digit input within the digitEndTimeout period, the digits entered to that point will be processed.

    numDigits: You can use this attribute to set the maximum number of digits the end-user has to provide on the call in the current operation. The default value is 32 and the allowed values are 1 to 32.

    If the end-user provides more digit inputs than the numDigits allows, Plivo will only send the maximum number of digits specified as numDigits to the action URL and the rest of the digit inputs will be ignored. For example, if numDigits is specified as ‘4’ and if the user provides 5 digits, then the last digit input will be ignored.

    finishOnKey: You can use this attribute to define the key that end-users need to press to submit their digit input. The default value is # and the allowed values are 0-9, *, # OR <empty string>,‘none’. When you set the value as <empty string> or ‘none,’ the DTMF input collection will end depending on the timeout or the numDigits attribute.

    Note: The above three attributes apply to input types dtmf and dtmf speech and do not apply to the speech input type. Also, if all these three attributes are specified, the priority is for finishOnKey.

    executionTimeout: You may use this attribute to configure the maximum execution time during which the input detection will be performed. You can use this to process the next element in the XML response when the end-user does not provide any input on the call. The default value is 15seconds, and the allowed values are 5 to 60 seconds.

    Detect speech inputs

    In this segment, you can learn how to use the GetInput XML feature to capture speech inputs and implement a simple IVR phone system.

    Outline

    Receive DTMF

    Let’s consider the simple IVR phone tree below:

    1. Caller dials a phone number, and a virtual assistant answers the call.
    2. The first branch of the IVR phone tree will include two choices, such as “Say Sales to talk to our Sales representative. Say Support to talk to our Support representative”.
    3. If the caller says “sales” then the call will be connected to the sales representative or if the caller says “support” then the call will be connected to the support representative.

    Create a Express App to Detect Speech Inputs

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    
    var plivo = require('plivo');
    var express = require('express');
    var app = express();
    app.set('port', (process.env.PORT || 5000));
    app.use(express.static(__dirname + '/public'));
    //  Welcome message - firstbranch
    var WelcomeMessage = "Welcome to the demo app, Say Sales to talk to our Sales representative. Say Support to talk to our Support representative"
    // This is the message that Plivo reads when the caller does nothing at all
    var NoInput = "Sorry, I didn't catch that. Please hangup and try again later."
    // This is the message that Plivo reads when the caller inputs a wrong digit.
    var WrongInput = "Sorry, it's a wrong input."
    
    app.all('/response/ivr/', function (request, response) {
    	if (request.method == "GET") {
    		var r = new plivo.Response();
    		const get_input = r.addGetInput(
    			{
    				'action': 'http://809927bea504.ngrok.io/multilevelivr/firstbranch/',
    				'method': 'POST',
    				'interimSpeechResultsCallback': 'https://3273948bbc57.ngrok.io/ivrspeech/firstbranch/',
    				'interimSpeechResultsCallbackMethod': 'POST',
    				'inputType': 'speech',
    				'redirect': 'true',
    			});
    		get_input.addSpeak(WelcomeMessage);
    		r.addSpeak(NoInput);
    		console.log(r.toXML());
    		response.set({ 'Content-Type': 'text/xml' });
    		response.end(r.toXML());
    	}
    });
    
    app.all('/multilevelivr/firstbranch/', function (request, response) {
    	var from_number = request.query.From;
    	var speech = request.query.Speech;
    	console.log("Speech Input is:", speech)
    	var r = new plivo.Response();
    	var params = {
    		'action': 'http://809927bea504.ngrok.io/multilevelivr/action/',
    		'method': 'POST',
    		'redirect': 'false',
    		'callerId': from_number
    	};
    	var dial = r.addDial(params);
    	if (speech == "sales") {
    		dial.addNumber("<number_1>");
    		console.log(r.toXML());
    	}
    	else if (speech == "support") {
    		dial.addNumber("<number_2>");
    		console.log(r.toXML());
    	}
    	else {
    		r.addSpeak(WrongInput);
    	}
    	response.set({ 'Content-Type': 'text/xml' });
    	response.end(r.toXML());
    });
    
    app.listen(app.get('port'), function () {
    	console.log('Node app is running on port', app.get('port'));
    });
    

    Save this code in any file (name the file something like detect_speech.js). To run this file on the server, go to the folder where this file resides and use the following command:

    $ node detect_speech.js
    

    And you should see your basic server app in action on http://localhost:3000/response/ivr/

    Speech recognition model & hints

    Speech Model

    You can select the type of Automatic Speech Recognition (ASR) Model using the speechModel attribute. Note that it is useful to select a speech recognition model based on your use-case.

    • You can set the speechModel as “command_and_search” for shorter audio clips. For example, if you expect callers to use voice commands or voice search, then you can use this model.
    • If you want to transcribe the audio from a phone call, you can set the model as “phone_call”.
    • You can explore both these models and see which one is best suited to your use-case.
    • You can set the model as “default” if your use-case does not suit the above models.

    Example XML:

    <Response>
    <GetInput action="https://example.com/action/" method="POST" inputType="speech" speechModel="command_and_search" redirect="true">
    <Speak>Welcome to the demo app, Say Sales to talk to our Sales representative. Say Support to talk to our Support representative</Speak>
    </GetInput>
    <Speak>Sorry, I didn't catch that. Please hangup and try again later.</Speak>
    </Response>
    

    Hints

    You can use the Hints attribute to improve speech transcription results. Using this attribute, you can define the words and phrases that would be common in your use-case. For example, if your use-case is a call-center, and callers would mostly use voice commands to connect to support & sales, you can use these keywords “support” & “sales” as hints.

    • Allowed values: a non-empty string of comma-separated phrases.
    • Limitations are:
      • Phrases per request: 500.
      • Characters per request: 10000.
      • Characters per phrase: 100.

    Example XML:

    <Response>
    <GetInput action="https://example.com/action/" method="POST" inputType="speech" hints="sales,support" redirect="true">
    <Speak>Welcome to the demo app, Say Sales to talk to our Sales representative. Say Support to talk to our Support representative</Speak>
    </GetInput>
    <Speak>Sorry, I didn't catch that. Please hangup and try again later.</Speak>
    </Response>
    

    Control the gathering of speech inputs

    You can improve the functionality of speech input collection by using the various attributes available for GetInput XML, such as speechEndTimeout, executionTimeout.

    speechEndTimeout: You can use this attribute to set the time that Plivo has to wait for more speech inputs once silence is detected. The default value is auto and the allowed values are 2 to 10 seconds or auto. If the end-user has not provided any new speech input within the speechEndTimeout period, the speech collected to that point will be processed.

    language: You can use this attribute to specify the language(along with the national/regional dialect) of the audio to be recognized on calls. The default language for speech detection is en-US. You can choose your preferred language from the language list available here.<hyperlink to the languages section in the same doc>.

    profanityFilter: If any profane words are used by end-users while providing speech inputs, Plivo will filter them out during transcription if you define this attribute as “true”. The profanity filter is used for single words and does not work for a combination of words. If you set this attribute to “false” or do not define this attribute, Plivo will not filter profane words by default, as the default value is “false.”

    Note: The above three attributes apply to input types speech and dtmf speech and do not apply to the dtmf input type.

    executionTimeout: You may use this attribute to configure the maximum execution time during which the speech detection will be performed. You can use this to process the next element in the XML response when the end-user does not provide any input on the call. The default value is 15seconds, and the allowed values are 5 to 60 seconds.

    Example XML

    <Response>
    <GetInput action="https://example.com/action/" method="POST" inputType="speech" speechEndTimeout="5" language="en-US" profanityFilter="true" executionTimeout="25" redirect="true">
    <Speak>Welcome to the demo app, Say Sales to talk to our Sales representative. Say Support to talk to our Support representative</Speak>
    </GetInput>
    <Speak>Sorry, I didn't catch that. Please hangup and try again later.</Speak>
    </Response>
    

    Real-time Speech Recognition

    You can use the interimSpeechResultsCallback attribute to perform real-time speech recognition. You can define the URL of your application server to this attribute and receive real-time callbacks of the user’s recognized speech while the user is still speaking on the call. Plivo sends the transcribed result to your server URL with attributes such as StableSpeech, UnstableSpeech, Stability, & SequenceNumber.

    • UnstableSpeech: This will hold the interim transcribed result of the user’s speech, which may be refined when more speech is collected from the user.
    • StableSpeech: This will hold the stable transcribed result of the user’s speech.
    • Stability: This field holds the UnstableSpeech stability score. Values range from 0.0 to 1.0, with 0.0 being completely unstable and 1.0 being completely stable. This value depicts the estimation of the probability that the recognizer will not change its guess about the interim speech result.
    • SequenceNumber: This argument will hold the sequence number of the interim speech callback that will help you to order the incoming callback requests.

    Example XML

    <Response>
    <GetInput action="https://example.com/action/" method="POST" interimSpeechResultsCallback="https://example.com/interimcallback/" interimSpeechResultsCallbackMethod="POST" inputType="speech" redirect="true">
    <Speak>Welcome to the demo app, Say Sales to talk to our Sales representative. Say Support to talk to our Support representative</Speak>
    </GetInput>
    <Speak>Sorry, I didn't catch that. Please hangup and try again later.</Speak>
    </Response>
    

    Data logging preferences

    You can use the log attribute of the GetInput XML to manage input logging preferences. If you define this attribute as “false” then logging will be disabled and Plivo will not log the digit and speech inputs. The default value for this is “true”.