Software Architect / Microsoft MVP (AI) and Pluralsight Author

AI, Architecture, Azure, Bot Framework, C#, Chatbots, Cognitive Services

Building a chatbot you can call using Bot Framework, Direct Line API and Twilio. Part 1

I couldn’t find any .NET tutorials that showed me how to create a configurable chatbot that I could call on the phone and interact with. It’s not surprising as a lot of the technology is still cutting edge.

In the future Azure Communication Service will offer a lot of what you’re about to read out of the box but for the time being this is restricted to the US.

I went through a lot trial and error to join the dots. In this blog post I share how I pieced together various pieces of technology and earlier blogs I read to make this possible.

This blog is a way for me to remember how to do this.

I read a few blog posts that were fundamental in piecing together this end-to-end solution so first – full credit to the following blogs in helping form parts of the solution:

 

These blogs were very helpful (thankyou!).

This blog will be split into 2 or more parts and by the time the series is complete you’ll understand how to create a simple bot that you can interact with on the phone.

Tools and Technology

The following technologies and services are used:

  • Azure Speech to Text – converts the humans voice to text. This is sent to the bot.
  • Azure Text to Speech – converts the chatbots response to speech. This is sent to the human.
  • Bot Framework SDK – used to orchestrate messages/activities/dialogs in the chatbot.
  • Direct Line API – lets you programmatically interact with the chatbot.
  • GStreamer CODEC – needed to parse audio.
  • NGrok – for creating a tunnel and exposing our development connection to the internet.
  • Twilio – used to provide a telephone number for the chatbot and route calls.
  • Twilio Media Streams – used to create socket connections and send/transmit audio data.

There are many moving parts so to visualise the flow of information and how each component interacts with each other a diagram can help.

Architecture

In this diagram we can see how the main components communication with each other. The types of data and message are also detailed.

Here is some pseudo code from the main steps in the above diagram:

  1. Person calls Twilio Number.
  2. Twilio Webhook intercepts call.
  3. Twilio Media Stream is invoked in parallel.
  4. As the person speaks, the Media Stream sends a Base64 string of the audio to the custom Media Handler.
  5. The Base64 is parsed to bytes and pushed into a byte array that contains the audio.
  6. The byte array is sent to Azure Speech to Text.
  7. The transcription of the text is sent to the chatbot.
  8. The chatbot processes the text like any usual activity.
  9. The chatbots response is converted to speech using Azure Text to Speech.
  10. The chatbot audio is converted to MULAW audio format that Twilio require.
  11. The MULAW audio is sent back upstream to the telephone.
  12. Rinse and repeat.

There’s quite a lot going on. Here is an overview of some of the main concepts mentioned:

Web Hook

This is an endpoint that you configure in the Twilio developer dashboard when you buy a phone number. The URL points to an endpoint you build. You can see this below. You’ll see mine is pointing to the nrgok url I have running. (ngrok is used to expose my development machine over the web).

Media Stream and Web Sockets

These facilitate the transmission of bidirectional data between the person and the bot. Or in plain English – let you have a normal conversation with the chatbot on the phone!

MULaw, GStreamer, Azure Speech Services and Twilio

This stuff was a pain. Twilio and Azure don’t play that together. There isn’t anything of the Twilio developer documentation to show you what’s required either. It turns out the Twilio Media Stream encodes the audio in a format known as MULAW.

This isn’t supported natively by Azure due to (I’m told) a licensing agreement.  You need to download a CODEC to do this which is part of a multimedia framework called GStreamer.

Sample Chatbot

We need a chatbot to interact with. The chatbot is simple for this example. It sends a greeting message then asks what we’d like to order then repeats our selection.

You can see this modelled in simple chatbot here:

protected override async Task OnMembersAddedAsync(IList<ChannelAccount> membersAdded, ITurnContext<IConversationUpdateActivity> turnContext, CancellationToken cancellationToken)

        {

            var welcomeText = "Hello! I am the Twilio and Bot Framework Voice Bot. What would you like to order?";

            foreach (var member in membersAdded)

            {

                if (member.Id != turnContext.Activity.Recipient.Id)

                {

                    await turnContext.SendActivityAsync(MessageFactory.Text(welcomeText, welcomeText), cancellationToken);

                }

            }

        }




        protected override async Task OnMessageActivityAsync(ITurnContext<IMessageActivity> turnContext, CancellationToken cancellationToken)

        {

            if (turnContext.Activity.Type == ActivityTypes.Message && !string.IsNullOrEmpty(turnContext.Activity.Text))

            {

                // Check to see if the user sent a simple "quit" message.

                if (turnContext.Activity.Text.ToLower().Contains("quit", StringComparison.InvariantCultureIgnoreCase))

                {

                    // Send a reply.

                    await turnContext.SendActivityAsync($"Goodbye!", cancellationToken: cancellationToken);

                    System.Environment.Exit(0);

                }

                else

                {

                    await turnContext.SendActivityAsync($"Your order for " + turnContext.Activity.Text + " has been processed.", cancellationToken: cancellationToken);

                    await turnContext.SendActivityAsync($"What do you want to do now?", cancellationToken: cancellationToken);

                }

            }

        }

 

This chatbot would be useless in the real world but is enough to test the concepts. We can test the chatbot out using the Bot Framework Emulator:

This chatbot is published to Azure and the Direct Line Channel is activated in the Bot Channel Registrations blade.

Direct Line

We need a way to programmatically interact with this chatbot and this is where the Direct Line API comes into play. Luckily, there is a NuGet package which makes this easy. It’s called Microsoft.Bot.Connector.DirectLine.

At a simple level you only need 3 methods to interact with your chatbot:

  • Start conversation.
  • Send message to the chatbot.
  • Get Messages from chatbot.

Luckily the NuGet packages has support for these out of the box.   For a deeper dive into connecting to chatbots using Direct Line API see my earlier blog post here.   With the chatbot done let’s look at the key events and processing you need perform to stitch together the chatbot with Twilio.

Twilio and Telephony Events – Under the Hood

These are some of the key components that handle audio data from the telephony side of the architecture. They belong to a WebSocket Manager project by Radu Matei:

  • WebSocketConnectionManager – Manages all web socket connections.
  • WebSocketHandler – Adds / removes sockets to the WebSocketConnectionManager
  • WebSocketManagerMiddleware – Injected at the .NET runtime (startup.cs) to accept data over the web socket (ws) protocol.

These are used in conjunction with a class MediaHandler. 

The MediaHandler orchestrates the processing of incoming and outgoing audio across sockets connections. The following sequence diagram gives you an overview of the flow of events:

Three of the key methods in the MediaHandler are the OnConnected, RecieveAsync and SendMessageAsync methods.

OnConnected

Here we get a socket id from the WebSocketConnectionManager. Using the DirectLine API we also start conversation with our chatbot which returns a conversation id.

public override async Task OnConnected(WebSocket socket)

        {

            try

            {

                await base.OnConnected(socket);

                string socketId = WebSocketConnectionManager.GetId(socket);

                _conversation = _directLineConnector.StartConversation();

                AddSocketTranscriptionEngine(socketId, _conversation.ConversationId);

            }

            catch (Exception ex)

            {

                Debug.WriteLine(ex.Message);

            }

        }

 

We take the socket id and conversation id and assign these values to a class that performs the speech to text transcription.

ReceiveAsync

Our webhook is invoked is invoked when the Twilio number is called. At the same time, a Twilio Media Stream is also activated. This also activates a web socket as part of the .NET middleware start up.

Twilio then send a series of events which the MediaHandler subscribes to. We can see ReceiveAsync here:

public override async Task ReceiveAsync(WebSocket socket, WebSocketReceiveResult result, byte[] buffer)

        {

            string socketId = WebSocketConnectionManager.GetId(socket);




            using (JsonDocument jsonDocument = JsonDocument.Parse(Encoding.UTF8.GetString(buffer, 0, result.Count)))

            {

                string eventMessage = jsonDocument.RootElement.GetProperty("event").GetString();

                string sid = jsonDocument.RootElement.GetProperty("streamSid").GetString();

               

     if (string.IsNullOrEmpty(_streamSid))

                {

                  _streamSid = sid;

                }




                switch (eventMessage)

                {

                    case "connected":

                        break;

                    case "start":

                        await StartSpeechTranscriptionEngine(socketId);

                        break;

                    case "media":

                        string payload = jsonDocument.RootElement.GetProperty("media").GetProperty("payload").GetString();

MediaChunkModel mediaChunk =

JsonSerializer.Deserialize<MediaChunkModel>(jsonDocument.RootElement.GetProperty("media").ToString());

                        if (mediaChunk.track == "inbound")

                        {

                            await ProcessAudioForTranscriptionAsync(socketId, payload);

                        }

                        break;

                    case "stop":

                        await OnConnectionFinishedAsync(socket, socketId);

                        break;

                }

            }

        }

 

In the code above we grab the streamSid (this is a unique identifier for the Media Stream).  We also handle connectedstartmedia and stop events which Twilio send via the web socket.

Start Event

When the start event is raised a transcription, engine is created for the given socket id. The speech to text service then starts listening for audio for the given socket id.

private async Task StartSpeechTranscriptionEngine(string socketId)

{

var transcriptionEngine = GetSocketTranscriptionEngine(socketId);

       await transcriptionEngine.Start();

}

Media Event

When the media event is received, we extract a base64 string which is found in the payload property.  This is the raw MULAW audio of the persons audio.  We have a final check to ensure the audio being processed is on the inbound track, if this is true, we then process the audio for transcription.

string payload = jsonDocument.RootElement.GetProperty("media").GetProperty("payload").GetString();

MediaChunkModel mediaChunk = JsonSerializer.Deserialize<MediaChunkModel>(jsonDocument.RootElement.GetProperty("media").ToString());

if (mediaChunk.track == "inbound")

{

await ProcessAudioForTranscriptionAsync(socketId, payload);

}

 Transcribing the Audio to Text

Speech is transcribed in using Azure Cognitive Services Speech services. It takes the raw MULAW audio from the “media event” payload then converts it to text which can be used by Azure Speech to Text:

private async Task ProcessAudioForTranscriptionAsync(string socketId, string payload)

        {

            byte[] payloadByteArray = Convert.FromBase64String(payload);

            byte[] decoded;

            MuLawDecoder.MuLawDecode(payloadByteArray, out decoded);

            var transcriptionEngine = GetSocketTranscriptionEngine(socketId);

            await transcriptionEngine.Transcribe(decoded);

        }

This transcription engine raises several events during the transcription process.  One key event that gets raised is the Recognized event.  This event is raised by the transcription engine to let you know that text has been identified in audio.

private void RecognizerRecognized(object sender, SpeechRecognitionEventArgs e)

{

currentText = e.Result.Text;

       this.VoiceParsed(this._socketId, "1", e.Result.Text);

}

In this event another event is raised called VoiceParsed which you can see above.  This contains text for the transcription that took place and the socket being processed.

Identifying When Transcription Has Occurred, Getting the bots Response and Sending Audio to Human (VoiceParsed)

This event VoiceParsed is the glue between the speech to text transcriber, chatbot and MediaHandler. We can see an overview of this entire process in the diagram here:

The MediaHandler listens for the event VoiceParsed. It takes the identified text (the message parameter), then sends it to the chatbot via the Direct Line API using the SendMessage method (comment 1 in the code)

private async void TranscriptionEngine_VoiceParsed(string socketid, string message)
        {
               // 1. send the human message to the bot via the Direct Line API
                ResourceResponse resourceResponse = _directLineConnector.SendMessage(message);

                //2 get the response from the chatbot
                var response = await _directLineConnector.GetMessagesAsync();

                // 3. Azure text to speech wrapper to get the Base64 representation of bot response
                foreach (Bot.Connector.DirectLine.Activity resp in response)
                {

                    // this condition will ignore messages from the human
                    if (resp.From.Id != "DirectLineClientUser")
                    {
                        string base64AudiResponse = await GetBase64String(resp.Text);
                        WebSocket webSocket = WebSocketConnectionManager.GetSocketById(socketid);

                        // 4. create an object that contains the audio for the media stream SID
                        OutboundChunk mediaToSend = new OutboundChunk
                        {
                            @event = "media",
                            streamSid = _streamSid,
                            media = new Media { payload = base64AudiResponse }
                        };

                        string jsonToSend = JsonSerializer.Serialize(mediaToSend);

// 5. Send the bot audio response back to the media stream/phone line

                        await this.SendMessageAsync(webSocket, jsonToSend);

                    }

                }

        }

When the text has been sent to the chatbot we need to the chatbots response. We do this by using the Direct Line API again and call the GetMessagesAsync method (comment 2 in the code).

This can contain one or more Bot Framework Activities, so we cycle through those and only take the messages the bot has sent (comment 3 in the code)

When we get a response from the chatbot we convert it to audio using a wrapper class for Azure Text to Speech. The outputs the MULAW audio into a base64 string that can the be used to hydrate the required objects that Twilio need (comments 3 and 4).

In comment 5 we send the audio bots voice back to the person on the phone

Summary

We’ve covered quite a lot in part one of this blog post so now is a good time to stop.  In Part 2 of the series, we’ll call the chatbot using Skype. We’ll then place the order, the chatbot will process it and then return a response.

JOIN MY EXCLUSIVE EMAIL LIST
Get the latest content and code from the blog posts!
I respect your privacy. No spam. Ever.

Leave a Reply