Software Architect / Microsoft MVP (AI) and Technical Author

AI, Audio Notes, Azure, Azure AI Services, Speech

Audio Notes: Using Azure AI Speech to Perform Speech-to-Text Conversion

In an earlier blog post, I introduced Audio Notes.

This is a new a SaaS experiment that uses artificial intelligence, speech to text, and text analytics to automatically summarise audio and create notes from your recordings.

In this blog post, I shared how you can use Azure AI Speech Services to perform continuous real-time speech-to-text transcribing.

You will also learn about:

  • speech to text transcribing options
  • the JavaScript SDK
  • main capabilities in the service
  • key objects and events

 

A video demo is available along with full source code.

~

Azure Speech to Text Capabilities

Real-time speech-to-text lets you transcribe speech from a microphone, file, memory stream. It can be used in scenarios such as:

  • Creating, captions or subtitles in meetings
  • Contact centres agent assistance
  • Dictation

~

Single Capture Recognition

Single-shot recognition lets you recognizes a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.

Continuous Recognition

Continuous recognition lets you control when to stop recognizing audio. It requires you to subscribe to the Recognizing, Recognized, and Canceled events to get the recognition results. To stop recognition, you must call stopContinuousRecognitionAsync.

For the purposes of developing Audio Notes, continuous recognition will be used.

~

Consuming Azure AI Speech Services

Like other Azure AI services, you can consume speech capabilities by using client SDKs, JavaScript or make REST API calls.

Audio Notes will run on the web and need to capture audio from a webpage. This means the speech service will be consumed using JavaScript.

~

Creating the Service

You create the service directly in the Azure Portal. You can see this here:

After a few moments, the service is created. We take a note of the location:

The API key is also needed:

~

Events and Process Overview

We’re using continues capture and JavaScript to consume Azure AI Speech. To do this, we need to initiate and subscribe to several events.

The following events are used:

  • recognizing: Signal for events that contain intermediate recognition results.
  • sessionStopped: Signal for events that indicate the end of a recognition session (operation).
  • canceled: Signal for events that contain canceled recognition results. These results indicate a recognition attempt that was canceled as a result of a direct cancellation request. Alternatively, they indicate a transport or protocol failure.

~

Example Code

Example code is available on GitHub that shows you how to perform single shot recognition using Azure AI Speech:

<!DOCTYPE html>
<html>
<head>
<title>Microsoft Cognitive Services Speech SDK JavaScript Quickstart</title>
<meta charset="utf-8" />
</head>
<body style="font-family:'Helvetica Neue',Helvetica,Arial,sans-serif; font-size:13px;">
<!-- <uidiv> -->
<div id="warning">
<h1 style="font-weight:500;">Speech Recognition Speech SDK not found (microsoft.cognitiveservices.speech.sdk.bundle.js missing).</h1>
</div>

<div id="content" style="display:none">
<table width="100%">
<tr>
<td></td>
<td><h1 style="font-weight:500;">Microsoft Cognitive Services Speech SDK JavaScript Quickstart</h1></td>
</tr>
<tr>
<td align="right"><a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/get-started" target="_blank">Subscription</a>:</td>
<td><input id="subscriptionKey" type="text" size="40" value="subscription"></td>
</tr>
<tr>
<td align="right">Region</td>
<td><input id="serviceRegion" type="text" size="40" value="YourServiceRegion"></td>
</tr>
<tr>
<td></td>
<td><button id="startRecognizeOnceAsyncButton">Start recognition</button></td>
</tr>
<tr>
<td align="right" valign="top">Results</td>
<td><textarea id="phraseDiv" style="display: inline-block;width:500px;height:200px"></textarea></td>
</tr>
</table>
</div>
<!-- </uidiv> -->

<!-- <speechsdkref> -->
<!-- Speech SDK reference sdk. -->
<script src="https://aka.ms/csspeech/jsbrowserpackageraw"></script>
<!-- </speechsdkref> -->

<!-- <quickstartcode> -->
<!-- Speech SDK USAGE -->
<script>
// status fields and start button in UI
var phraseDiv;
var startRecognizeOnceAsyncButton;

// subscription key and region for speech services.
var subscriptionKey, serviceRegion;
var SpeechSDK;
var recognizer;

document.addEventListener("DOMContentLoaded", function () {
startRecognizeOnceAsyncButton = document.getElementById("startRecognizeOnceAsyncButton");
subscriptionKey = document.getElementById("subscriptionKey");
serviceRegion = document.getElementById("serviceRegion");
phraseDiv = document.getElementById("phraseDiv");

startRecognizeOnceAsyncButton.addEventListener("click", function () {
startRecognizeOnceAsyncButton.disabled = true;
phraseDiv.innerHTML = "";

if (subscriptionKey.value === "" || subscriptionKey.value === "subscription") {
alert("Please enter your Microsoft Cognitive Services Speech subscription key!");
return;
}
var speechConfig = SpeechSDK.SpeechConfig.fromSubscription(subscriptionKey.value, serviceRegion.value);

speechConfig.speechRecognitionLanguage = "en-US";
var audioConfig = SpeechSDK.AudioConfig.fromDefaultMicrophoneInput();
recognizer = new SpeechSDK.SpeechRecognizer(speechConfig, audioConfig);

recognizer.recognizeOnceAsync(
function (result) {
startRecognizeOnceAsyncButton.disabled = false;
phraseDiv.innerHTML += result.text;
window.console.log(result);

recognizer.close();
recognizer = undefined;
},
function (err) {
startRecognizeOnceAsyncButton.disabled = false;
phraseDiv.innerHTML += err;
window.console.log(err);

recognizer.close();
recognizer = undefined;
});
});

if (!!window.SpeechSDK) {
SpeechSDK = window.SpeechSDK;
startRecognizeOnceAsyncButton.disabled = false;

document.getElementById('content').style.display = 'block';
document.getElementById('warning').style.display = 'none';
}
});
</script>
<!-- </quickstartcode> -->
</body>
</html>

 

The code renders a simple HTML form that lets you press a button and capture a single phrase.

We need to adapt this however to perform continuous speech to text recognition. I’ve done something similar on another project where I built a call centre in a box.

Read more about that here.

~

Adapting the Quick Start

Some things need changed in this original code. The main points being:

  • button to start continuous capture
  • button to stop continuous capture
  • a way to intercept the speech to text conversion

 

A way to emit the translation, in real-time, to the console or HTML element on the screen is also required. These are detailed next.

~

Handling Starting and Stopping Continuous Speech to Text Transcription

Two buttons, startRecognizeOnceAsyncButton and stopRecognizeOnceAsyncButton are added to the form to let you start/stop the speech to text transcription:

<div id="content" style="display:none">
<table width="100%">
<tr>
<td></td>
<td><h1 style="font-weight:500;">Microsoft Cognitive Services Speech SDK JavaScript Quickstart</h1></td>
</tr>
<tr>
<td align="right"><a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/get-started" target="_blank">Subscription</a>:</td>
<td><input id="subscriptionKey" type="text" size="40" value="subscription"></td>
</tr>
<tr>
<td align="right">Region</td>
<td><input id="serviceRegion" type="text" size="40" value="UK South"></td>
</tr>
<tr>
<td></td>
<td><button id="startRecognizeOnceAsyncButton">Start recognition</button>|<button id="stopRecognizeOnceAsyncButton">Stop recognition</button></td>
</tr>
<tr>
<td align="right" valign="top">Results</td>
<td><textarea id="phraseDiv" style="display: inline-block;width:500px;height:200px"></textarea></td>
</tr>
</table>
</div>

 

Handling Speech to Text Events for Continuous Transcription

The following code contains the all required JavaScript to hand continuous transcription:

<script src="https://aka.ms/csspeech/jsbrowserpackageraw"></script>

<script>
// status fields and start button in UI
var phraseDiv;
var startRecognizeOnceAsyncButton;

// subscription key and region for speech services.
var subscriptionKey, serviceRegion;
var recognizer;
var SpeechSDK;
var speechRecognizer;
var textArea;

document.addEventListener("DOMContentLoaded", function () {

startRecognizeOnceAsyncButton = document.getElementById("startRecognizeOnceAsyncButton");

subscriptionKey = document.getElementById("subscriptionKey");
serviceRegion = document.getElementById("serviceRegion");

phraseDiv = document.getElementById("phraseDiv");

stopRecognizeOnceAsyncButton = document.getElementById("stopRecognizeOnceAsyncButton");
stopRecognizeOnceAsyncButton.disabled = true;

startRecognizeOnceAsyncButton.addEventListener("click", function () {

startRecognizeOnceAsyncButton = document.getElementById("startRecognizeOnceAsyncButton");
stopRecognizeOnceAsyncButton = document.getElementById("stopRecognizeOnceAsyncButton");

startRecognizeOnceAsyncButton.disabled = true;
stopRecognizeOnceAsyncButton.disabled = false;

phraseDiv.innerHTML = "";

if (subscriptionKey.value === "" || subscriptionKey.value === "subscription") {
alert("Please enter your Microsoft Cognitive Services Speech subscription key!");
return;
}

var speechConfig = SpeechSDK.SpeechConfig.fromSubscription(subscriptionKey.value, serviceRegion.value);

speechConfig.speechRecognitionLanguage = "en-US";
var audioConfig = SpeechSDK.AudioConfig.fromDefaultMicrophoneInput();

//speechRecognizer = new SpeechSDK.SpeechspeechRecognizer(speechConfig, audioConfig);
speechRecognizer = new SpeechSDK.SpeechRecognizer(speechConfig, audioConfig);

speechRecognizer.startContinuousRecognitionAsync();

speechRecognizer.recognizing = (s, e) => {

console.log(`RECOGNIZING: Text=${e.result.text}`);
phraseDiv.value += e.result.text + '\r\n';
};

speechRecognizer.recognized = (s, e) => {
if (e.result.reason == sdk.ResultReason.RecognizedSpeech) {
console.log(`RECOGNIZED: Text=${e.result.text}`);
}
else if (e.result.reason == sdk.ResultReason.NoMatch) {
console.log("NOMATCH: Speech could not be recognized.");
}
};

speechRecognizer.canceled = (s, e) => {
console.log(`CANCELED: Reason=${e.reason}`);

if (e.reason == sdk.CancellationReason.Error) {
console.log(`"CANCELED: ErrorCode=${e.errorCode}`);
console.log(`"CANCELED: ErrorDetails=${e.errorDetails}`);
console.log("CANCELED: Did you set the speech resource key and region values?");
}

speechRecognizer.stopContinuousRecognitionAsync();
};

speechRecognizer.sessionStopped = (s, e) => {
console.log("\n Session stopped event.");
speechRecognizer.stopContinuousRecognitionAsync();
};
});

stopRecognizeOnceAsyncButton.addEventListener("click", function () {
// Make the following call at some point to stop recognition:
speechRecognizer.stopContinuousRecognitionAsync();

startRecognizeOnceAsyncButton = document.getElementById("startRecognizeOnceAsyncButton");
stopRecognizeOnceAsyncButton = document.getElementById("stopRecognizeOnceAsyncButton");

startRecognizeOnceAsyncButton.disabled = false;
stopRecognizeOnceAsyncButton.disabled = true;
});

if (!!window.SpeechSDK) {
SpeechSDK = window.SpeechSDK;
startRecognizeOnceAsyncButton.disabled = false;
stopRecognizeOnceAsyncButton.disabled = false;

document.getElementById('content').style.display = 'block';
document.getElementById('warning').style.display = 'none';
}
});
</script>

 

 

Main things to note in the above code are:

1. The JS library reference for Azure AI Speech Services:

<script src="https://aka.ms/csspeech/jsbrowserpackageraw"></script>

 

2. Setting the click event handler for to start recognition:

startRecognizeOnceAsyncButton.addEventListener("click", function () {

startRecognizeOnceAsyncButton = document.getElementById("startRecognizeOnceAsyncButton");
stopRecognizeOnceAsyncButton = document.getElementById("stopRecognizeOnceAsyncButton");

startRecognizeOnceAsyncButton.disabled = true;
stopRecognizeOnceAsyncButton.disabled = false;

phraseDiv.innerHTML = "";

if (subscriptionKey.value === "" || subscriptionKey.value === "subscription") {
alert("Please enter your Microsoft Cognitive Services Speech subscription key!");
return;
}

 

3. Instantiating the SpeechConfig object using the JavaScript SDK. Setting the speech recognition language, then creating the SpeechRecognizer object. The startContinuousRecognitionAsync method in the recognizer is then called:

var speechConfig = SpeechSDK.SpeechConfig.fromSubscription(subscriptionKey.value, serviceRegion.value);

speechConfig.speechRecognitionLanguage = "en-US";
var audioConfig = SpeechSDK.AudioConfig.fromDefaultMicrophoneInput();

//speechRecognizer = new SpeechSDK.SpeechspeechRecognizer(speechConfig, audioConfig);
speechRecognizer = new SpeechSDK.SpeechRecognizer(speechConfig, audioConfig);

speechRecognizer.startContinuousRecognitionAsync();

 

4. We subscribe to the recognizing event and emit the real-time transcription to the console and text box in the web form. This event fires whilst transcription is taking place:

speechRecognizer.recognizing = (s, e) => {
console.log(`RECOGNIZING: Text=${e.result.text}`);

phraseDiv.value += e.result.text + '\r\n';
};

 

5. The canceled and sessionStopped events are handled to detect any possible errors or when the session ends:

speechRecognizer.canceled = (s, e) => {
console.log(`CANCELED: Reason=${e.reason}`);

if (e.reason == sdk.CancellationReason.Error) {
console.log(`"CANCELED: ErrorCode=${e.errorCode}`);
console.log(`"CANCELED: ErrorDetails=${e.errorDetails}`);
console.log("CANCELED: Did you set the speech resource key and region values?");
}

speechRecognizer.stopContinuousRecognitionAsync();
};

speechRecognizer.sessionStopped = (s, e) => {
console.log("\n Session stopped event.");
speechRecognizer.stopContinuousRecognitionAsync();
};

 

6. The following script stops continuous speech recognition:

stopRecognizeOnceAsyncButton.addEventListener("click", function () {
// Make the following call at some point to stop recognition:
speechRecognizer.stopContinuousRecognitionAsync();

startRecognizeOnceAsyncButton = document.getElementById("startRecognizeOnceAsyncButton");
stopRecognizeOnceAsyncButton = document.getElementById("stopRecognizeOnceAsyncButton");

startRecognizeOnceAsyncButton.disabled = false;
stopRecognizeOnceAsyncButton.disabled = true;
});

 

The script can now be tested.

~

Demo

View a demo of this in action on YouTube or directly here:

~

Summary

In this blog, you’ve learned about Azure AI Speech. You’ve seen how to perform continues speech recognition using the JavaScript SDK.

This will be used to build a prototype for Audi Notes.

In a future blog post, you’ll see how Azure AI Language can be used to perform document summarization.

JOIN MY EXCLUSIVE EMAIL LIST
Get the latest content and code from the blog posts!
I respect your privacy. No spam. Ever.

Leave a Reply