Speech Recognition with Javascript

Have you ever wondered how marvelous the technology is growing and has brought tremendous improvements in our lives? The rise of Artificial Intelligence and Machine Learning indicates that Robots and machines are our new human model in the future. So, what else we’ll do in this tech world without getting updated with technology. Let’s get started with the most popular AI technology i.e., Speech Recognition.

Voice is the future as well as the present. So anything that you brand with voice is going to be a massive hit. Because humans are fascinated by a machine that can understand us! Let me bring some statistics of voice recognition used almost everywhere.

According to Adobe Analytics, 71% of owners of smart speakers like Amazon Echo and Google Home use voice assistants at least daily, and 44% using them multiple times a day.

Over 76% of smart speaker owners increased their usage of voice assistants in the last year.

So, 50% of searches will be voice by 2020.

A little while ago, I got an opportunity to implement the Speech recognition to one of our Applications. I was really overwhelmed to work on the most talked technology and bumped into the google search about the possible ways to implement the same. I’ve come across several npm modules, most of them are really cool to proceed with. Amongst various npm modules and technologies, I found this JavaScript web speech API is so simple and accurate to go with the speech recognition process.

It is a very powerful browser interface that allows JavaScript to have access to a browser’s audio stream and convert that to text. The Web Speech API is actually separated into two independent interfaces. We have SpeechRecognition for understanding human voice and turning it into text (Speech -> Text) and SpeechSynthesis for reading strings out loud in a computer-generated voice (Text -> Speech). 

The first thing that we need to do is to check the Browser compatibility table carefully before using this in your work or else you would be jammed with errors. 

Note: I’ll list down the things that are implemented in Angular application and I don’t think there will be much difference in implementing this in other frameworks of JS.

Voice to Text

Step:1

Let’s create an interface for the webkitSpeechRecognition like this,

export interface IWindow extends Window {
    webkitSpeechRecognition: any;
}

Step:2 

Now initialize the webkitSpeechRecognition to a variable like this,

const { webkitSpeechRecognition }: IWindow = window as IWindow;

   this.recognition = new webkitSpeechRecognition();

Step:3

There are some properties to make the speech API to function as per the configuration of that.  

The continuous property controls whether continuous results are returned for each recognition or only a single result. It defaults to single results (false.)

The interimResults property controls over the display of results. If set true, the results will be streaming live. The default value for interimResults is false which displays the text at the end.

We could also set the language, grammar and various others through its properties.

this.recognition.continuous = true;

this.recognition.interimResults = true;

this.recognition.lang = 'en-US';

All set. We’re done with the initial setup without any installations. We made it too simple and Now it’s good to go! 

Event Handlers

There are many event handlers in this API where you can track every event and perform whatever you want to. The following are the most used events.

  • onstart: This event is triggered when speech recognition is started. This is where we could show some notifications to the user that they can start speaking.
  • onresult: This is the most important event where we could track this and get the text string that is processed from a user’s speech recognition.
this.recognition.onresult = (event) => {

     let interimTranscript = '';

     let finalTranscript = '';

     for (let i = event.resultIndex; i < event.results.length; ++i) {

       if (event.results[i].isFinal) {

         finalTranscript += event.results[i][0].transcript;

       } else {

         interimTranscript += event.results[i][0].transcript;

        }

     }

     this.message += finalTranscript; // Processed message

   };
  • onerror: This is as important as handling onresult because the user might not know when their process stopped due to some reasons. So notify the user regarding this error and try to restart or stop the recognition or do whatever!
  • onend: A simple event that gets triggered when our recognition is stopped. 

Methods:

There are three important methods in web speech API which is highly useful to invoke the actual speech recognition functions. Just keep reminded that only after calling methods, the respective events will get triggered. So let’s know when to call a method.

  • start(): It starts the speech recognition service listening to incoming audio with intent to recognize grammar associated with the API.
  • stop(): It stops the speech recognition service from listening to the incoming audio and attempts to return a result using the audio captured so far.
  • abort(): It stops the speech recognition service from listening to the incoming audio and doesn’t attempt to return a  result.

Text to Speech

Actually, text to speech is really very easy. It is accessible through the speechSynthesis object and there are a couple of methods for playing, pausing and other audio related stuff like changing the pitch, rate, and even the voice of the reader. But we can do it simply with the speak() method. Here is the entire code needed to read out a string.

textToSpeech(message) {

   const speech = new SpeechSynthesisUtterance();

   // Set the text and voice attributes.

   speech.text = message;

   speech.volume = 1;

   speech.rate = 1;

   speech.pitch = 1;

   window.speechSynthesis.speak(speech);

When this function is called, a robot voice will read out the given string, doing its best human impression.

Conclusion:

It was cool. Isn’t it? Since conversational user interfaces and voice assistants are becoming more popular, we can build that with this simple and cool web speech API. I hope this blog has given some quick and simpler ways to implement speech recognition. Adding voice-based user interaction to your application would be a great form of user experience. So what are you waiting for? Just go ahead and do some wonder with voices.

References:

  1. https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition
  2. https://www.google.com/intl/en/chrome/demos/speech.html
  3. https://shapeshed.com/html5-speech-recognition-api/