Tutorial #37
Web Speech Recognition Advanced   2014-01-04

Introduction

Speech Recognition is starting to appear in more and more applications. Siri from Apple has had a lot of attention but it is being used in a wide range of phone applications for dictation and translation. Speech Recognition in the Browser has been lagging but Google is putting in a lot of work to make it happen.

You can find it in the Google Search page if you use the Chrome browser:

Image 1 for this tutorial

Speech Recognition is part of the Web Speech API Specification, which also covers Speech Synthesis. Implementation of this API is currently limited, with Google being the only vendor to have made it available to developers and users.

As of January 2014, Speech Recognition is only supported in the Google Chrome Browser.

While this is a serious limitation, the technology has so much potential that it is worth looking at right now.

Google have made a nice Demonstration Page available, which you should look at.

Image 2 for this tutorial

The code for that page is not particular clear for a beginner and contains no comments. So in this tutorial I have written a minimal version that should help you get started with the technology. The demo looks like this:

Demo 1 screenshot for this tutorial


Step by Step

1: When the demo starts you will see an empty Text Box and a 'Click to Start' button.

Image 3 for this tutorial

2: Clicking the button will bring up a browser notification bar that asks you to Allow or Deny the request to access your microphone.

Image 4 for this tutorial

3: When the button text changes, start talking clearly and slowly. As you do this you should see text appear in the text box, which hopefully matches the words that spoke. When you stop speaking, or after a suitable pause, the code will decide that you are finished. The text will not change any further and the microphone is disabled until you start again.

Image 5 for this tutorial


Understanding the Code

Web Speech Recognition, as implemented here by Google, is a Hybrid solution in that it involves Client sidecode, built into browser, which captures the input audio and Server side code at Google which performs the actual speech recognition and returns the results.

The Web Speech API Specification is, at least in prinicple, a Browser API but here we see it communicating with a remote server with proprietary software owned by Google. This raises a bunch of concerns and practical issues. Take a look at the Caveats section below where I discuss these. But for now, let's just see how the code works..,

The HTML code has an empty div with the id transcript, into which we will output the recognized text, a div for user instructions and a control button.

In the script, we test for the presence of the function webkitSpeechRecognition and alert the user if this is not available in their browser. If it is then we create a new webkitSpeechRecognition object and set three attributes.

recognition.continuous = true instructs it to keep processing speech input until no more words are being spoken. recognition.interimResults = true instructs it to output the text of the current best interpretation of the input audio, even before the audio has finished.

recognition.lang = 'en-GB' specifies the language and dialect in which the text is spoken. Here I am using British English. There are a number of options here - look at the source for the Google Demonstration Page for the complete list.

The correct language and dialect is important. I speak English with a British accent/dialect and the code performs better if I specify this rather than American English (en-US).

Next, the code defines four callback functions which are called at different stages of the recognition process:

onstart - when the start function is called

onerror - if recognition fails

onend - when the audio input stream ends

onresult - when a recognition result (interim or final) is received from the server

Note that these do anything with the actual audio stream or web audio nodes. All of the actual recognition process takes place behind the scenes. These callbacks allow us to update the web page and, in particular, to process and display the text we get back from the server.

The code shown here in recognition.onresult takes the event passed in to it and builds a string using the text elements (typically the individual words) that have been recognized so far. Once the audio input has ended the recognition software will finish its processing and set isFinal to true. You can choose to display the interimTranscript or finalTranscript, or both.

Here I show the current transcripts every time onresult is called in the JavaScript console of the browser. It helps you to understand the back and forth involved in recognition and shows how the early 'best guess' result may change as more text becomes available to the speech recognition engine.

Image 6 for this tutorial

The process is controlled by the Button which calls recognition.start() when first clicked and updates the instructions. Clicking it again, once recognition has started or finished, calls recognition.stop() which stops sending audio to the Google server and disables the user's microphone.

In comparison to the Web Audio API tutorials, such as Visualizing Audio #1 Time Domain, this code is relatively simple. You don't need to worry about audio buffers or any of those details. It all happens behind the scenes. On one hand this makes it easy to work with, but on the other it makes it hard to troubleshoot when something goes wrong with complex text - it is a Black Box.




Caveats / Issues

Speech Recognition is clearly a work in progress. Over time it will appear on other browsers in this, or a similar form - but not yet.

Perhaps the biggest issue right now is that the recognition backend is owned and operated by Google. Speech recognition requires sophisticated algorithms, large reference datasets and a lot of computation. Right now this can only be supported by an organization like Google - the computational expense alone is too large to build into the browser itself.

I have no problem with Google providing this - and I would be willing to pay for the service, in a model similar to that used with the Google Translate API.

But this is not set up like a Web Service API ... the Web Speech API is a Client side API. This combination of a hidden back end to a client API is, I think, unique.

At the very least, I should have a way to select which recognition server I want to use (right now there is only Google). Google has a history of creating great web services, but also of killing off services that no longer fit their business goals.


There are also some big technical issues that need to be dealt with. During a recognition 'session', the audio stream from my browser is being sent over the network to Google where a process is buffering that and applying recognition algorithms.

Google has to apply limits to the length of audio that it will accept - it is a process that does not scale otherwise. The current limit appears to be around 60 seconds. Recognition will just stop if you try and feed it a continuous stream longer than this.

One consequence of this is that you need to initiate a speech recognition session with, say, a mouse click. That prevents truly hands free applications that might listen continuously to an audio stream and then take actions based on spoken commands within that stream. That just won't work with the Speech API as it stands.

Accuracy is always going to be an issue, and one that will vary according to the input language. You will see examples of this when you try the demo. Quantifying the accuracy is an important issue. If you are using it to create transcriptions of, say, court proceedings, you need a statistical measure of confidence for each word so you can flag ones that might need manual confirmation. None of this is in the current specification.

The Black Box nature of the code is a big issue for some developers - for both practical and philosophical reasons. I don't have a fundamental issue with proprietary code, but I do need to see a very clear description of its API and at this stage in its development we don't have that yet.



Of course, the fact that it can only run on Google Chrome right now means that developers and designers can only offer it to a fraction of their potential users. For many projects this is complete deal breaker. This should improve over time but the limitiation will keep this as an experimental technology until that happens.


The potential for speech recognition is so large that I am sure these issues will get worked out over the next year or so. API specifications can and do change - and I think that will be necessary in this case.

But let me stress, in spite of its shortcomings, for the "right" application, the API works today and works well. I use it in an application for language learning where I can test the user's pronunciation of words in one of many languages, using speech recognition in that language. For individual words and short phrases it works pretty well most of the time.

Given the potential applications, I recommend that you experiment with the API that we have today and follow its evolution closely.


Code for this Tutorial


Share this tutorial



Comment on this Tutorial


comments powered by Disqus