How to use Google’s Speech-to-Text in a web application. A working example.

Published in

SoftwareMill Tech Blog

6 min readJan 4, 2021

What is Speech-to-Text

Google’s Speech-to-Text (STT) API is an easy way to integrate voice recognition into your application. The idea of the service is straightforward, it receives an audio stream and responds with recognized text. As of the time of writing the first 60 minutes of speech recognition each month are free of charge, so you can give it a try without any costs. Each minute over the limit costs about $0.006, the time is rounded up to 15 seconds.

Prerequisites

To follow this tutorial you have to enable Speech-to-Text:

Open GCP console https://console.cloud.google.com/
Select or create a new project
Search for “Cloud Speech-to-Text API” and enable it
Search for “Service accounts” and create a new service account
Add a key to the service account, choose JSON format, download and safely save the key file

Overview

It is possible to send the audio stream directly from the browser, but as far as I know, there is no way to authorize the client (browser) to use our account without exposing the service credentials. Therefore we are going to send an audio stream from the browser via web socket to the backend and then redirect it to the STT and send back the response.

At the client side we’re using Typescript without additional dependencies, and at the backend, it will be http4s configured with tapir. For STT calls we’ll use the library provided by Google.

Speech-to-Text API

The API is the central point of our solution, so first we have to understand how we can use the service and what requirements or restrictions it implies on the rest of the solution.

The documentation describes 3 typical usage scenarios: short file transcription, long file transcription, and the transcription of audio streaming input. We are interested in the 3rd scenario as we want to recognize a user’s speech on the fly.

To achieve the best result of voice recognition the documentation recommends the following features of the audio stream:

sampling at 16 kHz
16-bit signed sample format
lossless compression
single channel
100 ms length of the audio chunk in each request in the stream

Also any pre-processing like gain control, noise reduction, or resampling is discouraged.

Web browser

The common choice for audio (and video) capture in a browser is MediaStream Recording API. Unfortunately, it supports only compressed formats, and worse, supported formats depend on the browser and platform. The better choice is the Web Audio API, which can be used for custom audio stream processing. Both technologies are built on Media Capture and Streams that provides access to the client’s audio devices.

First, we have to obtain a handle for the audio stream of the user’s microphone using Media Capture and Streams API:

const sampleRate = 16000const stream = navigator.mediaDevices.getUserMedia({
  audio: {
    deviceId: "default",
    sampleRate: sampleRate,
    sampleSize: 16,
    channelCount: 1
  },
  video: false
})

Here we use the “default” device, though it’s possible to enumerate available devices and select the specific one. We also set the required parameters of the stream.

Next, we are going to process the stream with the Web Audio API. This API allows us to build a network of audio processing nodes. The API provides a set of nodes for common processing tasks. We are interested in two of them:

MediaStreamAudioSourceNode — connects a media stream to the network
AudioWorkletNode — allows custom processing of the audio stream

All nodes exist in AudioContext which we have to create first:

const audioContext = new window.AudioContext({sampleRate: sampleRate})

Then we can create MediaStreamAudioSourceNode from the stream obtained earlier:

const source: MediaStreamAudioSourceNode = audioContext.createMediaStreamSource(stream)

The creation of the worklet node is a bit more complicated. The worklet node has to perform its job in a separate thread. To achieve that the Web Audio API utilizes the Worker API. Fortunately, the API handles most of the process. We have to do 2 things:

create the processing script and register it under a name
create the worklet node in the main context using the registered name

Audio stream processing

Our processing node is responsible for 2 tasks:

transcoding of the stream
combining frames into 100 ms audio chunks

Nodes of the Web Audio API process the audio stream in frames of the length of 128 samples. Such a frame is called by the specification the render quantum. Each sample is represented by a 32-bit floating number, so the transcoding is simply a remapping of a 32-bit float sample to a 16-bit signed sample.

The full source of the processing script:

const quantumSize = 128

class TestProcessor extends AudioWorkletProcessor {
  constructor(options) {
    super()
    this.quantaPerFrame = 12
    this.quantaCount = 0
    this.frame = new Int16Array(quantumSize * this.quantaPerFrame)
  }

  process(inputs, outputs, parameters) {
    const offset = quantumSize * this.quantaCount
    inputs[0][0].forEach((sample, idx) => this.frame[offset + idx] = Math.floor(sample * 0x7fff))
    this.quantaCount = this.quantaCount + 1
    if (this.quantaCount === this.quantaPerFrame) {
      this.port.postMessage(this.frame)
      this.quantaCount = 0
    }
    return true
  }
}

registerProcessor('pcm-worker', TestProcessor)

The number of rendering quanta in each stream chunk is 12, so the length of the chunk will be: (1/16 kHz)*128*12 = 96 ms.

The 32-bit float number sample is in the range (-1;1). We need a number in the range (-32,768;32,767). To transcode we need to multiply the input sample by 32,768 and round the result: Math.floor(sample * 0x7fff).

After the full chunk is completed it is sent to the main context by the worker’s port: this.port.postMessage(this.frame). We will soon see how it is received at the other end.

Before we create the worklet node we have to register the worklet script into our audio context:

audioContext.audioWorklet.addModule('/pcmWorker.js')

Now we can create the worklet node in the main thread and connect it with the stream audio source node:

const pcmWorker = new AudioWorkletNode(audioContext, 'pcm-worker', {
  outputChannelCount: [1]
})
source.connect(pcmWorker)

WebSocket

To route the audio stream from the worklet node to the backend we have to make a WebSocket connection:

const conn = new WebSocket("ws://localhost:8080/ws/stt")

and then we can redirect the audio stream from the PCM worker to the connection (we use AudioWorkletNode’s port to receive data from the processing script):

pcmWorker.port.onmessage = event => conn.send(event.data)
pcmWorker.port.start()

Backend

We will start backend implementation with the WebSocket endpoint. Definition of the endpoint in tapir:

val wsEndpoint =
  endpoint.get
    .in("stt")
    .out(webSocketBody[WebSocketFrame, CodecFormat.TextPlain, WebSocketFrame, CodecFormat.TextPlain](Fs2Streams[Task]))

to create http4s route we have to provide handleWebSocket fs2 Pipe transforming the input stream of WebSocketFrame into the output stream of WebSocketFrame:

wsEndpoint.toRoutes(_ => Task.pure(Right(handleWebSocket)))

Before we start sending the audio stream to STT we have to create the SpeechClient and establish the gRPC connection:

val speechClient = SpeechClient.create
val sttStream = 
speechClient.streamingRecognizeCallable.splitCall(new RecognitionObserver(queue))

Our RecognitionObserver will receive the response from STT and push it to the fs2 Queue after conversing to the simple JSON:

class RecognitionObserver(queue: Queue[Task, String]) extends ResponseObserver[StreamingRecognizeResponse] {

  override def onResponse(response: StreamingRecognizeResponse) = {
    val result = response.getResultsList.get(0)
    val isFinal = result.getIsFinal
    val transcript = result.getAlternativesList.get(0).getTranscript
    val msg = s"""{"final": $isFinal, "text" : "$transcript"}"""
    queue.enqueue1(msg).runSyncUnsafe()
  }

  override def onError(t: Throwable): Unit = {}
  override def onComplete(): Unit = {}
  override def onStart(controller: StreamController): Unit = {}
}

The first message sent to STT after connecting has to be the configuration. We have to provide parameters of the audio stream (encoding and sample rate) and we can configure some parameters of the recognition process like recognition model, the language, or whether we want to receive interim results:

object SpeechRecognitionConfig {
  private val recognitionConfig = RecognitionConfig.newBuilder
    .setEncoding(RecognitionConfig.AudioEncoding.LINEAR16)
    .setLanguageCode("en_US")
    .setSampleRateHertz(16000)
    .setModel("command_and_search")
    .build

  def apply(): StreamingRecognitionConfig = StreamingRecognitionConfig.newBuilder
    .setConfig(recognitionConfig)
    .setInterimResults(true)
    .build

  val configRequest: StreamingRecognizeRequest = StreamingRecognizeRequest.newBuilder
.setStreamingConfig(SpeechRecognitionConfig())
.build
}

Sending of the configuration:

sttStream.send(SpeechRecognitionConfig.configRequest)

Then we can start sending audio stream chunks to the STT wrapping them into StreamingRecognizeRequest:

private def sendAudio(sttStream: ClientStream[StreamingRecognizeRequest], data: Array[Byte]) =
  Task(StreamingRecognizeRequest.newBuilder.setAudioContent(ByteString.copyFrom(data)).build)
    .flatMap(req => Task(sttStream.send(req)))

And finally, handleWebSocket Pipe that connects the WebSocket with STT stream:

def handleWebSocket: Pipe[Task, WebSocketFrame, WebSocketFrame] = audioStream =>
  for {
    queue <- eval(unbounded[Task, String])
    sttStream <- bracket(connectStt(queue))(stt => Task(stt.closeSend()))
    audioChunk <- audioStream.collect {
      case binary: WebSocketFrame.Binary => binary.payload
    }
    sttResultStream = queue.dequeue
    transcript <- eval(sendAudio(sttStream, audioChunk)).drain.mergeHaltBoth(sttResultStream)
  } yield WebSocketFrame.text(transcript)

Working example

The working example can be found here: https://github.com/gobio/bootzooka-speech-to-text

It’s based on SoftwareMill’s Bootzooka, look at the documentation on how to start the application. Remember to set the GOOGLE_APPLICATION_CREDENTIALS environment variable pointing to the downloaded service account JSON key. The example contains only essential elements requires for it to work, specifically, it lacks the proper error handling.

All STT related changes were introduced with this commit.