Google Cloud Speech

Google Cloud Speech

The Cloud Speech API serves as a state-of-the-art speech recognition tool that can accurately transcribe speech in over 80 languages. It can also effectively handle regional accents and noisy conditions.


Digital platforms: Cross-platform software

Versions: Cloud/On-Premise 

Use cases

Examples of the use of speech-to-text


Use this model for transcribing audio from video clips or other sources (such as podcasts) that have multiple speakers. This model is also often the best choice for audio that was recorded with a high-quality microphone or that has lots of background noise.

Phone call

Use this model for transcribing audio from a phone call.

ASR: Command and search

Use this model for transcribing shorter audio clips. Some examples include voice commands or voice search.

ASR: Default

Use this model if your audio does not fit any of the other models described in this table. For example, you can use this for long-form audio recordings that feature a single speaker only. The default model will produce transcription results for any type of audio, including audio such as video clips that has a separate model specifically tailored to it.


Google Cloud Speech, also known as: Cloud Speech API and Speech-to-Text API enables the following actions:

  • Asynchronously decrypt a local audio file
  • Asynchronously decrypt audio file in cloud storage
  • Asynchronously decrypt audio file with time offset
  • Create asynchronous speech file
  • Recognise streaming speech
  • Recognise streaming speech with punctuation
  • Simultaneously recognise words
  • Decrypt a local multi-channel file, etc.
  • Synchronous recognition (REST and gRPC) sends audio data to speech-to-text API, performs recognition of that data and returns the results after all audio is processed.

Asynchronous recognition (REST and gRPC) sends audio data to speech-to-text API and initiates a long operation . Using this operation, you can periodically poll the recognition results.

Streaming recognition (gRPC only) performs recognition of audio data provided in a bidirectional gRPC stream . Streaming requests are designed for real-time recognition, such as for recording live audio from a microphone.