OpenAI Whisper | Speech to text model

OpenAI's Whisper is a powerful tool that converts spoken language into written text with great accuracy. It works well even in tough conditions, like when speakers have accents or there is background noise. Whisper can also translate spoken content from different languages into English. This versatility makes it a valuable resource for many users, including business professionals, researchers, and everyday individuals.

Before you can use Whisper, you'll need to install a few prerequisites on your computer. Let's get them installed.

1. Install ffmpeg

ffmpeg is essential for managing audio file operations and conversions.on Ubuntu or Debian

sudo apt update && sudo apt install ffmpeg

on MacOS using Homebrew (https://brew.sh/)

brew install ffmpeg

on Windows using Chocolatey (https://chocolatey.org/)

choco install ffmpeg

2. openapi-whisper installation

To use OpenAI Whisper, you need to install it via pip. Make sure you have Python installed (preferably Python 3.6+):

pip install -U openai-whisper

Alternatively, the following command will pull and install the latest commit from this repository, along with its Python dependencies:

pip install git+https://github.com/openai/whisper.git

3. Test whisper through command line

Once installed, you can use Whisper to transcribe speech from audio files.

Open your favorite terminal. Below command will transcribe speech in audio files, using the medium model:

whisper audio.mp3 --model medium
  • If you do not specify the model it will use the default model small

To transcribe an audio file containing non-English speech, you can specify the language using the --language option. Example

whisper hindi.mp3 --model medium --language Hindi

Here's a clear explanation of how to transcribe audio using Python code:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

Internally, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.

When you run the code provided, the output will display the transcribed text from the audio file audio.mp3.

By following these steps, you can easily transcribe audio files into text using Python.

Let's use whisper model in a more detailed way. There are special functions called whisper.detect_language() and whisper.decode() that let you work directly with the model.

import whisper

model = whisper.load_model("small")

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("hindi.mp3")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)

After running the above code, you will see that Whisper has automatically detected the language of the audio and transcribed the Hindi audio to text, as shown in the screenshot below.

If the base model doesn't do a great job of understanding the audio, you can try using a different, more powerful model. These other models have been trained on much more data, so they might be better at understanding what's being said. Here is the list of models available right now: