OpenAI Whisper | Speech to text model
OpenAI's Whisper is a powerful tool that converts spoken language into written text with great accuracy. It works well even in tough conditions, like when speakers have accents or there is background noise. Whisper can also translate spoken content from different languages into English. This versatility makes it a valuable resource for many users, including business professionals, researchers, and everyday individuals.
Before you can use Whisper, you'll need to install a few prerequisites on your computer. Let's get them installed.
1. Install ffmpeg
ffmpeg is essential for managing audio file operations and conversions.on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
2. openapi-whisper installation
To use OpenAI Whisper, you need to install it via pip. Make sure you have Python installed (preferably Python 3.6+):
pip install -U openai-whisper
Alternatively, the following command will pull and install the latest commit from this repository, along with its Python dependencies:
pip install git+https://github.com/openai/whisper.git
3. Test whisper through command line
Once installed, you can use Whisper to transcribe speech from audio files.
Open your favorite terminal. Below command will transcribe speech in audio files, using the medium
model:
whisper audio.mp3 --model medium
- If you do not specify the model it will use the default model
small
To transcribe an audio file containing non-English speech, you can specify the language using the --language
option. Example
whisper hindi.mp3 --model medium --language Hindi
Here's a clear explanation of how to transcribe audio using Python code:
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
Internally, the transcribe()
method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.
When you run the code provided, the output will display the transcribed text from the audio file audio.mp3.
By following these steps, you can easily transcribe audio files into text using Python.
Let's use whisper model in a more detailed way. There are special functions called whisper.detect_language()
and whisper.decode()
that let you work directly with the model.
import whisper
model = whisper.load_model("small")
# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("hindi.mp3")
audio = whisper.pad_or_trim(audio)
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)
# print the recognized text
print(result.text)
After running the above code, you will see that Whisper has automatically detected the language of the audio and transcribed the Hindi audio to text, as shown in the screenshot below.
If the base model doesn't do a great job of understanding the audio, you can try using a different, more powerful model. These other models have been trained on much more data, so they might be better at understanding what's being said. Here is the list of models available right now: