Automatically transcribing DSD FME wav files

tumble_up · Feb 7, 2024

I realized I have more monitors than ears and thought it would be nice to transcribe the wav files that DSD-FME saves instead of having to listen to them in real time. Here is a short script to do that. It watches for file handles to close and then runs the audio through openai's whisper library.

To use it:
- Make sure dsd-fme is writing per-call WAV files.
- Run `pip install -r requirements.txt`
- Run `python dsd-fme-transcribe.py /path/to/WAV/files`

Once it's started you should see text in the terminal like:

TG Bus Dispatch RID 8703: 237 to central.
TG Bus Dispatch RID 8761: 2-3-7 standby just a moment. 2-3-2 go ahead.
TG Bus Dispatch RID 8784: I just wanted to update you that I'm route complete. Thank you.
TG Bus Dispatch RID 8761: Okay, 10-4. Thank you. Go ahead. Two, three, seven.
TG Bus Dispatch RID 8761: 274 standby just a moment. I got one more route calling in here. 237, I'm ready for you. Go ahead 237.
TG Bus Dispatch RID 8703: So it is closed and I have three stops down there. What do I do?

The script will work for any directory that is getting audio files dropped in it, but DSD-FME is the only thing I've tested it with. Cheers

thewraith2008 · Feb 7, 2024

Well that's an interesting project, nice work.

lwvmobile · Feb 7, 2024

Cool, I'll have to give it a try out when I have a chance to do so. I do have some questions. Could this work with any speech in a wav file, and not just things specifically from DSD-FME? Also, can openai-whisper easily be configured in the python script to do translation as well as transcription? I am unfamiliar with openai-whisper, so I suppose I could just google search this answer, but figured I'd ask, I often have samples (and remote access to locations) in different languages that I don't speak, so would be nice thing if its doable. I found myself the other day resorting to using google translate and piping audio into it to see if it would translate on the fly or not, but didn't do so well considering the talkers weren't doing a good job of speaking clearly directly into the mic.

tumble_up · Feb 7, 2024

@lwvmobile It'll work with any speech in a wav file. The only thing tying it to DSD-FME is I parse the talkgroups CSV file to print in the output.

Whisper will translate audio, including detecting the source language. You can run it right off the CLI if you want to experiment with it outside of python. See some examples here: GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

There are also many examples of people doing similar stuff but they all seemed really complex and I just wanted something simple and fast since my computer is kinda slow.

By the way, thanks for your work on DSD-FME. The audio I'm decoding is NXDN and I didn't see a lot of SDR options for that so I was happy to come across your work.

RaleighGuy · Feb 7, 2024

I'd love to see @lwvmobile make it part of the install package for the windows version

Do you think it would be able to do that @tumble_up?

lwvmobile · Feb 7, 2024

RaleighGuy said:
I'd love to see @lwvmobile make it part of the install package for the windows version Do you think it would be able to do that @tumble_up?

Just looking at the stuff on their Github page, it seems like the software could be used for virtually any speech with some degree of setup, but as was mentioned earlier, the openai-whisper software can be loaded and run independently of dsd-fme and of operating system, so sounds like it wouldn't be something I would package in by default (voice models are hefty downloads and the requirements like pytorch need to be installed locally, etc), but I may consider adding some form of support for it, once I work out a good way to do it. Would be nice to just skip the .wav file middle man and just go from decoded audio straight to text in some fashion. I'll tinker with it in my down time and see if I can come up with an efficient setup for it.

RaleighGuy · Feb 7, 2024

lwvmobile said:
Would be nice to just skip the .wav file middle man and just go from decoded audio straight to text in some fashion. I'll tinker with it in my down time and see if I can come up with an efficient setup for it.

You are awesome, that would be great.

lwvmobile · Feb 7, 2024

tumble_up said:
By the way, thanks for your work on DSD-FME. The audio I'm decoding is NXDN and I didn't see a lot of SDR options for that so I was happy to come across your work.

I'm glad you're finding it useful, NXDN decoding can be a bit of a mixed bag in DSD-FME. As long as your using one of the prescribed methods for NXDN, it usually works out just fine.

lwvmobile · Feb 7, 2024

RaleighGuy said:
You are awesome, that would be great.

Also, can't say for certain what the hard requirements are, as I can't seem to find any minimum or recommended builds, but usually, when you work with AI, having a modern GPU is usually a recommendation, and I don't mean an integrated one, either. Usually, you will want a modern Nvidia RTX, AMD RX, or perhaps one of the Intel Arc GPU. Not saying it can't be done on CPU, but the GPU will absolutely smoke the CPU in terms of speed when working on parallel processing in many compute heavy applications.

Edit: I should note, I am only referring to 'affordable' home market GPUs geared toward the end user, and not super computers and server farms, or whatever may be tailor made to execute specific high end tasks, etc.

RaleighGuy · Feb 7, 2024

lwvmobile said:
Also, can't say for certain what the hard requirements are, as I can't seem to find any minimum or recommended builds,

How to use OpenAI Whisper on Windows PC

Installing and using Whisper on a computer requires using PowerShell and installing key tools such as Python, PIP, Chocolatey, etc. Detailed steps here.

www.thewindowsclub.com

scannerloser · Feb 20, 2024

Hell yes! This is amazing, Thank you! I'll be using this for other things as well...

How has your success rate been, so far? As I'm listening right now, I'm betting that if there's issues, slowing the recordings down by 25%-50% might help with success rates because of how quickly dispatch and other people in the field speak over the radios. Then again, whisper might already do that since it's AI based.. I can't wait to play with this.

Also, just an FYI for anyone that doesn't have a massive GPU, but wants to play with the AI stuff without a huge Amazon or Azure bill: there's a place I've known about for a long time called vast.ai that's way cheaper. It suddenly becomes possible and even easy to deploy docker containers to someone's beastmode computer in Texas with 4 GPUs for a few days because they're renting it out. It's good stuff, if you don't mind using my referral link i'd appreciate it. https://cloud.vast.ai/?ref_id=5423#

-L

n3617400 · Feb 20, 2024

You also need to add sending transcriptions to Telegram and it will be a Bomb!

lwvmobile · Feb 20, 2024

My initial testing was that the voice transcription wasn't very accurate at all when using created wav files and feeding them into the software by using its CLI directly. I rarely got a text back that was more accurate than not, and that was on a crystal clear NXDN system where I feel the voices are very easy to understand. I realize there could be errors in the implementation on my end, like what the preferred sampling rate for wav files are, etc, but just running a bunch of wav files into the software, I found that no matter which voice model I used, it wasn't super good.

On that note, though, I found out the software only runs on CUDA (NVIDIA), so no AMD support for GPU. That being said, it will run on the CPU, and unlike a lot of other computational tasks, this still was reasonably fast at transcribing wav files, particularly on a Ryzen 5 2600 CPU. That being said, I can't help but to wonder whether or not the CPU implementation was the issue with the poor accuracy of the transcribing or not vs using native CUDA. I don't know enough about whisperai to say how the innards work, so I can't give an answer on that end.

Other than that, I haven't been able to do more testing or do any coding on it, I've had other things come up lately that's severely hampered my time. I did attempt to pipe audio directly into the software, but I don't think the software really liked that. I did see they have some projects that do live audio dictation, but I never got around to doing much with that either.

KCoax · Feb 20, 2024

I've ran wisper.ai several times on Google Colabs with a T4 GPU and on PC with a NVIDIA 1080. Even with the large models there isn't much of a difference in the accuracy rate. Unfortunately, the models wisper.ai is trained on don't quite yet get "radio speak."

scannerloser · Feb 23, 2024

KCoax said:
I've ran wisper.ai several times on Google Colabs with a T4 GPU and on PC with a NVIDIA 1080. Even with the large models there isn't much of a difference in the accuracy rate. Unfortunately, the models wisper.ai is trained on don't quite yet get "radio speak."

Yeah that's kinda what I figured, sadly. I think it's a great challenge to try and get under the eyeballs of "higher ups" at the various AI shops, because once they realize the po$$ibilities of providing a transcription service to government potentially... they will take much more notice and maybe invest efforts into assisting via scanner groups for now, then market it later after it's tested here... just a thought

Unitrunker2 · Feb 23, 2024

scannerloser said:
once they realize the po$$ibilities of providing a transcription service to government potentially...

The companies that provide recording systems are already on it. Example:

What Is Contact Center Automatic Speech Recognition (ASR)

Automatic speech recognition (ASR) is a technology that provides the ability for a person to interact with a computer application using voice instead of typing.

www.nice.com

This is geared primarily towards call centers but you get the idea. A bit more ...

NICE Frees its ElevateAI Transcription Service

NICE sees ElevateAI as an opportunity to build its pipeline: help smaller customers today for free, with the hopes of winning future business.

www.nojitter.com

tumble_up · Feb 25, 2024

My original post was a copy/paste from the terminal where it was running, so it can be accurate, but I agree it's not good enough to use day-to-day, and I haven't messed with it more since I posted here either.

When I was first looking for existing solutions I found a thesis where they trained their own data model on air traffic control and got it down to a ~13% error rate on what must be pretty challenging audio. So, for significant improvements, that would probably be the way to go.

RobDLG · May 28, 2024

I've been using DSD-FME on Fedora 40, and when I saw that the Fedora 40 repositories contain whisper-cpp, I thought this software would provide a convenient means to test speech recognition on recorded DSD-FME audio output.

I soon learned my expectations were wrong, or at least overly optimistic:

The Fedora package contains only a library; the example binaries included in the source distribution are not included.
The software version packaged is long outdated (1.5.4 vs 1.6.2).
The developer describes the project as a "hobby". "It does not strive to provide a production ready implementation."

So, installing the Fedora package did not enable me to run a speech recognition program, and any test results would have had a major caveat, anyway.

I decided to try building from source (v1.6.1), and it was trivial. The common cmake;make;make install command sequence yielded the libraries and executable files I needed to run my tests. After downloading the appropriate Whisper models, I was all set!

A review of the initial text file results revealed they were better than I expected. I then focused on evaluating whether or not street names were accurate enough for matching via partial string search.

For the street name test, I chose a 10.8 second recording of Police Central Dispatch with the phrase "Monte Sombra and McGuffey".

The test platform was a Dell Optiplex Micro with 8GB of memory - a typical office PC from ten years ago. The following table summarizes the results, sorted by encoding time.

Model	Encoding Time (Seconds)	Encoding Buffer Size (MB)	Text Result
ggml-base	3.96	132.07	Montessomber and MacGuffee
ggml-base.en	3.99	132.07	month of summer in my coffee
ggml-small	7.19	280.20	Monday Summer in McGuffey
ggml-small.en	14.27	280.20	month of summer in McGuffey
ggml-medium	23.44	594.22	month of December in McGuffey
ggml-large-v3	44.81	926.66	Montezuma and McGuffey
ggml-large-v3-q5_0	46.04	926.66	Montezuma and McGuffey
ggml-medium.en	46.86	594.22	Munted Summer in McGuffey

The base model was not only fast enough for real-time encoding, it was accurate enough for case-insensitive matches on "monte" and "guff". This is the model I'll use for further testing.

The two large models, as one might expect, were accurate enough to match "Monte" and "McGuff", but run way too slow, and use too much memory, for the modest hardware I was using. Still, I could use them for tasks such as offline generation of video subtitles.

The output of the base.en and medium.en models was crap! The encoding time for the latter is also an outlier - running so long that I should double-check to see if I made an error somewhere.

The remainder of the models didn't provide compelling output during this short test.

I'm not an AI expert, but I think it's safe to say that any major improvement in recognition accuracy for the case I examined would require additional training data and new models. Still, the ggml-base model performed well enough to merit additional investigation.

RobDLG · May 31, 2024

RobDLG said:
example binaries included in the source distribution are not included

Turns out that there is a "stream" example that isn't built by default, but a simple make stream from the project's base directory produces the binary forthwith.

As the name suggests, stream performs transcription on live audio. I'd hoped to create a video with DSD-FME running in one terminal, and stream transcribing in a second terminal, but stream takes audio from the microphone input, and it wasn't immediately clear to me how to configure the Fedora 40 system audio to make this work.

I got impatient and set a USB webcam with built-in microphone in front of my BCD325P2's speaker. Using the program defaults with the "base" AI model, stream began producing output. It was clear, though, that I should test different settings to potentially improve the performance for this particular use case. Still, stream transcription worked!

lwvmobile · May 31, 2024

RobDLG said:
I got impatient and set a USB webcam with built-in microphone in front of my BCD325P2's speaker. Using the program defaults with the "base" AI model, stream began producing output. It was clear, though, that I should test different settings to potentially improve the performance for this particular use case. Still, stream transcription worked!

I don't have the whisperai setup currently (or if I still do, I don't remember most of its use right now, or where I put it) but you might try making a null-sink and routing the audio output from dsd-fme into the null-sink, and have the stream listen to the monitor of the null-sink.

Here is something that I use for DSD-FME and M17-FME, but could just use it as information for performing the above operation.

m17-fme/docs/Audio_Plumbing.md at main · lwvmobile/m17-fme

Standalone M17 Project Voice and Data Packet Encoder and Decoder - lwvmobile/m17-fme

github.com

EDIT: Actually, you may not even need to use null-sinks, you can probably just have stream listen to the monitor of the hardware that dsd-fme is playing out of.

Automatically transcribing DSD FME wav files

Newbie

Member

DSD-FME

Newbie

Member

DSD-FME

Member

DSD-FME

DSD-FME

Member

im a loser, baby.

Member

DSD-FME

Member

im a loser, baby.

Member

Newbie

Member

Member

DSD-FME

Similar threads