Trunk-Recorder Transcription?

dbuttry · Jul 1, 2021

Has anyone tried to automate transcription of the autofiles that Trunk-Recorder creates? I was thinking that if we could make this work, you could possibly then set alerts for keywords, etc. Thoughts? Is there something else that can do this already?

kcams · Aug 3, 2021

Just started looking at that. Looks like Trunk-Recorder outputs a .wav in 8000hz mono, and 16000hz would be a much better sample rate for some kind of speech/dictation software. Poking at the code a bit, I can't figure out how to make that change. Not sure how the digital recorder works.

Specifically this: /trunk-recorder/p25_recorder_decode.cc
wav_sink = gr::blocks::nonstop_wavfile_sink_impl::make(1, 8000, 16, true);

It would appear it's much more complex than just this line (double speed audio output if you change it, so there must be other hacks required). I'm bad at C, and I haven't figured out how the end .wav gets dumped out. Suggestions? Am I looking at the wrong thing?

So far DeepSpeech (linux) looks a bit more promising and active than Julius (linux) for the wav-to-text thing. I'd imagine you'll be working on the "model" a bunch trying to train it to make it work. Multiple voices, not very clear.....going to be an issue.

Anyone else try this?

kcams · Aug 11, 2021

After a little more time ---> The 8000 Hz recording is just how the P25 is voice decoded into the wav file and just how it works, but the real battle is the modeling, audio cleanup/filtering, and speech recognition.

They speak fast on the radio, the audio isn't normalized, and not very clear in a lot of calls. The keywords that are weighted -- that a default model looks for -- isn't exactly normal speech patterns, so I get some really awesome and funny results on DeepSpeech. It's trying. It just doesn't know about sentence fragments and radio "banter".

It is going to be a keyword, machine learning, and model training exercise, so a machine with a lot of horse power, with a decent graphics card you can leverage the GPU's on is going to be required to make a model. Nvidia looks to be the most supported for this. Maybe it's possible to get close(r).

polkaroo · Aug 13, 2021

There's also Microsoft Azure Speech to Text – Audio to Text Translation | Microsoft Azure and Google Speech Speech-to-Text: Automatic Speech Recognition | Google Cloud though I'm sure (obscure) street names will give some interesting results.

they4kman · Sep 30, 2021

I wanted to do this, as well, as a method for filtering down the calls I choose to listen to from my system. I recall trying DeepSpeech, and not getting great results from it — and, IIRC, making adjustments meant retraining a new model, which I really don't know enough about, have enough annotated training data for, or have patience for

I tried a bunch of other things (that I've forgotten now — I did this Dec 2020) with mixed results. I found Google's Speech-to-Text API to give the best results. It also offers the ability to make adjustments (i.e. adding and/or weighting keywords) and provides position, duration, and confidence for each word it transcribes. I had the best luck using the phone call model. Enhanced mode had mixed results. But even with all my tinkering, it was never great.

Here's an example from my notes:

# Hand-transcribed
143-alpha copy, 342-bravo to back on a Call For Officer
14<garble ...> 21 and 4
on a call<?>
This is 40 3rd Avenue South unit 310, Beacon 430 apartments
430 3rd Avenue South unit 310, Beacon 430 apartments
Complainant heard a slamming noise and a female saying "no, no" at approximately a minute ago
Advised that the screaming has stopped, and complainant can hear talking at this time.
Waiting on further at 1:12

# Without enhanced
143 Alpha copy 342 Provident Bank on a call for officer one for you go from 21 and 4 hello this is 400 3rd Avenue South unit 310 to Deacon 4:30 Apartments 430 3rd Avenue South unit 310 Beacon 430 Apartments I\'m playing her to Fleming noise and a female saying no no approximately a minute ago if I the screaming as soft and cleaning can here talking at this time waiting on further 112

confidence: 0.6370497941970825

# With enhanced
143 off the coffee 3:40 to probably back on the call for officer 148304 from 21 and 4 on the Park Avenue South unit 310 a month Apartments 430 3rd Avenue South unit 310 you can 4:30 Apartments complain heard a slamming noise and a female saying no no approximately a minute ago the screaming. And running here talking at this time waiting on further $112

confidence: 0.8535746932029724

For reference, I've attached the script I was playing with — it needs Python 3.10 ('cause I use generic typing on builtin collections), and was developed with dependencies colorama==0.4.4 and google-cloud-speech==2.0.1

Trunk-Recorder Transcription?

dbuttry

Newbie

kcams

Member

kcams

Member

polkaroo

Missed him again!

they4kman

Member

Attachments

Similar threads