Trunk-Recorder Transcription?

Status
Not open for further replies.

dbuttry

Newbie
Premium Subscriber
Joined
Jun 26, 2010
Messages
4
Location
Champaign, IL
Has anyone tried to automate transcription of the autofiles that Trunk-Recorder creates? I was thinking that if we could make this work, you could possibly then set alerts for keywords, etc. Thoughts? Is there something else that can do this already?
 

kcams

Member
Joined
Aug 3, 2021
Messages
16
Just started looking at that. Looks like Trunk-Recorder outputs a .wav in 8000hz mono, and 16000hz would be a much better sample rate for some kind of speech/dictation software. Poking at the code a bit, I can't figure out how to make that change. Not sure how the digital recorder works.

Specifically this: /trunk-recorder/p25_recorder_decode.cc
wav_sink = gr::blocks::nonstop_wavfile_sink_impl::make(1, 8000, 16, true);

It would appear it's much more complex than just this line (double speed audio output if you change it, so there must be other hacks required). I'm bad at C, and I haven't figured out how the end .wav gets dumped out. Suggestions? Am I looking at the wrong thing?

So far DeepSpeech (linux) looks a bit more promising and active than Julius (linux) for the wav-to-text thing. I'd imagine you'll be working on the "model" a bunch trying to train it to make it work. Multiple voices, not very clear.....going to be an issue.

Anyone else try this?
 

kcams

Member
Joined
Aug 3, 2021
Messages
16
After a little more time ---> The 8000 Hz recording is just how the P25 is voice decoded into the wav file and just how it works, but the real battle is the modeling, audio cleanup/filtering, and speech recognition.

They speak fast on the radio, the audio isn't normalized, and not very clear in a lot of calls. The keywords that are weighted -- that a default model looks for -- isn't exactly normal speech patterns, so I get some really awesome and funny results on DeepSpeech. It's trying. It just doesn't know about sentence fragments and radio "banter".

It is going to be a keyword, machine learning, and model training exercise, so a machine with a lot of horse power, with a decent graphics card you can leverage the GPU's on is going to be required to make a model. Nvidia looks to be the most supported for this. Maybe it's possible to get close(r).
 

they4kman

Member
Premium Subscriber
Joined
Nov 22, 2020
Messages
7
Location
St Petersburg, FL
I wanted to do this, as well, as a method for filtering down the calls I choose to listen to from my system. I recall trying DeepSpeech, and not getting great results from it — and, IIRC, making adjustments meant retraining a new model, which I really don't know enough about, have enough annotated training data for, or have patience for :p

I tried a bunch of other things (that I've forgotten now — I did this Dec 2020) with mixed results. I found Google's Speech-to-Text API to give the best results. It also offers the ability to make adjustments (i.e. adding and/or weighting keywords) and provides position, duration, and confidence for each word it transcribes. I had the best luck using the phone call model. Enhanced mode had mixed results. But even with all my tinkering, it was never great.

Here's an example from my notes:
# Hand-transcribed
143-alpha copy, 342-bravo to back on a Call For Officer
14<garble ...> 21 and 4
on a call<?>
This is 40 3rd Avenue South unit 310, Beacon 430 apartments
430 3rd Avenue South unit 310, Beacon 430 apartments
Complainant heard a slamming noise and a female saying "no, no" at approximately a minute ago
Advised that the screaming has stopped, and complainant can hear talking at this time.
Waiting on further at 1:12



# Without enhanced
143 Alpha copy 342 Provident Bank on a call for officer one for you go from 21 and 4 hello this is 400 3rd Avenue South unit 310 to Deacon 4:30 Apartments 430 3rd Avenue South unit 310 Beacon 430 Apartments I\'m playing her to Fleming noise and a female saying no no approximately a minute ago if I the screaming as soft and cleaning can here talking at this time waiting on further 112

confidence: 0.6370497941970825


# With enhanced
143 off the coffee 3:40 to probably back on the call for officer 148304 from 21 and 4 on the Park Avenue South unit 310 a month Apartments 430 3rd Avenue South unit 310 you can 4:30 Apartments complain heard a slamming noise and a female saying no no approximately a minute ago the screaming. And running here talking at this time waiting on further $112

confidence: 0.8535746932029724

For reference, I've attached the script I was playing with — it needs Python 3.10 ('cause I use generic typing on builtin collections), and was developed with dependencies colorama==0.4.4 and google-cloud-speech==2.0.1
 

Attachments

  • trunk_recorder_cloud_tts_transcribe.py.txt
    7.6 KB · Views: 13
Status
Not open for further replies.
Top