voice to text for DSD+ or any voice/control channel decoding

Tom_Neverwinter · Jun 17, 2025

While radio software capably converts radio waves to audio, the critical advantage lies in transforming that audio into immediately scannable text. Imagine receiving an alert from your home system, capturing vital over-the-air communications even while you're away—instantly knowing if a concerning event, like an incident on your block or within your residence, is unfolding. This empowers you to swiftly discern actionable intelligence from background noise, enabling rapid response in urgent situations. Unlike other solutions such as Whisper, which can be resource-intensive and slower for real-time applications, Parakeet TDT 0.6 provides a robust and efficient alternative. It's engineered for both high-speed, faster-than-real-time transcription and efficient batch processing, all while requiring fewer resources, making it accessible even for older or less powerful systems, ensuring no critical detail is missed in time-sensitive scenarios.

I have been outputting dsd+ to an input folder which is then parsed by parakeet and output in an output folder. making a subtitle srt file which can be checked by other software for keywords or context. the word error rate isnt fantastic [6 words per 100] but its about on par with humans so??? why not.

I use this model:

nvidia/parakeet-tdt-0.6b-v2 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

hopefully something better comes along like this but with the timestamp ability and translation.
nvidia/parakeet-rnnt-1.1b · Hugging Face is Multilanguage model but slower and sadly doesn't translate it just outputs the language in text.

I installed the nemo framework like this:

Windows install
install python 3.11

py -3.11 -m venv .venv

.venv\Scripts\activate

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -U "nemo_toolkit[asr]" soundfile

then I made a bat file and pythong file to run the thing and perform chunking inside the main folder

I made a python file to run my model and chunk the information taking it from input folder processing it then outputting it:
"long_batch_transcribe_chunked.py"

import os
import shutil
import tempfile
import subprocess
import glob
from nemo.collections.asr.models import ASRModel

# ─── CONFIG ─────────────────────────────────────────────────────────────
INPUT_DIR = "input"
OUTPUT_DIR = "output"
#MODEL_DIR = "models/parakeet-tdt-0.6b-v2"
MODEL_DIR = "models/"
CHECKPOINT = os.path.join(MODEL_DIR, "parakeet-tdt-0.6b-v2.nemo")
#CHECKPOINT = os.path.join(MODEL_DIR, "Parakeet-RNNT-XXL-1.1b_merged_universal_spe8.5k_1.0.nemo")
CHUNK_SECONDS = 60 # chunk length in seconds

# ─── PREPARE DIRS & OFFLINE ─────────────────────────────────────────────
os.makedirs(INPUT_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["TRANSFORMERS_OFFLINE"] = "1"

# ─── LOAD & CONFIGURE MODEL FOR STREAMING ───────────────────────────────
print(f"Loading model from {CHECKPOINT}…")
model = ASRModel.restore_from(CHECKPOINT).to("cuda")
model.change_attention_model("rel_pos_local_attn", [128, 128])
model.change_subsampling_conv_chunking_factor(1)

# ─── PROCESS EACH AUDIO IN INPUT_DIR ────────────────────────────────────
for fname in os.listdir(INPUT_DIR):
src = os.path.join(INPUT_DIR, fname)
if not os.path.isfile(src):
continue

print(f"\n⏳ Processing {fname}…")

# 1) use ffmpeg to split into CHUNK_SECONDS WAVs
tmpdir = tempfile.TemporaryDirectory()
pattern = os.path.join(tmpdir.name, "chunk_%06d.wav")
cmd = [
"ffmpeg", "-y", "-i", src,
"-ac", "1", "-ar", "16000",
"-f", "segment", "-segment_time", str(CHUNK_SECONDS),
"-c:a", "pcm_s16le", pattern
]
subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

# 2) transcribe each chunk
transcripts = []
for chunk in sorted(glob.glob(os.path.join(tmpdir.name, "chunk_*.wav"))):
print(f" Transcribing {os.path.basename(chunk)}…")
text = model.transcribe([chunk])[0].text
transcripts.append(text)

# 3) combine & write out the full transcript
full_text = "\n".join(transcripts)
base = os.path.splitext(fname)[0]
out_txt = os.path.join(OUTPUT_DIR, f"{base}.txt")
with open(out_txt, "w", encoding="utf-8") as f:
f.write(full_text)
print(f"▶ Transcript saved to {out_txt}")

# 4) move original audio into OUTPUT_DIR
shutil.move(src, os.path.join(OUTPUT_DIR, fname))
print(f"✔ Moved {fname} → {OUTPUT_DIR}")

# 5) clean up temp chunks
tmpdir.cleanup()

print("\n✅ All files processed.")

pause

then the bat file to tell it what to do and make it easier to interface with:

@echo off
REM ─── Navigate to script directory ─────────────────────────────────────
cd /d %~dp0

REM ─── Environment & Config ─────────────────────────────────────────────
set HF_HUB_OFFLINE=1
set TRANSFORMERS_OFFLINE=1

REM (Optionally override these here if you want)
set INPUT_DIR=input
set OUTPUT_DIR=output
set MODEL_DIR=models
set CHECKPOINT=%MODEL_DIR%\Parakeet-RNNT-XXL-1.1b_merged_universal_spe8.5k_1.0.nemo
set CHUNK_SECONDS=60

REM ─── Create folders if they don’t exist ───────────────────────────────
if not exist "%INPUT_DIR%" mkdir "%INPUT_DIR%"
if not exist "%OUTPUT_DIR%" mkdir "%OUTPUT_DIR%"

REM ─── Run the Python transcription script ──────────────────────────────
echo.
echo Loading model from %CHECKPOINT% and processing all audio in %INPUT_DIR%…
echo.
python transcribe.py

echo.
echo DONE.
pause

to run the software I just schedule windows to run the bat file every few minutes.

I hope this helps people and makes a better world, sorry if my tutorial for how this works is a little sloppy. please feel free to make improvements

voice to text for DSD+ or any voice/control channel decoding

Tom_Neverwinter

Member

nvidia/parakeet-tdt-0.6b-v2 · Hugging Face

Similar threads