SDRTrunk Keep Alives on Broadcastify Calls

sdrohio

Newbie
Joined
Nov 21, 2014
Messages
4
SDRTrunk has an option for calls streaming to "send periodic keep-alive". This option is on by default with a setting of 15 minutes.

I noticed on the manage node page for my node that this option doesn't seem to update/tick the last seen timestamp. I verified this at a quiet time of night on my node where my node was last seen at 8 minutes and the keep-alive was sent by SDRTrunk.

I am guessing that this means that the keep-alive feature in SDRTrunk really doesn't do anything to let the Broadcastify system know the node is still alive even though the node hasn't sent any uploads recently.

I checked the Calls API reference and didn't see a mention of a keep-alive method in it. ( Broadcastify-Calls-API - The RadioReference Wiki )

If my assumptions are correct, maybe Broadcastify should consider doing something with keep-alives that are being sent by SDRTrunk?

Additionally maybe the Calls API could have a "heartbeat" call that gets touched when a node's streaming application is closed. That way the Broadcastify system would know to flip to a different node for the talkgroups that the exiting node was uploading. That would help "heal" the calls platform when a node stops running instead of having to wait the 15 minutes or so.

We could go further with this idea and have the heartbeat api call have 3 options, a startup for when the node is first online, a keep-alive/heartbeat to let Broadcastify know that the node is still there even though nothing has been uploaded, and a shutdown for when the node decides to stop streaming for whatever reason.

The only case for node health left is when the node's internet connection goes down. That could be solved by adding a next expected keep-alive time to the keep-alive/heartbeat call. If the node misses that keep-alive time (plus or minus 15 seconds for skew), it could then be marked as suspect. The Broadcastify system could be the one to send the next expected keep-alive time to the node software that way Broadcastify could control the frequency of the keep-alive. Shorten the time for a more active node, lengthen it for a less active node or if the Broadcastify system needs to manage traffic.

Thoughts @blantonl ?
 
Last edited:

jtwalker

Member
Premium Subscriber
Joined
Dec 3, 2012
Messages
2,175
Location
Gettysburg, PA & Fenwick Island, DE
The keep-alive feature was removed from Broadcastify according to @blantonl. Still in SDRtrunk last I looked but it doesn’t do anything.

 

blantonl

Founder and CEO
Staff member
Super Moderator
Joined
Dec 9, 2000
Messages
11,340
Location
San Antonio, Whitefish, New Orleans
Yes, the keep alive feature was disabled from updating your node status since we had tons of nodes that would just send the keep-alives but no calls.

We already automatically reset the node ranking process after 5 minutes of no calls received for a talkgroup and let the nodes "battle it out again" when transmissions resume on a group.

We already do a "first-online" check for nodes.

It's tough to reliability implement a shutdown call, because as you pointed out, in the vast majority cases when a node goes offline they aren't going to notify us beforehand.

We already run a process similar to what you describe to mark active nodes suspect (for instance, when they disappear), it's called a "reconsideration" feature which gives 4 opportunities (4 received calls by all nodes for a talkgroup) within 30 secs by the current active node to be missing out before we demote that node and promote the next best node.

Given all that above, if a node disappears for one reason or another, it takes the system 600 seconds, or 5 minutes, to heal in cases where there is little activity on the talkgroup, or 30 seconds if the talkgroup is very active.
 

blantonl

Founder and CEO
Staff member
Super Moderator
Joined
Dec 9, 2000
Messages
11,340
Location
San Antonio, Whitefish, New Orleans
Also note that we have to run this reconsideration feature on a per group basis, not a per node basis, because radios will roam between sites on wide area systems, and a group may no longer be active on a site because all radios roamed away from it. The only reliable way we can detect that case is when other nodes are capturing a group, but the active node isn't anymore, but we've got to give the active node some leniency or we'll get into a race condition where they are flipping back and forth constantly. Then mark it suspect if other nodes captured calls on a group but the active node didn't - 4 times in 30 secs.
 

sdrohio

Newbie
Joined
Nov 21, 2014
Messages
4
Cool. All good information to know. Thank you for the explanation.

I thought I read somewhere on the forums that the heal time was more like 15 minutes. That's part of what prompted me to make this post.

I guess my only other question then would be would it be worth it to have SDRTrunk remove the keep-alive feature all together or make it disabled by default instead of enabled? It's a tiny waste of bandwidth/resources but if it helps drive efficiency then why not consider changing it.

I have experience in development and I am considering making a couple of PRs to the SDRTrunk project. Simple stuff not complex as I don't have a huge amount of free time at the moment to take on the big stuff.
 

blantonl

Founder and CEO
Staff member
Super Moderator
Joined
Dec 9, 2000
Messages
11,340
Location
San Antonio, Whitefish, New Orleans
Keep alive function really serves no purpose anymore, because it appeared that nodes would send it but still be in an unstable state. I want to be able to notify node providers that their node isn't sending calls. Node providers want to know that as well. We had lots of instances where node providers had no idea that their node was in an unstable state because the keep alive kept kicking.

The heartbeat at this point is essentially if your node is sending ANY calls whatsoever or not.
 

blantonl

Founder and CEO
Staff member
Super Moderator
Joined
Dec 9, 2000
Messages
11,340
Location
San Antonio, Whitefish, New Orleans
Cool. All good information to know. Thank you for the explanation.

I thought I read somewhere on the forums that the heal time was more like 15 minutes. That's part of what prompted me to make this post.

I guess my only other question then would be would it be worth it to have SDRTrunk remove the keep-alive feature all together or make it disabled by default instead of enabled? It's a tiny waste of bandwidth/resources but if it helps drive efficiency then why not consider changing it.

I have experience in development and I am considering making a couple of PRs to the SDRTrunk project. Simple stuff not complex as I don't have a huge amount of free time at the moment to take on the big stuff.
We've probably tweaked it a little bit.

Originally the reconsideration timeframe was 5 minutes. I can't remember what prompted me to change it down to 30 seconds.

If a group is very active on a system, the healing time will be 30 secs or so

If a group isn't very active, or becomes inactive.. the healing time will be 5 minutes.
 

sdrohio

Newbie
Joined
Nov 21, 2014
Messages
4
We've probably tweaked it a little bit.

Originally the reconsideration timeframe was 5 minutes. I can't remember what prompted me to change it down to 30 seconds.

If a group is very active on a system, the healing time will be 30 secs or so

If a group isn't very active, or becomes inactive.. the healing time will be 5 minutes.
That makes sense. Glad you were able to come up with a method to solve that on the system end.

Keep alive function really serves no purpose anymore, because it appeared that nodes would send it but still be in an unstable state. I want to be able to notify node providers that their node isn't sending calls. Node providers want to know that as well.
This brings up another question, maybe instead of a heartbeat/keep-alive you could use a node health check to provide diagnostic information the the node application or perhaps notify the node provider in some way. Like a push notification that could be written in the node application log or displayed on the node screen.

Diagnostic information cold be something like the number of upload requests Broadcastify received from the node and/or the number of files successfully uploaded from the node. The node application could then do a cross check with a set of counters it has and if they don't line up within a certain range then it could pop an alert in the log file or on the screen in some way.

To me, the node offline emails seem to be sufficient notification assuming it gets seen in the provider's inbox and doesn't wind up in spam.

I'm just thinking out loud at this point.

If none of this is of interest, I guess my thought would be to recommend setting the keep-alive in SDRTrunk to off by default.
 

blantonl

Founder and CEO
Staff member
Super Moderator
Joined
Dec 9, 2000
Messages
11,340
Location
San Antonio, Whitefish, New Orleans
All good suggestions.

Keep in mind though, that Calls is designed to be platform agnostic in terms of sending clients - we're not just designed around SDRTrunk.

We would have to come up with a core, common standard for node health checks AND get all the sending clients, developers, broadcasters etc to participate for it to work properly. Not impossible, but a daunting task and project indeed.
 

sdrohio

Newbie
Joined
Nov 21, 2014
Messages
4
Yep. I understand there are more clients than just SDRTrunk. You as the platform definitely want that to be true.

If Broadcastify released an optional api health check call that could be looked at by the client software if desired, then the onus would be on the client software developer to develop the code for that if they even wanted to. Some would, some wouldn't. The node provider could then decide if that feature would be useful to them or not and use the appropriate client for their needs.

Coming up with the common standard for what the api would report that is useful to all client software is probably the hardest part. Without Broadcastify releasing something the client software developers have nothing to code for. Since there isn't an api call that can get something like the node statistics on the manage calls system page, the software developers have no information they can use to close the performance feedback loop of the node.

It's possible that the current api responses for uploads are good enough for node providers. I'm probably just over engineering things in my own mind.

All good future thoughts, or not. :)

Thank you for the discussion and the great platform you have created.
 
Top