The encoding methods you refer to generally don't need a discriminator tap to decode.
I'm referring to straight FSK, where typically two or four RF frequencies (+/- shifts from the center frequency) are used to directly represent two or four logic states / symbols, e.g. 0 or 1, or 00, 01, 10 or 11.
While the RF transmitter is sending a particular symbol, the receiver's discriminator point will (for the most part) maintain a matching voltage level. As shown in the attached image, when the transmitter sends a string of zero bits, the tap point is held low for an extended period; a .01 uF cap is not going to pass that signal - it will quickly start charging and the load (your soundcard) will see a signal that instead of being flat (as shown), will ramp towards the zero / center line as the cap's voltage level approaches the discriminator's voltage level.
That will cause a decoding application to see the zero crossing too early, which leads to errors in tracking bit edge timing, which leads to flipped (incorrect) bits. It doesn't take too many of them to defeat the FEC used in some of the common protocols. Some don't even have FEC, just error detection, so a single flipped bit means a lost data block.
That image shows what my sound card sees; its hardware, along with the various sound cards / laptop internal audio devices I've used over the last ten years, have never had a problem dealing with the low frequency components found in FSK signals. That's not to say that there aren't a few stinkers out there, but they seem to be rare.