Digital Audio - bits are bits, what could possibly go wrong?

elementze · August 30, 2022, 3:25pm

This week Darko posted a very interesting podcast on the topic of digital audio:

Nothing in the podcast is a new revelation, but it pulls together many important concepts and ideas all in one place. It’s not overly technical, but you do have to stop and think about the content to digest it.

Here’s my too long TL;DR of the podcast (with a little commentary on my part)

Digital Audio Waveform Clock / Timing Generation:

Digital audio file transfer only guarantees that the amplitude data (16-bit or 24-bit data) is exactly right, none of the bits are lost or are missing. The original timing information between each audio sample is completely lost in the storage and transfer of the digital file. It must be regenerated to playback the original audio waveform.

Once the necessary clock has been generated and synchronized to the data, this digital bitstream (containing both clock and data) behaves similarly to an analog waveform, containing both amplitude (bits) and frequency (clock) information. And just as an analog audio waveform is distorted by any modulation of it’s amplitude or frequency, so a digital audio waveform is also distorted by modulation of it’s amplitude or frequency.

It is unlikely for noise in the synchronized data stream to be so severe that it corrupts the data bits. If this happens the auditory effect is painfully obvious. But frequency information is very susceptible to modulation due to noise, resulting in timing shifts in each bit. The amplitude is absolutely correct, but the frequency may be wrong.

Transferring digital audio over SPDIF, TOSLINK, AES, or I2S are all synchronous transfer methods - meaning that both clock and data are embedded in the bitstream. As such, any distortion of the digital waveform has potential to impact sonic quality.

Transferring digital audio over ethernet or USB is an asynchronous activity - meaning no clock data is present, only data bits. This allows for error correction, re-ordering of packets, and requesting of retransmission of missed packets. Once an asynchronous transfer is completed (or buffered sufficiently), the data is converted to a synchronous stream with a locally generated clock and passed on to the digital-to-analog converter block (typically via internal I2S) for conversion to an analog waveform.

Having a high quality (ie low jitter, low drift) clock at the digital-to-analog converter is fundamental to achieving a high quality analog output. Creation of this high quality clock is done in stages, starting with assembly of the data from various asynchronous transfers (streaming over the internet, streaming from a media server, transfer via USB), going through various synchronization steps (generation of the reference clock, alignment of the clock to the data, filtering / buffering of the synchronized data), and ending with the synchronized bit stream entering the D2A. To the extent that this clock matches the original clock from when the data was recorded, the result can be accurate.

Power and Ground Noise
The second half of this story is power and ground noise. Digital data transmission and processing can bring with it a lot of analog power and ground noise. Digital systems tend to be very noise immune, and take advantage of this by using cheap, lower cost power supplies and board/component designs. But A2D converters are VERY noise sensitive. Any noise accumulated through the digital data transmission process, or clock/data synchronization process, can adversely impact the analog output of the D2A conversion. This is where high quality power supplies, galvanic isolation, quality cables, and other “low noise” devices and best practices come into play.

For example, using a LPS on a network streamer won’t reduce the likelihood of a bit being lost or improve the quality of the clock being generated. But it will reduce the amount of analog noise riding along with the digital bit stream. How much influence this analog noise has on the end result depends a lot on the internal DAC architecture and layout. Maybe the amount of noise is inconsequential compared to other noise already present. Or maybe the DAC internally filters this noise very well, and it doesn’t matter. Whatever the case, less noise in is always a good thing.

Another example is using a network server such as a Roon Core to stream Qobuz. The server buffers the asynchronously transmitted data packets, and sequences them so what is sent over your home network to the Roon endpoint is better oriented (fewer re-transmit requests, fewer error corrections, etc). The digital activity in the endpoint is reduced compared to streaming directly from Qobuz to the end point. Thereby reducing some of the digital supply / ground noise generated. When the endpoint is the DAC itself (ie a streaming DAC), this additional digital activity (streaming directly over the internet, instead of using a quality media streamer on the local network) can add up to adversely influence the sonic quality.

Some of these effects are certainly “mouse nuts”. But when getting to the absolute top end of resolving systems, even “mouse nuts” can become significant.

TempoTim · August 30, 2022, 3:47pm

Thanks for this summary. I find all the technical things fascinating. However,
my particular auditory system and gear are at the level of moose and squirrel.

elementze · August 30, 2022, 3:55pm

Never underestimate how good “moose nuts” can sound!

I’m not far away from that myself. I just take the philosophy to do the best I can with what I have, following whatever best practices I can afford. Then as my gear improves, bringing the entire digital chain with it.

MazeFrame · August 30, 2022, 4:07pm

Kinda wrong, kinda depends.

Via Ethernet, all sorts of transfers are possible. That begins at TCP or UDP and then works its way up the entire software stack.
One aspect Ethernet is good at is timing (and isolation), both have pretty tight spec for consumer goods

With USB, it gets similarly difficult. USB can work in a mode were correct transmission is guaranteed, for example when moving files. In case a broken/missing package is non-critical but timing is (mouse movement, audio stream), USB can also work in that mode.

General problem with “digital” is that it is analogue, but the devices only care when certain levels are hit.
For those interested:

Hazi59 · August 30, 2022, 4:12pm

This all makes sense to me. This all is a good explanation of why/how a good quality DDC or Streamer can improve things.

Gothique · August 30, 2022, 4:12pm

Really interesting topic and great summary, thanks for writing this up. I’ve been thinking about this lately as one of the benefits for me of linux is that (usually) I can use a sound card on another device as if it is a local one really easily and in several ways. The way I was using broke recently as I changed the sound stack to gain other advantages and I had to decide what to do about it and what method of moving data from computer to dac to replace it with. The details of different protocols can, as @MazeFrame says, get incredibly involved when you have access to more choices than Windows lays bare.

Audio control and fidelity was the main reason I switched to Linux some years ago and I still learn new things every time I tweak the sound stack. I always find it fascinating that so many people get hung on signal fidelity but never consider how the signal is assembled in the first place and just how much digital processing is involved to then interpret that signal, as if it is all standardised.

elementze · August 30, 2022, 4:13pm

yes, certainly a complex subject. The important point with it is there is no clock associated with the audio data embedded in the digital transmission at this point. Packets can go out of order, go through error correction, etc. If there’s an error the receiving device can either correct it or request a retransmission. It’s certainly not synchronous in terms of audio data contained within the packets.

but yes, your point is very well taken.

dB_Cooper · August 30, 2022, 4:52pm

This is actually something I’ve never thought of as a user of Roon, but does make a great point and a case as to the workload of direct streaming DACs where the processing of the digital file streamed happens at the DAC streamer itself.

elementze · August 30, 2022, 4:52pm

This point is fantastically important. An analog(ue) waveform can be completely defined in terms of amplitude and frequency. Our digital audio waveforms are also fully defined by amplitude (16-bit, 24-bit words) and frequency (sample rate). Meaning that digital audio waveforms are susceptible to the same distortions, just via different physical / electrical mechanisms.

The thing that became clear to me from listening to this podcast is that part of why we go round-and-round with people on noise in digital audio is that we mix-up our understanding of digital files and packets with the digital audio waveforms contained within them. It’s really two different things, just with common words used to describe them (making it more confusing). The files and packets are not digital audio, they are just digital data to be manipulated and moved around like any other digital data. But once we get the digital data to the place that we want to play the digital audio waveform contained within it, the digital audio waveform is extracted and reconstructed locally, becoming very analog(ue) like in nature at this point.

Maybe a useful way to think about this is Ethernet and USB are truly digital signals. Amplifier outputs are truly analog(ue) signals. But the intermediate states of SPDIF, TOSLINK, AES/EBU, I2S are mixed-signal. Digital amplitude information with analog(ue) frequency information. The amplitudes retain a good degree of digital robustness, but the frequency is just as susceptible to distortion as any other analog(ue) waveform.

MazeFrame · August 30, 2022, 5:05pm

If it was a parallel transmission, maybe. Since most modern communication is serial, this is mostly incorrect.

“Contained” is maybe the wrong way to say this. It is more like presentation. A table full of ingredients is a cake, just not presented as one.

Example of Pulse Density Modulation:

In oversimplified terms, the more often the switch is on, the more the signal climbs towards high.

dB_Cooper · August 30, 2022, 5:10pm

The way it’s described here isn’t PDM another way to say DSD?

elementze · August 30, 2022, 5:12pm

This is exactly my point. This is the case for SPDIF and I2S. Clock and data are transmitted concurrently (in parallel) with the data (SPDIF has clock embedded in it, I2S uses a dedicated bit and word clock line).

Maybe I didn’t say it very well here (or i’m just wrong… it does happen quite regularly). But ethernet packets contain many types of digital information. Once it is extracted it is used according to it’s type. A document sent to a printer uses the same network packets as an audio file sent to a DAC. At the printer the digital document information is extracted, formatted, and sent to the print head to print each line of the document. For audio the digital waveform is extracted, formatted, and sent to the DAC for conversion to analog. There is a big difference between the containers perfectly moving digital information around, and the digital content within those containers.

MazeFrame · August 30, 2022, 5:20pm

DSD uses PDM to encode the audio, yes.

elementze · August 30, 2022, 5:22pm

I’m almost exclusively thinking about this from a PCM standpoint, not DSD. that’s an entirely different game because of how sigma-delta modulation works and the reconstruction filter used at the output.

MazeFrame · August 30, 2022, 5:29pm

PCM or PDM only matter in the last step.

It is like counting to 16 before setting the level vs bumping the level up on every pulse.

I think I have an animation somewhere.

Polygonhell · August 30, 2022, 7:13pm

That’s not entirely true either, because a single bit error in PDM is likely inaudible, but a single bit error in PCM most definitely will be (unless your lucky).

There’s really no point in getting pedantic about this.
Or debating the meaning of asynchronous vs synchronous (because they mean different things at different levels of the stack).
The article is very good, covered pretty much everything of any relevance, yes I can get pedantic about how some of it is presented, but it’s not really “wrong”.

The one thing that does interest me is the whole noise introduced by the work done in the decoding process thing, it’s annoyed me since I first discovered almost all streamers are at least Raspberry PI levels of complexity, often running linux. It’s such gross overkill for what they actually do. And I wonder how much it has to do with CD transports being considered superior for a long time, since they have MUCH simpler uControllers.

MazeFrame · August 30, 2022, 7:46pm

At the sample rates, I would argue single bit-errors are never audible. in PCM, you need at least 2 to describe a valid waveform.

Polygonhell · August 30, 2022, 7:53pm

Trust me having spent years listening to looped sine waves when writing sound drivers, I can tell you if you flip a random bit in a PCM signal, you won’t miss it. It will be a very noticeable, loud click. The difference between -8388608 an 0 is one bit of a 24 bit signal.
For most DAC’s though it’s my understanding there is usually enough information to identify the error and they mute the output for at least the entire packet, which would be even more noticeable.
In contrast for PDM no bit is more significant than any other and all you’re doing is impacting what amounts to a running weighted average. Thinking about it timing errors on the output are probably much less audible as well, you don’t really have jitter in the conventional sense, unless part of your pipeline is converting to PCM.

But again we’re being pedantic for no value.

MazeFrame · August 30, 2022, 8:16pm

It can not be a loud click. Since there is a Low Pass filter on the output (Nyquist Theorem requires a limited bandwidth for the math to work out), and assuming a high-significance bit is flipped, it will result in this:

Source

Additionally, a single bit at 44.1kHz sample rate is 1/44,100 of a second in length. Human nerves have (depending on type) 1ms rise time, the pulse length (ignoring ringing) is a quarter of that in length.

If the LSB was to flip with a 24-bit DAC running of a 5V reference voltage, you are looking at roughly 3 µV deviation, which is probably below the noise floor of anything after the DAC (and probably the DAC as well).

Polygonhell · August 30, 2022, 8:22pm

I’m done with the argument, you can hear it, it’s a click, you can definitively hear even a sudden mute as a click.
As I said I’m not arguing theory here, I’ve heard the issue when actually working on in low level sound drivers at both 44Khz, 48KHz and probably higher sampling rates, a single bit error is quite obviously audible in a PCM signal.