Same thing as you get connecting things in a studio, both devices are clocking things out/in at exactly the same time, in effect you changing a protocol like AES with an embedded clock to a protocol like i2s where there are separate clocks.
Without it, between 2 devices one is acting as a clock provider (usually the streamer), and the other is trying to match that provider, so there isn’t significant drift between the two devices. It can do this simply accepting the provided clock, or it can try and “smooth” the clock variations with something like a PLL.
There is always some clock jitter in a digital system consisting of more than one clock domain, but with a “perfect” external clock signal best case (depending on the consistency of interrupt handling) be 50% of the device clock (not audio clock) in the devices.
Then you get into the whole so higher frequency is better, but higher frequency devices will generally introduce more noise so….