Magnum Opus

A new draft arrived this month for the Opus specification, or the Internet Audio Codec, whose home page you can find here. This creditable effort will deliver a codec capable of both narrowband and wideband peformance, and it will be scalable to carry not just speech but even music across the internet. Some commonly used codecs (like AMR used in 3G and MP3 used, well, everywhere) are encumbered with patents, so it’s not always possible to build what you want, especially if you have a business model that cannot support high per-station licensing fees at the start.

It’s quite cool the way the scalability works, too. Most codecs today are not pure waveform coders (like the storage of bits on a CD), but nor do they entirely synthesize speech. They are somewhere in the middle, and this is helpfully known as “hybrid”. Using a mathematical model of the human speech system, a hybrid codec can make a very compact description of the sounds, and send this instead of a slavish description of the audio waveform. This part of the Opus algorithm is provided by the Skype SILK algorithm (something we already know well at Voxygen), and this provides wideband performance for speech, up to 8kHz of bandwidth, double that of a “normal” telephone call. In some literature, “wideband” is reserved for a higher range, around 16kHz, but actually 8kHz is pretty good compared to what telephone users are used to.

When even further quaility is needed, Opus engages a parallel channel using something called the M-DCT (Modified Discrete Cosine Transform). The MDCT is used in encoding JPEG images, and MP3 music. It’s a computationally expensive process, but has the useful property that it encodes the most important perceptual information, and discards the rest. When you look at a JPEG image, it is not a faithful representation of the original bits (say from the CCD in the camera), but what has been discarded is information you would notice the least in any case. The same is true for MP3: relative to the music on a CD, MP3 is much more compact (at least 1/3, perhaps 1/10 the size), and this compression is in part from (perceptually unimportant) information that has been discarded. Running the MDCT algorithm in realtime was not always a possibility, but this is testament to the plentiful and cheap CPU that is available in our mobile phones, tablets and computers.

So the MDCT encodes high frequencies (above 8kHz), and the hybrid encoder handles the lower 8kHz. The decoder combines the two streams to give something like the original audio experience at the receiving end. Opus is pretty interesting in embracing everything from low quality voice all the way to maximal music quality, depending on application needs, network capacity and available CPU. Wideband audio research was in a pretty dead period until the last few years, but the convergence of factors like wireless, open source and the internet has brought a lot of brilliant new ideas forward.

← Back to the blog