Audio quality

Sample rate:
There are many algorithms for performing sample rate conversion (a.k.a. resampling) of an audio recording. These vary widely in both complexity and in audio quality. The simplest (e.g. 'polynomial interpolation algorithms' commonly used in synthesizers) are very CPU-efficient but do not perform any filtering of the sound before downsampling (converting from a higher to a lower sample rate). Frequencies left from the original recording above the frequency limit of the downsampled recording will result in very undesirable 'aliasing noise'. It is therefore very important to filter out those high frequencies before downsampling. The ideal filter would be a 'brick filter' that cuts off everything above the new frequency limit and leaves everything below it intact. Unfortunately such a filter is not practical to design - every real world filter will have a 'slope' at the cut-off frequency. While you therefore don't want to leave any information above the new frequency limit, you don't want to remove any more than necessary of the 'useful' frequency information below the limit either. This is one of the two challenges of sample rate conversion! The other is minimizing distortion. For both problems, in general you can say that the more CPU power you throw at the problem, the better audio quality you can get! Let's have a look at a real world example:

Here white noise (noise with a flat spectrum) generated at 48000 Hz has been first downsampled to 32000 Hz (which means a frequency limit of 16000 Hz) and then upsampled to 48000 Hz again using various algorithms. All calculations have been done using 32-bit floating in order to avoid quantization issues.
- The blue line is the original spectrum with no resampling.
- The yellow line shows a typical implementation with a nicely designed trade-off between computation time and audio quality for resampling 16-bit audio data. This algorithm can be run in real-time. The yellow 'hills' above 16000 Hz are aliasing noise, but note that they are all below -96dB and so would be entirely cut away if the sound were to be converted to 16-bit PCM format - which is what the implemenation was intended for. Also note the muting of high frequencies above 14000 Hz. This is actually a fairly well designed filter and many an audio software does much, much worse than this - but no names mentioned!
- The red line shows the high-quality resampling algorithm implementation of a very well known audio editor. Not much to comment on really, a nice steep cut-off and no aliasing noise.
- The green line finally, shows the results with Any Time. The filter cut-off is extremely steep - thus preserving as much as possible of the original sound - and there is no aliasing noise visible. Of course it also takes much longer time to compute than the other algorithms It should be noted that the cut-off is in reality much steeper than what is shown here - more than 80% of the slope for the green line is simply due to a side-effect of the analysis algorithm used to generate the graph (spectral spreading due to finite Fourier-analysis window length). This is also the explanation as to why both the green and red lines seem to extend above 16000 Hz - in reality they do not.
High frequency range extrapolation:
In many applications a low sample rate is used as a necessity for reducing file size or transmission band-width requirements (e.g. normal phone-switches use 8000 Hz). The idea of trying to 'recreate' lost higher frequency components is not new but it is not an easy problem. Of course, information theory says that this is impossible in the 'generic' case where nothing is assumed about the properties of the signal. But it is certainly possible to use different heuristic and statistical methods to try to come up with an algorithm for high frequency extrapolation (a.k.a. synthetic bandwidth extension) that subjectively improves the sound even though it may not provide a perfect 'recreation' of the original! A well know example is the 'Spectral Band Replication' algorithm (SBR) used in 'mp3pro'. There does not seem to be much released information about how the SBR algorithm actually works, but from what public information that we have been able to obtain we guess that it works by matching the low-frequency spectrum against a database of spectrums collected from many types of material. The decoder then selects and adds the high-frequency components from one entry in the database. The mp3pro file (compared to a normal mp3 file) would then add 'hints' that do not take up much space, but which helps the decoder make a better selection from the database. One potential problem with this approach is that it may not ensure a harmonic relationship between the added high frequency components and the 'real' lower frequency spectrum (musical tones tend to have harmonic overtones - that's what makes them sound good to our ears!) The proprietary algorithm that we have developed for Any Time use an entirely different approach and it does not have anything in common with the SBR algorithm. Very broadly speaking it, analyzes the existing frequency spectrum and tries to identify the 'fundamental frequencies' of the sound sources (much like our instrument tuner software), it then adds harmonic series of overtones to them. The rea
l tricky issues are how to identify these frequencies, how to determine the proper amplitudes for the overtones, how to handle non-harmonic contents (e.g. a drum crash), and how to do this is manner that has good stability over time.

The graph shows a 44100 Hz synth pad that has first been downsampled to 8000 Hz (i.e. only frequencies below 4000 Hz have been retained from the original) and then 'restored' to 44100 Hz using Any Time's high frequency range extrapolation option.
- The blue line is the original spectrum - from this only the frequencies below 4000 Hz were passed on to Any Time.
- The green-line is the reconstruction by Any Time. The results are quite satisfactory, especially considering that the algorithm only had 18% of the original frequency content to work with! Deviations increase of course with the extension 'range' - it is apparent from the graph that the reconstruction in this case is closest to the original in the first octave above the 4000 Hz frequency limit (i.e. the 4000-8000 Hz range), while in the second octave (8000-16000) Hz the peaks are still fairly well aligned but the amplitudes no longer match as well. Interestingly the amplitudes are good in the final 3rd octave even though the original spectrum shape there has very little similarity to the spectrum below 4000 Hz! Anyhow, the important thing is how does it sound? Rather than put up sample clips here, we are confident enough to invite you to download and try the software yourself for up to 30 days!
Pitch scaling and formant correction:
Pitch scaling moves the 'pitch' of a recording up or down without changing the sample rate or the recording length. This is done by breaking down the sound into separate frequency components, scaling them by the desired factor, then re-synthesizing' the sound. Simply changing the playback speed has the same effect on the pitch - but unlike pitch in that it that change the playback time with the same factor as the pitch is changed. But one side-effect both have in common is the so called "Mickey mouse" effect - i.e. speech (and music too) sound 'tinny' when pitched up, or 'boomy' when pitched down. This happens because most instruments - the human voice included - works by in one way or another exciting a 'resonance body' and this resonance body has a set of 'resonance frequencies' near which sound is better amplified. This is seen as definite 'hills' - also called 'formants' - if you look at a frequency-amplitude graph. When playing notes of different pitches on an instrument (or whatever is producing the sound), the resonance frequencies will remain fixed. But when pitch-scaling, you will also change the resonance frequencies by the same factor as you scale all the frequency components - this is why the 'character' of the instrument changes (e.g. from a normal voice character to a Mickey-mouse character voice). The solution is to apply what is commonly called 'formant correction'. This entails somehow analyzing the 'frequency envelope' of the sound (the overall 'shape' of the frequency graph - a good example BTW of something that is much, much easier for a human to do than for a computer!) and then re-enforcing it on the pitch-scaled sound. The result is very often a sound with a much more 'natural' character!

The graph shows a short segment of a violin recording that has been pitch scaled by a factor 1.2x.
- The blue line is the original (unscaled) spectrum.
- The yellow line is the spectrum after scaling in Any Time without formant correction.
- The green line is the spectrum after scaling in Any Time with formant correction. As you can see it much more closely follows the 'envelope' of the blue line.
Time-stretching:
Time stretching changes the length of a recording without changing the sample rate or the pitch of the sound. Mathematically this is equivalent to doing both pitch scaling and resampling at the same time - so the same comments apply as for those two operations. A mathematical necessity when time-stretching - or when pitch scaling - is that different frequencies get different amounts of 'phase shift' when they are scaled. When time-stretching by non-integer factors this may sometimes be especially noticeable because an you can get a pronounced 'wah-wah' effect (amplitude modulation due to cancellation). Any Time employs two methods to try to deal with this, one is a feature to 're-synchronize' the phase at larger amplitude increases. The other is an optional feature to 'enforce the original volume envelope'. This will analyze the overall 'volume envelope' of the original recording and 'force' this back onto the processed audio - much like the formant correction but in the time domain instead of in the frequency domain. The side effect of that is that it will not preserve the spectral characteristics of the sound since 'soft frequencies components' are lifted up to compensate whenever 'loud frequency components' cancel each other out. If you prefer this is not is a personal choice - but first try without this option and if you are bothered by a 'wah-wah', then try turning it on!
Audio mastering:
In the high quality audio studios of today, recording, mixing and effects processing are often done at a high sample rate and bit depth (e.g. 96000 Hz, 32-bit floating point). When 'mastering it', i.e. preparing it for the final lower resolution distribution medium (e.g. 44100 Hz, 16-bit audio CD format), it is important to get everything right in order to maintain maximum audio quality.
- Volume normalization: When preparing a CD or other 'package', you are often given sound clips with different 'loudness'. The relative volume of each clip then needs to be carefully adjusted so that the listener wont have to jump to the volume control every so often (e.g. at each track change on a CD). Any Time provides a heuristics algorithm for automatically performing such 'volume normalization' that tries to achieve a result very close to what a sound engineer would do. The algorithm also employs psycho-acoustical corrections for the ears different sensitivity for different frequencies. The same algorithm as is used in Any Time was also selected by a large national radio station after evaluating several different methods.
- Noise floor clean-up: If you have a noisy recording, then Any Time can selectively filter out any frequency components with amplitudes lower than a selectable thresh-hold value. This preserves the full resolution of all other frequency components.
- Noise shaped dithering: Bit-depth quantization is the process of constructing output sample value numbers from incoming, often higher precision data (e.g. 24-bit to 16-bit PCM). The internal processing in Any Time is always done at a precision of 64-bit floating point and the processing of the audio data means that the output may actually contain information at lower levels than the input had - in other words, doing the bit-depth quantization right is truly important! The most common - and very simple - method is to round off the samples to nearest value in the output data format. Any Time can of course do this if you want. Unfortunately the high-frequency quantization noise introduced by this is fairly objectionable to the human brain. Any Time can also do state-of-the-art 'noise shaped dithering'. This method adds a very low level noise before rounding the sample values - i.e. instead of rounding to the nearest value, a random element is introduced. Studies have shown that this noise is much more pleasing for the brain! That is the 'noise' part. The 'shaped' part means that the added noise is very carefully filtered so that it's spectral characteristics is the inverse of the ears sensitivity curve - i.e. we get a noise that has most of its energy in the regions where the ear is the least sensitive. In other words a fairly bad type of noise is traded for a much better type of noise! The 'dithering' part means that quantization errors for one sample are carried over to the next sample. This taken together with the noise shaping has a beneficial side effect - a phenomenon called 'stochastic resonance'. This phenomenon 'subjectively' preserves some of the audio information below the lowest level that can 'truly' be represented in the output data format resulting in improved 'perceived' sound quality. If you have ever seen regular audio CD's bearing '20-bit' stickers, then it is often this phenomenon that they are referring to! (note: if you see a HDCD sticker then that is something entirely different).
To explain it in rather simplified terms, the noise being truly random (i.e. 'stochastic') it every now and then will amplify ('resonate' with) a low level signal - just for a short while. That short while may be enough if it occurs repeatedly - if the ear can hear short bits of a sound sort of 'sticking out' on top of the noise - just a little bit here and a little bit there - then the brain is smart enough to 'fill in the blanks' so to speak and perceives it as a steady tone at a lower level than the noise (the perceived amplitude level of the tone is proportional to how often it resonates which statistically is proportional to real tone level). Even though the subjective sound quality is improved by this method, and 'perception' of sounds below the output bit-depth is thus possible, to actually claim '20-bit resolution' may be a bit misleading...
Back to the Any Time page...
