¶AVI timing and audio sync
Last time I promised I would write up some information about how VBR audio is popularly implemented in an AVI file; I'm going to generalize this slightly and talk about the timing of AVI streams. I'm not going to speak on the properness of VBR audio in AVI because almost everyone knows how I feel about this and that doesn't change the fact that VBR files are out in the wild and will be encountered by applications that accept the AVI format. Instead, here are the technicals so you will at least know how it works and what issues arise as a result.
I should note that I didn't devise the VBR scheme; I simply reverse engineered it from the Nandub output when I started receiving reports that newer versions of VirtualDub suddenly were not handling audio sync properly on some files. The technique I describe below varies slightly from Nandub's output, as I omit some settings that, as far as I can tell, are not necessary to get VBR-in-AVI working.
As usual, any and all corrections are welcome.
AVI streams, both audio and video, are composed of a series of samples which are evenly spaced in time. For a video stream, a stream sample is a video frame, and the stream rate is the frame rate; for an audio stream, a stream sample is an audio block, which for PCM is equivalent to an audio sample. These stream samples are in turn stored in chunks, where there is generally one sample per chunk for a video stream, and multiple samples per chunk for an audio stream. These chunks are then pointed to by the index, which lists all chunks in the file in their stream order.
Timing of an AVI stream is governed by several variables:
- The sample rate (dwRate/dwScale) of the stream determines the spacing in time between the samples. For instance, a video stream might have a sample rate of 25 samples per second, so each sample is 1/25th of a second or 40 milliseconds apart. This is actually stored as a fraction of two 32-bit values so many values that would ordinarily have to be approximated in integer or floating-point math can still be represented exactly; for instance, NTSC frame rate is 30000/1001. VirtualDub tries to use fraction math whenever possible to preserve the exact rate.
- The start (dwStart) of the stream determines when the first sample in the stream starts. A start value of 2 for a 25 sample/sec stream would mean 80ms of dead time before the first sample starts. Generally this is filled in by extending the first sample backwards in time. VirtualDub does not yet support non-zero start values. (I need to fix this.)
- The sample size (dwSampleSize) determines the sample-to-chunk mapping. If it is zero, one sample is stored per chunk, and each sample can have a different size. This is used for video streams. If it is non-zero, each sample is the same number of bytes in size, and the number of samples in a chunk is determined from the size of the chunk. Audio streams use this mode, with the sample size being the same as the block size in the audio format (WAVEFORMATEX::nBlockAlign).
There is one last tidbit missing: where exactly each sample starts and ends. The standard set by DirectShow is that the start time for the initial sample is zero, so assuming dwStart=0, the first sample in a 25/sec stream would occupy [0ms, 40ms), the second [40ms, 80ms), etc. This can be interpreted as nearest neighbor sampling, which means that an interpolator would consider the samples to be in the center of each interval at 20ms, 60ms, and so on.
Note that, based on the above, the timing of a sample is determined solely by its position in the stream -- that is, a sample N always has a start time of (dwStart + N)*(dwScale/dwRate) seconds regardless of its position in the file. In particular, the grouping of samples into chunks or the position of a stream's chunks relative to another stream's chunks doesn't matter. This means that interleaving of a file doesn't affect synchronization between two streams. That doesn't mean that interleaving doesn't affect performance, and if a player has strict playback constraints as hardware devices often do, poor interleave may render a player unable to maintain correct sync or even uninterrupted playback. However, a non-realtime conversion on a hard disk (or other random access medium) on a PC should not have such constraints.
Now, about VBR....
You might think that setting dwSampleSize=0 for an audio stream would allow it to be encoded as variable bitrate (VBR) like a video stream, where each sample has a different size. Unfortunately, this is not the case -- Microsoft AVI parsers simply ignore dwSampleSize for audio streams and use nBlockAlign from the WAVEFORMATEX audio structure instead, which cannot be zero. Nuts. So how is it done, then?
The key is in the translation from chunks to samples.
Earlier, I said that the number of samples in a chunk is determined from the size of the chunk in bytes, since samples are a fixed size. But what happens if the chunk size is not evenly divisible by the sample size? Well, DirectShow, the engine behind Windows Media Player and a number of third-party video players that run on Windows, rounds up. This means that if you set nBlockAlign to be higher than the size of any chunk in the stream, DirectShow will regard all of them as holding one sample, even though they are all different sizes. Thus, to encode VBR MP3, you simply have to set nBlockAlign to at least 960, the maximum frame size for MPEG layer III, and then store each MPEG audio frame in its own chunk. Since each audio frame encodes a constant amount of audio data -- 1152 samples for 32KHz or higher, 576 samples for 24Khz or lower -- this permits proper timing and seeking despite the variable bitrate. This can also be done for other compressed audio formats, provided that the encoding application is able to determine the compressed block boundaries and the maximum block size, and the decoders accept non-standard values for the nBlockAlign header field.
The advantages of this VBR encoding:
- Better size-to-quality ratio. Because the encoding is variable bitrate, bits are more efficiently used overall. Well, assuming your VBR encoder is a good one.
- No runts fed to the decoder. Since the AVI stream sample boundaries are now exactly between the MPEG audio frames, and not between every byte as the is indicated by the normal nBlockAlign=1 for MP3, the player always feeds data to the decoder starting on MPEG audio frame boundaries. This means that, barring errors in the MPEG data, the decoder never has to hunt through the audio data to find a valid MPEG header. MPEG audio headers do not have unique bit patterns, such hunting can normally lead to false hits within the encoded audio data and thus lead to decoding errors. (This is one of the reasons why editing an MP3-encoded audio stream within in AVI file is not recommended -- a generic editor basically ends up chopping the stream apart as would a hex editor and leaving fragments of frames all over the place, because blocks that are supposed to be atomic aren't.)
- More reasonable latency. Normally, because MP3 streams are encoded with nBlockAlign=1, the MP3 decoder has an enormous latency in AVI stream samples -- it eats thousands of stream samples before outputting a single decoded sample, whereas most decoders only have to consume one or two. With one stream sample per MPEG frame, this drops to a much more reasonable 1-2 stream sample latency. This can be avoided with more intelligent buffering logic, however.
Now, the downsides:
- AVI overhead. Depending on the indexing scheme, each chunk requires anywhere from 8-33 bytes of overhead. This is exacerbated by the increased number of chunks in a VBR-encoded stream and can add up to several additional megabytes on a highly compressed AVI file, compared to one that uses constant-bitrate (CBR) with a sparse interleave. As Alexander NoƩ pointed out last time, though, you can place multiple samples per chunk in the VBR scheme, effectively sparsing out the sync points in the audio stream; this trades off looser seek precision for lower overhead.
- AVIFile incompatibility. The Microsoft AVIFile APIs in the Video for Windows API round down instead of up when computing samples-per-chunk, so any AVIFile-based program will see zero samples in a VBR audio stream and thus be unable to read the audio data. Despite its deprecated status, there are still a surprisingly high number of apps using the Video for Windows API, some of which are even new. Unfortunately, it is not always apparent whether an application is Video for Windows or DirectShow based.
- DirectShow incompatibility. The debug build of DirectShow, at least in DX7, will assert when decoding an AVI sample. Also, if you try to rewrite a VBR stream into a new AVI file using GraphEdit, the AVI Mux filter will balk with an error. I do not know exactly where the problem lies; it may either be an inability of the AVI Mux filter to handle variable-size samples (not surprising), or it may be a problem with the media type produced by the AVI Splitter filter.
As I mentioned in the introduction, I will refrain from saying whether VBR audio should or shouldn't be used, as I've already done the subject to death. Hopefully now those of you trying to write AVI parsers will have some idea about how to read and detect VBR files, however.