SDL2 Audio Programming
Table of Contents
I've been following along with Casey Muratori's Handmade Hero series where he live codes a game from scratch1. Casey covers audio in days 7, 8, and 9. While it's not conceptually much harder than trigonometry, it is fiddly and terminology is easy to get mixed up.
There are several barriers in practice. First, Casey uses Windows. Non-Windows systems, such as GNU, can't use the same API. Others who have followed along on non-Windows systems use the Simple DirectMedia Layer 2.0 (SDL2). And while this works great, its documentation isn't. The SDL2 documentation focuses on what, leaving how as an exercise (in frustration). The second problem is that all the audio implementations and explanations I've found are buggy or needlessly complex2.
This document presents my understanding of SDL2 audio in an effort to solidify my learning. I also hope it will assist others who feel similarly confused by the currently available resources.
Oscillation within a medium produces displacement. For example, the back and forth vibration of a taut string causes the surrounding air to move. The disturbance which spreads across the medium is called a wave. Sound is the auditory perception of a wave. It has several representations. Mathematically, a wave is a cyclic function (often of time) which measures displacement of the medium.
Several characteristics describe a wave. The characteristics fall into two groups depending on whether they're considered relative to time or space. For Handmade Hero, we only need to consider the time related characteristics of period and frequency3.
High frequency d i <-------- Period --------> s| ... ... ... . p| .... .... .... .... .... .... .... l|.. .. .. .. ... .. .. a+----------..------------..----------..-----^------..----------..------------...----time c| .. ... .. | Amplitude .. ... e| ... ... ... | ... ... ... m| ...... ..V.... ...... e n t Low frequency d i ............ ......... s| .. .. .. .. . ... . ..... p| . . . . .. ... l| . . ... .. ... a+..----------------------..-----------------------------....----------------------.. time c| ... .... . e| .... . .. m| ................. e n t
Period, T, is the time it takes for one wave cycle to complete. Frequency, f, is the number of waves passing by a fixed point per unit time (where the wave is measured by a particular feature, such as a peak or trough). A high frequency results in a high pitch, or shriller sound. Low frequencies result in a low pitch, or bassier sound. Frequency is the reciprocal of period and is measured in Hertz, abbreviated Hz, which stands for 1/s (a "one-per-second"). Humans typically can only hear between 20Hz and 20,000Hz.
frequency = 1/period f = 1/T
Another characteristic we need, related to space, concerns loudness, or volume. Amplitude, A, is the distance between the resting position and the maximum displacement of the wave. It's a measure of how pronounced the wave is. The bigger the amplitude, the louder the sound. Volume is directly related to a wave's amplitude.
There are many different waveforms and they can be combined in
arbitrary ways. Two of the simplest waveforms are the square and sine
waves. A square wave is the simple alternation between high and low
values. A sine wave is given by the sine function,
Square wave | |....... ....... ....... | . . . . . | . . . . . +------.-----.-----.-----.-----.------ | . . . . . | . . . . . | ....... ....... ....... |
We represent sound on a computer using samples. From a continuous waveform, we select discreet points to represent the whole. Amazingly, this discrete representation is sufficient to completely reconstruct a continuous waveform4.
d| . = continuous wave i| o = sampled value s| .o. p| .o. | .o. l| .o. | | | .o a|o | | | | |. c+----------------o----------------o---time e| . | | | | |. m| .o. | | | .o e| .o. | .o n| .o. t|
Each sample value must be represented by a binary number. The size of the number is the bit depth. We can represent two channels of sound, a left and a right speaker, each 16-bits wide, using a 32-bit number, such as an int. The two channel values form a 32-bit sample frame. Bit depth controls how much noise is encoded in the sample. Sampling at 16-bits is sufficient to remove any noise resulting from the encoding5.
The sample rate, or number of samples per second, determines the ability to recreate the original waveform. The Nyquist–Shannon Sampling Theorem governs the conversion between discreet (digital) samples and continuous (analog) waveforms:
Nyquist–Shannon Sampling Theorem
If a function x(t) contains no frequencies higher than B hertz, then it can be completely determined from its ordinates at a sequence of points spaced less than 1/(2B) seconds apart.
This means a sample rate higher than 2B samples per second is sufficient to exactly recreate the sound being sampled. Since the typical range of human hearing is 20Hz to 20,000Hz, any sample rate over 40,000 samples per second is sufficient to reconstruct waveforms audible to the average human6. The standard rate of 44100Hz comes from the sample rate available to common audio equipment at the time the standard was codified. A higher sample rate, such as 48000Hz, captures a larger dynamic range (8000Hz above the typical range). This provides more flexiblity when transforming audio, such as bringing sounds outside normal hearing into the audible range.
Minimal square wave implementation
The simplest implementation which outputs sound uses a square wave.
Include the SDL2 library
The first step is to include the SDL2 library. Be warned: SDL2, aside from not having much documentation, has misleading documentation. Chris Wellons has a nice write up of some common gotchas7. He recommends including "SDL.h" rather than "<SDL2/SDL.h>" in order to avoid importing the wrong library.
Initialize the audio subsystem
The audio subsystem needs to be initialized. Someone, somewhere may say, "Duh! Of course you need to initialize it first!" Between you and me, it's not obvious (at all) that this is required by the user. "Init" is too vague a term to be meaningful without context. I couldn't find anywhere in the documentation or tutorials which explicitly explains it8. If your audio logic is perfect yet no sound comes out, check if you've initialized the audio subsystem.
Technically, you can confirm the subsystem is initialized by checking
SDL_GetError. For example, if we thought to look within the bowels
of the implementation for opening an audio device, we would see that
an error is set when the audio subsystem isn't initialized:
You must manually check for this error using
In reality, just make sure your code has something like the following:
Set up the device
Once the subsystem is initialized, an audio device needs to be opened. From the perspective of SDL2, an audio device is a data structure pointing to a section of heap memory, the audio buffer, along with various parameters to manage interactions with it. The audio hardware reads from the buffer and translates the values it finds into displacements of a speaker membrane, producing sound.
SDL2 has two APIs for audio, legacy and not legacy. The non-legacy API is a generalization of the legacy API. We'll use the non-legacy API here.
The audio device needs to know how to understand the bits we place in
the audio buffer. The
SDL_AudioSpec structure provides this
information, such as bit depth, number of channels, sample rate, and
the size of the buffer itself.
The API is a bit confusing. Sample rate is given as "frequency". This frequency refers to samples per second, not wave cycles per second (pitch). The "samples" member is also unclear. What it really means is the size of the audio buffer in sample frames. That is, if there are two channels, then one (sample) frame consists of two (individual) samples. The documentation says that the "samples" member should be a power of 2 between 512 and 8096. If you're confused by this (because 8096 is not a power of two), then join the club. Maybe it should be 213=8192? The documentation also says,
the number of sample frames is directly related to time by the following formula: ms = (sampleframes*1000)/freq
I read this to mean that if the sample rate is 48000Hz then there are 24000 frames per second (since there are two individual samples per frame). This corresponds to (24000*1000)/48000 = 500ms of sound (half a second).
Beware! As far as I can tell (and deduce), the size of the audio buffer does not directly correspond to the duration of playback. It's merely a pool of data that processed sequentially by the sound device.
SDL2 uses a callback function which fills the audio buffer with new data when the sound device needs it. When this happens is opaque to the user. The callback may (and in fact does) be called multiple times over the course of playback, even for short durations. For example, if the audio buffer contains 500ms of data, the callback will still be executed several times for a 32ms long playback. The size of the audio buffer instead determines the latency, or responsiveness of playback to changes in the audio data. A smaller audio buffer results in more callback requests (calls for data).
Opening an audio device requires specifying the settings you want…and seeing if that request was fulfilled. If the device is found and opened, then starts paused. You must set the paused state to 0 (unpaused) to resume playing. Finally, the program needs to execute long enough for the audio to actually play.
Fill the audio buffer with the sound you want
All this does nothing if there's no data to play. Data comes from the
callback specified by the
SDL_AudioSpec. The callback is a function
To play a square wave, we alternate the signal from low to high each period. A quick, yet incorrect, implementation is to keep a running total of all samples written in order to determine the period. Note that each channel is written to the audio buffer in left-right-left-right order.
This implementation is buggy because the
eventually roll over. When this happens, if the signal does not align
with the previous set of data (high to high or low to low), a glitch
occurs in the audio.
Simple (yet buggy) square wave implementation
Putting it all together, we get:
Complete square wave implementation
The previous implementation is faultly because it keeps track of where it's at in the wave being output by using a running index which eventually rolls over. When that happens, the index jumps to another location which likely doesn't align with the end of the previous set of data.
Think of it this way: the index acts as the input to the wave function. The index is, in a way, time.
Since sound is cyclical, we know that two different times produce the same output value. In fact, high school trigonometry taught us about phase shift. Phase shift is the horizontal displacement required for two functions to be identical. We should be able to calculate the phase shift for our sound waves.
We also know the length of the audio buffer (we're manually setting it). Therefore, we aught to be possible to calculate two index values that
SDL2 reads sound from the audio buffer one buffer length at a time. The issue is that we need
One way to solve this is to reset the index counter
- Day 7: https://davidgow.net/handmadepenguin/ch7.html
- Day 8: https://davidgow.net/handmadepenguin/ch8.html
- Day 9: https://davidgow.net/handmadepenguin/ch9.html
- Getting Circular with SDL Audio: https://ericscrivner.me/2017/10/getting-circular-sdl-audio/
Space related characteristics include things such as wavelength and wave velocity.
Wavelength, λ, is the distance over which a wave's shape repeats. Wave velocity, or speed if ignoring direction of travel, v, is the rate at which the wave propagates through the medium.
velocity = wavelength / period v = λ/T
We can relate frequency, velocity, and wavelength by combining the previous relationships.
Solving for T... f = 1/T T = 1/f Solving for T... v = λ/T T = λ/v Equating... T = T 1/f = λ/v Solving for v... v = f * λ
To understand what it means to init within SDL2, you need to look at the definition of the subsystem. Initialization appears to mean setting up reference counting for related resources.