SDL2 Audio Programming

Table of Contents

I've been following along with Casey Muratori's Handmade Hero series where he live codes a game from scratch1. Casey covers audio in days 7, 8, and 9. While it's not conceptually much harder than trigonometry, it is fiddlely and terminology is easy to get mixed up.

There are several barriers in practice. First, Casey uses Windows. Non-Windows systems, such as GNU, can't use the same API. Others who have followed along on non-Windows systems use the Simple DirectMedia Layer 2.0 (SDL2). And while this works great, its documentation isn't. The SDL documentation focuses on what, leaving how as an exercise (in frustration). The second problem is that all the audio implementations and explanations I've found are buggy or needlessly complex2.

This document presents my understanding of SDL2 audio in an effort to solidify my learning. I also hope it will assist others who feel similarly confused by the currently available resources.

Sound basics

Oscillation within a medium produces displacement. For example, the back and forth vibration of a taut string causes the surrounding air to move. The disturbance which spreads across the medium is called a wave. Sound is the auditory perception of a wave. It has several representations. Mathematically, a wave is a cyclic function of time which measures displacement of the medium.

Several characteristics describe a wave. The characteristics fall into two groups depending on whether they're considered relative to time or space. For Handmade Hero, we only need to consider the time related characteristics of period and frequency3.

High frequency

d
i      <-------- Period -------->
s|    ...                       ...                       ...                       .
p| .... ....                 .... ....                 .... ....                 ....
l|..       ..               ..       ..              ...       ..               ..
a+----------..------------..----------..-----^------..----------..------------...----time
c|           ..         ...            ..    | Amplitude         ..         ...
e|            ...     ...               ...  |  ...               ...     ...
m|               ......                    ..V....                    ......
e
n
t


Low frequency

d
i         ............                                            .........
s|     .. ..       .. ..                                      . ...    .  .....
p|   . .               . .                                   ..                ...
l| . .                   ...                                ..                   ...
a+..----------------------..-----------------------------....----------------------.. time
c|                         ...                        ....                          .
e|                           ....                  . ..
m|                               .................
e
n
t

Period, T, is the time it takes for one wave cycle to complete. Frequency, f, is the number of waves passing by a fixed point per unit time (where the wave is measured by a particular feature, such as a peak or trough). A high frequency results in a high pitch, or shriller sound. Low frequencies result in a low pitch, or bassier sound. Frequency is the reciprocal of period and is measured in Hertz, abbreviated Hz, which stands for 1/s (a "one-per-second"). Humans typically can only hear between 20Hz and 20,000Hz.

frequency = 1/period
f = 1/T

Another characteristic we need, related to space, concerns loudness, or volume. Amplitude, A, is the distance between the resting position and the maximum displacement of the wave. It's a measure of how pronounced the wave is. The bigger the amplitude, the louder the sound. Volume is directly related to a wave's amplitude.

There are many different waveforms and they can be combined in arbitrary ways. Two of the simplest waveforms are the square and sine waves. A square wave is the simple alternation between high and low values. A sine wave is given by the sine function, sin(t), from trigonometry.

           Square wave
|
|.......     .......     .......
|      .     .     .     .     .
|      .     .     .     .     .
+------.-----.-----.-----.-----.------
|      .     .     .     .     .
|      .     .     .     .     .
|      .......     .......     .......
|

We represent sound on a computer using samples. From a continuous waveform, we select discreet points to represent the whole. Amazingly, this discrete representation is sufficient to completely reconstruct a continuous waveform4.

d|                         . = continuous wave
i|                         o = sampled value
s|       .o.
p|    .o. | .o.
l| .o. |  |  | .o
a|o |  |  |  |  |.
c+----------------o----------------o---time
e|                 . |  |  |  |  |.
m|                  .o. |  |  | .o
e|                     .o. | .o
n|                        .o.
t|

Each sample value must be represented by a binary number. The size of the number is the bit depth. We can represent two channels of sound, a left and a right speaker, each 16-bits wide, using a 32-bit number, such as an int. The two channel values form a 32-bit sample frame. Bit depth controls how much noise is encoded in the sample. Sampling at 16-bits is sufficient to remove any noise resulting from the encoding5.

The sample rate, or number of samples per second, determines the ability to recreate the original waveform. The Nyquist–Shannon Sampling Theorem governs the conversion between discreet (digital) samples and continuous (analog) waveforms:

Nyquist–Shannon Sampling Theorem

If a function x(t) contains no frequencies higher than B hertz, then it can be completely determined from its ordinates at a sequence of points spaced less than 1/(2B) seconds apart.

This means a sample rate higher than 2B samples per second is sufficient to exactly recreate the sound being sampled. Since the typical range of human hearing is 20Hz to 20,000Hz, any sample rate over 40,000 samples per second is sufficient to reconstruct waveforms audible to the average human6. The standard rate of 44100Hz comes from the sample rate available to common audio equipment at the time the standard was codified. A higher sample rate, such as 48000Hz, captures a larger dynamic range (8000Hz above the typical range). This provides more flexiblity when transforming audio, such as bringing sounds outside normal hearing into the audible range.

Minimal implementation

The simplest implementation which outputs sound uses a square wave.

Include the SDL library

The first step is to include the SDL library. Be warned: SDL, aside from not having much documentation, has misleading documentation. Chris Wellons has a nice write up of some common gotchas7. He recommends including "SDL.h" rather than "<SDL2/SDL.h>" in order to avoid importing the wrong library.

#include "SDL.h"

Initialize the audio subsystem

The audio subsystem needs to be initialized. Someone, somewhere may say, "Duh! Of course you need to initialize it first!" Between you and me, it's not obvious (at all) that this is required by the user. "Init" is too vague a term to be meaningful without context. I couldn't find anywhere in the documentation or tutorials which explicitly explains it8. If your audio logic is perfect yet no sound comes out, check if you've initialized the audio subsystem.

Technically, you can confirm the subsystem is initialized by checking SDL_GetError. For example, if we thought to look within the bowels of the implementation for opening an audio device, we would see that an error is set when the audio subsystem isn't initialized:

static SDL_AudioDeviceID
open_audio_device(const char *devname, int iscapture,
                  const SDL_AudioSpec * desired, SDL_AudioSpec * obtained,
                  int allowed_changes, int min_id)
{
  // ...some code...

  if (!SDL_WasInit(SDL_INIT_AUDIO)) {
    SDL_SetError("Audio subsystem is not initialized");
    return 0;
  }

  // ...some code...
}

You must manually check for this error using SDL_GetError():

SDL_AudioDeviceID device_id =
  SDL_OpenAudioDevice (NULL, 0, &requested_settings, &obtained_settings, 0);

SDL_Log ("%s\n", SDL_GetError ());

In reality, just make sure your code has something like the following:

if (0 != SDL_Init(SDL_INIT_AUDIO)) {
  SDL_Log ("SDL_Init failed: %s\n", SDL_GetError ());
  return (1);
}

Set up the device

Once the subsystem is initialized, an audio device needs to be opened. From the perspective of SDL, an audio device is a data structure pointing to a section of heap memory, the audio buffer, along with various parameters to manage interactions with it. The audio hardware reads from the buffer and translates the values it finds into displacements of a speaker membrane, producing sound.

SDL2 has two APIs for audio, legacy and not legacy. The non-legacy API is a generalization of the legacy API. We'll use the non-legacy API here.

The audio device needs to know how to understand the bits we place in the audio buffer. The SDL_AudioSpec structure provides this information, such as bit depth, number of channels, sample rate, and the size of the buffer itself.

The API is a little confusing because sample rate is given as "frequency". This frequency refers to samples per second, not wave cycles per second (pitch). The "samples" member is also confusing. Here, "samples" means the size of the audio buffer in sample frames. That is, if there are two channels, then one (sample) frame consists of two (individual) samples. The "samples" member should be a power of 2 between 512 and 8096. The documentation says,

the number of sample frames is directly related to time by the following formula: ms = (sampleframes*1000)/freq

For example, if we say there are 4096 frames, then a sample rate of 48000Hz means there are 24000 frames per second (since there are two individual samples per frame). This corresponds to (24000*1000)/48000 = 500ms of sound (half a second).

Beware! The size of the audio buffer does not directly correspond to the duration of playback. It's merely a pool of data that is processed sequentially. In Handmade Hero, Casey implements the audio buffer as a circlar buffer, meaning that the play cursor (which indicates which byte the sound card is currently reading) returns to the start of the buffer when it reaches the end. SDL implements a different mechanism. It uses a callback function which fills the audio buffer with new data when the sound device needs it. When and how that happens is opaque to the user. The callback may (and does) get called many times over the course of playback, even for short durations (as small as 32ms) of an audio buffer that contains 500ms of data. The size of the audio buffer instead determines the latency, or responsiveness of playback, to changes in the audio data. A smaller buffer size means that more requests will be made for new data.

SDL_AudioSpec requested_settings = { };

requested_settings.freq     = 48000;                     // (individual) samples per second (Hz)
requested_settings.format   = AUDIO_S16LSB;              // bit depth, 16-bit amplitude values
requested_settings.channels = 2;                         // stereo, left and right
requested_settings.samples  = 4096;                      // size of the audio buffer in
                                                         // sample frames (two samples
                                                         // per frame, L and R).
requested_settings.callback = &fill_audio_device_buffer; // called when sound device needs data

Opening an audio device requires specifying the settings you want…and seeing if that request was fulfilled. If the device is found and opened, then it does so paused. You must set the paused state to 0 (unpaused) to resume playing. Finally, the program needs to execute long enough for the audio to actually play.

SDL_AudioSpec obtained_settings = { };
SDL_AudioDeviceID device_id =
  SDL_OpenAudioDevice (NULL, 0, &requested_settings, &obtained_settings, 0);

SDL_PauseAudioDevice (device_id, 0);

SDL_Delay (500);  // ms

Fill the audio buffer with the sound you want

All this does nothing if there's no data to play. Data comes from the callback specified by the SDL_AudioSpec. The callback is a function with the signature:

void SDL_AudioCallback(void*  userdata,  // whatever you want
                       Uint8* stream,    // pointer to the audio buffer
                       int    len)       // length of audio buffer in bytes

To play a square wave, we alternate the signal from low to high each period. We keep a running total of all samples written in order to determine the period. Otherwise, each channel is written to the audio buffer in left-right-left-right order.

void
fill_audio_device_buffer (void *user_data, Uint8 * device_buffer, int length)
{
  static int total_sample_count = 0;
  Sint16 *audio_buffer = (Sint16 *) device_buffer;

  int bytes_per_sample = 2 * sizeof (Sint16);        // 2 channels of 16-bit audio depth, 32-bit frame
  int samples_to_write = length / bytes_per_sample;  // sample frames (LR) to write

  int samples_per_second = 48000;                    // sample rate in Hz
  int tone_volume = 3000;                            // amplitude
  int tone_hz = 262;                                 // pitch, approximately middle C
  int samples_per_period = samples_per_second / tone_hz;

  for (int i=0; i < samples_to_write; i++)
    {
      Sint16 sample_value = tone_volume;

      // alternate high/low per period
      if ((total_sample_count / samples_per_period) % 2)
        {
          sample_value = -tone_volume;
        }
      *audio_buffer++ = sample_value; // left channel value
      *audio_buffer++ = sample_value; // right channel value
      total_sample_count++;
    }
}

Complete square wave implementation

Putting it all together, we get:

#include "SDL.h"

void
fill_audio_device_buffer (void *user_data, Uint8 * device_buffer, int length)
{
  static int total_sample_count = 0;
  Sint16 *audio_buffer = (Sint16 *) device_buffer;

  int bytes_per_sample = 2 * sizeof (Sint16);        // 2 channels of 16-bit audio depth, 32-bit frame
  int samples_to_write = length / bytes_per_sample;  // sample frames (LR) to write

  int samples_per_second = 48000;                    // sample rate in Hz
  int tone_volume = 3000;                            // amplitude
  int tone_hz = 262;                                 // pitch, approximately middle C
  int samples_per_period = samples_per_second / tone_hz;

  for (int i=0; i < samples_to_write; i++)
    {
      Sint16 sample_value = tone_volume;

      // alternate high/low per period
      if ((total_sample_count / samples_per_period) % 2)
        {
          sample_value = -tone_volume;
        }
      *audio_buffer++ = sample_value; // left channel value
      *audio_buffer++ = sample_value; // right channel value
      total_sample_count++;
    }
}

int
main (void)
{
  if (0 != SDL_Init(SDL_INIT_AUDIO)) {
    SDL_Log ("SDL_Init failed: %s\n", SDL_GetError ());
    return (1);
  }
  SDL_AudioSpec requested_settings = { };

  requested_settings.freq     = 48000;                     // (individual) samples per second (Hz)
  requested_settings.format   = AUDIO_S16LSB;              // bit depth, 16-bit amplitude values
  requested_settings.channels = 2;                         // stereo, left and right
  requested_settings.samples  = 4096;                      // size of the audio buffer in
                                                           // sample frames (two samples
                                                           // per frame, L and R).
  requested_settings.callback = &fill_audio_device_buffer; // called when sound device needs data
  SDL_AudioSpec obtained_settings = { };
  SDL_AudioDeviceID device_id =
    SDL_OpenAudioDevice (NULL, 0, &requested_settings, &obtained_settings, 0);

  SDL_PauseAudioDevice (device_id, 0);

  SDL_Delay (500);  // ms

  return (0);
}

Footnotes:

3

Space related characteristics include things such as wavelength and wave velocity.

Wavelength, λ, is the distance over which a wave's shape repeats. Wave velocity, or speed if ignoring direction of travel, v, is the rate at which the wave propagates through the medium.

velocity = wavelength / period
v = λ/T

We can relate frequency, velocity, and wavelength by combining the previous relationships.

Solving for T...
f = 1/T
T = 1/f

Solving for T...
v = λ/T
T = λ/v

Equating...
T = T
1/f = λ/v

Solving for v...
v = f * λ
8

To understand what it means to init within SDL, you need to look at the definition of the subsystem. Initialization appears to mean setting up reference counting for related resources.

2023-03-16

Powered by peut-publier

©2023 Excalamus.com