Vectorizing IIR Filters: What are you Recursing?

Posted on Wed 12 February 2025 in signal-processing

Disclaimer: This article took quite a while to prepare. Although I’ve made every effort to fact-check and ensure the accuracy of the content, there may still be errors. If you notice any mistakes, please feel free to reach out and let me know!

I like writing programs that run fast (and accurate). Knowing when to compromise on accuracy for performance is a very difficult task, but it can be achieved with enough time, dedication, and a sprinkle of Zen. Over the course of my career, I had the pleasure of taking algorithms and CPU ISA documents, and mash them together into something spectacular. By "spectacular," I mean innovation that lead to longer battery life, cooler CPU temperature, and the satisfaction of fully utilizing every transistor on the chip. One algorithm that I always had to port over various CPU architecture was the IIR filter. It is primarily used for audio equalization, which is a crucial algorithm in most audio products today.

IIR or Infinite Response Filters are mathematically defined as:

$$ y(n) + \sum_{k=1}^{M-1}{a_{k} y(n - k)} = \sum_{k=0}^{N-1}{b_{k} x(n - k)} $$

Note: Wikipedia uses different notation. I am redefining them to be consistent with the rest of the post.

In the equation above:

$n$ is discrete time sequence: $n \in \mathbb{Z}$
$x(n)$ is the discrete-time input signal
$y(n)$ is the output
$a_k$ are feedback or denominator coefficients
$b_k$ are feed-forward or numerator coefficients
$a_k$ and $b_k$ shapes the filter’s behavior (like lowpass, highpass, etc.)

I won’t dive too deep into IIR fundamentals here as there are plenty of excellent resources available:

Selesnick, Infinite-Impulse Response Digital Filters: Lecture notes covering the basics of IIR filter design.
CCRMA, Elementary Filter Sections: A discussion on various elementary filter sections for digital filtering.
WPI, L5: IIR Filters and Their Implementation: Lecture notes focusing on the practical aspects of implementing IIR filters.

In summary, here are some distilled guidelines for efficient IIR filter implementations:

Second-Order Sections (SOS): Break the filter into second-order stages to mitigate numerical instability. While other topologies exist, such as the State Variable Filter (SVF), this discussion focuses on SOS implementations.
Floating-Point Implementations: Opt for the transposed direct form II topology to enhance numerical stability. There’s a wealth of literature on this topic, so further exploration can yield additional insights.
Fixed-Point Implementations: Use the direct form I topology for fixed-point designs, as there are no internal overflows. Reserve direct form II for cases where memory constraints are particularly severe.
Regular Coefficient Testing: IIR filters can (and will) drift into instability over time with certain inputs. It’s wise to periodically test, and if necessary, reset the filter ~~coefficients~~ states¹ to ensure continued stability.

Most modern amplifier SoCs include dedicated hardware accelerators for IIR filters. Memory-mapped registers can be configured with the desired filter coefficients. Some DSPs even offer dual accumulators to separately maintain feed-forward and feedback states before merging the results. In contrast, general-purpose CPUs are designed for parallel execution using pipeline stages and SIMD instruction sets. Standard C implementations of IIR filters rarely exploit this parallelism due to inherent data dependencies. To bridge this gap, we must leverage compiler intrinsic functions and some DSP math, which is the main focus of this post.

Biquads: 2^nd Order IIR Filters (SOS)

Biquads are second-order IIR filters commonly used as building blocks for higher-order designs. They are named "biquad" because their transfer function in the Z-domain consists of two quadratic polynomials:

$$ H(z) = \frac{b_0 + b_1 z^{-1} + b_2 z^{-2}}{a_0 + a_1 z^{-1} + a_2 z^{-2}} $$

In the time domain, the filter is described by the difference equation:

$$ a_0 y(n) = b_0 x(n) + b_1 x(n - 1) + b_2 x(n - 2) - a_1 y(n - 1) - a_2 y(n - 2) $$

Typically, $a_0$ is normalized to $1$, because divisions are computationally expensive.

For floating-point implementations, the transposed direct form topology is preferred over direct form I due to its superior numerical stability. Moreover, the transposed direct form II structure reduces memory usage by consolidating four delay elements into two. In this form, the filter equations can be reformulated as:

$$ y(n) = b_0 x(n) + w_1 \\ w_1 = b_1 x(n) - a_1 y(n) + w_2 \\ w_2 = b_2 x(n) - a_2 y(n) $$

Below is a sample C function that processes a block of samples using this formulation:

void process_biquad(const float *coeff, float *state, const float *input,
                    float *output, size_t num_frames) {
  // Load coefficients
  const float b0 = coeff[0];
  const float b1 = coeff[1];
  const float b2 = coeff[2];
  const float a1 = coeff[3];
  const float a2 = coeff[4];

  // Load states
  float w1 = state[0];
  float w2 = state[1];

  float x, y;

  for (size_t i = 0; i < num_frames; ++i) {
    x = input[i];
    y = b0 * x + w1;
    w1 = (b1 * x) - (a1 * y) + w2;
    w2 = (b2 * x) - (a1 * y);
    output[i] = y;
  }

  // Save states
  state[0] = w1;
  state[1] = w2;
}

Memory Map for the Function Arguments:

coeff: An array of floats representing the filter coefficients in the order $\{b_0, b_1, b_2, a_1, a_2\}$. $a_0$ is assumed to be $1$ and is therefore omitted.
state: The delay buffer containing the state variables $\{w_1, w_2\}$.
input: Input buffer containing num_frames samples.
output: Output buffer that will hold num_frames processed samples.
num_frames: The number of samples (or frames) to process.

SIMD Level 1: Process Multiple Channels

A straightforward SIMD optimization is to utilize 2-lane SIMD registers to process multiple biquad channels concurrently. This technique is implemented in my A5Eq.lv2 plugin project, where I use double-precision SIMD with both SSE (for x86) and NEON (for ARM) to process two channels simultaneously. The general idea is shown in the figure below:

graph LR;

   input_left --> H0L(H_0,L)
   input_right --> H0R(H_0,R)
   H0L --> H1L(H_1,L)
   H0R --> H1R(H_1,R)
   H1L --> HKL(...)
   H1R --> HKR(...)

   HKL --> HNL(H_M,L)
   HKR --> HNR(H_M,R)

   HNL --> output_left
   HNR --> output_right

Since LV2 audio buffers are non-interleaved, an extra step is required to interleave and later de-interleave the buffer for SIMD processing. However, for filters with more than four stages (i.e., 8^th order filters or higher), the overhead from interleaving is negligible compared to the performance benefits. The basic idea of the implmenetaion is shown below:

graph LR;

   input_left --> Interleave
   input_right --> Interleave

   Interleave --> H0(H_0,L

H_0,R)
   H0 --> H1(H_1,L

H_1,R)
   H1 -.-> H2(H_M,L

H_N,R)

   H2 --> D(De-interleave)

   D --> output_left
   D --> output_right

For implementation details, check out the stereo biquad filter code in A5Eq.lv2/src/dsp_biquad.hpp:77. This project uses the same sets of coefficients across both channels, but it can be easliy adapted to have different coefficients across multiple channels with an update of how tuning coefficients are stored.

Although, A5Eq uses the same sets of coefficients across both channels, it can be easily modified for a true multi-channel filter architecture. The memory mmap of the coefficients for stereo SIMD are as follows. Note the coefficients are interleaved:

coeff =
{
// Left,    Right
    b00,      b01
    b10,      b11
    b20,      b21
    a10,      a11
    a20,      a21
};

state =
{
// Left,    Right
    w10,      w11
    w20,      w21
};

SIMD Level 2: Crossover Filters

Modern CPUs feature multiple SIMD lanes. For example, on the x86/AMD64 architecture, SSE offers 4-lanes of float vectors, AVX offers 8, AVX-512 offers 16. On ARM, NEON typically offers 4-lanes, with optional SVE feature that can process 64 elements. We can exploit this to perform parallel filtering. Although the underlying math remains unchanged, we can restructure the code and data into parallel nodes that align with the SIMD lanes. One effective approach is to use a crossover structure that scales to any power of two. For example, consider this four-band crossover filter:

graph LR;
   input --> LP0
   input --> HP0

   LP0 --> LP1L
   LP0 --> HP1L

   HP0 --> LP1H
   HP0 --> HP1H

   LP1L --> output_ll
   HP1L --> output_lh
   LP1H --> output_hl
   HP1H --> output_hh

In this design, reusing the crossover section produces a tree-like hierarchy, enabling a divide-and-conquer approach. This allows each branch of the tree to be processed concurrently, effectively leveraging the full width of the SIMD lanes.

For instance, suppose you have a biquad filter implementation that processes four lanes simultaneously:

void crossover_kernel(const float *coeff, float *state,
                      const float *input, float *output,
                      size_t num_frames);

This function will process the following biquad filter denoted $H_{0..3}$:

graph LR;
   x_l --> H_0
   x_l --> H_1
   x_r --> H_2
   x_r --> H_3
   H_0 --> y_1
   H_1 --> y_2
   H_2 --> y_3
   H_3 --> y_4

One possible implementation is available in this Github Gist, where a stereo signal is processed concurrently by four filters. The input is arranged as $\{x_l, x_l, x_r, x_r\}$.

Building on that foundation, we can construct the entire crossover structure by cascading multiple stages of this kernel. Each stage splits the incoming signal into low and high-frequency components, with the outputs from one stage serving as the inputs for the next.

In the code below, the scratch buffer is used to store intermediate results, ensuring that data is properly aligned for SIMD operations. This structure dramatically improves performance by processing multiple bands in parallel. There are ways to minimize scratch buffer usage by utilizing a queue with a worst-case period. But, that is left as an exercise for the reader.

void process_crossover(const float *coeff, float *states,
                       const float *input, float *output, float *scratch,
                       size_t num_frames, size_t radix)
{
  /*
   * `radix` is defined as log2(output_bands). Since, there are four fitlers
   * utilizing 4-SIMD lanes, number of stages is half of radix.
   */

  assert(radix >= 2);
  size_t num_stages = radix / 2;

  /*
   * Preprocess the input buffer:
   * Since the crossover kernel processes a stereo pair (2 samples) per frame,
   * duplicate each mono sample into two lanes.
   */
  for (size_t frame = 0; frame < num_frames; ++frame) {
    scratch[frame * 2 + 0] = input[frame];
    scratch[frame * 2 + 1] = input[frame];
  }

  /*
   * Each biquad filter section is defined by 5 groups of coefficients,
   * with 4 coefficients per SIMD lane (thus 5 * 4 coefficients per set).
   */
  const size_t coeffs_per_set = 5 * 4;

  /*
   * Similarly, each biquad filter uses 2 sets for its state,
   * with 4 state values per set.
   */
  const size_t states_per_set = 2 * 4;

  /*
   * The kernel outputs 4 channels (SIMD lanes) per filter stage, while
   * the input to the kernel has 2 channels (from the duplicated mono input).
   *
   * Hence, the "block" sizes for each set are defined as:
   *   - block_per_set_in: 2 channels * num_frames
   *   - block_per_set_out: 4 channels * num_frames
   */
  const size_t block_per_set_out = 4 * num_frames;
  const size_t block_per_set_in = 2 * num_frames;

  /* Some offset counters to keep track of pointers */
  size_t coeff_offset = 0;
  size_t state_offset = 0;
  size_t scratch_offset = 0;

  /* Num sets start at 1, and then doubles every stage */
  size_t num_sets = 1;

  for (size_t stage = 0; stage < num_stages; ++stage) {
    for (size_t set = 0; set < num_sets; ++set) {
      const float *coeff_ptr = coeff + coeff_offset;
      float *state_ptr = state + state_offset;
      float *scratch_ptr = scratch + scratch_offset;

      /*
       * Compute input pointer for this set:
       * Each set uses a contiguous block of 'block_per_set_in' samples in the
       * scratch buffer.
       */
      const float *input_ptr =
          (const float *)(scratch_ptr + (set * block_per_set_in));

      float *output_ptr = NULL;
      if (stage == num_stages - 1) {
        /*
         * For the final stage, write the filter output directly into the output
         * buffer.
         */
        output_ptr = output + (set * block_per_set_out);
      } else {
        /*
         * For intermediate stages, write the output back into the scratch
         * buffer. This output region is located after the current input block.
         */
        output_ptr = scratch_ptr + (set * block_per_set_out) +
                     (num_sets * block_per_set_in);
      }

      crossover_kernel(coeff_ptr, state_ptr, input_ptr, output_ptr, num_frame);
      /*
       * Advance the coefficient and state offsets to point to the next filter set.
       */
      coeff_offset += coeffs_per_set;
      state_offset += states_per_set;
    }

    /*
     * After processing all sets for this stage, move the scratch offset
     * to the next available block (i.e. after all input blocks of this stage).
     */
    scratch_offset += num_sets * block_per_set_in;
    num_sets *= 2;
  }
}

Parameter Breakdown

Parameter	Description
`coeff`	Pointer to filter coefficients (size: `5 * 4 * TOTAL_SETS`).
`state`	Pointer to filter state buffer (size: `2 * 4 * TOTAL_SETS`).
`input`	Pointer to the mono input signal (`num_frames` elements).
`output`	Pointer to the processed output buffer (`num_frames * bands`).
`scratch`	Pointer to a temporary buffer for intermediate results.
`num_frames`	The number of input frames (samples).
`radix`	Defines the structure and number of stages (`>= 2`).

For a binary tree of height RADIX, the total number of nodes (filter) is given by $2^{\text{RADIX}+1} - 1$. In C, this can be defined as:

#define TOTAL_SETS ((1 << (RADIX + 1)) - 1)

The coefficient array is organized as a sequence of "sets," each corresponding to a node in the crossover filter tree. The memory map of the coefficient array is shown below. Note, initial low-pass and high-pass filters are repeated twice to take advantage of the SIMD architecture:

; Set 0
;    LP0,     LP0,     HP0,     HP0
   b0(0),    b0(1),   b0(2),  b0(3),
   b1(0),    b1(1),   b1(2),  b1(3),
   b2(0),    b2(1),   b2(2),  b2(3),
   a1(0),    a1(1),   a1(2),  a1(3),
   a2(0),    a2(1),   a2(2),  a2(3),
; Set 1
;   LP1L,     HP1L,    LP1H,   HP1H
   b0(0),    b0(1),   b0(2),  b0(3),
   b1(0),    b1(1),   b1(2),  b1(3),
   b2(0),    b2(1),   b2(2),  b2(3),
   a1(0),    a1(1),   a1(2),  a1(3),
   a2(0),    a2(1),   a2(2),  a2(3),
...

The scratch buffer size is given by:

SCRATCH_SIZE = 4 * TOTAL_SETS * FRAME_SIZE

For instance, with TOTAL_SETS = 16 (from RADIX = 2) and FRAMES = 8, the scratch buffer contains 16 * 4 * 8 = 512 floats.

radix is $log_2(bands)$, so for 4 band crossover, radix = 2.

Tuning the Crossover

For a crossover filter with M bands (where $M$is a power of two), there will be $M - 1$ critical crossover frequencies that define the transitions between adjacent bands. The classic fence vs. post problem, thanks Matt Parker! We are not concerned about the posts at the two ends, since they correspond to DC and nyquist frequency. These frequencies can be represented as:

$$ F_c = \{f_{DC}, f_0, f_1, \dots, f_{M-2}, f_{Ny}\} $$

For example, in a 4-band crossover, the critical frequencies are:

$$ F_c = \{ f_0, f_1, f_2 \} $$

The crossover process follows a hierarchical structure, where the first split occurs at $f_1$, dividing the signal into two bands. Each subsequent stage refines these bands further, processing $f_0$ and $f_2$ in a top-down order within the crossover tree.

A systematic way to determine the processing sequence is by traversing the crossover tree in top-down order, where the root node (highest-level split) is processed first, followed by its child nodes. Implementing this traversal method is left as an exercise for the reader. I believe in you!

SIMD Level 3: Parallel SOS

Alright, final stretch! We have explored some niche ways of utilizing SIMD instruction sets. But, this method is my favorite because it transforms a recursive filter into a parallel structure, which is mathematically elegant. We want to take something that is recursive and convert into something parallel.

Our goal is to take a cascade of second‐order sections (SOS):

graph LR;
   x --> H0(H₀)
   H0 --> H1(H₁)
   H1 --> H2(H₂)
   H2 --> H3(H₃)
   H3 --> y

and recast it as a parallel combination:

graph LR;
   x --> H0(H'₀)
   x --> H1(H'₁)
   x --> H2(H'₂)
   x --> H3(H'₃)

   H0 --> A(+)
   H1 --> A
   H2 --> A
   H3 --> A

   A --> y

We know that an IIR filter's transfer function can be written as a ratio of two polynomials:

$$ H(z) = \frac{b_0 + b_1 z^{-1} + b_2 z^{-2} + \dots + b_{N} z^{-N}} {a_0 + a_1 z^{-1} + a_2 z^{-2} + \dots + a_{M} z^{-M}} $$

We also know that for any polynomial of $N$ degree, there are $N$ roots. Setting the numerator to zero yields the zeros of the system. Likewise, setting the denominator to zero gives the poles. We can express the transfer function in its factored form as:

$$ H(z) = \frac{(z - c_0) (z - c_1) (z - c_2) \dots (z - c_N)} {(z - p_0) (z - p_1) (z - p_2) \dots (z - p_M)} $$

We can rearrange the transfer function into a partial-fraction expansion. For a strictly proper filter:

$$ H(z) = \frac{r_1}{1 - p_1\,z^{-1}} + \frac{r_3}{1 - p_2\,z^{-1}} + \cdots + \frac{r_Q}{1 - p_Q\,z^{-1}} $$

where each ( r_k ) is the residue corresponding to the pole $p_k$. Fortunately, both SciPy and MATLAB offer the residuez function to compute these coefficients, and we don't have do pre-calculus by hand:

For real-valued filters, the poles (and corresponding residues) occur in complex–conjugate pairs. Suppose we have a conjugate pair given by

$$ \frac{r}{1-p\,z^{-1}} \quad \text{and} \quad \frac{r^*}{1-p^*\,z^{-1}}. $$

Because the overall system is linear, we add these terms:

$$ H'(z) = \frac{r}{1-p\,z^{-1}} + \frac{r^*}{1-p^*\,z^{-1}}. $$

Given our original formulation of SOS:

$$ H(z)=\frac{b_0 + b_1\,z^{-1} + b_2\,z^{-2}}{1 + a_1\,z^{-1} + a_2\,z^{-2}}, $$

After doing a lot of algebra, we derive:

$b_0 = 2\,\Re\{r\}$
$b_1 = -2\,\Re\{r\,p^*\}$
$b_2 = 0$ (since there is no $z^{-2}$ term in the numerator)
$a_1 = -2\,\Re\{p\}$
$a_2 = |p|^2$

If two real poles are encountered instead, they can be grouped similarly by adding the corresponding first-order terms. And if an unpaired real pole remains, it can be padded to form a first-order branch represented in second-order form.

This parallel SOS structure allows you to process multiple second-order sections simultaneously using SIMD, effectively converting a recursive (cascade) IIR filter into a parallel structure. For an implementation example, check out my Github Gist

Conclusion

I genuinely enjoyed exploring and writing about these concepts. While I dedicated a lot of time to fact-checking, proofs, and testing, I cannot guarantee that every detail is perfect. If you spot any errors or have suggestions for improvement, please let me know via email. And if you appreciate nerdy deep dives into technical topics like this, consider supporting my work via GitHub Sponsors.

Anonymous: Thanks for catching that typo! Phew, that was close. ↩

Biquads: 2nd Order IIR Filters (SOS)