Fixed-Point Tutorial #2

Posted on Thu 27 March 2025 in signal-processing

Introduction

I've received a lot of feedback on my blog posts over the past few months. I'm humbled and grateful for the suggestions, and I always look forward to hearing from readers. You'll often see that I leave my mistakes in the posts along with the corrections, giving credit where it's due. I'm thrilled (and a little nervous) to know that people are using my tutorial to learn about fixed-point signal processing! This post is a continuation of my original tutorial. I will add on a few things that I feel are necessary to understand about fixed-point signal processing that may not have been clear before.

The Fundamental Rules of Binary Arithmetic

Computers work with finite and computatble numbers, as stated by Alan Turing. That means we must reserve a finite number of bits to represent and compute these numbers. Fixed-point computation, commonly implemented using two’s complement integers, follows certain rules that govern how bits are allocated for arithmetic operations:

  • Given two integers a1 and a2 with bit widths of N1 and N2 respectively, the product, \(a_1 \times a_2\) will require \(N_1 + N_2\) bits.

  • The sum, \(a_1 + a_2\), will require max(N₁, N₂) + 1 bits.

  • The difference, \(a_1 − a_2\), will require max(N₁, N₂) + 1 bits (same as addition, since subtraction is addition of a negated value).

  • The quotient, \(a_1 \div a_2\), will result in a value with approximately N₁ − N₂ bits (though the actual bit width needed depends on desired precision and scaling).

In the previous fixed-point post, I discussed the Qm.n format, where n is the number of bits reserved for the fractional part and m is the number of bits reserved for the integer part. To summarize: in a 16-bit signed integer (int16_t), if we set the scaling point at bit 14 (i.e., Q2.14), it means 2 bits are used for the integer part (including the sign), and 14 bits are used for the fractional part.

Also note: in the Qm.n notation, the sum of m and n should equal the total bit-width of the type.

With that in mind, we can now expand on the fundamentals using the Q notation:

  • The product of two fixed-point numbers, Qa.b and Qc.d, results in a number of format Q(a + c).(b + d).

    Example: Q1.15 × Q1.15 → Q2.30

  • The sum of two fixed-point numbers requires aligning their Q points and results in the maximum Q format of the two, plus one extra bit for potential carry.

    Example: Q1.15 + Q2.14 → Q2.15

  • Subtraction follows the same rule as addition.

  • Division is... well, we’ll get to that soon.

Division

I didn’t cover division in the previous post because, in real-time DSP environments, division is generally discouraged. It’s slow and can be costly on embedded systems. Also, it introduces the risk of divide-by-zero errors. But, if you’re stuck having to do it, here’s how to think about it.

Remember long division from grade school? Suppose you wanted three decimal places in the result. Take the example: 1234 ÷ 1000.

If we’re working in base-10 and want three decimal places of precision, we scale the numerator by 10³: 1234 → 1234000.

Now divide: 1234000 ÷ 1000 = 1234, which we interpret as 1.234.

In binary (base-2), the principle is exactly the same. Suppose you're dividing a Q1.15 number by another Q1.15 number, and you want to maintain 15 bits of fractional precision in the result. To achieve that, you first left-shift the numerator by 15 bits (i.e., multiply it by \(2^{15}\)) before performing the integer division. That means you need 30 bits in the numerator to hold the intermediate value safely:

int32_t quotient = ((int32_t)numerator << 15) / denominator;

This will give you a Q17.15-formatted result.

But why shift the numerator?

Dividing by a small number like 0.001 is equivalent to multiplying by a large number. In this case, \(\frac{1}{0.001} = 1000\). To understand this better, let’s continue to look at a decimal analogy:

Suppose we’re working with a decimal scaling factor of \(10^2\), and we want to compute 123.45 ÷ 67.89. Here’s how we’d do it:

  • Scale both numbers by \(10^2\) to preserve precision:

    123.45 → 12345
    67.89 → 6789

  • To ensure the result has the same scaling factor, we need to shift the numerator by another factor of \(10^2\):

    12345 × 100 = 1234500

  • Now divide:

    1234500 ÷ 6789 = 181 (truncating any fractional part)

  • Rescale the result:

    181 → 1.81, which matches the actual division result.

Rounding

Rounding is a big topic, and I won’t pretend to be an expert. There are many rounding behaviors, but one of the most common is Banker's Rounding. I’ll try to illustrate this using fixed-point arithmetic.

Let’s revisit the division example, but this time with rounding:

  1. After scaling the numerator and denominator, we add additional scaling to the numerator. Instead of scaling by \(10^2\), let’s scale by \(10^3\) for an additional rounding digit:

    12345 × 1000 = 12345000

  2. Now divide:

    12345000 ÷ 6789 = 1818 (truncating the fractional part)

  3. To apply rounding, add the rounding factor, which is 5 at this scale:

    1818 + 5 = 1823

  4. Remove the extra scaling by dividing by 10:

    1823 ÷ 10 = 182 (truncating the fractional part)

  5. Rescale the result back to fixed-point:

    182 → 1.82

    Which gives us the correct result with rounding applied.

This can be extended to binary using a few additional operations, which is shown below for a Q1.15:

int32_t scaled_numerator = (int32_t)numerator << (15 + 1);
int32_t intermediate_quotient = scaled_numerator / denominator;
int32_t quotient_rounded = (intermediate_quotient + 0x1) >> 1;

For a more generic divide, we can write a function like this:

#include <stdint.h>

#define Q_NOT_REAL INT16_C(0x8000)

int16_t q15_divide(int16_t numerator, int16_t denominator,
                   unsigned int result_q) {
  if (result_q >= 15 || denominator == 0) {
    return Q_NOT_REAL;
  }

  int32_t num = (int32_t)numerator;
  int32_t den = (int32_t)denominator;

  // Left shift the numerator to preserve precision during division
  int32_t scaled_num = num << (result_q + 1);

  int32_t intermediate = scaled_num / den;

  // Rounding: add 1 << 0 (i.e., 1) since we shifted by +1 bit
  int32_t rounded = (intermediate + 1) >> 1;

  // Optionally clamp to int16_t range
  if (rounded > INT16_MAX)
    return INT16_MAX;
  else if (rounded < INT16_MIN)
    return INT16_MIN;
  else
    return (int16_t)rounded;
}

Practical Considerations of Fixed-Point Processing

This section focuses on some general rules and lessons learned when deploying algorithms in fixed-point environments. It’s also worth discussing practical considerations—such as project timelines, expectations, and hardware constraints.

  • Double your project timeline when targeting fixed-point
    Developing a fixed-point algorithm is effectively a two-step process:

    1. Implement the algorithm in floating-point.
    2. Re-implement it in fixed-point.

    Once both versions are complete, you'll need to run regression tests to ensure the results do not diverge significantly. If you've budgeted six months for a floating-point implementation, be realistic and plan for a full year when fixed-point is involved.

  • Claim: "Fixed-point is always faster than floating-point" Not necessarily! It depends entirely on the hardware architecture.

    For example, a microcontroller with a hardware FPU (Floating Point Unit) may actually run floating-point code faster than fixed-point emulation. If your target lacks DSP-style features (e.g., wide accumulators, single-cycle multiply-accumulate, barrel shifters, etc.), then porting to fixed-point might not give you the performance or power savings you're expecting.

  • Be prepared to dive into assembly or use compiler intrinsics This cannot be overstated. To fully optimize fixed-point performance, you often need to work closer to the metal, either by writing assembly for your target or by leveraging compiler intrinsics that map directly to specialized instructions.

    Some toolchains support Embedded-C extensions for fixed-point arithmetic, such as n1169. If your compiler supports this, take advantage of it. It can significantly improve code clarity and portability, while retaining performance.

Conclusion

Thank you again for your feedback! I hope you found this post valuable and insightful. I would love to keep this two-way communication going. So, feel free to reach out with suggestions, corrections, or ideas for future posts. I’m not perfect, and I’m never afraid to be wrong. Every bit of feedback helps me (and others) learn and grow.